From jm at jmason.org  Sat Mar  1 13:45:52 2003
From: jm at jmason.org (Justin Mason)
Date: Sat Mar  1 08:46:05 2003
Subject: [Spambayes] Graph results 
In-Reply-To: Message from "T. Alexander Popiel" <popiel@wolfskeep.com> 
	<20030301051552.4DB592DE8C@cashew.wolfskeep.com> 
Message-ID: <20030301134557.15F1216F16@jmason.org>


Alexander -- nice work!  Thanks for investigating this...

> 2. Spambayes continues to improve for a couple months,
>    but I'm starting to see an increase in errors after
>    about 4-5 months.  I don't know why this is; it might
>    be because spam is mutating, or it might be because
>    my definition of spam has been mutating.

Spam has definitely been mutating heavily in the last 4 months.

> Anyway, the next thing for me to really look at is the effect
> of aging...

As in expiration of tokens?  I thought SB didn't use that?
Or do you mean validity of trained results from >3 months ago...

--j.

From mhammond at skippinet.com.au  Sun Mar  2 02:29:26 2003
From: mhammond at skippinet.com.au (Mark Hammond)
Date: Sat Mar  1 10:30:28 2003
Subject: [Spambayes] "delete as spam" gives error in Outlook XP
In-Reply-To: <20030301065159.47330.qmail@web41305.mail.yahoo.com>
Message-ID: <LCEPIIGDJPKCOIHOBJEPCEHDODAA.mhammond@skippinet.com.au>

This is a known bug - please define a folder for filtering your spam mail
to - even if filtering is not enabled.

Mark.

> -----Original Message-----
> From: spambayes-bounces@python.org
> [mailto:spambayes-bounces@python.org]On Behalf Of Chris Lopes
> Sent: Saturday, 1 March 2003 5:52 PM
> To: spambayes@python.org
> Subject: [Spambayes] "delete as spam" gives error in Outlook XP
>
>
> Hello,
>
> I am running Outlook 2002 SP-2 on Windows XP Pro SP1.
> I have spambayes 1.0a2 installed, along with python.org's python
> 2.2.2 with win32all-150
> installed.
> In order to install the add-in for outlook, I just ran addin.py
> from spambayes' outlook2000
> directory. The plugin installed fine, and I was able to train
> spambayes on a set of both spam and
> non-spam emails just fine.
>
> However, "Delete As Spam" does not work. It gives the following
> error visible from
> PythonWin's Trace Collector Debugging Tool when I click "Delete As Spam":
> pythoncom error: Python error invoking COM method.
> Traceback (most recent call last):
>   File "D:\Python22\lib\site-packages\win32com\server\policy.py",
> line 275, in _Invoke_
>     return self._invoke_(dispid, lcid, wFlags, args)
>   File "D:\Python22\lib\site-packages\win32com\server\policy.py",
> line 280, in _invoke_
>     return S_OK, -1, self._invokeex_(dispid, lcid, wFlags, args,
> None, None)
>   File "D:\Python22\lib\site-packages\win32com\server\policy.py",
> line 510, in _invokeex_
>     return apply(func, args)
>   File "D:\spambayes-1.0a2\Outlook2000\addin.py", line 305, in OnClick
>     spam_folder = msgstore.GetFolder(spam_folder_id)
>   File "D:\spambayes-1.0a2\Outlook2000\msgstore.py", line 223, in
> GetFolder
>     folder_id = self.NormalizeID(folder_id)
>   File "D:\spambayes-1.0a2\Outlook2000\msgstore.py", line 185, in
> NormalizeID
>     assert type(item_id) in [type(''), type(u'')], "What kind of
> ID is '%r'?" % (item_id,)
> exceptions.AssertionError: What kind of ID is 'None'?
>
> Please help
>
>
> __________________________________________________
> Do you Yahoo!?
> Yahoo! Tax Center - forms, calculators, tips, more
> http://taxes.yahoo.com/
>
> _______________________________________________
> Spambayes mailing list
> Spambayes@python.org
> http://mail.python.org/mailman/listinfo/spambayes


From popiel at wolfskeep.com  Sat Mar  1 07:47:19 2003
From: popiel at wolfskeep.com (T. Alexander Popiel)
Date: Sat Mar  1 10:47:25 2003
Subject: [Spambayes] Graph results 
In-Reply-To: Message from jm@jmason.org (Justin Mason) 
   of "Sat, 01 Mar 2003 13:45:52 GMT." <20030301134557.15F1216F16@jmason.org> 
References: <20030301134557.15F1216F16@jmason.org> 
Message-ID: <20030301154719.EDE3A2DEB4@cashew.wolfskeep.com>

In message:  <20030301134557.15F1216F16@jmason.org>
             jm@jmason.org (Justin Mason) writes:
>
>Alexander -- nice work!  Thanks for investigating this...

Heh.  It's just a way to use up even more CPU-hours, in the same
spirit as was prevalent last October... ;-)

>> 2. Spambayes continues to improve for a couple months,
>>    but I'm starting to see an increase in errors after
>>    about 4-5 months.  I don't know why this is; it might
>>    be because spam is mutating, or it might be because
>>    my definition of spam has been mutating.
>
>Spam has definitely been mutating heavily in the last 4 months.

Oh, definitely.  However, since the test runs were training
throughout the data period, one would hope that they'd have
picked up on the mutations without a loss of accuracy.  (Of
course, some of the mutations have been to include features
that SB doesn't recognize at all (s p a c e d  o u t  w o r d s),
which could well be the source of the trouble.)

I'm just worried that having too much information about past
forms of spam may be interfering with recognition of current
spam (through the auspices of spam probability deflation due
to the probabilities being based on fraction of known spams
containing any feature... so as more spams are known with
differing features, the probability for any given feature
decreases).  Hence my interest in aging.

>> Anyway, the next thing for me to really look at is the effect
>> of aging...
>
>As in expiration of tokens?  I thought SB didn't use that?
>Or do you mean validity of trained results from >3 months ago...

Standard SB doesn't, you're right.  On the other hand, my personal
installation (not what I ran tests with!) expires messages after
120 days.  I'm curious to see if this is actually the boon I
suspect it is.

- Alex

From popiel at wolfskeep.com  Sat Mar  1 09:12:46 2003
From: popiel at wolfskeep.com (T. Alexander Popiel)
Date: Sat Mar  1 12:12:49 2003
Subject: [Spambayes] Graphs on my website
Message-ID: <20030301171246.64E992DEB4@cashew.wolfskeep.com>

Those who want to see my pretty graphs without waiting
for the moderator approval of my .png-laden posting
can go to http://www.wolfskeep.com/~popiel/spambayes/incremental
to see all the pretty pictures (along with a bunch of the
raw and semi-cooked data files).

- Alex

From skip at pobox.com  Sat Mar  1 12:05:09 2003
From: skip at pobox.com (Skip Montanaro)
Date: Sat Mar  1 13:05:13 2003
Subject: [Spambayes] Graphs on my website
In-Reply-To: <20030301171246.64E992DEB4@cashew.wolfskeep.com>
References: <20030301171246.64E992DEB4@cashew.wolfskeep.com>
Message-ID: <15968.63061.877833.567556@montanaro.dyndns.org>

Alex,

After reading your note and looking at the graphs on your website I have a
couple questions:

  1. For the dense among us can you define "perfect" and "corrected"
     training? 

  2. Did you adjust your spam/ham cutoffs from the default?

  3. Do you have any measure of how the unsure stuff broke down between ham
     and spam?

Thx,

Skip


From popiel at wolfskeep.com  Sat Mar  1 10:23:43 2003
From: popiel at wolfskeep.com (T. Alexander Popiel)
Date: Sat Mar  1 13:23:47 2003
Subject: [Spambayes] Graphs on my website 
In-Reply-To: Message from Skip Montanaro <skip@pobox.com> 
	<15968.63061.877833.567556@montanaro.dyndns.org> 
References: <20030301171246.64E992DEB4@cashew.wolfskeep.com>
	<15968.63061.877833.567556@montanaro.dyndns.org> 
Message-ID: <20030301182343.C01F22DEB4@cashew.wolfskeep.com>

In message:  <15968.63061.877833.567556@montanaro.dyndns.org>
             Skip Montanaro <skip@pobox.com> writes:
>
>After reading your note and looking at the graphs on your website I have a
>couple questions:
>
>  1. For the dense among us can you define "perfect" and "corrected"
>     training? 

Perfect trains immediately after scoring with the _actual_
classification.

Corrected trains immediately after scoring with the _guessed_
classification, then fixes everything to _actual_ at the end of
the day.  (This was interesting to me because it somewhat closely
models my actual usage, given my nightly retrains.)

>  2. Did you adjust your spam/ham cutoffs from the default?

No.

>  3. Do you have any measure of how the unsure stuff broke down between ham
>     and spam?

In the raw output, yes, though I didn't graph it.  Some rough
cumulative averages:

perfect:    42 ham unsure and  290 spam unsure
corrected:  55 ham unsure and  330 spam unsure
fpfnunsure: 80 ham unsure and 1200 spam unsure

- Alex

From bill at parducci.net  Sat Mar  1 11:51:15 2003
From: bill at parducci.net (bill parducci)
Date: Sat Mar  1 14:51:19 2003
Subject: [Spambayes] train on demand
Message-ID: <3E610F33.8090507@parducci.net>

not wanting to leave mail laying around for a day whilst i wait for the daily mboxtrain.py cron job to fire off i came up with the following scheme for being able to initiate retraining via e-mail:

1. modification for .procmailrc, inserting this above the recipe that initiates hammiefilter.py:


:0
* ^Subject:.*mboxtrain.[MyKeyCode]
{
:0
* ^From.*[MyEmailAddress]
|${HOME}/retrain.sh
}


2. spiff up the shell script (retrain.sh) that calls mboxtrain.py to to send back a note telling me that the retraining is done and to output the information to a log file that can be read later ( would have included it in the note, but the way that mboxtrain.py outputs the message counts it makes for a very unwieldy message).

#!/bin/sh

mailhome="${HOME}/mail"
user=`basename ${HOME}`
inbox="/var/spool/mail/$user"
xhost=`hostname`
xdomain=`dnsdomainname`

/opt/spambayes/mboxtrain.py -d /home/$user/.hammiedb -s $mailhome/spam -g $inbox -g $mailhome/foo -g $mailhome/bar -g $mailhome/blah -g $mailhome/oink >${HOME}/retrain.out

/usr/sbin/sendmail -f devnull@$xhost.$xdomain $user <<EOF
Subject: Mailbox Training Acknowledgment

Mailbox retraining for $user has completed.

EOF


NOTE: the extra ham folders are those that are auto-filed by procmail (mailing lists that i lurk on) that are processed after hammiefilter.py is invoked.


so far, i am happy with the results (other than not being to capture usable output from hammiefilter.py in the ack e-mail). when i sit down at my machine i quickly take stock of any misplaced spam/ham, rectify the situation and fire off my training note. a minute or so later i get the ack and i then read/delete/file my mail to my heart's content.

anyway, i figure i would throw this out there in case anyone else wanted to give it a shot.

b

p.s. as a side note, i think that being able to designate a config file from the command line would be helpful in running hammiefilter.py.


From noreply at sourceforge.net  Sat Mar  1 08:48:57 2003
From: noreply at sourceforge.net (SourceForge.net)
Date: Sat Mar  1 15:03:37 2003
Subject: [Spambayes] 
	[ spambayes-Bugs-695632 ] MySQL Digest Causes Spambayes to Crash
Message-ID: <E18pAAT-0005DM-00@sc8-sf-web2.sourceforge.net>

Bugs item #695632, was opened at 2003-03-01 16:48
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=498103&aid=695632&group_id=61702

Category: None
Group: None
Status: Open
Resolution: None
Priority: 5
Submitted By: Richard Scott (rich1)
Assigned to: Nobody/Anonymous (nobody)
Summary: MySQL Digest Causes Spambayes to Crash

Initial Comment:
The main mysql e-mail list (digest version) and the
mysql bugs e-mail list (digest version) always cause
Spambayes to crash.  It appears that the error occurs
in Generator.py.
Here is the output:

Training ham (/home/richard/Mail/inbox):
  Reading as MH mailbox
  /home/richard/Mail/inbox/2        
  /home/richard/Mail/inbox/5        
  /home/richard/Mail/inbox/6        
  /home/richard/Mail/inbox/724        
  /home/richard/Mail/inbox/29        
  /home/richard/Mail/inbox/751        
Traceback (most recent call last):
  File "/home/richard/spambayes/mboxtrain.py", line
278, in ?
    main()
  File "/home/richard/spambayes/mboxtrain.py", line
265, in main
    train(h, g, False, force)
  File "/home/richard/spambayes/mboxtrain.py", line
207, in train
    mhdir_train(h, path, is_spam, force)
  File "/home/richard/spambayes/mboxtrain.py", line
190, in mhdir_train
    f.write(msg.as_string())
  File
"/usr/lib/python2.2/site-packages/email/Message.py",
line 107, in as_string
    g.flatten(self, unixfrom=unixfrom)
  File
"/usr/lib/python2.2/site-packages/email/Generator.py",
line 100, in flatten
    self._write(msg)
  File
"/usr/lib/python2.2/site-packages/email/Generator.py",
line 128, in _write
    self._dispatch(msg)
  File
"/usr/lib/python2.2/site-packages/email/Generator.py",
line 154, in _dispatch
    meth(msg)
  File
"/usr/lib/python2.2/site-packages/email/Generator.py",
line 243, in _handle_multipart
    g.flatten(part, unixfrom=False)
  File
"/usr/lib/python2.2/site-packages/email/Generator.py",
line 100, in flatten
    self._write(msg)
  File
"/usr/lib/python2.2/site-packages/email/Generator.py",
line 128, in _write
    self._dispatch(msg)
  File
"/usr/lib/python2.2/site-packages/email/Generator.py",
line 154, in _dispatch
    meth(msg)
  File
"/usr/lib/python2.2/site-packages/email/Generator.py",
line 212, in _handle_text
    raise TypeError, 'string payload expected: %s' %
type(payload)
TypeError: string payload expected: <type 'list'>


----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=498103&aid=695632&group_id=61702

From tim_one at email.msn.com  Sat Mar  1 15:23:42 2003
From: tim_one at email.msn.com (Tim Peters)
Date: Sat Mar  1 15:24:36 2003
Subject: [Spambayes] Graphs on my website
In-Reply-To: <20030301171246.64E992DEB4@cashew.wolfskeep.com>
Message-ID: <LNBBLJKPBEHFEDALKOLCMECEDPAB.tim_one@email.msn.com>

[T. Alexander Popiel
> Those who want to see my pretty graphs without waiting
> for the moderator approval of my .png-laden posting

I approved it around midnight, so anyone who hasn't gotten it yet probably
isn't going to.  It was held for approval merely due to sheer size.  After
approving it, it bounced back from a number of mailing-list recipients
because braindead "virus detection" gimmicks thought it was a virus.  A
typical bounce report complained that you were trying to hide the real
nature of the attachments by giving them two extensions
("whatever.mtv.png").  Software <wink>.


From klassa at nc.rr.com  Sun Mar  2 09:42:45 2003
From: klassa at nc.rr.com (klassa@nc.rr.com)
Date: Sun Mar  2 09:42:35 2003
Subject: [Spambayes] Outlook plugin doesn't want to filter while on dialup
Message-ID: <9504.1046616165@qwop.com>


I'm visiting my folks, and am stuck with dialup.  Oddly, the Outlook
plugin doesn't seem to want to filter.  Before I left, while on broadband,
life was good.  Here, every piece of spam gets through untouched.

Did the Outlook plugin notice the crappy connection speed :-) and punt?

Confused,
John

From tim.one at comcast.net  Sun Mar  2 12:40:10 2003
From: tim.one at comcast.net (Tim Peters)
Date: Sun Mar  2 12:40:40 2003
Subject: [Spambayes] Outlook plugin doesn't want to filter while on dialup
In-Reply-To: <9504.1046616165@qwop.com>
Message-ID: <LNBBLJKPBEHFEDALKOLCGEPJDPAB.tim.one@comcast.net>

[klassa@nc.rr.com]
> I'm visiting my folks, and am stuck with dialup.  Oddly, the Outlook
> plugin doesn't seem to want to filter.  Before I left, while on broadband,
> life was good.  Here, every piece of spam gets through untouched.
>
> Did the Outlook plugin notice the crappy connection speed :-) and punt?

It shouldn't matter, and I routinely run Outlook + spambayes via cable modem
and via dialup on the same machine without trouble.  IOW, I bet that when
you get back to a broadband connection, it still won't work, that somehow
it's turned itself off, or can't get started.  Open PythonWin and then

    Tools -> Trace Collector Debugging Tool

before starting Outlook and see if any interesting msgs appear in
PythonWin's Python Trace Collector window.  No msgs at all count as
"interesting" too <wink>.


From klassa at nc.rr.com  Sun Mar  2 20:26:13 2003
From: klassa at nc.rr.com (klassa@nc.rr.com)
Date: Sun Mar  2 20:26:00 2003
Subject: [Spambayes] Outlook plugin doesn't want to filter while on dialup
	
In-Reply-To: Your message of "Sun, 02 Mar 2003 12:40:10 EST."
             <LNBBLJKPBEHFEDALKOLCGEPJDPAB.tim.one@comcast.net> 
Message-ID: <10310.1046654773@qwop.com>


>>>>> On Sun, 2 Mar 2003, "Tim" == Tim Peters wrote:

  Tim> It shouldn't matter, and I routinely run Outlook + spambayes via
  Tim> cable modem and via dialup on the same machine without trouble.
  Tim> IOW, I bet that when you get back to a broadband connection, it
  Tim> still won't work, that somehow it's turned itself off, or can't get
  Tim> started.  Open PythonWin and then

  Tim> Tools -> Trace Collector Debugging Tool

  Tim> before starting Outlook and see if any interesting msgs appear in
  Tim> PythonWin's Python Trace Collector window.  No msgs at all count as
  Tim> "interesting" too <wink>.

Output enclosed, below.  This was with no mail to process, of course,
but everything looks fine.

What I'm noticing (now that I'm back at home, in the land of broadband...
as God intended it to be :-)) is that I'm suddenly getting more false
negatives.  That is, SB *does* appear to be filtering, but more spam is
getting through than got through just a couple of days ago.  Significantly
more.  I can't imagine that spam changed that much just overnight. :-)

I'll keep an eye on this...

Weird.

Thanks for the reply!

John

Outlook Spam Addin module loading
SpamAddin - Connecting to Outlook
Loaded bayes database from 'd:\Program Files\SpamBayes\Outlook2000\default_bayes_database.pck'
Loaded message database from 'd:\Program Files\SpamBayes\Outlook2000\default_message_database.pck'
Bayes database initialized with 259 spam and 598 good messages
AntiSpam: Watching for new messages in folder Inbox
AntiSpam: Watching for new messages in folder Spam
Processing 0 missed spam in folder 'Inbox' took 0.600076ms

From Paul.Moore at atosorigin.com  Mon Mar  3 09:01:07 2003
From: Paul.Moore at atosorigin.com (Moore, Paul)
Date: Mon Mar  3 04:02:31 2003
Subject: [Spambayes] This message crashed Spambayes...
Message-ID: <16E1010E4581B049ABC51D4975CEDB880113D946@UKDCX001.uk.int.atosorigin.com>

The attached message caused an error when being processed by the Outlook plugin, which stopped it processing the rest of my inbox. Unfortunately, I've no idea if attaching an email from Outlook will result in something readable from any other mailer (at least, in terms of diagnosing an issue like this!) If it doesn't, let me know what to do to diagnose the problem...

Paul.

PS I didn't raise a SF bug report for now, as when I saved the message as text from Outlook, it *definitely* lost any useful header info :-(

Traceback info:

Exception in thread Thread-5:
Traceback (most recent call last):
  File "C:\Python22\Lib\threading.py", line 408, in __bootstrap
    self.run()
  File "C:\Python22\Lib\threading.py", line 396, in run
    apply(self.__target, self.__args, self.__kwargs)
  File "C:\Applications\Spambayes\Outlook2000\dialogs\AsyncDialog.py", line 115, in thread_target
    self._DoProcess()
  File "C:\Applications\Spambayes\Outlook2000\dialogs\FilterDialog.py", line 375, in _DoProcess
    self.filterer(self.mgr, self.progress)
  File "C:\Applications\Spambayes\Outlook2000\filter.py", line 88, in filterer
    this_dispositions = filter_folder(f, mgr, progress)
  File "C:\Applications\Spambayes\Outlook2000\filter.py", line 68, in filter_folder
    disposition = filter_message(message, mgr, all_actions)
  File "C:\Applications\Spambayes\Outlook2000\filter.py", line 15, in filter_message
    prob = mgr.score(msg)
  File "C:\Applications\Spambayes\Outlook2000\manager.py", line 388, in score
    result = self.bayes.spamprob(bayes_tokenize(email), evidence)
  File "C:\Applications\Spambayes\spambayes\classifier.py", line 217, in chi2_spamprob
    clues = self._getclues(wordstream)
  File "C:\Applications\Spambayes\spambayes\classifier.py", line 436, in _getclues
    for word in Set(wordstream):
  File "C:\Applications\Spambayes\spambayes\compatsets.py", line 374, in __init__
    self._update(iterable)
  File "C:\Applications\Spambayes\spambayes\compatsets.py", line 333, in _update 
    for element in it:
  File "C:\Applications\Spambayes\spambayes\tokenizer.py", line 1052, in tokenize
    for tok in self.tokenize_headers(msg):
  File "C:\Applications\Spambayes\spambayes\tokenizer.py", line 1106, in tokenize_headers
    for x, subjcharset in email.Header.decode_header(x):
  File "C:\Python22\Lib\email\Header.py", line 92, in decode_header
    dec = email.base64MIME.decode(encoded)
  File "C:\Python22\Lib\email\base64MIME.py", line 179, in decode
    dec = a2b_base64(s)
Error: Incorrect padding

 <<	(??) ???? ??1? ? ??? ??? ???>> 
-------------- next part --------------
An embedded message was scrubbed...
From: <dnfld4krle4@lycos.co.kr>
Subject: (??) ???? ??1? ? ??? ??? ???
Date: Sun, 2 Mar 2003 13:54:19 -0000
Size: 1689
Url: http://mail.python.org/pipermail/spambayes/attachments/20030303/52d154bd/attachment.eml
From mhammond at skippinet.com.au  Mon Mar  3 21:13:43 2003
From: mhammond at skippinet.com.au (Mark Hammond)
Date: Mon Mar  3 05:14:20 2003
Subject: [Spambayes] editing the project HTML
Message-ID: <LCEPIIGDJPKCOIHOBJEPEENDODAA.mhammond@skippinet.com.au>

The docs at http://spambayes.sourceforge.net/applications.html need a minor
edit.  I am an administrator of the group at sourceforge, but I can't work
out how to edit this page.  All clues gratefully accepted :)

Mark.


From sjoerd at acm.org  Mon Mar  3 11:27:45 2003
From: sjoerd at acm.org (Sjoerd Mullender)
Date: Mon Mar  3 05:27:49 2003
Subject: [Spambayes] This message crashed Spambayes...
In-Reply-To: 
	<16E1010E4581B049ABC51D4975CEDB880113D946@UKDCX001.uk.int.atosorigin.com> 
References: 
	<16E1010E4581B049ABC51D4975CEDB880113D946@UKDCX001.uk.int.atosorigin.com> 
Message-ID: <20030303102745.3F9F474EB0@indus.ins.cwi.nl>

On Mon, Mar 3 2003 "Moore, Paul" wrote:

> The attached message caused an error when being processed by the Outlook =
> plugin, which stopped it processing the rest of my inbox. Unfortunately, =
> I've no idea if attaching an email from Outlook will result in something =
> readable from any other mailer (at least, in terms of diagnosing an =
> issue like this!) If it doesn't, let me know what to do to diagnose the =
> problem...
> 
> Paul.
> 
> PS I didn't raise a SF bug report for now, as when I saved the message =
> as text from Outlook, it *definitely* lost any useful header info :-(

I did file an SF bug report after I got a similar crash for a message
that I received and after I investigated where it went wrong.
See bug #696458.

-- Sjoerd Mullender <sjoerd@acm.org>

From Paul.Moore at atosorigin.com  Mon Mar  3 10:38:04 2003
From: Paul.Moore at atosorigin.com (Moore, Paul)
Date: Mon Mar  3 05:39:37 2003
Subject: [Spambayes] This message crashed Spambayes...
Message-ID: <16E1010E4581B049ABC51D4975CEDB880113D94A@UKDCX001.uk.int.atosorigin.com>

From: Sjoerd Mullender [mailto:sjoerd@acm.org]
> I did file an SF bug report after I got a similar crash for
> a message that I received and after I investigated where it
> went wrong. See bug #696458.

Ah. Thanks - this looks like it's the same issue as I saw.

Paul.

From anthony at interlink.com.au  Mon Mar  3 23:05:05 2003
From: anthony at interlink.com.au (Anthony Baxter)
Date: Mon Mar  3 07:05:13 2003
Subject: [Spambayes] editing the project HTML 
In-Reply-To: <LCEPIIGDJPKCOIHOBJEPEENDODAA.mhammond@skippinet.com.au> 
Message-ID: <200303031205.h23C56211301@localhost.localdomain>


>>> "Mark Hammond" wrote
> The docs at http://spambayes.sourceforge.net/applications.html need a minor
> edit.  I am an administrator of the group at sourceforge, but I can't work
> out how to edit this page.  All clues gratefully accepted :)

check out the "website" repository.


From wsy at merl.com  Mon Mar  3 07:55:40 2003
From: wsy at merl.com (Bill Yerazunis)
Date: Mon Mar  3 07:55:48 2003
Subject: [Spambayes] editing the project HTML
In-Reply-To: <LCEPIIGDJPKCOIHOBJEPEENDODAA.mhammond@skippinet.com.au>
References: <LCEPIIGDJPKCOIHOBJEPEENDODAA.mhammond@skippinet.com.au>
Message-ID: <200303031255.h23CteX10043@localhost.localdomain>


   From: "Mark Hammond" <mhammond@skippinet.com.au>

   The docs at http://spambayes.sourceforge.net/applications.html need a minor
   edit.  I am an administrator of the group at sourceforge, but I can't work
   out how to edit this page.  All clues gratefully accepted :)

The way I do it on CRM114 is to log directly into sourceforge via
ssh and use an editor on the offending HTML.

In your case, log in like:

   ssh -l hammond spambayes.sourceforge.net

and then cd over to the spambayes HTML directory:

   cd /home/groups/s/sp/spambayes/htdocs

and then invoke the editor of your choice.

	 -Bill Yerazunis

From noreply at sourceforge.net  Mon Mar  3 02:12:50 2003
From: noreply at sourceforge.net (SourceForge.net)
Date: Mon Mar  3 10:22:06 2003
Subject: [Spambayes] 
	[ spambayes-Bugs-696458 ] crash in tokenizer due to bad base64 in
	subject
Message-ID: <E18pmwE-0000Nr-00@sc8-sf-web1.sourceforge.net>

Bugs item #696458, was opened at 2003-03-03 11:12
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=498103&aid=696458&group_id=61702

Category: None
Group: None
Status: Open
Resolution: None
Priority: 5
Submitted By: Sjoerd Mullender (sjoerd)
Assigned to: Nobody/Anonymous (nobody)
Summary: crash in tokenizer due to bad base64 in subject

Initial Comment:
I got a crash in the tokenizer in the line where it does

        x = msg.get('subject', '')
        for x, subjcharset in
email.Header.decode_header(x):

The reason is, the subject of this particular message is

Subject: *****SPAM*****
=?EUC-KR?B?CSixpLDtKSC/7Liuvsax4iC6uLmwMcijIKHaILzSwd/H0SC8+LCjwLsgv7W/+Mj3IQ?=

which gives a binascii.Error: Incorrect padding from
binascii.a2b_base64.

I am running an up-to-date spambayes and python (i.e.
both fresh from CVS).

Here is a (parial) stack trace:

  File
"/ufs/sjoerd/src/spambayes/spambayes/tokenizer.py",
line 1052, in tokenize
    for tok in self.tokenize_headers(msg):
  File
"/ufs/sjoerd/src/spambayes/spambayes/tokenizer.py",
line 1106, in tokenize_headers
    for x, subjcharset in email.Header.decode_header(x):
  File
"/ufs/sjoerd/src/Python/dist/src/Lib/email/Header.py",
line 92, in decode_header
    dec = email.base64MIME.decode(encoded)
  File
"/ufs/sjoerd/src/Python/dist/src/Lib/email/base64MIME.py",
line 179, in decode
    dec = a2b_base64(s)
binascii.Error: Incorrect padding


----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=498103&aid=696458&group_id=61702

From noreply at sourceforge.net  Mon Mar  3 02:39:39 2003
From: noreply at sourceforge.net (SourceForge.net)
Date: Mon Mar  3 10:22:07 2003
Subject: [Spambayes] 
	[ spambayes-Bugs-696476 ] Manual filtering in outlook fails
Message-ID: <E18pnMB-000836-00@sc8-sf-web2.sourceforge.net>

Bugs item #696476, was opened at 2003-03-03 11:39
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=498103&aid=696476&group_id=61702

Category: Outlook
Group: None
Status: Open
Resolution: None
Priority: 5
Submitted By: Fredrik Rodland (fmmr)
Assigned to: Mark Hammond (mhammond)
Summary: Manual filtering in outlook fails

Initial Comment:
When I try to run "filter now" from the outlook plugin - I 
get the following trace:

Traceback (most recent call last):
  File "c:\Programfiler\_UTIL\spambayes-1.0a2
\Outlook2000\dialogs\AsyncDialog.py", line 98, in 
OnStart
    self.StartProcess()
  File "c:\Programfiler\_UTIL\spambayes-1.0a2
\Outlook2000\dialogs\FilterDialog.py", line 365, in 
StartProcess
    self.mgr.EnsureOutlookFieldsForFolder(folder_id, 
config.include_sub)
  File "c:\Programfiler\_UTIL\spambayes-1.0a2
\Outlook2000\manager.py", line 156, in 
EnsureOutlookFieldsForFolder
    folders = item.Folders
  File "C:\PROGRA~1\_DEV\Python22\lib\site-
packages\win32com\client\__init__.py", line 402, in 
__getattr__
    if d is not None: return getattr(d, attr)
  File "C:\PROGRA~1\_DEV\Python22\lib\site-
packages\win32com\client\__init__.py", line 368, in 
__getattr__
    raise AttributeError, "'%s' object has no attribute '%s'" 
% (repr(self), attr)
AttributeError: '<win32com.gen_py.Microsoft Outlook 
9.0 Object Library._AppointmentItem>' object has no 
attribute 'Folders'
win32ui: Error in Command Message handler for 
command ID 1100, Code 0

OS: windows XP home
Spambayes version: 1.0a2
outlook version: 2000 sp3

----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=498103&aid=696476&group_id=61702

From noreply at sourceforge.net  Mon Mar  3 08:30:26 2003
From: noreply at sourceforge.net (SourceForge.net)
Date: Mon Mar  3 11:23:56 2003
Subject: [Spambayes] 
	[ spambayes-Bugs-696671 ] server error attempting to review
Message-ID: <E18pspe-0002cC-00@sc8-sf-web1.sourceforge.net>

Bugs item #696671, was opened at 2003-03-03 16:30
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=498103&aid=696671&group_id=61702

Category: pop3proxy
Group: None
Status: Open
Resolution: None
Priority: 5
Submitted By: Jeremy Hylton (jhylton)
Assigned to: Nobody/Anonymous (nobody)
Summary: server error attempting to review

Initial Comment:

500 Server error

Traceback (most recent call last):

  File
"/usr/local/lib/python2.3/site-packages/spambayes/Dibbler.py",
line 398, in found_terminator
    getattr(plugin, name)(**params)

  File "/usr/local/bin/pop3proxy.py", line 930, in onReview
    messageInfo = self._makeMessageInfo(message)

  File "/usr/local/bin/pop3proxy.py", line 825, in
_makeMessageInfo
    messageInfo.bodySummary = self._trimHeader(text, 200)

  File "/usr/local/bin/pop3proxy.py", line 623, in
_trimHeader
    sections = email.Header.decode_header(field)

  File "/usr/local/lib/python2.3/email/Header.py", line
92, in decode_header
    dec = email.base64MIME.decode(encoded)

  File "/usr/local/lib/python2.3/email/base64MIME.py",
line 179, in decode
    dec = a2b_base64(s)

Error: Incorrect padding


----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=498103&aid=696671&group_id=61702

From noreply at sourceforge.net  Mon Mar  3 08:41:04 2003
From: noreply at sourceforge.net (SourceForge.net)
Date: Mon Mar  3 11:50:17 2003
Subject: [Spambayes] 
	[ spambayes-Bugs-696671 ] server error attempting to review
Message-ID: <E18pszw-0007WJ-00@sc8-sf-web3.sourceforge.net>

Bugs item #696671, was opened at 2003-03-03 17:30
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=498103&aid=696671&group_id=61702

Category: pop3proxy
Group: None
>Status: Closed
>Resolution: Duplicate
Priority: 5
Submitted By: Jeremy Hylton (jhylton)
Assigned to: Nobody/Anonymous (nobody)
Summary: server error attempting to review

Initial Comment:

500 Server error

Traceback (most recent call last):

  File
"/usr/local/lib/python2.3/site-packages/spambayes/Dibbler.py",
line 398, in found_terminator
    getattr(plugin, name)(**params)

  File "/usr/local/bin/pop3proxy.py", line 930, in onReview
    messageInfo = self._makeMessageInfo(message)

  File "/usr/local/bin/pop3proxy.py", line 825, in
_makeMessageInfo
    messageInfo.bodySummary = self._trimHeader(text, 200)

  File "/usr/local/bin/pop3proxy.py", line 623, in
_trimHeader
    sections = email.Header.decode_header(field)

  File "/usr/local/lib/python2.3/email/Header.py", line
92, in decode_header
    dec = email.base64MIME.decode(encoded)

  File "/usr/local/lib/python2.3/email/base64MIME.py",
line 179, in decode
    dec = a2b_base64(s)

Error: Incorrect padding


----------------------------------------------------------------------

>Comment By: Sjoerd Mullender (sjoerd)
Date: 2003-03-03 17:41

Message:
Logged In: YES 
user_id=43607

Closing as duplicate of bug #696458.

----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=498103&aid=696671&group_id=61702

From noreply at sourceforge.net  Mon Mar  3 08:44:15 2003
From: noreply at sourceforge.net (SourceForge.net)
Date: Mon Mar  3 11:50:18 2003
Subject: [Spambayes] 
	[ spambayes-Bugs-696458 ] crash in tokenizer due to bad base64 in
	subject
Message-ID: <E18pt31-0007io-00@sc8-sf-web3.sourceforge.net>

Bugs item #696458, was opened at 2003-03-03 11:12
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=498103&aid=696458&group_id=61702

Category: None
Group: None
Status: Open
Resolution: None
Priority: 5
Submitted By: Sjoerd Mullender (sjoerd)
Assigned to: Nobody/Anonymous (nobody)
Summary: crash in tokenizer due to bad base64 in subject

Initial Comment:
I got a crash in the tokenizer in the line where it does

        x = msg.get('subject', '')
        for x, subjcharset in
email.Header.decode_header(x):

The reason is, the subject of this particular message is

Subject: *****SPAM*****
=?EUC-KR?B?CSixpLDtKSC/7Liuvsax4iC6uLmwMcijIKHaILzSwd/H0SC8+LCjwLsgv7W/+Mj3IQ?=

which gives a binascii.Error: Incorrect padding from
binascii.a2b_base64.

I am running an up-to-date spambayes and python (i.e.
both fresh from CVS).

Here is a (parial) stack trace:

  File
"/ufs/sjoerd/src/spambayes/spambayes/tokenizer.py",
line 1052, in tokenize
    for tok in self.tokenize_headers(msg):
  File
"/ufs/sjoerd/src/spambayes/spambayes/tokenizer.py",
line 1106, in tokenize_headers
    for x, subjcharset in email.Header.decode_header(x):
  File
"/ufs/sjoerd/src/Python/dist/src/Lib/email/Header.py",
line 92, in decode_header
    dec = email.base64MIME.decode(encoded)
  File
"/ufs/sjoerd/src/Python/dist/src/Lib/email/base64MIME.py",
line 179, in decode
    dec = a2b_base64(s)
binascii.Error: Incorrect padding


----------------------------------------------------------------------

>Comment By: Sjoerd Mullender (sjoerd)
Date: 2003-03-03 17:44

Message:
Logged In: YES 
user_id=43607

It seems to me that all calls to email.Header.decode_header
should be protected with try/except, or decode_header itself
should protect itself with a try/except.  A third
possibility is to add an extra indirection through a
function that does basically:

def decode_header(x):
    try:
        return email.Header.decode_header(x)
    except:
        return x


----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=498103&aid=696458&group_id=61702

From noreply at sourceforge.net  Mon Mar  3 09:30:00 2003
From: noreply at sourceforge.net (SourceForge.net)
Date: Mon Mar  3 12:22:57 2003
Subject: [Spambayes] 
	[ spambayes-Bugs-696458 ] crash in tokenizer due to bad base64 in
	subject
Message-ID: <E18ptlI-0000ep-00@sc8-sf-web4.sourceforge.net>

Bugs item #696458, was opened at 2003-03-03 04:12
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=498103&aid=696458&group_id=61702

Category: None
Group: None
Status: Open
Resolution: None
Priority: 5
Submitted By: Sjoerd Mullender (sjoerd)
Assigned to: Nobody/Anonymous (nobody)
Summary: crash in tokenizer due to bad base64 in subject

Initial Comment:
I got a crash in the tokenizer in the line where it does

        x = msg.get('subject', '')
        for x, subjcharset in
email.Header.decode_header(x):

The reason is, the subject of this particular message is

Subject: *****SPAM*****
=?EUC-KR?B?CSixpLDtKSC/7Liuvsax4iC6uLmwMcijIKHaILzSwd/H0SC8+LCjwLsgv7W/+Mj3IQ?=

which gives a binascii.Error: Incorrect padding from
binascii.a2b_base64.

I am running an up-to-date spambayes and python (i.e.
both fresh from CVS).

Here is a (parial) stack trace:

  File
"/ufs/sjoerd/src/spambayes/spambayes/tokenizer.py",
line 1052, in tokenize
    for tok in self.tokenize_headers(msg):
  File
"/ufs/sjoerd/src/spambayes/spambayes/tokenizer.py",
line 1106, in tokenize_headers
    for x, subjcharset in email.Header.decode_header(x):
  File
"/ufs/sjoerd/src/Python/dist/src/Lib/email/Header.py",
line 92, in decode_header
    dec = email.base64MIME.decode(encoded)
  File
"/ufs/sjoerd/src/Python/dist/src/Lib/email/base64MIME.py",
line 179, in decode
    dec = a2b_base64(s)
binascii.Error: Incorrect padding


----------------------------------------------------------------------

>Comment By: Skip Montanaro (montanaro)
Date: 2003-03-03 11:30

Message:
Logged In: YES 
user_id=44345

Casual observation for anyone reporting spambayes bugs which involve
the email package - You should also check/report such errors on the 
http://mimelib.sourceforge.net/ project, which is where the email
gurus hang out.

----------------------------------------------------------------------

Comment By: Sjoerd Mullender (sjoerd)
Date: 2003-03-03 10:44

Message:
Logged In: YES 
user_id=43607

It seems to me that all calls to email.Header.decode_header
should be protected with try/except, or decode_header itself
should protect itself with a try/except.  A third
possibility is to add an extra indirection through a
function that does basically:

def decode_header(x):
    try:
        return email.Header.decode_header(x)
    except:
        return x


----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=498103&aid=696458&group_id=61702

From piersh at friskit.com  Mon Mar  3 09:41:41 2003
From: piersh at friskit.com (Piers Haken)
Date: Mon Mar  3 12:40:38 2003
Subject: [Spambayes] Error during outlook plugin startup
Message-ID: <9891913C5BFE87429D71E37F08210CB92C7504@zeus.sfhq.friskit.com>

I just updated from CVS and I'm now getting the following error on
startup. Can anyone tell me what's up?

Piers.

Outlook Spam Addin module loading
SpamAddin - Connecting to Outlook
Traceback (most recent call last):
  File "C:\Python22\lib\site-packages\win32com\universal.py", line 150,
in dispatch
    retVal = ob._InvokeEx_(meth.dispid, 0, pythoncom.DISPATCH_METHOD,
args, None, None)
  File "C:\Python22\lib\site-packages\win32com\server\policy.py", line
322, in _InvokeEx_
    return self._invokeex_(dispid, lcid, wFlags, args, kwargs,
serviceProvider)
  File "C:\Python22\lib\site-packages\win32com\server\policy.py", line
562, in _invokeex_
    return DesignatedWrapPolicy._invokeex_( self, dispid, lcid, wFlags,
args, kwArgs, serviceProvider)
  File "C:\Python22\lib\site-packages\win32com\server\policy.py", line
510, in _invokeex_
    return apply(func, args)
  File "C:\Python22\spam\spambayes\Outlook2000\addin.py", line 615, in
OnConnection
    self.manager = manager.GetManager(application)
  File "C:\Python22\spam\spambayes\Outlook2000\manager.py", line 472, in
GetManager
    _mgr = BayesManager(outlook=outlook, verbose=verbose)
  File "C:\Python22\spam\spambayes\Outlook2000\manager.py", line 142, in
__init__
    self.MigrateDataDirectory()
  File "C:\Python22\spam\spambayes\Outlook2000\manager.py", line 200, in
MigrateDataDirectory
    self._MigrateFile("default_bayes_database.pck")
  File "C:\Python22\spam\spambayes\Outlook2000\manager.py", line 211, in
_MigrateFile
    shutil.move(src, dest)
exceptions.AttributeError: 'module' object has no attribute 'move'

From mike at plokta.com  Mon Mar  3 20:38:52 2003
From: mike at plokta.com (Mike Scott)
Date: Mon Mar  3 15:38:55 2003
Subject: [Spambayes] Server error when training in POP3proxy
Message-ID: <24FDE878-4DB8-11D7-9B4E-000393DB4B0C@plokta.com>

After using it successfully for a couple of weeks, POP3proxy is 
throwing the following error when I try to review emails for training 
in the web browser interface. The rest of the web browser interface, 
and POP3proxy, seems to be working OK. Does anyone who knows more than 
me about POP3proxy have any ideas for how to diagnose or fix it? I've 
just pulled the most recent update from CVS, which hasn't helped. I'm 
on Mac OS X 10.2.4 running Python 2.2.2, in case it's relevant.


500 Server error
Traceback (most recent call last):

   File "spambayes/Dibbler.py", line 398, in found_terminator
     getattr(plugin, name)(**params)

   File "pop3proxy.py", line 1003, in onReview
     messageInfo = self._makeMessageInfo(message)

   File "pop3proxy.py", line 856, in _makeMessageInfo
     messageInfo.bodySummary = self._trimHeader(text, 200)

   File "pop3proxy.py", line 648, in _trimHeader
     sections = email.Header.decode_header(field)

   File 
"/sw/src/root-python22-2.2.2-2/sw/lib/python2.2/email/Header.py", line 
92, in decode_header

   File 
"/sw/src/root-python22-2.2.2-2/sw/lib/python2.2/email/base64MIME.py", 
line 179, in decode

Error: Incorrect padding


-- 
Mike Scott
mike@plokta.com


From piersh at friskit.com  Mon Mar  3 13:01:55 2003
From: piersh at friskit.com (Piers Haken)
Date: Mon Mar  3 16:00:50 2003
Subject: [Spambayes] Error during outlook plugin startup
Message-ID: <9891913C5BFE87429D71E37F08210CB92C7505@zeus.sfhq.friskit.com>

Okay, I worked around this problem by deleting my pickles and starting
from scratch (it didn't need to do the migration) but I believe this is
still a problem. I'm using python2.2.2 and win32all-152.

Piers.

> -----Original Message-----
> From: Piers Haken 
> Sent: Monday, March 03, 2003 9:42 AM
> To: Spambayes
> Subject: [Spambayes] Error during outlook plugin startup
> 
> 
> I just updated from CVS and I'm now getting the following 
> error on startup. Can anyone tell me what's up?
> 
> Piers.
> 
> Outlook Spam Addin module loading
> SpamAddin - Connecting to Outlook
> Traceback (most recent call last):
>   File "C:\Python22\lib\site-packages\win32com\universal.py", 
> line 150, in dispatch
>     retVal = ob._InvokeEx_(meth.dispid, 0, 
> pythoncom.DISPATCH_METHOD, args, None, None)
>   File 
> "C:\Python22\lib\site-packages\win32com\server\policy.py", 
> line 322, in _InvokeEx_
>     return self._invokeex_(dispid, lcid, wFlags, args, kwargs,
> serviceProvider)
>   File 
> "C:\Python22\lib\site-packages\win32com\server\policy.py", 
> line 562, in _invokeex_
>     return DesignatedWrapPolicy._invokeex_( self, dispid, 
> lcid, wFlags, args, kwArgs, serviceProvider)
>   File 
> "C:\Python22\lib\site-packages\win32com\server\policy.py", 
> line 510, in _invokeex_
>     return apply(func, args)
>   File "C:\Python22\spam\spambayes\Outlook2000\addin.py", 
> line 615, in OnConnection
>     self.manager = manager.GetManager(application)
>   File "C:\Python22\spam\spambayes\Outlook2000\manager.py", 
> line 472, in GetManager
>     _mgr = BayesManager(outlook=outlook, verbose=verbose)
>   File "C:\Python22\spam\spambayes\Outlook2000\manager.py", 
> line 142, in __init__
>     self.MigrateDataDirectory()
>   File "C:\Python22\spam\spambayes\Outlook2000\manager.py", 
> line 200, in MigrateDataDirectory
>     self._MigrateFile("default_bayes_database.pck")
>   File "C:\Python22\spam\spambayes\Outlook2000\manager.py", 
> line 211, in _MigrateFile
>     shutil.move(src, dest)
> exceptions.AttributeError: 'module' object has no attribute 'move'
> 
> _______________________________________________
> Spambayes mailing list
> Spambayes@python.org http://mail.python.org/mailman/listinfo/spambayes
> 

From mhammond at skippinet.com.au  Tue Mar  4 09:46:39 2003
From: mhammond at skippinet.com.au (Mark Hammond)
Date: Mon Mar  3 17:47:19 2003
Subject: [Spambayes] Error during outlook plugin startup
In-Reply-To: <9891913C5BFE87429D71E37F08210CB92C7504@zeus.sfhq.friskit.com>
Message-ID: <LCEPIIGDJPKCOIHOBJEPMEPEODAA.mhammond@skippinet.com.au>

>   File "C:\Python22\spam\spambayes\Outlook2000\manager.py", line 211, in
> _MigrateFile
>     shutil.move(src, dest)
> exceptions.AttributeError: 'module' object has no attribute 'move'

Damn - it seems Python 2.2 doesn't have shutil.move.  I will replace it with
win32api.MoveFileEx().

Mark.


From mhammond at skippinet.com.au  Tue Mar  4 10:22:21 2003
From: mhammond at skippinet.com.au (Mark Hammond)
Date: Mon Mar  3 18:23:24 2003
Subject: [Spambayes] Error during outlook plugin startup
In-Reply-To: <9891913C5BFE87429D71E37F08210CB92C7504@zeus.sfhq.friskit.com>
Message-ID: <LCEPIIGDJPKCOIHOBJEPMEPGODAA.mhammond@skippinet.com.au>

I have checked in a fix for this.

Mark.

> -----Original Message-----
> From: spambayes-bounces@python.org
> [mailto:spambayes-bounces@python.org]On Behalf Of Piers Haken
> Sent: Tuesday, 4 March 2003 4:42 AM
> To: Spambayes
> Subject: [Spambayes] Error during outlook plugin startup
> 
> 
> I just updated from CVS and I'm now getting the following error on
> startup. Can anyone tell me what's up?
> 
> Piers.
> 
> Outlook Spam Addin module loading
> SpamAddin - Connecting to Outlook
> Traceback (most recent call last):
>   File "C:\Python22\lib\site-packages\win32com\universal.py", line 150,
> in dispatch
>     retVal = ob._InvokeEx_(meth.dispid, 0, pythoncom.DISPATCH_METHOD,
> args, None, None)
>   File "C:\Python22\lib\site-packages\win32com\server\policy.py", line
> 322, in _InvokeEx_
>     return self._invokeex_(dispid, lcid, wFlags, args, kwargs,
> serviceProvider)
>   File "C:\Python22\lib\site-packages\win32com\server\policy.py", line
> 562, in _invokeex_
>     return DesignatedWrapPolicy._invokeex_( self, dispid, lcid, wFlags,
> args, kwArgs, serviceProvider)
>   File "C:\Python22\lib\site-packages\win32com\server\policy.py", line
> 510, in _invokeex_
>     return apply(func, args)
>   File "C:\Python22\spam\spambayes\Outlook2000\addin.py", line 615, in
> OnConnection
>     self.manager = manager.GetManager(application)
>   File "C:\Python22\spam\spambayes\Outlook2000\manager.py", line 472, in
> GetManager
>     _mgr = BayesManager(outlook=outlook, verbose=verbose)
>   File "C:\Python22\spam\spambayes\Outlook2000\manager.py", line 142, in
> __init__
>     self.MigrateDataDirectory()
>   File "C:\Python22\spam\spambayes\Outlook2000\manager.py", line 200, in
> MigrateDataDirectory
>     self._MigrateFile("default_bayes_database.pck")
>   File "C:\Python22\spam\spambayes\Outlook2000\manager.py", line 211, in
> _MigrateFile
>     shutil.move(src, dest)
> exceptions.AttributeError: 'module' object has no attribute 'move'
> 
> _______________________________________________
> Spambayes mailing list
> Spambayes@python.org
> http://mail.python.org/mailman/listinfo/spambayes

From mhammond at skippinet.com.au  Tue Mar  4 10:25:26 2003
From: mhammond at skippinet.com.au (Mark Hammond)
Date: Mon Mar  3 18:33:43 2003
Subject: [Spambayes] Missing HTML payload
Message-ID: <LCEPIIGDJPKCOIHOBJEPGEPHODAA.mhammond@skippinet.com.au>

The following mail got past SpamBayes.  Looking at the clues, it appears
that spambayes was missing the HTML body of the message (which *does* render
almost correctly in Outlook).

I instrumented the "show clues" feature to show *all* message tokens found
in the body.  As you can see at the very end, the entire body was stripped.

I am guessing that we barf on:
            <td><!--#rotato>
a comment which is never closed.  Outlook actually shows this entire tag
(ie, literally "<!--#rotato>", then displays the rest of the HTML
correctly - ie, I guess that we treat the comment as unclosed, while Outlook
ignores it.

Any thoughts?  {Please ignore the clues themselves - this is the subject of
a following mail)

Mark.


Spam Score: 0.305365


word                                spamprob         #ham  #spam
'*H*'                               0.698329            -      -
'*S*'                               0.309059            -      -
'subject:133'                       0.0505618           4      0
'x-mailer:microsoft outlook express 5.50.4522.1200' 0.118157          262
79
'subject:: '                        0.143355          836    316
'from:addr:domresgube'              0.155172            1      0
'message-id:@atbsfwo.wvk'           0.155172            1      0
'subject:2120uBwJ9'                 0.155172            1      0
'subject:Following'                 0.155172            1      0
'url:vivapharmacy1'                 0.155172            1      0
'header:Message-ID:1'               0.337892         1501   1732
'url:twelveover'                    0.342142            1      1
'url:home'                          0.605737            6     21
'skip:5 20'                         0.67407             4     19
'url:unsubscribe'                   0.693218            6     31
'header:Received:9'                 0.730547           50    307
'from:no real name:2**0'            0.740359          202   1303
'url:1'                             0.772286           20    154
'from:addr:yahoo.com'               0.836924           33    384
'url:images'                        0.901105           15    311
'url:jpg'                           0.9546              8    385

Message Stream:


Return-Path: <domresgube@yahoo.com>
Received: from mta06ps.bigpond.com ([192.168.115.5]) by
          mailms7a.email.bigpond.com (Netscape Messaging Server 4.15
          mailms7a Apr 29 2002 13:22:02) with ESMTP id HB4SEB02.X2J for
          <mhammond@bigpond.net.au>; Mon, 3 Mar 2003 02:56:35 +1000
Received: from dampier.southern.net.au ([144.135.25.87]) by
          mta06ps.bigpond.com (Netscape Messaging Server 4.15 mta06ps May
          23 2002 23:53:28) with SMTP id HB4SEA00.FMS for
          <mhammond@bigpond.net.au>; Mon, 3 Mar 2003 02:56:34 +1000
Received: from dampier.southern.net.au ([202.182.64.135]) by
	psmam07.bigpond.com(MailRouter V3.2g 119/7122531); 03 Mar 2003 02:56:33
Received: from localhost (localhost [127.0.0.1])
	by dampier.southern.net.au (Postfix) with SMTP id CD92A64280
	for <mhammond@bigpond.net.au>; Mon,  3 Mar 2003 03:56:32 +1100
Received: by dampier.southern.net.au (Postfix, from userid 0)
	id AB9BE5D4FE; Mon,  3 Mar 2003 03:56:32 +1100
Received: from localhost (localhost [127.0.0.1])
	by dampier.southern.net.au (Postfix) with SMTP id 58AC56427F
	for <mhammond@skippinet.com.au>; Mon,  3 Mar 2003 03:56:32 +1100
Received: from eyre.southern.net.au (ip10-0-0-11.unroutable [10.0.0.11])
	by dampier.southern.net.au (Postfix) with ESMTP id 245445D4D4
	for <mhammond@skippinet.com.au>; Mon,  3 Mar 2003 03:56:32 +1100
Received: from localhost (localhost [127.0.0.1])
	by eyre.southern.net.au (Postfix) with SMTP id B39B982C
	for <mhammond@skippinet.com.au>; Mon,  3 Mar 2003 03:56:37 +1100
Received: from yahoo.com (0x50c61e76.hrnxx3.adsl-dhcp.tele.dk
[80.198.30.118])
	by eyre.southern.net.au (Postfix) with SMTP id 94DB5A412B
	for <mhammond@skippinet.com.au>; Mon,  3 Mar 2003 03:56:26 +1100
Message-ID: <000310b0ba33$ddd62445$00384561@atbsfwo.wvk>
From: <domresgube@yahoo.com>
To: <mhammond@skippinet.com.au>
Subject: RE: Following up2120uBwJ9-133-12
Date: Mon, 03 Mar 2003 18:42:37 -1000
MIME-Version: 1.0
Content-Type: multipart/mixed;
	boundary="----=_NextPart_000_00D2_32D55E0B.E7046D05"
X-Priority: 3
X-Mailer: Microsoft Outlook Express 5.50.4522.1200
Importance: Normal

<HTML>5cFGY1ws1bMNedOg853Pv
<div align="center">
  <table width="700" border="0" cellspacing="5" cellpadding="0">
    <tr>
      <td>
        <table width="650" border="0" cellspacing="0" cellpadding="0"
align="center">
          <tr>
            <td><!--#rotato>
              <div align="center"><font face=Arial size=6><font face=Arial
size=6><font size="3" face="Arial, Helvetica, sans-serif"><b>Did
                you know you can&nbsp;get prescription medications
prescribed
                o<font color="#000000">nline?</font></b></font></font><font
color="#ff0000" face="Arial, Helvetica, sans-serif"><br>
                No Prior Prescription Required!!<br>
                <font color=#0033CC size=4>AND GET YOUR ORDER THE NEXT
DAY!!!</font></font></font>
              </div>
              <p align=center><font face="Arial, Helvetica, sans-serif"> <a
href="http://www.twelveover.com/vivapharmacy1/home.asp"><b><font
size="3">Click
                Here Now</font></b></a><font size="3"> for more information
about
                getting your new or existing <br>
                medications prescribed online and shipped to your door
overnight!</font></font></p>
              <p align=center><font face="Arial, Helvetica,
sans-serif"><b><font size="3">TRY
                US NOW</font></b><font size="3"> and <b>JOIN THE
THOUSANDS</b>
                of satisfied Internet Pharmacy customers! </font></font></p>
            </td>
          </tr>
        </table>
        <br>
        <div align="center"><a
href="http://www.twelveover.com/vivapharmacy1/home.asp"><img
src="http://www.twelveover.com/images/1.jpg" width="641" height="239"
border="0"></a>
          <br>
          <br>
          <font face="Arial, Helvetica, sans-serif"><a
href="http://www.twelveover.com/vivapharmacy1/home.asp"><b><font
size="3">Click
          Here Now</font></b></a><font size="3"> for more
information</font></font>
          <br>
          <br>
        </div>
        <div align="center">
<table width="650" border="0" cellspacing="0" cellpadding="0">
            <tr>
              <td>
                <div align="left"><font face="Arial" size="2">One of our US
licensed
                  physicians will write you an FDA approved prescription for
free,
                  and have your order shipped overnight via a US Licensed
pharmacy
                  direct to your doorstep, fast and secure!</font> </div>
              </td>
            </tr>
          </table>
          <br>
          <table width="648" border="0" cellspacing="0" cellpadding="0">
            <tr>
              <td>
                <p>
                <p><font face="Arial, Helvetica, sans-serif" size="2">If you
wish
                  to decline to receive these offers, <a
href="http://www.twelveover.com/vivapharmacy1/unsubscribe.asp">Please
                  Click Here</a></font></p>
              </td>
            </tr>
          </table>
        </div>
      </td>
    </tr>
  </table>
  <br>
  <br>
  <br>
</div>
</BODY>
</HTML>

Rokgtbmmpuwfoitsoahkmtgpocojkuxflsixwwyuaqpkclwiphmmnygit
7692RkAi0-575sNFa4909QyBK4-

Message Tokens:

23 unique tokens

header:Importance:1
subject:Following
from:addr:yahoo.com
message-id:@atbsfwo.wvk
header:From:1
from:addr:domresgube
header:MIME-Version:1
x-mailer:microsoft outlook express 5.50.4522.1200
header:Subject:1
to:2**0
header:Received:9
subject:133
subject:2120uBwJ9
subject:
subject::
header:To:1
subject:-
from:no real name:2**0
content-type:multipart/mixed
header:Return-Path:1
header:Date:1
header:Message-ID:1
subject:
From mhammond at skippinet.com.au  Tue Mar  4 10:38:13 2003
From: mhammond at skippinet.com.au (Mark Hammond)
Date: Mon Mar  3 18:38:46 2003
Subject: [Spambayes] Confused by tokens
Message-ID: <LCEPIIGDJPKCOIHOBJEPEEPJODAA.mhammond@skippinet.com.au>

*sigh* - I am wallowing in confusion today - reeling from bug-to-bug trying
to keep my eye on the ball as I go.

While looking into the previous "missing HTML payload" problem, I discovered
two issues:

1) Outlook's incremental training is *definitely* broken.  Unfortunately,
not in an obvious way.  It is possible to get hapaxes showing up in the
wrong category, or showing up multiple times.  Eg, I have confirmed that:

'url:vivapharmacy1'                 0.155172            1      0

is a hapax unique to this spam.  However, I have seen this occasionally with
a "2" in the ham column, a "1" in each of "ham" and "spam", and as above "1"
in ham even though the most recent operation was a "train as spam".  Simple
tests show that it works OK, so there is something subtle going on.  I'm
trying to track this down.

2) The point of this mail - I am confused by our tokens.  Again, it we look
at the clues for this message, we see:
'url:vivapharmacy1'                 0.155172            1      0

But the 'all tokens' list consists of:
"""
23 unique tokens

header:Importance:1
subject:Following
from:addr:yahoo.com
message-id:@atbsfwo.wvk
header:From:1
from:addr:domresgube
header:MIME-Version:1
x-mailer:microsoft outlook express 5.50.4522.1200
header:Subject:1
to:2**0
header:Received:9
subject:133
subject:2120uBwJ9
subject:
subject::
header:To:1
subject:-
from:no real name:2**0
content-type:multipart/mixed
header:Return-Path:1
header:Date:1
header:Message-ID:1
subject:
"""

ie, that token is not listed (and strangely 'subject:' is listed twice).
The code to dump the tokens is:

        from spambayes.tokenizer import tokenize
        from spambayes.classifier import Set # whatever classifier uses
        push("<h2>Message Tokens:</h2><br>")
        toks = Set(tokenize(msg))
        push("%d unique tokens<br>" % (len(toks),))
        push("<PRE>")
        for token in toks:
            push(escape(token) + "\n")
        push("</PRE>")

'push' id list.append, 'escape' is cgi.escape, and 'msg' is an 'email'
package object.

I am confused where our tokens came from, and why no 'url:' tokens appear in
the list of all tokens, even though they do appear in the clues list.

One-of-those-days ly,

Mark.


From mhammond at skippinet.com.au  Tue Mar  4 11:47:28 2003
From: mhammond at skippinet.com.au (Mark Hammond)
Date: Mon Mar  3 19:52:00 2003
Subject: [Spambayes] Confused by tokens
In-Reply-To: <LCEPIIGDJPKCOIHOBJEPEEPJODAA.mhammond@skippinet.com.au>
Message-ID: <LCEPIIGDJPKCOIHOBJEPIEAAOEAA.mhammond@skippinet.com.au>

> 2) The point of this mail - I am confused by our tokens.  Again,
> it we look
> at the clues for this message, we see:
> 'url:vivapharmacy1'                 0.155172            1      0

Sorry - false alarm - operator error!  The url tokens appear when we don't
perform the funky, Outlook specific MIME-munging that we do.  This may or
may not be related to the incremental training problem - I'll let you know
<wink>.

Mark.


From tim.one at comcast.net  Mon Mar  3 20:11:21 2003
From: tim.one at comcast.net (Tim Peters)
Date: Mon Mar  3 20:11:53 2003
Subject: [Spambayes] Confused by tokens
In-Reply-To: <LCEPIIGDJPKCOIHOBJEPEEPJODAA.mhammond@skippinet.com.au>
Message-ID: <LNBBLJKPBEHFEDALKOLCMECIEAAB.tim.one@comcast.net>

[Mark Hammond]
> *sigh* - I am wallowing in confusion today - reeling from
> bug-to-bug trying to keep my eye on the ball as I go.
>
> While looking into the previous "missing HTML payload" problem, I
> discovered two issues:
>
> 1) Outlook's incremental training is *definitely* broken.  Unfortunately,
> not in an obvious way.  It is possible to get hapaxes showing up in the
> wrong category, or showing up multiple times.  Eg, I have confirmed that:
>
> 'url:vivapharmacy1'                 0.155172            1      0
>
> is a hapax unique to this spam.  However, I have seen this
> occasionally with a "2" in the ham column, a "1" in each of "ham" and
> "spam", and as above "1" in ham even though the most recent operation
> was a "train as spam".  Simple tests show that it works OK, so there is
> something subtle going on.  I'm trying to track this down.

I haven't seen this, but I haven't updated my spambayes directory in at
least a month (ain't broke, why fix <wink>).

> 2) The point of this mail - I am confused by our tokens.  Again,
> it we look at the clues for this message, we see:
> 'url:vivapharmacy1'                 0.155172            1      0

That clue must have come from the body of the msg.  I note that *all* the
tokens you show next came from the headers:

> But the 'all tokens' list consists of:
> """
> 23 unique tokens
>
> header:Importance:1
> subject:Following
> from:addr:yahoo.com
> message-id:@atbsfwo.wvk
> header:From:1
> from:addr:domresgube
> header:MIME-Version:1
> x-mailer:microsoft outlook express 5.50.4522.1200
> header:Subject:1
> to:2**0
> header:Received:9
> subject:133
> subject:2120uBwJ9
> subject:
> subject::
> header:To:1
> subject:-
> from:no real name:2**0
> content-type:multipart/mixed
> header:Return-Path:1
> header:Date:1
> header:Message-ID:1
> subject:
> """
>
> ie, that token is not listed

There are no body tokens here at all.  I don't expect that to be obvious to
anyone, I just happen to know that all those prefix tags ("header:",
"subject:", etc) come from tokenize_headers() (as opposed to
tokenize_body(), from which "url:"-tagged tokens come).

> (and strangely 'subject:' is listed twice).

Probably not <wink>.  Tokenization of a Subject header is unique in one
respect:

            for w in punctuation_run_re.findall(x):
                yield 'subject:' + w

where

     punctuation_run_re = re.compile(r'\W+')

IOW, runs of (among other things) consecutive whitespace characters count as
tokens in a subject line, but they don't anywhere else.  This made a small
but real improvement in tests at the time, likely because of spam subject
lines of the form

Subject: Get Big Now!                              random_gibberish_here

You probably can't see the difference between:

    subject:

and

    subject:

but they're distinct tokens (the first is a single blank, the second a run
of 30 blanks).

> The code to dump the tokens is:
>
>         from spambayes.tokenizer import tokenize
>         from spambayes.classifier import Set # whatever classifier uses
>         push("<h2>Message Tokens:</h2><br>")
>         toks = Set(tokenize(msg))
>         push("%d unique tokens<br>" % (len(toks),))

You could write that

          push("%d unique tokens<br>" % len(toks))

>         push("<PRE>")
>         for token in toks:
>             push(escape(token) + "\n")
>         push("</PRE>")
>
> 'push' id list.append, 'escape' is cgi.escape, and 'msg' is an 'email'
> package object.
>
> I am confused where our tokens came from, and why no 'url:'
> tokens appear in the list of all tokens, even though they do appear
> in the clues list.

I can only guess that msg only contained headers in this case, or that
damaged MIME structure in the body caused the email pkg to give up in a way
the tokenizer didn't recover from.  But then I wonder how we *ever* got a
url: token out of the body.

> One-of-those-days ly,

Indeed it is <wink>.


From mhammond at skippinet.com.au  Tue Mar  4 12:12:35 2003
From: mhammond at skippinet.com.au (Mark Hammond)
Date: Mon Mar  3 20:13:38 2003
Subject: [Spambayes] Missing HTML payload
In-Reply-To: <LCEPIIGDJPKCOIHOBJEPGEPHODAA.mhammond@skippinet.com.au>
Message-ID: <LCEPIIGDJPKCOIHOBJEPGEACOEAA.mhammond@skippinet.com.au>

I wrote:

> I instrumented the "show clues" feature to show *all* message tokens found
> in the body.  As you can see at the very end, the entire body was
> stripped.

I finally worked out where my missing "url:" tokens got to.  However, once
that is corrected, the same problem remains - no tokens extracted from the
HTML body, *except* URL tokens, appear.

> I am guessing that we barf on:
>             <td><!--#rotato>
> a comment which is never closed.  Outlook actually shows this entire tag

Digging deeper, this seems to be true.

>>> from spambayes import tokenizer
>>> tokenizer.crack_html_comment("hi <!-- wow --> there")
('hi  there', [])
>>> tokenizer.crack_html_comment("hi <!-- wow> there")
('hi ', [])

Mark.


From mhammond at skippinet.com.au  Tue Mar  4 12:49:40 2003
From: mhammond at skippinet.com.au (Mark Hammond)
Date: Mon Mar  3 20:50:45 2003
Subject: [Spambayes] Missing HTML payload
In-Reply-To: <LCEPIIGDJPKCOIHOBJEPGEACOEAA.mhammond@skippinet.com.au>
Message-ID: <LCEPIIGDJPKCOIHOBJEPAEAGOEAA.mhammond@skippinet.com.au>

I'm getting there :)

I've added a bug, with a patch in:
[ 696995 ] Invalid HTML comments are not ignored

Mark.


From tim.one at comcast.net  Mon Mar  3 20:50:42 2003
From: tim.one at comcast.net (Tim Peters)
Date: Mon Mar  3 20:51:30 2003
Subject: [Spambayes] Missing HTML payload
In-Reply-To: <LCEPIIGDJPKCOIHOBJEPGEPHODAA.mhammond@skippinet.com.au>
Message-ID: <LNBBLJKPBEHFEDALKOLCKECMEAAB.tim.one@comcast.net>

[Mark Hammond]
> The following mail got past SpamBayes.  Looking at the clues, it appears
> that spambayes was missing the HTML body of the message (which
> *does* render almost correctly in Outlook).
>
> I instrumented the "show clues" feature to show *all* message tokens
> found in the body.  As you can see at the very end, the entire body was
> stripped.
>
> I am guessing that we barf on:
>             <td><!--#rotato>
> a comment which is never closed.

That would do it!  tokenizer.py's Stripper class eliminates (via subclasses)
various kinds of bracketed structures, and HTML comments are among them.  I
see that the analyze() method will just ignore any text at and after the
last open-bracket match without a matching end-bracket construct.  This was
neither intentional nor unintentional <wink>.  It seems like it would be
better to replace:

            m = self.find_end(text, end)
            if not m:
                break

with:

            m = self.find_end(text, end)
            if not m:
                pushretained(text[start :])  # add this line
                break

Then the unmatched open-bracket construct, and everything following it, will
be retained.  This will apply to unclosed HTML comments, unclosed style
sheets, unclosed uuencoded sections, and unclosed embedded URLs.  I think
I'm fine with retaining all of those.

> Outlook actually shows this entire tag (ie, literally "<!--#rotato>",
> then displays the rest of the HTML correctly - ie, I guess that we treat
> the comment as unclosed, while Outlook ignores it.

Sounds right.

> Any thoughts?

Nope, not a one <wink>.


From tim.one at comcast.net  Mon Mar  3 20:54:43 2003
From: tim.one at comcast.net (Tim Peters)
Date: Mon Mar  3 20:55:14 2003
Subject: [Spambayes] Missing HTML payload
In-Reply-To: <LCEPIIGDJPKCOIHOBJEPGEACOEAA.mhammond@skippinet.com.au>
Message-ID: <LNBBLJKPBEHFEDALKOLCCECNEAAB.tim.one@comcast.net>

[Mark Hammond]
> I finally worked out where my missing "url:" tokens got to.  However,
> once that is corrected, the same problem remains - no tokens extracted
> from the HTML body, *except* URL tokens, appear.

Sorry this was so painful!  URL extraction occurs after the body has been
lower-cased, and after uuencoded-section removal, but before anything else
is done with the body.  In particular, URL extraction is done before style
sheet and comment removal.   That's why you saw url: tokens despite that the
comment construct was unclosed and comment-removal nuked the body.  The
one-liner change in my last email should repair the problem.


From noreply at sourceforge.net  Mon Mar  3 17:55:43 2003
From: noreply at sourceforge.net (SourceForge.net)
Date: Mon Mar  3 21:00:08 2003
Subject: [Spambayes] 
	[ spambayes-Bugs-696995 ] Invalid HTML comments are not ignored
Message-ID: <E18q1eh-0005yN-00@sc8-sf-web4.sourceforge.net>

Bugs item #696995, was opened at 2003-03-04 12:55
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=498103&aid=696995&group_id=61702

Category: None
Group: None
Status: Open
Resolution: None
Priority: 5
Submitted By: Mark Hammond (mhammond)
Assigned to: Nobody/Anonymous (nobody)
Summary: Invalid HTML comments are not ignored

Initial Comment:
Incorrectly terminated HTML comments are ignored by
SpamBayes, but most clients handle this gracefully.

For both of the following:
hi <!-- comment > there
hi <!-- comment > more <!-- comment 2 --> there

IE and Mozilla both render "hi there".  SpamBayes will
miss the "there".  Thus, spambayes can miss most of the
message payload even though the user sees it all.

Attaching a patch which creates a new option,
ignore_unterminated_html_comments: True, which
correctly handles this case.  If set to False, you get
the old behaviour.  If no one can see a reason to keep
the existing behaviour, then this can be dropped as an
option.

----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=498103&aid=696995&group_id=61702

From noreply at sourceforge.net  Mon Mar  3 18:09:17 2003
From: noreply at sourceforge.net (SourceForge.net)
Date: Mon Mar  3 21:00:10 2003
Subject: [Spambayes] 
	[ spambayes-Bugs-696995 ] Invalid HTML comments are not ignored
Message-ID: <E18q1rp-0005jX-00@sc8-sf-web2.sourceforge.net>

Bugs item #696995, was opened at 2003-03-03 20:55
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=498103&aid=696995&group_id=61702

Category: None
Group: None
Status: Open
Resolution: None
Priority: 5
Submitted By: Mark Hammond (mhammond)
Assigned to: Nobody/Anonymous (nobody)
Summary: Invalid HTML comments are not ignored

Initial Comment:
Incorrectly terminated HTML comments are ignored by
SpamBayes, but most clients handle this gracefully.

For both of the following:
hi <!-- comment > there
hi <!-- comment > more <!-- comment 2 --> there

IE and Mozilla both render "hi there".  SpamBayes will
miss the "there".  Thus, spambayes can miss most of the
message payload even though the user sees it all.

Attaching a patch which creates a new option,
ignore_unterminated_html_comments: True, which
correctly handles this case.  If set to False, you get
the old behaviour.  If no one can see a reason to keep
the existing behaviour, then this can be dropped as an
option.

----------------------------------------------------------------------

>Comment By: Tim Peters (tim_one)
Date: 2003-03-03 21:09

Message:
Logged In: YES 
user_id=31435

I suggest the one-line change to analyze() I posted to the 
mailing list instead -- there's no real value I can see in the 
current behavior of throwing away everything after an 
unmatched open-block construct, and it wasn't intentional 
behavior.  If an open-block construct isn't matched by a 
close-block construct, all in all it's more reasonable to act 
as if the open-block construct hadn't been recognized as 
one at all.

----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=498103&aid=696995&group_id=61702

From mhammond at skippinet.com.au  Tue Mar  4 13:02:16 2003
From: mhammond at skippinet.com.au (Mark Hammond)
Date: Mon Mar  3 21:02:48 2003
Subject: [Spambayes] Missing HTML payload
In-Reply-To: <LNBBLJKPBEHFEDALKOLCKECMEAAB.tim.one@comcast.net>
Message-ID: <LCEPIIGDJPKCOIHOBJEPKEAIOEAA.mhammond@skippinet.com.au>

Thanks for the replies!

> > Outlook actually shows this entire tag (ie, literally "<!--#rotato>",
> > then displays the rest of the HTML correctly - ie, I guess that we treat
> > the comment as unclosed, while Outlook ignores it.
>
> Sounds right.

Interestingly, Outlook shows the text, but IE and Mozilla do not.  All 3
show the text *after* the unmatched comment, but only Outlook shows the
comment itself.  I don't want to think about the implications of that
<wink>.

I made an alternative patch in that bug I pointed to, which completely
strips the invalid comment.  From purely an Outlook POV, your patch is
probably better (as your patch better reflects what we see), but from the
"correctness" POV, maybe mine is (as it better reflects what most HTML
clients see)

It does seem that no option is required whatever way we go.

I-dont-care-either ly,

Mark


From noreply at sourceforge.net  Mon Mar  3 18:17:25 2003
From: noreply at sourceforge.net (SourceForge.net)
Date: Mon Mar  3 21:21:45 2003
Subject: [Spambayes] 
	[ spambayes-Bugs-696995 ] Invalid HTML comments are not ignored
Message-ID: <E18q1zh-00062y-00@sc8-sf-web2.sourceforge.net>

Bugs item #696995, was opened at 2003-03-04 12:55
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=498103&aid=696995&group_id=61702

Category: None
Group: None
>Status: Closed
>Resolution: Fixed
Priority: 5
Submitted By: Mark Hammond (mhammond)
Assigned to: Nobody/Anonymous (nobody)
Summary: Invalid HTML comments are not ignored

Initial Comment:
Incorrectly terminated HTML comments are ignored by
SpamBayes, but most clients handle this gracefully.

For both of the following:
hi <!-- comment > there
hi <!-- comment > more <!-- comment 2 --> there

IE and Mozilla both render "hi there".  SpamBayes will
miss the "there".  Thus, spambayes can miss most of the
message payload even though the user sees it all.

Attaching a patch which creates a new option,
ignore_unterminated_html_comments: True, which
correctly handles this case.  If set to False, you get
the old behaviour.  If no one can see a reason to keep
the existing behaviour, then this can be dropped as an
option.

----------------------------------------------------------------------

>Comment By: Mark Hammond (mhammond)
Date: 2003-03-04 13:17

Message:
Logged In: YES 
user_id=14198

Tim's fix (plus a couple of comments) checked in.

----------------------------------------------------------------------

Comment By: Tim Peters (tim_one)
Date: 2003-03-04 13:09

Message:
Logged In: YES 
user_id=31435

I suggest the one-line change to analyze() I posted to the 
mailing list instead -- there's no real value I can see in the 
current behavior of throwing away everything after an 
unmatched open-block construct, and it wasn't intentional 
behavior.  If an open-block construct isn't matched by a 
close-block construct, all in all it's more reasonable to act 
as if the open-block construct hadn't been recognized as 
one at all.

----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=498103&aid=696995&group_id=61702

From tim.one at comcast.net  Mon Mar  3 21:21:17 2003
From: tim.one at comcast.net (Tim Peters)
Date: Mon Mar  3 21:21:51 2003
Subject: [Spambayes] Missing HTML payload
In-Reply-To: <LCEPIIGDJPKCOIHOBJEPKEAIOEAA.mhammond@skippinet.com.au>
Message-ID: <LNBBLJKPBEHFEDALKOLCAEDAEAAB.tim.one@comcast.net>

[Mark Hammond]
> Thanks for the replies!

I guess you didn't get my bill <wink>.

> Interestingly, Outlook shows the text, but IE and Mozilla do not.  All 3
> show the text *after* the unmatched comment, but only Outlook shows the
> comment itself.  I don't want to think about the implications of that
> <wink>.
>
> I made an alternative patch in that bug I pointed to, which completely
> strips the invalid comment.  From purely an Outlook POV, your patch is
> probably better (as your patch better reflects what we see), but from the
> "correctness" POV, maybe mine is (as it better reflects what most HTML
> clients see)

My belief is that non-spam HTML mail moves in the direction of using HTML
correctly, so that damaged HTML is itself a spam indicator.  Unlike Paul
Graham <wink>, I have sisters, and they love sending HTML mail.  It's fun
for them and they do some beautiful stuff with it.  So, all along, I've been
much less willing to penalize HTML than other projects of this ilk (only
computer geeks have bugs up their butts about using HTML in email).

The flip side is that if damaged HTML is a symptom of spam, damaged HTML
should be penalized, and *not* stripping the damaged stuff will create a
mountain of characteristic clues.  Senders of ham can avoid those penalties
by sending well-formed HTML.

> It does seem that no option is required whatever way we go.

I'd agree even if we didn't have too many options.


From skip at pobox.com  Mon Mar  3 22:33:39 2003
From: skip at pobox.com (Skip Montanaro)
Date: Mon Mar  3 23:33:44 2003
Subject: [Spambayes] binascii.Error
Message-ID: <15972.11427.981401.997736@montanaro.dyndns.org>


A couple people recently reported binascii.Error being raised by pop3proxy,
etc.  Sjoerd Mullender filed a bug report on SF as well.  I just checked in
a change to spambayes/tokenizer.py which seems to fix the problem.  Please
give the latest CVS version a try and let me know if you still experience
the problem.

As an added bonus, a new token, "charset:invalid" gets generated when
binascii barfs.  More clues for the guys in the white hats.

Skip

From noreply at sourceforge.net  Mon Mar  3 20:41:03 2003
From: noreply at sourceforge.net (SourceForge.net)
Date: Tue Mar  4 00:04:48 2003
Subject: [Spambayes] 
	[ spambayes-Bugs-696458 ] crash in tokenizer due to bad base64 in
	subject
Message-ID: <E18q4Eh-00082O-00@sc8-sf-web3.sourceforge.net>

Bugs item #696458, was opened at 2003-03-03 04:12
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=498103&aid=696458&group_id=61702

Category: None
Group: None
>Status: Closed
>Resolution: Fixed
Priority: 5
Submitted By: Sjoerd Mullender (sjoerd)
>Assigned to: Skip Montanaro (montanaro)
Summary: crash in tokenizer due to bad base64 in subject

Initial Comment:
I got a crash in the tokenizer in the line where it does

        x = msg.get('subject', '')
        for x, subjcharset in
email.Header.decode_header(x):

The reason is, the subject of this particular message is

Subject: *****SPAM*****
=?EUC-KR?B?CSixpLDtKSC/7Liuvsax4iC6uLmwMcijIKHaILzSwd/H0SC8+LCjwLsgv7W/+Mj3IQ?=

which gives a binascii.Error: Incorrect padding from
binascii.a2b_base64.

I am running an up-to-date spambayes and python (i.e.
both fresh from CVS).

Here is a (parial) stack trace:

  File
"/ufs/sjoerd/src/spambayes/spambayes/tokenizer.py",
line 1052, in tokenize
    for tok in self.tokenize_headers(msg):
  File
"/ufs/sjoerd/src/spambayes/spambayes/tokenizer.py",
line 1106, in tokenize_headers
    for x, subjcharset in email.Header.decode_header(x):
  File
"/ufs/sjoerd/src/Python/dist/src/Lib/email/Header.py",
line 92, in decode_header
    dec = email.base64MIME.decode(encoded)
  File
"/ufs/sjoerd/src/Python/dist/src/Lib/email/base64MIME.py",
line 179, in decode
    dec = a2b_base64(s)
binascii.Error: Incorrect padding


----------------------------------------------------------------------

>Comment By: Skip Montanaro (montanaro)
Date: 2003-03-03 22:41

Message:
Logged In: YES 
user_id=44345

Still not clear what the best course of action is at the email package level.
I solved it here by catching the binascii exception and tossing in a
'charset:invalid' token.  It solved the problem here.  Sjoerd, let me know if
it's still a problem for you, but I think this should worm around it.

S

----------------------------------------------------------------------

Comment By: Skip Montanaro (montanaro)
Date: 2003-03-03 11:30

Message:
Logged In: YES 
user_id=44345

Casual observation for anyone reporting spambayes bugs which involve
the email package - You should also check/report such errors on the 
http://mimelib.sourceforge.net/ project, which is where the email
gurus hang out.

----------------------------------------------------------------------

Comment By: Sjoerd Mullender (sjoerd)
Date: 2003-03-03 10:44

Message:
Logged In: YES 
user_id=43607

It seems to me that all calls to email.Header.decode_header
should be protected with try/except, or decode_header itself
should protect itself with a try/except.  A third
possibility is to add an extra indirection through a
function that does basically:

def decode_header(x):
    try:
        return email.Header.decode_header(x)
    except:
        return x


----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=498103&aid=696458&group_id=61702

From spambayes at rodland.no  Tue Mar  4 09:51:21 2003
From: spambayes at rodland.no (Fredrik Rodland)
Date: Tue Mar  4 03:51:26 2003
Subject: [Spambayes] FW: reg [ 642740 ] "Recover from Spam" wrong folder
Message-ID: <OLEKJBLGLGDHBDLHGIINAELBCKAA.spambayes@rodland.no>

I am wondering if I'm doing something wrong here.

I just checked out the last copy of CVS.

you have fixed bugs:
[ 642740 ] "Recover from Spam" wrong folder
and
[ 696476 ] Manual filtering in outlook fails

however both of these (still) fails.

I've completely deleted my old installations.  I've unregistered, and then
re-registered addin.py. I've checked that I've got the last versions of both
addin.py & manager.py:

[Fredrik@FMR_WIN Outlook2000]$ spcvs status addin.py
===================================================================
File: addin.py          Status: Up-to-date

   Working revision:    1.50
   Repository revision: 1.50
/cvsroot/spambayes/spambayes/Outlook2000/addin.py,v
   Sticky Tag:          (none)
   Sticky Date:         (none)
   Sticky Options:      (none)

[Fredrik@FMR_WIN Outlook2000]$ spcvs status manager.py
===================================================================
File: manager.py        Status: Up-to-date

   Working revision:    1.51
   Repository revision: 1.51
/cvsroot/spambayes/spambayes/Outlook2000/manager.py,v
   Sticky Tag:          (none)
   Sticky Date:         (none)
   Sticky Options:      (none)

Am I missing something here?

I've allready posted a bug similar to 696476 (that is bug #697120).  I'll be
happy to re-post a bug similar to 642740.


F


--
Fredrik R?dland	Technical Architect, Stocknet, Oslo, Norway
Stocknet:		http://www.stocknet.com		phone: +47 23 28 40 17
Private:		http://rodland.no			phone: +47 99 21 98 17


From spambayes at rodland.no  Tue Mar  4 09:56:59 2003
From: spambayes at rodland.no (Fredrik Rodland)
Date: Tue Mar  4 03:57:05 2003
Subject: [Spambayes] FW: reg [ 642740 ] "Recover from Spam" wrong folder
In-Reply-To: <OLEKJBLGLGDHBDLHGIINAELBCKAA.spambayes@rodland.no>
Message-ID: <OLEKJBLGLGDHBDLHGIINOELCCKAA.spambayes@rodland.no>


> -----Original Message-----
> From: spambayes-bounces@python.org
> [mailto:spambayes-bounces@python.org]On Behalf Of Fredrik Rodland
> Sent: 4. mars 2003 09:51
> To: Spambayes
> Subject: [Spambayes] FW: reg [ 642740 ] "Recover from Spam" wrong folder
>
> I've allready posted a bug similar to 696476 (that is bug
> #697120).  I'll be
> happy to re-post a bug similar to 642740.

what's the prefered of:

A. re-opeing a bug
B. posting a new bug (with a link/comment to the old)

when something (still) does not work when the bug is closed?


Fredrik


--
Fredrik R?dland	Technical Architect, Stocknet, Oslo, Norway
Stocknet:		http://www.stocknet.com		phone: +47 23 28 40 17
Private:		http://rodland.no			phone: +47 99 21 98 17


From mhammond at skippinet.com.au  Tue Mar  4 21:26:38 2003
From: mhammond at skippinet.com.au (Mark Hammond)
Date: Tue Mar  4 05:27:41 2003
Subject: [Spambayes] FW: reg [ 642740 ] "Recover from Spam" wrong folder
In-Reply-To: <OLEKJBLGLGDHBDLHGIINOELCCKAA.spambayes@rodland.no>
Message-ID: <LCEPIIGDJPKCOIHOBJEPMEDAOEAA.mhammond@skippinet.com.au>

> what's the prefered of:
>
> A. re-opeing a bug
> B. posting a new bug (with a link/comment to the old)

My preference is A, assuming that the bug is still "warm", or not actually
fixed as was the case here.  If an identical bug appears in the future as a
regression due to some other change, then it should be a new bug.

Still-wishing-we-had-bugzilla ly,

Mark.


From frodland at aston.no  Tue Mar  4 09:50:52 2003
From: frodland at aston.no (Fredrik Rodland)
Date: Tue Mar  4 09:55:21 2003
Subject: [Spambayes] reg [ 642740 ] "Recover from Spam" wrong folder
Message-ID: <OLEKJBLGLGDHBDLHGIINMELACKAA.frodland@aston.no>

I am wondering if I'm doing something wrong here.

I just checked out the last copy of CVS.

you have fixed bugs:
[ 642740 ] "Recover from Spam" wrong folder
and
[ 696476 ] Manual filtering in outlook fails

however both of these (still) fails.

I've completely deleted my old installations.  I've unregistered, and then
re-registered addin.py. I've checked that I've got the last versions of both
addin.py & manager.py:

[Fredrik@FMR_WIN Outlook2000]$ spcvs status addin.py
===================================================================
File: addin.py          Status: Up-to-date

   Working revision:    1.50
   Repository revision: 1.50
/cvsroot/spambayes/spambayes/Outlook2000/addin.py,v
   Sticky Tag:          (none)
   Sticky Date:         (none)
   Sticky Options:      (none)

[Fredrik@FMR_WIN Outlook2000]$ spcvs status manager.py
===================================================================
File: manager.py        Status: Up-to-date

   Working revision:    1.51
   Repository revision: 1.51
/cvsroot/spambayes/spambayes/Outlook2000/manager.py,v
   Sticky Tag:          (none)
   Sticky Date:         (none)
   Sticky Options:      (none)

Am I missing something here?

I've allready posted a bug similar to 696476 (that is bug #697120).  I'll be
happy to re-post a bug similar to 642740.


F


--
Fredrik R?dland	Technical Architect, Stocknet, Oslo, Norway
Stocknet:		http://www.stocknet.com		phone: +47 23 28 40 17
Private:		http://rodland.no			phone: +47 99 21 98 17


From noreply at sourceforge.net  Mon Mar  3 22:15:28 2003
From: noreply at sourceforge.net (SourceForge.net)
Date: Tue Mar  4 09:55:34 2003
Subject: [Spambayes] 
	[ spambayes-Bugs-696476 ] Manual filtering in outlook fails
Message-ID: <E18q5i4-0005P8-00@sc8-sf-web4.sourceforge.net>

Bugs item #696476, was opened at 2003-03-03 21:39
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=498103&aid=696476&group_id=61702

Category: Outlook
Group: None
>Status: Closed
>Resolution: Fixed
Priority: 5
Submitted By: Fredrik Rodland (fmmr)
Assigned to: Mark Hammond (mhammond)
Summary: Manual filtering in outlook fails

Initial Comment:
When I try to run "filter now" from the outlook plugin - I 
get the following trace:

Traceback (most recent call last):
  File "c:\Programfiler\_UTIL\spambayes-1.0a2
\Outlook2000\dialogs\AsyncDialog.py", line 98, in 
OnStart
    self.StartProcess()
  File "c:\Programfiler\_UTIL\spambayes-1.0a2
\Outlook2000\dialogs\FilterDialog.py", line 365, in 
StartProcess
    self.mgr.EnsureOutlookFieldsForFolder(folder_id, 
config.include_sub)
  File "c:\Programfiler\_UTIL\spambayes-1.0a2
\Outlook2000\manager.py", line 156, in 
EnsureOutlookFieldsForFolder
    folders = item.Folders
  File "C:\PROGRA~1\_DEV\Python22\lib\site-
packages\win32com\client\__init__.py", line 402, in 
__getattr__
    if d is not None: return getattr(d, attr)
  File "C:\PROGRA~1\_DEV\Python22\lib\site-
packages\win32com\client\__init__.py", line 368, in 
__getattr__
    raise AttributeError, "'%s' object has no attribute '%s'" 
% (repr(self), attr)
AttributeError: '<win32com.gen_py.Microsoft Outlook 
9.0 Object Library._AppointmentItem>' object has no 
attribute 'Folders'
win32ui: Error in Command Message handler for 
command ID 1100, Code 0

OS: windows XP home
Spambayes version: 1.0a2
outlook version: 2000 sp3

----------------------------------------------------------------------

>Comment By: Mark Hammond (mhammond)
Date: 2003-03-04 17:15

Message:
Logged In: YES 
user_id=14198

/cvsroot/spambayes/spambayes/Outlook2000/manager.py,v  <-- 
manager.py
new revision: 1.51; previous revision: 1.50

----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=498103&aid=696476&group_id=61702

From noreply at sourceforge.net  Tue Mar  4 00:24:30 2003
From: noreply at sourceforge.net (SourceForge.net)
Date: Tue Mar  4 09:55:35 2003
Subject: [Spambayes] 
	[ spambayes-Bugs-697120 ] Manual filtering in Outlook (still) fails
Message-ID: <E18q7iw-0001Je-00@sc8-sf-web4.sourceforge.net>

Bugs item #697120, was opened at 2003-03-04 09:24
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=498103&aid=697120&group_id=61702

Category: Outlook
Group: None
Status: Open
Resolution: None
Priority: 5
Submitted By: Fredrik Rodland (fmmr)
Assigned to: Mark Hammond (mhammond)
Summary: Manual filtering in Outlook (still) fails

Initial Comment:
also see bug #696476 which is very similar to this one 
(but has status: closed).

When trying to filter manually in outlook, I get this error.  
I've tried to filter multiple folders, both with and wiothout 
the "include subfolder-checkbox" set, and also ensured 
that there was a message in the folder I trie3d to filter.


Traceback (most recent call last):
  
File "c:\Programfiler\_UTIL\spambayes_cvs\Outlook2000
\dialogs\AsyncDialog.py", line 98, in OnStart
    self.StartProcess()
  
File "c:\Programfiler\_UTIL\spambayes_cvs\Outlook2000
\dialogs\FilterDialog.py", line 366, in StartProcess
    self.mgr.EnsureOutlookFieldsForFolder(folder_id, 
config.include_sub)
  
File "c:\Programfiler\_UTIL\spambayes_cvs\Outlook2000
\manager.py", line 290, in EnsureOutlookFieldsForFolder
    folders = item.Folders
  File "C:\PROGRA~1\_DEV\Python22\lib\site-
packages\win32com\client\__init__.py", line 402, in 
__getattr__
    if d is not None: return getattr(d, attr)
  File "C:\PROGRA~1\_DEV\Python22\lib\site-
packages\win32com\client\__init__.py", line 368, in 
__getattr__
    raise AttributeError, "'%s' object has no attribute '%s'" 
% (repr(self), attr)

AttributeError: '<win32com.gen_py.Microsoft Outlook 
9.0 Object Library._MailItem>' object has no 
attribute 'Folders'
win32ui: Error in Command Message handler for 
command ID 1100, Code 0


----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=498103&aid=697120&group_id=61702

From noreply at sourceforge.net  Tue Mar  4 02:11:31 2003
From: noreply at sourceforge.net (SourceForge.net)
Date: Tue Mar  4 09:55:37 2003
Subject: [Spambayes] 
	[ spambayes-Bugs-697120 ] Manual filtering in Outlook (still) fails
Message-ID: <E18q9OV-0004zJ-00@sc8-sf-web2.sourceforge.net>

Bugs item #697120, was opened at 2003-03-04 09:24
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=498103&aid=697120&group_id=61702

Category: Outlook
Group: None
Status: Open
Resolution: None
Priority: 5
Submitted By: Fredrik Rodland (fmmr)
Assigned to: Mark Hammond (mhammond)
Summary: Manual filtering in Outlook (still) fails

Initial Comment:
also see bug #696476 which is very similar to this one 
(but has status: closed).

When trying to filter manually in outlook, I get this error.  
I've tried to filter multiple folders, both with and wiothout 
the "include subfolder-checkbox" set, and also ensured 
that there was a message in the folder I trie3d to filter.


Traceback (most recent call last):
  
File "c:\Programfiler\_UTIL\spambayes_cvs\Outlook2000
\dialogs\AsyncDialog.py", line 98, in OnStart
    self.StartProcess()
  
File "c:\Programfiler\_UTIL\spambayes_cvs\Outlook2000
\dialogs\FilterDialog.py", line 366, in StartProcess
    self.mgr.EnsureOutlookFieldsForFolder(folder_id, 
config.include_sub)
  
File "c:\Programfiler\_UTIL\spambayes_cvs\Outlook2000
\manager.py", line 290, in EnsureOutlookFieldsForFolder
    folders = item.Folders
  File "C:\PROGRA~1\_DEV\Python22\lib\site-
packages\win32com\client\__init__.py", line 402, in 
__getattr__
    if d is not None: return getattr(d, attr)
  File "C:\PROGRA~1\_DEV\Python22\lib\site-
packages\win32com\client\__init__.py", line 368, in 
__getattr__
    raise AttributeError, "'%s' object has no attribute '%s'" 
% (repr(self), attr)

AttributeError: '<win32com.gen_py.Microsoft Outlook 
9.0 Object Library._MailItem>' object has no 
attribute 'Folders'
win32ui: Error in Command Message handler for 
command ID 1100, Code 0


----------------------------------------------------------------------

>Comment By: Fredrik Rodland (fmmr)
Date: 2003-03-04 11:11

Message:
Logged In: YES 
user_id=724871

I've tested this some more.  It seems like I was wrong in my 
initial bug-report.  everything seems to be working fine 
if "include subfolder" is UNCHECKED.  The filtering then both 
handles empty and non-empty folders.

However if the "include subfolder" is CHECKED, the filtering 
fails - also if all folders filtered contain mails.

----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=498103&aid=697120&group_id=61702

From noreply at sourceforge.net  Tue Mar  4 02:33:06 2003
From: noreply at sourceforge.net (SourceForge.net)
Date: Tue Mar  4 09:55:39 2003
Subject: [Spambayes] 
	[ spambayes-Bugs-697120 ] Manual filtering in Outlook (still) fails
Message-ID: <E18q9jO-0003F6-00@sc8-sf-web1.sourceforge.net>

Bugs item #697120, was opened at 2003-03-04 19:24
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=498103&aid=697120&group_id=61702

Category: Outlook
Group: None
>Status: Closed
>Resolution: Fixed
Priority: 5
Submitted By: Fredrik Rodland (fmmr)
Assigned to: Mark Hammond (mhammond)
Summary: Manual filtering in Outlook (still) fails

Initial Comment:
also see bug #696476 which is very similar to this one 
(but has status: closed).

When trying to filter manually in outlook, I get this error.  
I've tried to filter multiple folders, both with and wiothout 
the "include subfolder-checkbox" set, and also ensured 
that there was a message in the folder I trie3d to filter.


Traceback (most recent call last):
  
File "c:\Programfiler\_UTIL\spambayes_cvs\Outlook2000
\dialogs\AsyncDialog.py", line 98, in OnStart
    self.StartProcess()
  
File "c:\Programfiler\_UTIL\spambayes_cvs\Outlook2000
\dialogs\FilterDialog.py", line 366, in StartProcess
    self.mgr.EnsureOutlookFieldsForFolder(folder_id, 
config.include_sub)
  
File "c:\Programfiler\_UTIL\spambayes_cvs\Outlook2000
\manager.py", line 290, in EnsureOutlookFieldsForFolder
    folders = item.Folders
  File "C:\PROGRA~1\_DEV\Python22\lib\site-
packages\win32com\client\__init__.py", line 402, in 
__getattr__
    if d is not None: return getattr(d, attr)
  File "C:\PROGRA~1\_DEV\Python22\lib\site-
packages\win32com\client\__init__.py", line 368, in 
__getattr__
    raise AttributeError, "'%s' object has no attribute '%s'" 
% (repr(self), attr)

AttributeError: '<win32com.gen_py.Microsoft Outlook 
9.0 Object Library._MailItem>' object has no 
attribute 'Folders'
win32ui: Error in Command Message handler for 
command ID 1100, Code 0


----------------------------------------------------------------------

>Comment By: Mark Hammond (mhammond)
Date: 2003-03-04 21:33

Message:
Logged In: YES 
user_id=14198

OK, finally fixed:
/cvsroot/spambayes/spambayes/Outlook2000/manager.py,v  <-- 
manager.py
new revision: 1.52; previous revision: 1.51

I was tricked by the original traceback, which had an
appointment item.  My previous checkin made sure *that*
couldn't happen again <wink>

Note that if you comment in the bug that it still fails, I
will simply re-open the old bug, rather than creating a new
one.  Do that if this fix doesn't work :(

----------------------------------------------------------------------

Comment By: Fredrik Rodland (fmmr)
Date: 2003-03-04 21:11

Message:
Logged In: YES 
user_id=724871

I've tested this some more.  It seems like I was wrong in my 
initial bug-report.  everything seems to be working fine 
if "include subfolder" is UNCHECKED.  The filtering then both 
handles empty and non-empty folders.

However if the "include subfolder" is CHECKED, the filtering 
fails - also if all folders filtered contain mails.

----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=498103&aid=697120&group_id=61702

From noreply at sourceforge.net  Tue Mar  4 02:43:29 2003
From: noreply at sourceforge.net (SourceForge.net)
Date: Tue Mar  4 09:55:42 2003
Subject: [Spambayes] 
	[ spambayes-Bugs-642740 ] "Recover from Spam" wrong folder
Message-ID: <E18q9tR-0003UN-00@sc8-sf-web1.sourceforge.net>

Bugs item #642740, was opened at 2002-11-24 01:00
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=498103&aid=642740&group_id=61702

Category: None
Group: None
>Status: Open
>Resolution: Works For Me
Priority: 5
Submitted By: Mark Hammond (mhammond)
Assigned to: Mark Hammond (mhammond)
>Summary: "Recover from Spam" wrong folder

Initial Comment:
Outlook addin:

Selecting "Recover From Spam" recovers the selected
message to the Inbox folder - which is not necessarily
where came from.  The filterer will need to save the
folder it came from before we can do this.

----------------------------------------------------------------------

>Comment By: Mark Hammond (mhammond)
Date: 2003-03-04 21:43

Message:
Logged In: YES 
user_id=14198

Can you post an example of something that fails?

Note that a remaining potential problem is out of our
control: occasionally the "Inbox" will see a message before
the builtin rules.  In this case, we filter it from the
Inbox, not from where the Outlook rule would have moved it.
 Thus, when we recover, we see the inbox as the source.

Note that I also fixed another bug related to this -
previously, simply scoring a message would store that folder
name as the "source" of the message.  Thus, if you had
previously viewed the clues for a message once in the wrong
folder, the correct source folder would have been lost.  So
please ensure you are testing with mail received since I
said I fixed this.

----------------------------------------------------------------------

Comment By: Mark Hammond (mhammond)
Date: 2003-02-04 17:23

Message:
Logged In: YES 
user_id=14198

/cvsroot/spambayes/spambayes/Outlook2000/addin.py,v  <-- 
addin.py
new revision: 1.48; previous revision: 1.47
/cvsroot/spambayes/spambayes/Outlook2000/filter.py,v  <-- 
filter.py
new revision: 1.16; previous revision: 1.15
/cvsroot/spambayes/spambayes/Outlook2000/msgstore.py,v  <--
 msgstore.py
new revision: 1.39; previous revision: 1.38


----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=498103&aid=642740&group_id=61702

From noreply at sourceforge.net  Tue Mar  4 02:45:47 2003
From: noreply at sourceforge.net (SourceForge.net)
Date: Tue Mar  4 09:55:50 2003
Subject: [Spambayes] 
	[ spambayes-Bugs-697120 ] Manual filtering in Outlook (still) fails
Message-ID: <E18q9vf-0003a0-00@sc8-sf-web3.sourceforge.net>

Bugs item #697120, was opened at 2003-03-04 09:24
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=498103&aid=697120&group_id=61702

Category: Outlook
Group: None
>Status: Open
Resolution: Fixed
Priority: 5
Submitted By: Fredrik Rodland (fmmr)
Assigned to: Mark Hammond (mhammond)
Summary: Manual filtering in Outlook (still) fails

Initial Comment:
also see bug #696476 which is very similar to this one 
(but has status: closed).

When trying to filter manually in outlook, I get this error.  
I've tried to filter multiple folders, both with and wiothout 
the "include subfolder-checkbox" set, and also ensured 
that there was a message in the folder I trie3d to filter.


Traceback (most recent call last):
  
File "c:\Programfiler\_UTIL\spambayes_cvs\Outlook2000
\dialogs\AsyncDialog.py", line 98, in OnStart
    self.StartProcess()
  
File "c:\Programfiler\_UTIL\spambayes_cvs\Outlook2000
\dialogs\FilterDialog.py", line 366, in StartProcess
    self.mgr.EnsureOutlookFieldsForFolder(folder_id, 
config.include_sub)
  
File "c:\Programfiler\_UTIL\spambayes_cvs\Outlook2000
\manager.py", line 290, in EnsureOutlookFieldsForFolder
    folders = item.Folders
  File "C:\PROGRA~1\_DEV\Python22\lib\site-
packages\win32com\client\__init__.py", line 402, in 
__getattr__
    if d is not None: return getattr(d, attr)
  File "C:\PROGRA~1\_DEV\Python22\lib\site-
packages\win32com\client\__init__.py", line 368, in 
__getattr__
    raise AttributeError, "'%s' object has no attribute '%s'" 
% (repr(self), attr)

AttributeError: '<win32com.gen_py.Microsoft Outlook 
9.0 Object Library._MailItem>' object has no 
attribute 'Folders'
win32ui: Error in Command Message handler for 
command ID 1100, Code 0


----------------------------------------------------------------------

>Comment By: Fredrik Rodland (fmmr)
Date: 2003-03-04 11:45

Message:
Logged In: YES 
user_id=724871

Well - i still get an error - bug reopened:

Traceback (most recent call last):
  File "c:\Programfiler\_UTIL\spambayes_cvs\Outlook2000
\dialogs\AsyncDialog.py", line 98, in OnStart
    self.StartProcess()
  File "c:\Programfiler\_UTIL\spambayes_cvs\Outlook2000
\dialogs\FilterDialog.py", line 366, in StartProcess
    self.mgr.EnsureOutlookFieldsForFolder(folder_id, 
config.include_sub)
  File "c:\Programfiler\_UTIL\spambayes_cvs\Outlook2000
\manager.py", line 293, in EnsureOutlookFieldsForFolder
    self.EnsureOutlookFieldsForFolder(folder.EntryID, True)
  File "c:\Programfiler\_UTIL\spambayes_cvs\Outlook2000
\manager.py", line 245, in EnsureOutlookFieldsForFolder
    msgstore_folder = self.message_store.GetFolder(folder_id)
  File "c:\Programfiler\_UTIL\spambayes_cvs\Outlook2000
\msgstore.py", line 232, in GetFolder
    folder_id = self.NormalizeID(folder_id)
  File "c:\Programfiler\_UTIL\spambayes_cvs\Outlook2000
\msgstore.py", line 186, in NormalizeID
    assert False, "We expect fully qualified IDs"
AssertionError: We expect fully qualified IDs
win32ui: Error in Command Message handler for command ID 
1100, Code 0


----------------------------------------------------------------------

Comment By: Mark Hammond (mhammond)
Date: 2003-03-04 11:33

Message:
Logged In: YES 
user_id=14198

OK, finally fixed:
/cvsroot/spambayes/spambayes/Outlook2000/manager.py,v  <-- 
manager.py
new revision: 1.52; previous revision: 1.51

I was tricked by the original traceback, which had an
appointment item.  My previous checkin made sure *that*
couldn't happen again <wink>

Note that if you comment in the bug that it still fails, I
will simply re-open the old bug, rather than creating a new
one.  Do that if this fix doesn't work :(

----------------------------------------------------------------------

Comment By: Fredrik Rodland (fmmr)
Date: 2003-03-04 11:11

Message:
Logged In: YES 
user_id=724871

I've tested this some more.  It seems like I was wrong in my 
initial bug-report.  everything seems to be working fine 
if "include subfolder" is UNCHECKED.  The filtering then both 
handles empty and non-empty folders.

However if the "include subfolder" is CHECKED, the filtering 
fails - also if all folders filtered contain mails.

----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=498103&aid=697120&group_id=61702

From noreply at sourceforge.net  Tue Mar  4 02:52:49 2003
From: noreply at sourceforge.net (SourceForge.net)
Date: Tue Mar  4 09:55:57 2003
Subject: [Spambayes] 
	[ spambayes-Bugs-697120 ] Manual filtering in Outlook (still) fails
Message-ID: <E18qA2T-0007IC-00@sc8-sf-web2.sourceforge.net>

Bugs item #697120, was opened at 2003-03-04 19:24
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=498103&aid=697120&group_id=61702

Category: Outlook
Group: None
>Status: Closed
Resolution: Fixed
Priority: 5
Submitted By: Fredrik Rodland (fmmr)
Assigned to: Mark Hammond (mhammond)
Summary: Manual filtering in Outlook (still) fails

Initial Comment:
also see bug #696476 which is very similar to this one 
(but has status: closed).

When trying to filter manually in outlook, I get this error.  
I've tried to filter multiple folders, both with and wiothout 
the "include subfolder-checkbox" set, and also ensured 
that there was a message in the folder I trie3d to filter.


Traceback (most recent call last):
  
File "c:\Programfiler\_UTIL\spambayes_cvs\Outlook2000
\dialogs\AsyncDialog.py", line 98, in OnStart
    self.StartProcess()
  
File "c:\Programfiler\_UTIL\spambayes_cvs\Outlook2000
\dialogs\FilterDialog.py", line 366, in StartProcess
    self.mgr.EnsureOutlookFieldsForFolder(folder_id, 
config.include_sub)
  
File "c:\Programfiler\_UTIL\spambayes_cvs\Outlook2000
\manager.py", line 290, in EnsureOutlookFieldsForFolder
    folders = item.Folders
  File "C:\PROGRA~1\_DEV\Python22\lib\site-
packages\win32com\client\__init__.py", line 402, in 
__getattr__
    if d is not None: return getattr(d, attr)
  File "C:\PROGRA~1\_DEV\Python22\lib\site-
packages\win32com\client\__init__.py", line 368, in 
__getattr__
    raise AttributeError, "'%s' object has no attribute '%s'" 
% (repr(self), attr)

AttributeError: '<win32com.gen_py.Microsoft Outlook 
9.0 Object Library._MailItem>' object has no 
attribute 'Folders'
win32ui: Error in Command Message handler for 
command ID 1100, Code 0


----------------------------------------------------------------------

>Comment By: Mark Hammond (mhammond)
Date: 2003-03-04 21:52

Message:
Logged In: YES 
user_id=14198

OK - dare ya to re-open it again <wink>

/cvsroot/spambayes/spambayes/Outlook2000/manager.py,v  <-- 
manager.py
new revision: 1.53; previous revision: 1.52

----------------------------------------------------------------------

Comment By: Fredrik Rodland (fmmr)
Date: 2003-03-04 21:45

Message:
Logged In: YES 
user_id=724871

Well - i still get an error - bug reopened:

Traceback (most recent call last):
  File "c:\Programfiler\_UTIL\spambayes_cvs\Outlook2000
\dialogs\AsyncDialog.py", line 98, in OnStart
    self.StartProcess()
  File "c:\Programfiler\_UTIL\spambayes_cvs\Outlook2000
\dialogs\FilterDialog.py", line 366, in StartProcess
    self.mgr.EnsureOutlookFieldsForFolder(folder_id, 
config.include_sub)
  File "c:\Programfiler\_UTIL\spambayes_cvs\Outlook2000
\manager.py", line 293, in EnsureOutlookFieldsForFolder
    self.EnsureOutlookFieldsForFolder(folder.EntryID, True)
  File "c:\Programfiler\_UTIL\spambayes_cvs\Outlook2000
\manager.py", line 245, in EnsureOutlookFieldsForFolder
    msgstore_folder = self.message_store.GetFolder(folder_id)
  File "c:\Programfiler\_UTIL\spambayes_cvs\Outlook2000
\msgstore.py", line 232, in GetFolder
    folder_id = self.NormalizeID(folder_id)
  File "c:\Programfiler\_UTIL\spambayes_cvs\Outlook2000
\msgstore.py", line 186, in NormalizeID
    assert False, "We expect fully qualified IDs"
AssertionError: We expect fully qualified IDs
win32ui: Error in Command Message handler for command ID 
1100, Code 0


----------------------------------------------------------------------

Comment By: Mark Hammond (mhammond)
Date: 2003-03-04 21:33

Message:
Logged In: YES 
user_id=14198

OK, finally fixed:
/cvsroot/spambayes/spambayes/Outlook2000/manager.py,v  <-- 
manager.py
new revision: 1.52; previous revision: 1.51

I was tricked by the original traceback, which had an
appointment item.  My previous checkin made sure *that*
couldn't happen again <wink>

Note that if you comment in the bug that it still fails, I
will simply re-open the old bug, rather than creating a new
one.  Do that if this fix doesn't work :(

----------------------------------------------------------------------

Comment By: Fredrik Rodland (fmmr)
Date: 2003-03-04 21:11

Message:
Logged In: YES 
user_id=724871

I've tested this some more.  It seems like I was wrong in my 
initial bug-report.  everything seems to be working fine 
if "include subfolder" is UNCHECKED.  The filtering then both 
handles empty and non-empty folders.

However if the "include subfolder" is CHECKED, the filtering 
fails - also if all folders filtered contain mails.

----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=498103&aid=697120&group_id=61702

From noreply at sourceforge.net  Tue Mar  4 03:03:34 2003
From: noreply at sourceforge.net (SourceForge.net)
Date: Tue Mar  4 09:55:59 2003
Subject: [Spambayes] 
	[ spambayes-Bugs-642740 ] "Recover from Spam" wrong folder
Message-ID: <E18qACs-0007o5-00@sc8-sf-web4.sourceforge.net>

Bugs item #642740, was opened at 2002-11-23 15:00
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=498103&aid=642740&group_id=61702

Category: None
Group: None
Status: Open
Resolution: Works For Me
Priority: 5
Submitted By: Mark Hammond (mhammond)
Assigned to: Mark Hammond (mhammond)
Summary: "Recover from Spam" wrong folder

Initial Comment:
Outlook addin:

Selecting "Recover From Spam" recovers the selected
message to the Inbox folder - which is not necessarily
where came from.  The filterer will need to save the
folder it came from before we can do this.

----------------------------------------------------------------------

Comment By: Fredrik Rodland (fmmr)
Date: 2003-03-04 12:03

Message:
Logged In: YES 
user_id=724871

OK - i've tested some more.  this seems to work sometimes, 
and sometimes not.  It may be related to the other bug you're 
refering to, but I'll try to walk thorugh an example.

- I've got a message in a folder (inbox/maillister/locker).  The 
message was filtered by outlooks rules to this folder this 
morning - i.e. I've never viewed neither the message or the 
clues from any other folder.
- I run a manual filter on this folder (which returns with 1 good 
msg as expected) - WILL THIS FORGET THE FOLDER OF 
THIS MSG?
- I press the "delete as spam" button, and the message 
appears in my SPAM-folder.
- I enter my spam-folder and press the "recover from spam"-
button.
- the message appears in my INBOX

The message was ORIGINALLY (this morning local time) 
filtered using the 1.0.a2 version of spambayes, while I now 
use the latest CVS-version.

the following appears in the trace-collector:
Deleting and spam training message '[Lockergnome Penguin 
Shell]  Network Shutdown' -  trained as spam
Recovering to folder 'Inbox' and ham training 
message '[Lockergnome Penguin Shell]  Network Shutdown' -
  trained as ham

If you add some more debug, I'll be happy to run some tests 
on this msg.  Is there anyway to check whether this message 
actually 


----------------------------------------------------------------------

Comment By: Mark Hammond (mhammond)
Date: 2003-03-04 11:43

Message:
Logged In: YES 
user_id=14198

Can you post an example of something that fails?

Note that a remaining potential problem is out of our
control: occasionally the "Inbox" will see a message before
the builtin rules.  In this case, we filter it from the
Inbox, not from where the Outlook rule would have moved it.
 Thus, when we recover, we see the inbox as the source.

Note that I also fixed another bug related to this -
previously, simply scoring a message would store that folder
name as the "source" of the message.  Thus, if you had
previously viewed the clues for a message once in the wrong
folder, the correct source folder would have been lost.  So
please ensure you are testing with mail received since I
said I fixed this.

----------------------------------------------------------------------

Comment By: Mark Hammond (mhammond)
Date: 2003-02-04 07:23

Message:
Logged In: YES 
user_id=14198

/cvsroot/spambayes/spambayes/Outlook2000/addin.py,v  <-- 
addin.py
new revision: 1.48; previous revision: 1.47
/cvsroot/spambayes/spambayes/Outlook2000/filter.py,v  <-- 
filter.py
new revision: 1.16; previous revision: 1.15
/cvsroot/spambayes/spambayes/Outlook2000/msgstore.py,v  <--
 msgstore.py
new revision: 1.39; previous revision: 1.38


----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=498103&aid=642740&group_id=61702

From jeremy at zope.com  Tue Mar  4 10:29:57 2003
From: jeremy at zope.com (Jeremy Hylton)
Date: Tue Mar  4 10:30:45 2003
Subject: [Spambayes] Server error when training in POP3proxy
In-Reply-To: <24FDE878-4DB8-11D7-9B4E-000393DB4B0C@plokta.com>
References: <24FDE878-4DB8-11D7-9B4E-000393DB4B0C@plokta.com>
Message-ID: <1046791797.1953.6.camel@slothrop.zope.com>

On Mon, 2003-03-03 at 15:38, Mike Scott wrote:
> After using it successfully for a couple of weeks, POP3proxy is 
> throwing the following error when I try to review emails for training 
> in the web browser interface. The rest of the web browser interface, 
> and POP3proxy, seems to be working OK. Does anyone who knows more than 
> me about POP3proxy have any ideas for how to diagnose or fix it? I've 
> just pulled the most recent update from CVS, which hasn't helped. I'm 
> on Mac OS X 10.2.4 running Python 2.2.2, in case it's relevant.

I saw the same problem and filed a spambayes bug report.  The funny
thing is, I wrote a script to scan the unknown cache and found three
messages that caused the problem.  The messages were all generated from
a sourceforge bug report for python.  The bug report was that some MIME
text caused the email package to barf -- and the bug report included an
example of the input the caused the problem.

The bug report was carefully crafted to cause any tool that used the
email package to fail.

I think the right solution is not just to fix the email package, but to
make pop3proxy more robust.  It should expect that the email package may
fail unexpectedly.  In those cases, it should not fail catastrophically.

Jeremy


From skip at pobox.com  Tue Mar  4 09:36:51 2003
From: skip at pobox.com (Skip Montanaro)
Date: Tue Mar  4 10:47:29 2003
Subject: [Spambayes] Server error when training in POP3proxy
In-Reply-To: <1046791797.1953.6.camel@slothrop.zope.com>
References: <24FDE878-4DB8-11D7-9B4E-000393DB4B0C@plokta.com>
        <1046791797.1953.6.camel@slothrop.zope.com>
Message-ID: <15972.51219.834328.821065@montanaro.dyndns.org>

    >> After using it successfully for a couple of weeks, POP3proxy is
    >> throwing the following error when I try to review emails for training
    >> in the web browser interface....

    Jeremy> I think the right solution is not just to fix the email package,
    Jeremy> but to make pop3proxy more robust....

I checked in a change to tokenizer.py yesterday evening which should
robustify things a bit.  Please "cvs up" and give it a whirl.

Skip

From jeremy at zope.com  Tue Mar  4 13:23:00 2003
From: jeremy at zope.com (Jeremy Hylton)
Date: Tue Mar  4 13:23:40 2003
Subject: [Spambayes] Server error when training in POP3proxy
In-Reply-To: <15972.51219.834328.821065@montanaro.dyndns.org>
References: <24FDE878-4DB8-11D7-9B4E-000393DB4B0C@plokta.com>
	 <1046791797.1953.6.camel@slothrop.zope.com>
	 <15972.51219.834328.821065@montanaro.dyndns.org>
Message-ID: <1046802180.2030.23.camel@slothrop.zope.com>

On Tue, 2003-03-04 at 10:36, Skip Montanaro wrote:
>     >> After using it successfully for a couple of weeks, POP3proxy is
>     >> throwing the following error when I try to review emails for training
>     >> in the web browser interface....
> 
>     Jeremy> I think the right solution is not just to fix the email package,
>     Jeremy> but to make pop3proxy more robust....
> 
> I checked in a change to tokenizer.py yesterday evening which should
> robustify things a bit.  Please "cvs up" and give it a whirl.

I'm looking at the checkin comment for tokenizer, and I think it won't
work.  If you look at the traceback we provided, it shows that the
tokenizer isn't involved.  The proxy is calling
email.Header.decode_header() directly.  On the failure in question, it
isn't even calling it on a header :-).

Jeremy


From piersh at friskit.com  Tue Mar  4 10:54:21 2003
From: piersh at friskit.com (Piers Haken)
Date: Tue Mar  4 13:53:14 2003
Subject: [Spambayes] Outlook plugin error
Message-ID: <9891913C5BFE87429D71E37F08210CB9297588@zeus.sfhq.friskit.com>

I'm seeing some weird behavior sometimes when the outlook plugin filters
spam. Sometimes the spam that ends up in my spam folder has a spam field
value of '0%' even though the 'show clues' feature shows the correct
value. Looking through the trace output I'm seeing a bunch of assertion
failures like this:

pythoncom error: Python error invoking COM method.
Traceback (most recent call last):
  File "C:\Python22\lib\site-packages\win32com\server\policy.py", line
275, in _Invoke_
    return self._invoke_(dispid, lcid, wFlags, args)
  File "C:\Python22\lib\site-packages\win32com\server\policy.py", line
280, in _invoke_
    return S_OK, -1, self._invokeex_(dispid, lcid, wFlags, args, None,
None)
  File "C:\Python22\lib\site-packages\win32com\server\policy.py", line
601, in _invokeex_
    return DesignatedWrapPolicy._invokeex_( self, dispid, lcid, wFlags,
args, kwArgs, serviceProvider)
  File "C:\Python22\lib\site-packages\win32com\server\policy.py", line
541, in _invokeex_
    return apply(func, args)
  File "C:\Python22\spam\spambayes\Outlook2000\addin.py", line 184, in
OnItemAdd
    msgstore_message = self.manager.message_store.GetMessage(item)
  File "C:\Python22\spam\spambayes\Outlook2000\msgstore.py", line 258,
in GetMessage
    message_id = self.NormalizeID(message_id)
  File "C:\Python22\spam\spambayes\Outlook2000\msgstore.py", line 185,
in NormalizeID
    assert type(item_id) in [type(''), type(u'')], "What kind of ID is
'%r'?" % (item_id,)
exceptions.AssertionError: What kind of ID is
'<win32com.gen_py.None.MailItem>'?

I'm not sure what's going on here, has anyone else seen this before?

Piers.

From skip at pobox.com  Tue Mar  4 14:09:02 2003
From: skip at pobox.com (Skip Montanaro)
Date: Tue Mar  4 15:09:25 2003
Subject: [Spambayes] Server error when training in POP3proxy
In-Reply-To: <1046802180.2030.23.camel@slothrop.zope.com>
References: <24FDE878-4DB8-11D7-9B4E-000393DB4B0C@plokta.com>
        <1046791797.1953.6.camel@slothrop.zope.com>
        <15972.51219.834328.821065@montanaro.dyndns.org>
        <1046802180.2030.23.camel@slothrop.zope.com>
Message-ID: <15973.2014.368579.363028@montanaro.dyndns.org>

    >> I checked in a change to tokenizer.py yesterday evening which should
    >> robustify things a bit.  Please "cvs up" and give it a whirl.

    Jeremy> I'm looking at the checkin comment for tokenizer, and I think it
    Jeremy> won't work.  If you look at the traceback we provided, it shows
    Jeremy> that the tokenizer isn't involved.  The proxy is calling
    Jeremy> email.Header.decode_header() directly.  On the failure in
    Jeremy> question, it isn't even calling it on a header :-).

I was working off the traceback I got which wasn't from pop3proxy.  In my
checkin comment I wrote:

    These two may not be the only places requiring a change.  Anywhere
    email.Header.decode_header() is called - particularly when passed a
    subject or email address - should probably be guarded.

I don't regularly run pop3proxy, so couldn't easily check any changes I'd
make to that code.  Still, the try/except structure should be similar.

Skip

From noreply at sourceforge.net  Tue Mar  4 16:39:36 2003
From: noreply at sourceforge.net (SourceForge.net)
Date: Tue Mar  4 20:33:58 2003
Subject: [Spambayes] 
	[ spambayes-Bugs-693423 ] email message generates error in
	pop3proxy.py
Message-ID: <E18qMwa-0005Uo-00@sc8-sf-web4.sourceforge.net>

Bugs item #693423, was opened at 2003-02-25 23:02
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=498103&aid=693423&group_id=61702

Category: pop3proxy
Group: None
>Status: Open
Resolution: None
Priority: 5
Submitted By: David Shaw (dshaw)
Assigned to: Tim Stone (timstone4)
Summary: email message generates error in pop3proxy.py

Initial Comment:
Hi all,
  A friend of mine had a cache file in his "unknown" folder that caused the "review" web page in pop3proxy.py to generate the following traceback:

Traceback (most recent call last):

  File "spambayes/Dibbler.py", line 398, in found_terminator
    getattr(plugin, name)(**params)

  File "pop3proxy.py", line 929, in onReview
    judgement = judgement.split(';')[0].strip()

  File "pop3proxy.py", line 815, in _makeMessageInfo
    print type(text)

AttributeError: 'list' object has no attribute 'replace' 

He sent me the offending message, and I replicated the problem:

msg = open("/Users/dshaw/Desktop/crash_spam.txt", "r")
message = mbox.get_message(msg)
part = typed_subpart_iterator(message, 'text', 'plain').next()
text = part.get_payload()
>>> text
[<email.Message.Message instance at 0x275ff0>]


So, instead of text, the payload is a list containing a single email message instance.  Here are the objects' respective payloads:

>>> message._payload
[<email.Message.Message instance at 0x279290>, <email.Message.Message instance at 0x279160>, <email.Message.Message instance at 0x279e00>, <email.Message.Message instance at 0x280b10>, <email.Message.Message instance at 0x281340>, <email.Message.Message instance at 0x2828d0>, <email.Message.Message instance at 0x283300>, <email.Message.Message instance at 0x2b60a0>, <email.Message.Message instance at 0x27f4d0>, <email.Message.Message instance at 0x2b7c70>, <email.Message.Message instance at 0x2b9ac0>, <email.Message.Message instance at 0x2b8c30>, <email.Message.Message instance at 0x2bb770>, <email.Message.Message instance at 0x2bc180>]


----------------------------------------------------------------------

>Comment By: Tim Stone (timstone4)
Date: 2003-03-04 18:39

Message:
Logged In: YES 
user_id=645698

I just checked in a fix for this problem.  I have no ability to actually test it, 
though. Please try your test case again and let me know the outcome.

----------------------------------------------------------------------

Comment By: David Shaw (dshaw)
Date: 2003-02-28 10:34

Message:
Logged In: YES 
user_id=244639

Seems to be fixed!  Thanks.

----------------------------------------------------------------------

Comment By: Tim Stone (timstone4)
Date: 2003-02-27 22:29

Message:
Logged In: YES 
user_id=645698

I just checked in a fix for this problem.  I have no ability to actually test it, 
though. Please try your test case again and let me know the outcome.

----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=498103&aid=693423&group_id=61702

From niek at haunter.student.utwente.nl  Wed Mar  5 10:04:03 2003
From: niek at haunter.student.utwente.nl (Niek Bergboer)
Date: Wed Mar  5 04:04:08 2003
Subject: [Spambayes] Graphs on my website
Message-ID: <20030305090403.GB30529@haunter.student.utwente.nl>


On Sat, Mar 01, 2003 at 09:12:46AM -0800, T. Alexander Popiel wrote:
> Those who want to see my pretty graphs without waiting
> for the moderator approval of my .png-laden posting
> can go to http://www.wolfskeep.com/~popiel/spambayes/incremental
> to see all the pretty pictures (along with a bunch of the
> raw and semi-cooked data files).

Looks very nice indeed, and the results seem to be good (fn and fp ~10^-2).
For the other examples on your site, for which you use a parameter to
check its effect on the performance (e.g. the ham:spam ratio, of the
training set size), it would be nice to generate a ROC-curve:

In a ROC-curve (Receiver Operating Characteristic curve), you plot the
correct positive rate (y-axis) against the false positive rate (x-axis). The
points on the curve are given by using e.g. different spam:ham
ratio's. A ROC-curve doesn't necessarily provide more information, but
it is a rather standard way to present results in (more or less)
binary classification. The term ROC originates from RADAR detection
results, AFAIK.

A problem that needs to be addressed in making ROC-curves for
spambayes is how to handle unsures: disregarding them completely in
the ROC curve seems reasonable, but then one probably also needs a
correct.pos.rate vs. unsures rate curve.

> - Alex

Just my 2 Eurocents...

Niek

-- 
 Max Brod:    "Gibt es denn gar keine Hoffnung?"
 Franz Kafka: "Aber ja! Es gibt unendlich viel Hoffnung.
               Nur nicht fuer uns."

PGP public key at http://www.bergboer.net


From Paul.Moore at atosorigin.com  Wed Mar  5 09:45:54 2003
From: Paul.Moore at atosorigin.com (Moore, Paul)
Date: Wed Mar  5 04:47:20 2003
Subject: [Spambayes] Outlook plugin error
Message-ID: <16E1010E4581B049ABC51D4975CEDB880113D955@UKDCX001.uk.int.atosorigin.com>

From: Piers Haken [mailto:piersh@friskit.com]
> I'm seeing some weird behavior sometimes when the outlook plugin filters
> spam. Sometimes the spam that ends up in my spam folder has a spam field
> value of '0%' even though the 'show clues' feature shows the correct
> value.

[...]

> I'm not sure what's going on here, has anyone else seen this before?

Yes, I see it fairly often, and it has been reported before (to the list,
but possibly not on SF). IIRC, Mark thought it was a timing issue between
when the message arrived and when the plugin fired. But that's about as
much as I know...

Paul.

From mhammond at skippinet.com.au  Wed Mar  5 21:19:44 2003
From: mhammond at skippinet.com.au (Mark Hammond)
Date: Wed Mar  5 05:20:18 2003
Subject: [Spambayes] Outlook plugin error
In-Reply-To: <16E1010E4581B049ABC51D4975CEDB880113D955@UKDCX001.uk.int.atosorigin.com>
Message-ID: <LCEPIIGDJPKCOIHOBJEPCEGDOEAA.mhammond@skippinet.com.au>

> From: Piers Haken [mailto:piersh@friskit.com]
> > I'm seeing some weird behavior sometimes when the outlook plugin filters
> > spam. Sometimes the spam that ends up in my spam folder has a spam field
> > value of '0%' even though the 'show clues' feature shows the correct
> > value.
>
> [...]
>
> > I'm not sure what's going on here, has anyone else seen this before?
>
> Yes, I see it fairly often, and it has been reported before (to the list,
> but possibly not on SF). IIRC, Mark thought it was a timing issue between
> when the message arrived and when the plugin fired. But that's about as
> much as I know...

I never see this.  The timing issue I was thinking of would account for a
*blank* spam score, but not a zero score.  A zero implies that the scoring
worked correctly, but did indeed return zero.

If you disable filtering, you should see all new mail arrive with a blank
score, rather than zero.  Please tell me if this is not true.  If it *is*
true, then I guess we can add some additional trace statements to see what
is going on.

Mark.


From Paul.Moore at atosorigin.com  Wed Mar  5 10:31:04 2003
From: Paul.Moore at atosorigin.com (Moore, Paul)
Date: Wed Mar  5 05:32:26 2003
Subject: [Spambayes] Outlook plugin error
Message-ID: <16E1010E4581B049ABC51D4975CEDB880113D958@UKDCX001.uk.int.atosorigin.com>

From: Mark Hammond [mailto:mhammond@skippinet.com.au]
> If you disable filtering, you should see all new mail arrive with a
> blank score, rather than zero.

Yes, that's right.

Paul

From noreply at sourceforge.net  Wed Mar  5 05:09:38 2003
From: noreply at sourceforge.net (SourceForge.net)
Date: Wed Mar  5 08:20:31 2003
Subject: [Spambayes] 
	[ spambayes-Patches-697970 ] pop3proxy didn't use addressAndPort
	for uiPort
Message-ID: <E18qYeQ-0002i6-00@sc8-sf-web1.sourceforge.net>

Patches item #697970, was opened at 2003-03-05 14:09
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=498105&aid=697970&group_id=61702

Category: None
Group: None
Status: Open
Resolution: None
Priority: 5
Submitted By: Wolfgang Strobl (strobl)
Assigned to: Nobody/Anonymous (nobody)
Summary: pop3proxy didn't use addressAndPort for uiPort

Initial Comment:
pop3proxy doesn't accept the hostname:portno notation 
for the -l (i.e. uiPort) flag. I did'nt like everybody on our 
LAN being able to read my mail using a webbrowser, so 
I wrote the attached path, this allows -l localhost:8880


----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=498105&aid=697970&group_id=61702

From noreply at sourceforge.net  Wed Mar  5 05:10:48 2003
From: noreply at sourceforge.net (SourceForge.net)
Date: Wed Mar  5 08:20:32 2003
Subject: [Spambayes] 
	[ spambayes-Patches-697970 ] pop3proxy didn't use addressAndPort
	for uiPort
Message-ID: <E18qYfY-0002km-00@sc8-sf-web1.sourceforge.net>

Patches item #697970, was opened at 2003-03-05 14:09
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=498105&aid=697970&group_id=61702

Category: None
Group: None
Status: Open
Resolution: None
Priority: 5
Submitted By: Wolfgang Strobl (strobl)
Assigned to: Nobody/Anonymous (nobody)
Summary: pop3proxy didn't use addressAndPort for uiPort

Initial Comment:
pop3proxy doesn't accept the hostname:portno notation 
for the -l (i.e. uiPort) flag. I did'nt like everybody on our 
LAN being able to read my mail using a webbrowser, so 
I wrote the attached path, this allows -l localhost:8880


----------------------------------------------------------------------

>Comment By: Wolfgang Strobl (strobl)
Date: 2003-03-05 14:10

Message:
Logged In: YES 
user_id=311771

Forgot the checkmark, as usual. arrggg.

----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=498105&aid=697970&group_id=61702

From roz at one.net  Wed Mar  5 01:14:24 2003
From: roz at one.net (J. Solomon Kostelnik)
Date: Wed Mar  5 08:21:03 2003
Subject: [Spambayes] Great Job!
Message-ID: <1046844863.4850.5.camel@jsk.one.net>

Just wanted to say "great job" on the software so far.  After training
only about 10-15 emails, it successfully caught ALL spam, and only
accidentally got a few "hams."

With each successive train, it gets better.

I really am impressed.

One suggestion: document (if it exists), or add a run-time flag to run a
certain .ini file on startup of the pop3proxy script.  I'd like to add
pop3proxy.py to my rc.local file, but I need to be able to tell it where
to look for the .ini file.  If this exists, please just point me to the
docs where it says.

Thanks again and keep up the great work!

-- 
Solomon aka JSK333
http://w3.one.net/~roz/

?Come to me, all you who labor and are heavily burdened, and
I will give you rest. Take my yoke upon you, and learn from me, for
I am gentle and lowly in heart; and you will find rest for your souls?
--Jesus Christ, Son of God; Matthew 11:28-29

PGP Public Key Available: http://w3.one.net/~roz/jsk333.asc


From noreply at sourceforge.net  Wed Mar  5 05:36:04 2003
From: noreply at sourceforge.net (SourceForge.net)
Date: Wed Mar  5 08:38:20 2003
Subject: [Spambayes] 
	[ spambayes-Patches-697970 ] pop3proxy didn't use addressAndPort
	for uiPort
Message-ID: <E18qZ40-00045X-00@sc8-sf-web1.sourceforge.net>

Patches item #697970, was opened at 2003-03-05 14:09
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=498105&aid=697970&group_id=61702

Category: None
Group: None
Status: Open
Resolution: None
Priority: 5
Submitted By: Wolfgang Strobl (strobl)
Assigned to: Nobody/Anonymous (nobody)
Summary: pop3proxy didn't use addressAndPort for uiPort

Initial Comment:
pop3proxy doesn't accept the hostname:portno notation 
for the -l (i.e. uiPort) flag. I did'nt like everybody on our 
LAN being able to read my mail using a webbrowser, so 
I wrote the attached path, this allows -l localhost:8880


----------------------------------------------------------------------

>Comment By: Wolfgang Strobl (strobl)
Date: 2003-03-05 14:36

Message:
Logged In: YES 
user_id=311771

ahem. Make that ".. this allows -u localhost:8880", of course

----------------------------------------------------------------------

Comment By: Wolfgang Strobl (strobl)
Date: 2003-03-05 14:10

Message:
Logged In: YES 
user_id=311771

Forgot the checkmark, as usual. arrggg.

----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=498105&aid=697970&group_id=61702

From wsy at merl.com  Wed Mar  5 08:52:19 2003
From: wsy at merl.com (Bill Yerazunis)
Date: Wed Mar  5 08:52:23 2003
Subject: [Spambayes] Graphs on my website
In-Reply-To: <20030305090403.GB30529@haunter.student.utwente.nl>
	(niek@haunter.student.utwente.nl)
References: <20030305090403.GB30529@haunter.student.utwente.nl>
Message-ID: <200303051352.h25DqJh20230@localhost.localdomain>


   From: niek@haunter.student.utwente.nl (Niek Bergboer)

   In a ROC-curve (Receiver Operating Characteristic curve), you plot the
   correct positive rate (y-axis) against the false positive rate (x-axis). The
   points on the curve are given by using e.g. different spam:ham
   ratio's. A ROC-curve doesn't necessarily provide more information, but
   it is a rather standard way to present results in (more or less)
   binary classification. The term ROC originates from RADAR detection
   results, AFAIK.

   A problem that needs to be addressed in making ROC-curves for
   spambayes is how to handle unsures: disregarding them completely in
   the ROC curve seems reasonable, but then one probably also needs a
   correct.pos.rate vs. unsures rate curve.

The ROC curves I've seen are all plots of correct% v incorrect% with
the parameterization variable being some controllable threshold that's
an input to the system; the closer the "knee" in the curve comes to
the origin, the better the discrimination, and the parameter value(s)
at the point of closest approach are the optimal operating parameters .

In the case of SpamBayes, where there's a distinct "third class",
I'd suggest _three_ curves:

    Ham v. Unsure
    Unsure  v. Spam
    Ham v. Spam

This would plot the confusion on all three axes, and make it clear that
you can drive the third one (ham v. spam) really close to the
origin (which is good) by expanding the size of the Unsure class.

       -Bill Yerazunis ( CRM114 spy :-) )
    

From skip at pobox.com  Wed Mar  5 08:35:36 2003
From: skip at pobox.com (Skip Montanaro)
Date: Wed Mar  5 09:35:45 2003
Subject: [Spambayes] Great Job!
In-Reply-To: <1046844863.4850.5.camel@jsk.one.net>
References: <1046844863.4850.5.camel@jsk.one.net>
Message-ID: <15974.2872.617304.198800@montanaro.dyndns.org>


    Solomon> I really am impressed.

As are we all.  

    Solomon> One suggestion: document (if it exists), or add a run-time flag
    Solomon> to run a certain .ini file on startup of the pop3proxy script.

You can set your BAYESCUSTOMIZE environment variable to (on Unix) a colon
separated list of ini files which will be loaded, in order.

Skip

From noreply at sourceforge.net  Wed Mar  5 06:38:38 2003
From: noreply at sourceforge.net (SourceForge.net)
Date: Wed Mar  5 11:32:15 2003
Subject: [Spambayes] 
	[ spambayes-Patches-697970 ] pop3proxy didn't use addressAndPort
	for uiPort
Message-ID: <E18qa2Y-0008Lk-00@sc8-sf-web2.sourceforge.net>

Patches item #697970, was opened at 2003-03-05 13:09
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=498105&aid=697970&group_id=61702

Category: None
Group: None
Status: Open
Resolution: None
Priority: 5
Submitted By: Wolfgang Strobl (strobl)
Assigned to: Nobody/Anonymous (nobody)
Summary: pop3proxy didn't use addressAndPort for uiPort

Initial Comment:
pop3proxy doesn't accept the hostname:portno notation 
for the -l (i.e. uiPort) flag. I did'nt like everybody on our 
LAN being able to read my mail using a webbrowser, so 
I wrote the attached path, this allows -l localhost:8880


----------------------------------------------------------------------

>Comment By: Richie Hindle (richiehindle)
Date: 2003-03-05 14:38

Message:
Logged In: YES 
user_id=85414

Unless I'm misunderstanding something, this is exactly
what the html_ui_allow_remote_connections setting is for...?

Thanks for the patch anyway - there's nothing wrong with
being able to specify the address that way whether
html_ui_allow_remote_connections solves your problem
or not.


----------------------------------------------------------------------

Comment By: Wolfgang Strobl (strobl)
Date: 2003-03-05 13:36

Message:
Logged In: YES 
user_id=311771

ahem. Make that ".. this allows -u localhost:8880", of course

----------------------------------------------------------------------

Comment By: Wolfgang Strobl (strobl)
Date: 2003-03-05 13:10

Message:
Logged In: YES 
user_id=311771

Forgot the checkmark, as usual. arrggg.

----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=498105&aid=697970&group_id=61702

From noreply at sourceforge.net  Wed Mar  5 07:25:45 2003
From: noreply at sourceforge.net (SourceForge.net)
Date: Wed Mar  5 11:32:16 2003
Subject: [Spambayes] 
	[ spambayes-Patches-697970 ] pop3proxy didn't use addressAndPort
	for uiPort
Message-ID: <E18qam9-000267-00@sc8-sf-web2.sourceforge.net>

Patches item #697970, was opened at 2003-03-05 07:09
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=498105&aid=697970&group_id=61702

Category: None
Group: None
Status: Open
Resolution: None
Priority: 5
Submitted By: Wolfgang Strobl (strobl)
Assigned to: Nobody/Anonymous (nobody)
Summary: pop3proxy didn't use addressAndPort for uiPort

Initial Comment:
pop3proxy doesn't accept the hostname:portno notation 
for the -l (i.e. uiPort) flag. I did'nt like everybody on our 
LAN being able to read my mail using a webbrowser, so 
I wrote the attached path, this allows -l localhost:8880


----------------------------------------------------------------------

>Comment By: Tim Stone (timstone4)
Date: 2003-03-05 09:25

Message:
Logged In: YES 
user_id=645698

You bring up a very good point, Wolfgang.  Your patch plugs one hole, but 
someone can still access your mail via http://<yourip>:8880 (or 
whatever port you happen to be listening on).  This is a problem, and I think 
the solution is to implement http auth...  We can't just reject connections 
that don't originate from localhost, because someone really might want 
to use another computer to access the pop3proxy ui.

----------------------------------------------------------------------

Comment By: Richie Hindle (richiehindle)
Date: 2003-03-05 08:38

Message:
Logged In: YES 
user_id=85414

Unless I'm misunderstanding something, this is exactly
what the html_ui_allow_remote_connections setting is for...?

Thanks for the patch anyway - there's nothing wrong with
being able to specify the address that way whether
html_ui_allow_remote_connections solves your problem
or not.


----------------------------------------------------------------------

Comment By: Wolfgang Strobl (strobl)
Date: 2003-03-05 07:36

Message:
Logged In: YES 
user_id=311771

ahem. Make that ".. this allows -u localhost:8880", of course

----------------------------------------------------------------------

Comment By: Wolfgang Strobl (strobl)
Date: 2003-03-05 07:10

Message:
Logged In: YES 
user_id=311771

Forgot the checkmark, as usual. arrggg.

----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=498105&aid=697970&group_id=61702

From noreply at sourceforge.net  Wed Mar  5 07:41:15 2003
From: noreply at sourceforge.net (SourceForge.net)
Date: Wed Mar  5 11:32:18 2003
Subject: [Spambayes] 
	[ spambayes-Feature Requests-698036 ] pop3proxy security
Message-ID: <E18qb19-0005Ay-00@sc8-sf-web3.sourceforge.net>

Feature Requests item #698036, was opened at 2003-03-05 09:41
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=498106&aid=698036&group_id=61702

Category: pop3proxy
Group: None
Status: Open
Priority: 5
Submitted By: Tim Stone (timstone4)
Assigned to: Tim Stone (timstone4)
Summary: pop3proxy security

Initial Comment:
Currently, there is no security on the pop3proxy, so anyone can 
access the user interface from any computer, given a web browser 
and knowledge of the ip address and port.  Even if you didn't know the 
port, figuring it out wouldn't necessarily be difficult.  This allows 
several operations that could be security problems, including 
reading at least the first couple hundred characters of each mail 
body.

It would seem that the correct solution is to 
implement a challenge/authentication on the pop3proxy http 
server.

----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=498106&aid=698036&group_id=61702

From noreply at sourceforge.net  Wed Mar  5 08:48:14 2003
From: noreply at sourceforge.net (SourceForge.net)
Date: Wed Mar  5 11:45:09 2003
Subject: [Spambayes] 
	[ spambayes-Feature Requests-698036 ] pop3proxy security
Message-ID: <E18qc3y-0006IZ-00@sc8-sf-web4.sourceforge.net>

Feature Requests item #698036, was opened at 2003-03-05 09:41
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=498106&aid=698036&group_id=61702

Category: pop3proxy
Group: None
Status: Open
Priority: 5
Submitted By: Tim Stone (timstone4)
Assigned to: Tim Stone (timstone4)
Summary: pop3proxy security

Initial Comment:
Currently, there is no security on the pop3proxy, so anyone can 
access the user interface from any computer, given a web browser 
and knowledge of the ip address and port.  Even if you didn't know the 
port, figuring it out wouldn't necessarily be difficult.  This allows 
several operations that could be security problems, including 
reading at least the first couple hundred characters of each mail 
body.

It would seem that the correct solution is to 
implement a challenge/authentication on the pop3proxy http 
server.

----------------------------------------------------------------------

>Comment By: Skip Montanaro (montanaro)
Date: 2003-03-05 10:48

Message:
Logged In: YES 
user_id=44345

I don't think this is a problem.  Just tell the webserver to listen on "localhost"
or "127.0.0.1", or maybe even "".  Connections from remote hosts won't be accepted.

----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=498106&aid=698036&group_id=61702

From neale at woozle.org  Wed Mar  5 09:36:04 2003
From: neale at woozle.org (Neale Pickett)
Date: Wed Mar  5 12:36:01 2003
Subject: [Spambayes] Adding a message database
In-Reply-To: <LCEPIIGDJPKCOIHOBJEPMENOOCAA.mhammond@skippinet.com.au> ("Mark
 Hammond"'s message of "Thu, 27 Feb 2003 09:11:55 +1100")
References: <LCEPIIGDJPKCOIHOBJEPMENOOCAA.mhammond@skippinet.com.au>
Message-ID: <w53wujdn2bf.fsf@woozle.org>

Hi everybody.  I just got my Internet service restored.  Boy howdy is
the phone company ever responsive <2.0 wink>.

"Mark Hammond" <mhammond@skippinet.com.au> writes:

> I simply want a memory of how a specific message was trained, for the
> following reasons:
>
> * Accidental attempt to train the same message, in the same way, multiple
>   times.
> * Accidental attempt to train the same message as ham and spam.

So, this is a rockin' idea and I'd be glad to rewrite
mboxtrain/hammiefilter to use it once it's implemented.

Neale

From piersh at friskit.com  Wed Mar  5 09:42:13 2003
From: piersh at friskit.com (Piers Haken)
Date: Wed Mar  5 12:41:06 2003
Subject: [Spambayes] Outlook plugin error
Message-ID: <9891913C5BFE87429D71E37F08210CB92C7515@zeus.sfhq.friskit.com>

Paul, are you using any of:
1) oulook XP
2) hotmail plugin for (1)
3) exchange server

?

I'm wondering if the problem has anything to do with the fact that the
spam field is set before the message is moved.

Piers.

-----Original Message-----
From: Moore, Paul [mailto:Paul.Moore@atosorigin.com] 
Sent: Wednesday, March 05, 2003 1:46 AM
To: Piers Haken; Spambayes
Subject: RE: [Spambayes] Outlook plugin error


From: Piers Haken [mailto:piersh@friskit.com]
> I'm seeing some weird behavior sometimes when the outlook plugin 
> filters spam. Sometimes the spam that ends up in my spam folder has a 
> spam field value of '0%' even though the 'show clues' feature shows 
> the correct value.

[...]

> I'm not sure what's going on here, has anyone else seen this before?

Yes, I see it fairly often, and it has been reported before (to the
list, but possibly not on SF). IIRC, Mark thought it was a timing issue
between when the message arrived and when the plugin fired. But that's
about as much as I know...

Paul.

From noreply at sourceforge.net  Wed Mar  5 09:35:33 2003
From: noreply at sourceforge.net (SourceForge.net)
Date: Wed Mar  5 13:13:40 2003
Subject: [Spambayes] 
	[ spambayes-Feature Requests-698036 ] pop3proxy security
Message-ID: <E18qcnl-0000xW-00@sc8-sf-web2.sourceforge.net>

Feature Requests item #698036, was opened at 2003-03-05 15:41
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=498106&aid=698036&group_id=61702

Category: pop3proxy
Group: None
Status: Open
Priority: 5
Submitted By: Tim Stone (timstone4)
Assigned to: Tim Stone (timstone4)
Summary: pop3proxy security

Initial Comment:
Currently, there is no security on the pop3proxy, so anyone can 
access the user interface from any computer, given a web browser 
and knowledge of the ip address and port.  Even if you didn't know the 
port, figuring it out wouldn't necessarily be difficult.  This allows 
several operations that could be security problems, including 
reading at least the first couple hundred characters of each mail 
body.

It would seem that the correct solution is to 
implement a challenge/authentication on the pop3proxy http 
server.

----------------------------------------------------------------------

>Comment By: Richie Hindle (richiehindle)
Date: 2003-03-05 17:35

Message:
Logged In: YES 
user_id=85414

[Tim Stone]
> Currently, there is no security on the pop3proxy

Not true - you can use the html_ui_allow_remote_connections
setting to reject connections from anywhere other than the local
machine.  This is a bit draconian - as you say, we should have
a better solution - but it's not as bad as you make out.


----------------------------------------------------------------------

Comment By: Skip Montanaro (montanaro)
Date: 2003-03-05 16:48

Message:
Logged In: YES 
user_id=44345

I don't think this is a problem.  Just tell the webserver to listen on "localhost"
or "127.0.0.1", or maybe even "".  Connections from remote hosts won't be accepted.

----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=498106&aid=698036&group_id=61702

From noreply at sourceforge.net  Wed Mar  5 09:40:02 2003
From: noreply at sourceforge.net (SourceForge.net)
Date: Wed Mar  5 13:13:41 2003
Subject: [Spambayes] 
	[ spambayes-Feature Requests-698036 ] pop3proxy security
Message-ID: <E18qcs6-0000Gp-00@sc8-sf-web4.sourceforge.net>

Feature Requests item #698036, was opened at 2003-03-05 09:41
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=498106&aid=698036&group_id=61702

Category: pop3proxy
Group: None
Status: Open
Priority: 5
Submitted By: Tim Stone (timstone4)
Assigned to: Tim Stone (timstone4)
Summary: pop3proxy security

Initial Comment:
Currently, there is no security on the pop3proxy, so anyone can 
access the user interface from any computer, given a web browser 
and knowledge of the ip address and port.  Even if you didn't know the 
port, figuring it out wouldn't necessarily be difficult.  This allows 
several operations that could be security problems, including 
reading at least the first couple hundred characters of each mail 
body.

It would seem that the correct solution is to 
implement a challenge/authentication on the pop3proxy http 
server.

----------------------------------------------------------------------

>Comment By: Tim Stone (timstone4)
Date: 2003-03-05 11:40

Message:
Logged In: YES 
user_id=645698

Ya, the problem here is that I might want to allow remote connections, but 
I certainly don't want just anybody to be able to connect.  Skip's 
suggestion doesn't help here.

----------------------------------------------------------------------

Comment By: Richie Hindle (richiehindle)
Date: 2003-03-05 11:35

Message:
Logged In: YES 
user_id=85414

[Tim Stone]
> Currently, there is no security on the pop3proxy

Not true - you can use the html_ui_allow_remote_connections
setting to reject connections from anywhere other than the local
machine.  This is a bit draconian - as you say, we should have
a better solution - but it's not as bad as you make out.


----------------------------------------------------------------------

Comment By: Skip Montanaro (montanaro)
Date: 2003-03-05 10:48

Message:
Logged In: YES 
user_id=44345

I don't think this is a problem.  Just tell the webserver to listen on "localhost"
or "127.0.0.1", or maybe even "".  Connections from remote hosts won't be accepted.

----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=498106&aid=698036&group_id=61702

From noreply at sourceforge.net  Wed Mar  5 09:40:19 2003
From: noreply at sourceforge.net (SourceForge.net)
Date: Wed Mar  5 13:13:43 2003
Subject: [Spambayes] 
	[ spambayes-Patches-697970 ] pop3proxy didn't use addressAndPort
	for uiPort
Message-ID: <E18qcsN-0000H2-00@sc8-sf-web4.sourceforge.net>

Patches item #697970, was opened at 2003-03-05 14:09
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=498105&aid=697970&group_id=61702

Category: None
Group: None
Status: Open
Resolution: None
Priority: 5
Submitted By: Wolfgang Strobl (strobl)
Assigned to: Nobody/Anonymous (nobody)
Summary: pop3proxy didn't use addressAndPort for uiPort

Initial Comment:
pop3proxy doesn't accept the hostname:portno notation 
for the -l (i.e. uiPort) flag. I did'nt like everybody on our 
LAN being able to read my mail using a webbrowser, so 
I wrote the attached path, this allows -l localhost:8880


----------------------------------------------------------------------

>Comment By: Wolfgang Strobl (strobl)
Date: 2003-03-05 18:40

Message:
Logged In: YES 
user_id=311771

Richie, thanks for the hint, I didn't know about the new 
html_ui_allow_remote_connections option, because I didn't 
read through docs and sources again after doing a new 
checkout. Using the Option parsing helper functions was 
simply done by looking for symmetry. 

Tim: assuming that localhost is resolved locally to 127.0.0.1,  
AFIK only local processes using the loopback interface can 
bind to the port, when somesthing listens on 
localhost:<something>.  That's exactly what I need, when 
everything (mail client, browser, pop3proxy) runs on the very 
same machine.


----------------------------------------------------------------------

Comment By: Tim Stone (timstone4)
Date: 2003-03-05 16:25

Message:
Logged In: YES 
user_id=645698

You bring up a very good point, Wolfgang.  Your patch plugs one hole, but 
someone can still access your mail via http://<yourip>:8880 (or 
whatever port you happen to be listening on).  This is a problem, and I think 
the solution is to implement http auth...  We can't just reject connections 
that don't originate from localhost, because someone really might want 
to use another computer to access the pop3proxy ui.

----------------------------------------------------------------------

Comment By: Richie Hindle (richiehindle)
Date: 2003-03-05 15:38

Message:
Logged In: YES 
user_id=85414

Unless I'm misunderstanding something, this is exactly
what the html_ui_allow_remote_connections setting is for...?

Thanks for the patch anyway - there's nothing wrong with
being able to specify the address that way whether
html_ui_allow_remote_connections solves your problem
or not.


----------------------------------------------------------------------

Comment By: Wolfgang Strobl (strobl)
Date: 2003-03-05 14:36

Message:
Logged In: YES 
user_id=311771

ahem. Make that ".. this allows -u localhost:8880", of course

----------------------------------------------------------------------

Comment By: Wolfgang Strobl (strobl)
Date: 2003-03-05 14:10

Message:
Logged In: YES 
user_id=311771

Forgot the checkmark, as usual. arrggg.

----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=498105&aid=697970&group_id=61702

From N7DR at arrisi.com  Wed Mar  5 14:11:02 2003
From: N7DR at arrisi.com (D. R. Evans)
Date: Wed Mar  5 16:11:17 2003
Subject: [Spambayes] pop3proxy crashes
Message-ID: <3E660576.15567.1F786E44@localhost>

I made the mistake of rebooting my Linux box....

Following the reboot, pop3proxy.py now dumps the following to the 
screen whenever I try to run it:

Loading database...
Traceback (most recent call last):
  File "./pop3proxy.py", line 1577, in ?
    run()
  File "./pop3proxy.py", line 1551, in run
    state.createWorkers()
  File "./pop3proxy.py", line 1161, in createWorkers
    self.bayes = storage.DBDictClassifier(filename)
  File "./spambayes/storage.py", line 140, in __init__
    self.load()
  File "./spambayes/storage.py", line 152, in load
    t = self.db[self.statekey]
  File "/usr/local/lib/python2.2/shelve.py", line 71, in __getitem__
    return Unpickler(f).load()
EOFError

It worked fine (for about three weeks) until the reboot. I'm probably 
forgetting to do something obvious (I hope).

  Doc
--------------------------------------------------------------
Phone:  +1 303 494 0394
Mobile: +1 720 839 8462
Fax:    +1 781 240 0527
--------------------------------------------------------------


From dave at nullcube.com  Thu Mar  6 08:06:12 2003
From: dave at nullcube.com (Dave Harrison)
Date: Wed Mar  5 16:11:39 2003
Subject: [Spambayes] encountered error while processing spam folder
Message-ID: <20030305210612.GA5950@dave@alana.ucc.usyd.edu.au>

Hey, Ive been using spambayes for a few days and at first it worked fine.  But recently I have been getting the following error when I try to train it on my spam folder.

Im assuming it might have to do with an email with a mangled header.  But Im having trouble tracking down which exact email it is.  Is there a way I can track down the offending email to forward onto the devel team to help assess this error ?

Cheers
Dave

Training spam (/home/dave/.mail/spam):
  Reading as Unix mbox
Traceback (most recent call last):
  File "/home/dave/spambayes-1.0a2/mboxtrain.py", line 278, in ?
    main()
  File "/home/dave/spambayes-1.0a2/mboxtrain.py", line 270, in main
    train(h, s, True, force)
  File "/home/dave/spambayes-1.0a2/mboxtrain.py", line 203, in train
    mbox_train(h, path, is_spam, force)
  File "/home/dave/spambayes-1.0a2/mboxtrain.py", line 139, in mbox_train
    if msg_train(h, msg, is_spam, force):
  File "/home/dave/spambayes-1.0a2/mboxtrain.py", line 71, in msg_train
    h.train(msg, is_spam)
  File "/home/dave/spambayes-1.0a2/hammie.py", line 150, in train
    spambayes.hammiebulk.main()
  File "./spambayes/classifier.py", line 270, in learn
  File "./spambayes/classifier.py", line 391, in _add_msg
  File "./spambayes/compatsets.py", line 374, in __init__
  File "./spambayes/compatsets.py", line 333, in _update
  File "./spambayes/tokenizer.py", line 1052, in tokenize
  File "./spambayes/tokenizer.py", line 1106, in tokenize_headers
  File "/usr/local/lib/python2.2/email/Header.py", line 92, in decode_header
    dec = email.base64MIME.decode(encoded)
  File "/usr/local/lib/python2.2/email/base64MIME.py", line 179, in decode
    dec = a2b_base64(s)
binascii.Error: Incorrect padding


From skip at pobox.com  Wed Mar  5 15:21:31 2003
From: skip at pobox.com (Skip Montanaro)
Date: Wed Mar  5 16:21:40 2003
Subject: [Spambayes] encountered error while processing spam folder
In-Reply-To: <20030305210612.GA5950@dave@alana.ucc.usyd.edu.au>
References: <20030305210612.GA5950@dave@alana.ucc.usyd.edu.au>
Message-ID: <15974.27227.241247.403310@montanaro.dyndns.org>


    Dave> Hey, Ive been using spambayes for a few days and at first it
    Dave> worked fine.  But recently I have been getting the following error
    Dave> when I try to train it on my spam folder....

Fixed in CVS. ;-)

Skip

From mhammond at skippinet.com.au  Thu Mar  6 08:35:32 2003
From: mhammond at skippinet.com.au (Mark Hammond)
Date: Wed Mar  5 16:36:34 2003
Subject: [Spambayes] locale and ConfigParser
Message-ID: <LCEPIIGDJPKCOIHOBJEPAEIBOEAA.mhammond@skippinet.com.au>

I recently received a mail regarding SpamBayes refusing to work:

> Possible reasons:
>
> Outlook 2002, Dutch version.
...
> File "C:\Python22\lib\ConfigParser.py", line 306, in getfloat
>   return self.__get(section, float, option)
> File "C:\Python22\lib\ConfigParser.py", line 300, in __get
>   return conv(self.get(section, option))
> exceptions.ValueError: invalid literal for float(): 0.20

Addiing the following anywhere before the file is parsed:

> import locale
> locale.setlocale(locale.LC_NUMERIC, "en")

Corrects the problem.  However, it is unclear to me what the ramifications
of this would be.

Anyone have a clue what we should do about this?

Those-bloody-dutch ly,

Mark.


From mhammond at skippinet.com.au  Thu Mar  6 09:33:08 2003
From: mhammond at skippinet.com.au (Mark Hammond)
Date: Wed Mar  5 17:33:51 2003
Subject: [Spambayes] Outlook plugin error
In-Reply-To: <9891913C5BFE87429D71E37F08210CB92C7515@zeus.sfhq.friskit.com>
Message-ID: <LCEPIIGDJPKCOIHOBJEPIEIEOEAA.mhammond@skippinet.com.au>

> Paul, are you using any of:
> 1) oulook XP
> 2) hotmail plugin for (1)
> 3) exchange server
>
> ?
>
> I'm wondering if the problem has anything to do with the fact that the
> spam field is set before the message is moved.

Further, when you see this behaviour, can you immediately check the
Pythonwin debug window for a message?  Each message processed should have a
message that indicates its spam disposition - the first thing I need to know
is if such mails fire this debug trace.

Mark.


From roz at one.net  Wed Mar  5 17:39:29 2003
From: roz at one.net (J. Solomon Kostelnik)
Date: Wed Mar  5 17:40:54 2003
Subject: [Spambayes] POP3proxy.py error
Message-ID: <1046903969.3119.1.camel@jsk.one.net>

This script had been working fine for the last several days.  I have
changed nothing in the setup.

Today when I attempt to load pop3proxy.py from my spambayes directory, I
get the following output:

Loading database...
Traceback (most recent call last):
  File "/usr/bin/pop3proxy.py", line 1651, in ?
    run()
  File "/usr/bin/pop3proxy.py", line 1619, in run
    state.createWorkers()
  File "/usr/bin/pop3proxy.py", line 1307, in createWorkers
    self.bayes = storage.DBDictClassifier(self.databaseFilename)
  File "/usr/lib/python2.2/site-packages/spambayes/storage.py", line
140, in __init__
    self.load()
  File "/usr/lib/python2.2/site-packages/spambayes/storage.py", line
148, in load
    self.dbm = dbmstorage.open(self.db_name, self.mode)
  File "/usr/lib/python2.2/site-packages/spambayes/dbmstorage.py", line
54, in open
    return f(*args)
  File "/usr/lib/python2.2/site-packages/spambayes/dbmstorage.py", line
36, in open_best
    return f(*args)
  File "/usr/lib/python2.2/site-packages/spambayes/dbmstorage.py", line
22, in open_gdbm
    return gdbm.open(*args)
gdbm.error: (11, 'Resource temporarily unavailable')

-----

What is happening here?

Solomon


From bill at parducci.net  Wed Mar  5 15:47:49 2003
From: bill at parducci.net (bill parducci)
Date: Wed Mar  5 18:47:53 2003
Subject: [Spambayes] statistical comparison of enviroment?
Message-ID: <3E668CA5.3050203@parducci.net>

first off, FWIW i am really amazed at the level of work that has gone into just the consideration of tokenization strategies. having struggled against the spam onslaught for the last 2 years armed solely with procmail i can really appreciate the work that has been done here! (after 200+ recipes i asked myself if there wasn't a better way... and found you guys... now i *know* there is. kudos to the group, this is some great work!

obeisance complete, off to the topic at hand :o)

i have been reading through the code/documentation looking at not just the token process, but considering the data that is subject to statistical analysis as well. i might have missed this, but has anyone considered including environmental factors into the spam vs. ham analysis? a couple of things come to mind right off the bat, but i am sure more could be found:

1. time of day (would require some real granularity tweaking)

2. size of header / size message / header:message ratio

3. attachment count (MIME count) / MIME count:message size ratio

4. [space|tab|\n]:[visible char] ratio

etc...

i think that if it hasn't already been done, it would be interesting to see if statistically comparing the *phyiscal* attributes of the messages would have an effect on the accuracy of the decision. currently--and i freely admit to being a lamer in undergrad stats--i think that this information is only considered implicitly.

b


From mhammond at skippinet.com.au  Thu Mar  6 12:35:55 2003
From: mhammond at skippinet.com.au (Mark Hammond)
Date: Wed Mar  5 20:36:56 2003
Subject: [Spambayes] Adding a message database
In-Reply-To: <w53wujdn2bf.fsf@woozle.org>
Message-ID: <LCEPIIGDJPKCOIHOBJEPOEIPOEAA.mhammond@skippinet.com.au>

> > I simply want a memory of how a specific message was trained, for the
> > following reasons:
> >
> > * Accidental attempt to train the same message, in the same
> way, multiple
> >   times.
> > * Accidental attempt to train the same message as ham and spam.
>
> So, this is a rockin' idea and I'd be glad to rewrite
> mboxtrain/hammiefilter to use it once it's implemented.

OK - while I am here... ;)

It seems to me that sub-classing classifier to change storage semantics is
wrong.  IMO, this should use delegation.  sub-classing of classifier should
be used should the classification sheme want overriding, not the storage
requirements.

This wouldn't be too hard to do - _setwordinfo() etc just delegate to a
self.storage - and would make some sense to do as part of a "message
database".

If there a compelling reason for it being the way it is?

Mark.


From popiel at wolfskeep.com  Wed Mar  5 17:59:16 2003
From: popiel at wolfskeep.com (T. Alexander Popiel)
Date: Wed Mar  5 20:59:21 2003
Subject: [Spambayes] statistical comparison of enviroment? 
In-Reply-To: Message from bill parducci <bill@parducci.net> 
   of "Wed, 05 Mar 2003 15:47:49 PST." <3E668CA5.3050203@parducci.net> 
References: <3E668CA5.3050203@parducci.net> 
Message-ID: <20030306015916.5BEF62DEA4@cashew.wolfskeep.com>

In message:  <3E668CA5.3050203@parducci.net>
             bill parducci <bill@parducci.net> writes:
>
>i might have missed this, but has anyone considered including
>environmental factors into the spam vs. ham analysis? a couple
>of things come to mind right off the bat, but i am sure more
>could be found:
>
>1. time of day (would require some real granularity tweaking)

This was tried, with 10 minute intervals; testing on two
separate corpora (that of the guy who came up with the
patch and my own) showed that the effect was inconsequential.
The largest result was the observation that both ham and spam
tend to slacken a bit in the middle of the night.

>2. size of header / size message / header:message ratio
>
>3. attachment count (MIME count) / MIME count:message size ratio
>
>4. [space|tab|\n]:[visible char] ratio

All of these have been mentioned in the past, but no one to my
knowledge has actually tested them.

Please feel free to code up something to turn these ideas into
tokens... then they can be tested, and if they're useful then
they'll likely be incorporated.

Testing of new tokens like this has dropped off since about
last October... spambayes is already good enough for just
about everyone to be happy.  My recent tests on training
methods seem to show that accuracy has been dropping off for
the last twho months, though, so it may be time to revisit
this problem...

- Alex

From bill at parducci.net  Wed Mar  5 18:26:30 2003
From: bill at parducci.net (bill parducci)
Date: Wed Mar  5 21:26:33 2003
Subject: [Spambayes] statistical comparison of enviroment?
In-Reply-To: <20030306015916.5BEF62DEA4@cashew.wolfskeep.com>
References: <3E668CA5.3050203@parducci.net>
	<20030306015916.5BEF62DEA4@cashew.wolfskeep.com>
Message-ID: <3E66B1D6.90308@parducci.net>

> Please feel free to code up something to turn these ideas into
> tokens... then they can be tested, and if they're useful then
> they'll likely be incorporated.

ok. in the interest of time saving (i've not programmed in python before), how about i [tabular] list what i find and let the statistas in the group decide if there is significance? i have a pile of spam and ham that i can wade through (unless there is a standardized sample that is preferable).

b


From skip at pobox.com  Wed Mar  5 20:39:25 2003
From: skip at pobox.com (Skip Montanaro)
Date: Wed Mar  5 21:39:28 2003
Subject: [Spambayes] statistical comparison of enviroment? 
In-Reply-To: <20030306015916.5BEF62DEA4@cashew.wolfskeep.com>
References: <3E668CA5.3050203@parducci.net>
        <20030306015916.5BEF62DEA4@cashew.wolfskeep.com>
Message-ID: <15974.46301.300535.819582@montanaro.dyndns.org>

    >> 1. time of day (would require some real granularity tweaking)
    >> 2. size of header / size message / header:message ratio
    >> 3. attachment count (MIME count) / MIME count:message size ratio
    >> 4. [space|tab|\n]:[visible char] ratio

(Just thinking out loud.)

One of the problems we have generating new improvements is the system is so
good now that improvements of any kind tend to be microscopic, and thus
extremely hard to measure.  Still, the more ways you can get the tool to
tell you "this smells like spam", the harder it will be for spammers to
defeat it.

Accordingly, when considering potential improvements (improved tokenizing
tricks, for example), perhaps what we should be doing is disabling much of
the current capability and then testing a new change against such a
"crippled" system.  Making it more concrete, suppose we split tokenizing
into two groups, "natural" tokens and "synthetic" tokens.  Natural tokens
would be what you get with basic whitespace splitting, nothing more.
Synthetic tokens would be stuff like tokenizing this subject and generating

    subject:[Spambayes]
    subject:statistical
    subject:comparison
    subject:of
    subject:environment

By reducing the effectiveness of the system for testing, I think we'd have a
better idea how effective a new idea might be.  What I don't know is how to
measure the independence of two different "improvements".  (The more
independent two improvements are, the harder it seems it would be for a
spammer to hit two birds with one stone when trying to defeat spambayes.)
Suppose for the sake of argument that this base system I talk about is 80%
effective at properly distinguishing ham from spam.  Suppose improvement A
takes that to 83% and applied independently to the base system, improvement
B takes that to 85%.  How do you tell how independent A and B are from one
another?

Skip


From bill at parducci.net  Wed Mar  5 19:23:27 2003
From: bill at parducci.net (bill parducci)
Date: Wed Mar  5 22:23:31 2003
Subject: [Spambayes] statistical comparison of enviroment?
In-Reply-To: <15974.46301.300535.819582@montanaro.dyndns.org>
References: <3E668CA5.3050203@parducci.net>
	<20030306015916.5BEF62DEA4@cashew.wolfskeep.com>
	<15974.46301.300535.819582@montanaro.dyndns.org>
Message-ID: <3E66BF2F.6080200@parducci.net>


Skip Montanaro wrote:
> By reducing the effectiveness of the system for testing, I think we'd have a
> better idea how effective a new idea might be.  What I don't know is how to
> measure the independence of two different "improvements".  (The more
> independent two improvements are, the harder it seems it would be for a
> spammer to hit two birds with one stone when trying to defeat spambayes.)
> Suppose for the sake of argument that this base system I talk about is 80%
> effective at properly distinguishing ham from spam.  Suppose improvement A
> takes that to 83% and applied independently to the base system, improvement
> B takes that to 85%.  How do you tell how independent A and B are from one
> another?

how about you measure each of the methodologies individually (at least those that have relevance; it seems that time is not one such approach), then look for those that are most complimentary? for example, suppose you had a simple matrix with message_id along the vertical axis and methodology across the horizontal access (plus one entry for 'true nature' of message) and then checked to see which combination of methodologies was the most accurate? 

of course, there may be some level of combinatorial explosion in doing it this way, but it would speak to the independence issue wouldn't it? 

b


From popiel at wolfskeep.com  Wed Mar  5 20:03:36 2003
From: popiel at wolfskeep.com (T. Alexander Popiel)
Date: Wed Mar  5 23:03:40 2003
Subject: [Spambayes] statistical comparison of enviroment? 
In-Reply-To: Message from bill parducci <bill@parducci.net> 
   of "Wed, 05 Mar 2003 18:26:30 PST." <3E66B1D6.90308@parducci.net> 
References: <3E668CA5.3050203@parducci.net>
	<20030306015916.5BEF62DEA4@cashew.wolfskeep.com>
	<3E66B1D6.90308@parducci.net> 
Message-ID: <20030306040336.77E4E2DEA4@cashew.wolfskeep.com>

In message:  <3E66B1D6.90308@parducci.net>
             bill parducci <bill@parducci.net> writes:
>> Please feel free to code up something to turn these ideas into
>> tokens... then they can be tested, and if they're useful then
>> they'll likely be incorporated.
>
>ok. in the interest of time saving (i've not programmed in python
>before), how about i [tabular] list what i find and let the statistas
>in the group decide if there is significance? i have a pile of spam
>and ham that i can wade through (unless there is a standardized sample
>that is preferable).

We've actually got a pretty good testing infrastructure set up;
for tokenization tests, I personally use timcv.py with each of the
tokenization options and then feed the output of the runs into
table.py.  This produces some nice tabularizations that you may
notice in the mailing list archives.

Using your own ham and spam is standard procedure here; most people
are touchy about giving their ham away due to privacy concerns.
If some new option looks good, then multiple people try it out on
their different corpora, and if it still looks good after that,
then it gets included.

Don't worry about not having coded in python before.  I hadn't
done much in python before this project either, and people haven't
been screaming about how ugly my code is, yet...

- Alex

From popiel at wolfskeep.com  Wed Mar  5 20:22:00 2003
From: popiel at wolfskeep.com (T. Alexander Popiel)
Date: Wed Mar  5 23:22:02 2003
Subject: [Spambayes] statistical comparison of enviroment? 
In-Reply-To: Message from Skip Montanaro <skip@pobox.com> 
	<15974.46301.300535.819582@montanaro.dyndns.org> 
References: <3E668CA5.3050203@parducci.net>
	<20030306015916.5BEF62DEA4@cashew.wolfskeep.com>
	<15974.46301.300535.819582@montanaro.dyndns.org> 
Message-ID: <20030306042200.08E482DEA4@cashew.wolfskeep.com>

In message:  <15974.46301.300535.819582@montanaro.dyndns.org>
             Skip Montanaro <skip@pobox.com> writes:
>
>Accordingly, when considering potential improvements (improved tokenizing
>tricks, for example), perhaps what we should be doing is disabling much of
>the current capability and then testing a new change against such a
>"crippled" system.

This seems like a reasonable strategy.  There's already options to
control some of the header parsing; I suspect more options could be
put in to disable various other aspects of the tokenizer.  I'm not
sure how much the folks who are just trying to use the system will
like all the extra options, though...

>What I don't know is how to measure the independence of two different
>"improvements".

The simple solution for that seems to me to be doing four runs, with
each combination of the two options on and off.  If the two are
independent, then the run with both on should be better than the
run with either on, and the run with neither on should be worse
than both.  If it's really independent, then there should be a
nice mathematical relation between the improvements from none to
either and from either to both... but I'm forgetting what that
math is at the moment, and I doubt than anything is perfectly
independent anyway.
 
>Suppose for the sake of argument that this base system I talk about is 80%
>effective at properly distinguishing ham from spam.  Suppose improvement A
>takes that to 83% and applied independently to the base system, improvement
>B takes that to 85%.  How do you tell how independent A and B are from one
>another?

By doing a run with both A and B, and seeing if it was at about 87%.

>(The more independent two improvements are, the harder it seems it would
>be for a spammer to hit two birds with one stone when trying to defeat
>spambayes.)

Aye.  The problem, of course, is that we could start making spambayes
so tricked-out that it'd be as slow as SpamAssassin. ;-)

- Alex

From tim at fourstonesExpressions.com  Wed Mar  5 23:08:22 2003
From: tim at fourstonesExpressions.com (Tim Stone - Four Stones Expressions)
Date: Thu Mar  6 00:11:06 2003
Subject: [Spambayes] Adding a message database
In-Reply-To: <LCEPIIGDJPKCOIHOBJEPOEIPOEAA.mhammond@skippinet.com.au>
Message-ID: <YSGPNHG75VTVLHLJFBBA844XUPZTPM.3e66d7c6@myst>

3/5/2003 7:35:55 PM, "Mark Hammond" <mhammond@skippinet.com.au> wrote:

>It seems to me that sub-classing classifier to change storage semantics is
>wrong.  IMO, this should use delegation.  sub-classing of classifier should
>be used should the classification sheme want overriding, not the storage
>requirements.

Yes, I agree with this.  I think the same kind of argument applies to the 
message id database thing.  Assuming that there is a classifier subclass to 
manage message ids seems wrong.  And while I am here... ;)  assuming that 
classifier will be subclassed as some kind of persistent classifier seems 
wrong to me, too.

>
>This wouldn't be too hard to do - _setwordinfo() etc just delegate to a
>self.storage - and would make some sense to do as part of a "message
>database".

I wonder if delegate is the right pattern here.  Perhaps observer?

>
>If there a compelling reason for it being the way it is?

Nope.

So... let's consider a strawman like this:


class Classifier:

    def __init__(self, wi):

        self.wordinfo = wi()

class WordInfo:
    """ In memory wordinfo class """

class PersistentWordInfo(WordInfo):
    """ Implements persistence as dbdict, let's forget pickles."""

class Message:
    """ Message abstraction """

    def __init__(self, id)
        """ All messages have an id """

        if id is None:
            self.id = time()  # make up some arbitrary id
        else:
            self.id = id

    def setPayload(self, payload)
        """ payload is delivered to an email.Message object """

        self.msg = email.Message()
        self.msg.add_payload(payload)

    """ have appropriate delegators to the Message object """

class FileMessage(Message):
    """ Message stored in a file system """

class MboxMessage(Message):
    """ Message stored in an mbox """

""" Perhaps other Message classes for various mechanisms, like Outlook,
Lotus, etc."""

class MessageSet:
    """ Iterable set of Message objects """

class FileMessageSet:
    """ Set of Messages in the file system """

class MboxMessageSet(MessageSet):
    """ Set of Messages in an mbox """

""" Perhaps other MessageSet classes for various mechanisms, like Outlook, 
Lotus, etc. """

class Trainer:
    def __init__(self, wordinfo, idDb):
        """ Trains.  Some methods in this class will come from current 
classifier class. """

        self.wordinfo = wordinfo
        self.idDb = idDb

    def learn(self, msg, isSpam):
        """ unlearns if need be, then learns a message. """

        try:
            mstat = idDb.isSpam(msg)
        except NeverTrainedError:
            pass
        else:
            if isSpam != mstat
            self.unlearn(msg, not mstat)

        wordinfo.learn(msg, isSpam)  # you get the idea
            
    def unlearn(self, msg, isSpam):
        """ unlearn previous training """

        wordinfo.unlearn(msg, isSpam)

class MessageIdDb:
    """ Maintains a persistent set of message ids and how they've been 
trained"""

    def __init__(self, dbname):
        """ Assumes a particular persistence mechanism (pickle, bsddb, 
whatever)"""

        self.dbname = dbname
        # do something to load

    def rememberSpam(id):

    def rememberHam(id):

    def isSpam(id):

    """ Iteratable? """

Rip away, dudes... :)

c'est moi - TimS
http://www.fourstonesExpressions.com
http://wecanstopspam.org


From tim_one at email.msn.com  Thu Mar  6 01:01:12 2003
From: tim_one at email.msn.com (Tim Peters)
Date: Thu Mar  6 01:01:50 2003
Subject: [Spambayes] statistical comparison of enviroment?
In-Reply-To: <15974.46301.300535.819582@montanaro.dyndns.org>
Message-ID: <LNBBLJKPBEHFEDALKOLCOEHIDPAB.tim_one@email.msn.com>

[Skip Montanaro]
> ...
> Suppose improvement A takes that to 83% and applied independently to the
> base system, improvement B takes that to 85%.  How do you tell how
> independent A and B are from one another?

It's a well studied area, and any std work on experimental design will cover
it.  Picture an analogy:  spam == disease, and various kinds of clues are
various drugs claimed to cure the disease (or test procedure claimed to
identify the disease).  A proper experimental design can quantify which
drugs work and how well, which combinations are better than the sum of their
parts, and which worse.  This is a messy combinatorial problem, though, and
real-life experiments rarely try to tackle more than a few drugs at a time.
Then again, despite the howling of the perturbed, few people actually die
from a spam that leaks thru <wink>.

If I had time, I'd rather investigate Adaboost (mentioned several times here
long ago) as a means to combine various kinds of clues as if they were each
classifiers on their own.  Adaboost is a general approach to combining
multiple classifiers so that the combined classifier is better than any of
its parts, provided only (roughly speaking) that each classifier going into
it does better than chance.  For example, we've seen here that a header-only
classifier can do very well, and so can a classifier than looks only at msg
bodies.  The *best* way to combine those two may very well not be simply
lumping them together as equals.  I ran experiments on a classifier that
looked only at Subject lines, and reported here that it had error rates down
around 5% all by itself.  Etc:  there are lots of little classifiers you
*could* build out of our code base.

Chi-combining gives each kind of clue (token) equal weight, and there's no
reason to believe that's optimal.  Gary Robinson once suggested a variant on
the geometric-mean approaches that weighted tokens differently by giving
each an exponent derived from its spamprob (instead of giving each one
exponent 1/n, where n is the # of tokens).  I couldn't make time to pursue
that then.  In a sense, Adaboost is a way of weighting a collection of
classifiers where the data *tells* you good weights to use, instead of
dreaming up an a priori weighting scheme.  Lots of "learning" algorithms do
a similar thing, but Adaboost enjoys a long list of provably good
performance and convergence properties.

OTOH, if you come up with a better scheme, my original 35K collection of
test msgs can't demonstrate it (spambayes already does a
perfect-as-it-can-be job on it).  OTOH, lots of marginal decisions were
based on that specific collection, and I'm sure some of them would have been
decided differently if anyohe else had spent 20 hours a day for two months
dreaming up tests on their test corpus <wink>.


From jean-marc.valin at hermes.usherb.ca  Thu Mar  6 00:44:31 2003
From: jean-marc.valin at hermes.usherb.ca (Jean-Marc Valin)
Date: Thu Mar  6 01:04:15 2003
Subject: [Spambayes] mboxtrain.py crashes
Message-ID: <1046929470.1829.20.camel@idefix.homelinux.org>

Skipped content of type multipart/mixed-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 241 bytes
Desc: not available
Url : http://mail.python.org/pipermail/spambayes/attachments/20030306/8248f39a/attachment-0001.bin
From skip at pobox.com  Thu Mar  6 00:42:15 2003
From: skip at pobox.com (Skip Montanaro)
Date: Thu Mar  6 01:42:19 2003
Subject: [Spambayes] statistical comparison of enviroment? 
In-Reply-To: <20030306042200.08E482DEA4@cashew.wolfskeep.com>
References: <3E668CA5.3050203@parducci.net>
        <20030306015916.5BEF62DEA4@cashew.wolfskeep.com>
        <15974.46301.300535.819582@montanaro.dyndns.org>
        <20030306042200.08E482DEA4@cashew.wolfskeep.com>
Message-ID: <15974.60871.524835.326773@montanaro.dyndns.org>


    Alex> I suspect more options could be put in to disable various other
    Alex> aspects of the tokenizer.  I'm not sure how much the folks who are
    Alex> just trying to use the system will like all the extra options,
    Alex> though...

I was thinking along the lines of one extra option which could collectively
disable all but the most basic features.  It would default to False so
normal users would have to explicitly enable it (and might even get a
warning displayed if it was enabled).

    >> (The more independent two improvements are, the harder it seems it
    >> would be for a spammer to hit two birds with one stone when trying to
    >> defeat spambayes.)

    Alex> Aye.  The problem, of course, is that we could start making
    Alex> spambayes so tricked-out that it'd be as slow as SpamAssassin. ;-)

Not necessarily.  If A and B prove to not be independent, we dump one and
keep the other.  In some situations, spambayes may actually perform fewer
tricks, thus speeding it up.

Skip


From Paul.Moore at atosorigin.com  Thu Mar  6 09:05:47 2003
From: Paul.Moore at atosorigin.com (Moore, Paul)
Date: Thu Mar  6 04:07:14 2003
Subject: [Spambayes] Outlook plugin error
Message-ID: <16E1010E4581B049ABC51D4975CEDB880113D959@UKDCX001.uk.int.atosorigin.com>

From: Piers Haken [mailto:piersh@friskit.com]

> Paul, are you using any of:
> 1) oulook XP
> 2) hotmail plugin for (1)
> 3) exchange server

Yes, Exchange Server

> I'm wondering if the problem has anything to do with the fact that the
> spam field is set before the message is moved.

I'm not sure I see how, but I've no reason to think you're wrong, either.

I always assumed that it was somehow related to the fact that mails arrive
asynchronously, and could therefore arrive when the plugin "wasn't ready"
somehow. That implies (a) that some form of locking or queueing mechanism
is needed, and (b) that it's going to be bloody hard to diagnose or test :-)

But this is pure speculation on my part...

Paul.

From rob at hooft.net  Thu Mar  6 11:13:04 2003
From: rob at hooft.net (Rob W. W. Hooft)
Date: Thu Mar  6 05:13:09 2003
Subject: [Spambayes] statistical comparison of enviroment?
References: <3E668CA5.3050203@parducci.net>
	<20030306015916.5BEF62DEA4@cashew.wolfskeep.com>
	<15974.46301.300535.819582@montanaro.dyndns.org>
Message-ID: <3E671F30.9080101@hooft.net>

Skip Montanaro wrote:
>  Suppose improvement A
> takes that to 83% and applied independently to the base system, improvement
> B takes that to 85%.  How do you tell how independent A and B are from one
> another?

Separate from all the good suggestions already made to help this, I would say that a little information entropy would do wonders.

Say we have one token that occurs in 25 out of 100 messages, regardless of whether they are ham or spam. And another one that does also hit 25 out of the same 100 messages.

          present absent
token1      25     75
token2      25     75

In this case, both tokens have an information entropy (S) of:

  S = 0.25*log_e(1/0.25)+0.75*log_e(1/0.75) = 0.56 bit

Combining the two tokens can give different possibilities, among which:

         token1
token2  present   absent
present   25       0              S = 0.56 bit
absent     0       75

         token1
token2  present   absent
present    9       16             S = 1.11 bit
absent    16       59


         token1
token2  present   absent
present    0       25             S = 1.03 bit
absent    25       50

This way it is possible to see how many "bits" of information are obtained from one token individually, or by combining tokens. In general, combining tokens will give less than the sum of their individual contributions. How much less is a quantitave measure of the correlation of the tokens. Of course this does not make any prediction as to the suitability of each token to characterize a message as spam. Someone with better background in information theory can probably combine the information entropy with the suitability in a proper way. In any case, if the two tokens under study are correlated as in the first combination (25/0/0/75), they are equally suited for spam classification.

Regards,

Rob 

-- 
Rob W.W. Hooft  ||  rob@hooft.net  ||  http://www.hooft.net/people/rob/


From tim at fourstonesExpressions.com  Thu Mar  6 06:29:18 2003
From: tim at fourstonesExpressions.com (Tim Stone - Four Stones Expressions)
Date: Thu Mar  6 07:29:23 2003
Subject: [Spambayes] mboxtrain.py crashes
In-Reply-To: <1046929470.1829.20.camel@idefix.homelinux.org>
Message-ID: <3WYVYXKH2XJFDC86FEXTNKC071F0SR5.3e673f1e@myst>

Jean-Marc, please report this as a bug so we can track it.  You can do that at 
http://sourceforge.net/projects/spambayes/  Otherwise, your report will get 
lost in the mailing list noise.  Thanks.

3/5/2003 11:44:31 PM, Jean-Marc Valin <jean-marc.valin@hermes.usherb.ca> 
wrote:

>Hi,
>
>I'm trying to train a spam database and I'm experiencing crashes with
>mboxtrain.py. I'm attaching three mbox's (simplified to their offending
>e-mail) that produce the crash. This happens with both CVS and the last
>nightly build (tried both python 2.2 and 2.3a2). The message printed is:
>
>Traceback (most recent call last):
>  File "mboxtrain.py", line 284, in ?
>    main()
>  File "mboxtrain.py", line 271, in main
>    train(h, g, False, force)
>  File "mboxtrain.py", line 209, in train
>    mbox_train(h, path, is_spam, force)
>  File "mboxtrain.py", line 140, in mbox_train
>    for msg in mbox:
>  File "/opt//lib/python2.3/mailbox.py", line 35, in next
>    return self.factory(_Subfile(self.fp, start, stop))
>  File "/software/spambayes/spambayes/mboxutils.py", line 116, in
>get_message
>    msg = email.message_from_string(obj)
>  File "/opt//lib/python2.3/email/__init__.py", line 52, in
>message_from_string
>    return Parser(_class, strict=strict).parsestr(s)
>  File "/opt//lib/python2.3/email/Parser.py", line 75, in parsestr
>    return self.parse(StringIO(text), headersonly=headersonly)
>  File "/opt//lib/python2.3/email/Parser.py", line 64, in parse
>    self._parsebody(root, fp, firstbodyline)
>  File "/opt//lib/python2.3/email/Parser.py", line 239, in _parsebody
>    msgobj = self.parsestr(part)
>  File "/opt//lib/python2.3/email/Parser.py", line 75, in parsestr
>    return self.parse(StringIO(text), headersonly=headersonly)
>  File "/opt//lib/python2.3/email/Parser.py", line 64, in parse
>    self._parsebody(root, fp, firstbodyline)
>  File "/opt//lib/python2.3/email/Parser.py", line 146, in _parsebody
>    boundary = container.get_boundary()
>  File "/opt//lib/python2.3/email/Message.py", line 701, in get_boundary
>    boundary = self.get_param('boundary', missing)
>  File "/opt//lib/python2.3/email/Message.py", line 566, in get_param
>    for k, v in self._get_params_preserve(failobj, header):
>  File "/opt//lib/python2.3/email/Message.py", line 516, in
>_get_params_preserve    params = Utils.decode_params(params)
>  File "/opt//lib/python2.3/email/Utils.py", line 337, in decode_params
>    charset, language, value = decode_rfc2231(EMPTYSTRING.join(value))
>  File "/opt//lib/python2.3/email/Utils.py", line 283, in decode_rfc2231
>    charset, language, s = s.split("'", 2)
>ValueError: unpack list of wrong size
>
>	Jean-Marc
>
>-- 
>Jean-Marc Valin, M.Sc.A.
>LABORIUS (http://www.gel.usherb.ca/laborius)
>Universit? de Sherbrooke, Qu?bec, Canada
>
>


c'est moi - TimS
http://www.fourstonesExpressions.com
http://wecanstopspam.org


From jm at jmason.org  Thu Mar  6 12:38:13 2003
From: jm at jmason.org (Justin Mason)
Date: Thu Mar  6 08:32:30 2003
Subject: [Spambayes] statistical comparison of enviroment? 
In-Reply-To: Message from "T. Alexander Popiel" <popiel@wolfskeep.com> 
	<20030306042200.08E482DEA4@cashew.wolfskeep.com> 
Message-ID: <20030306123818.73B6016F1B@jmason.org>


T. Alexander Popiel said:

> Aye.  The problem, of course, is that we could start making spambayes
> so tricked-out that it'd be as slow as SpamAssassin. ;-)

Hey! ;)

--j.

From noreply at sourceforge.net  Thu Mar  6 06:24:36 2003
From: noreply at sourceforge.net (SourceForge.net)
Date: Thu Mar  6 09:56:44 2003
Subject: [Spambayes] 
	[ spambayes-Feature Requests-690928 ] turn off saving messages in
	popproxy
Message-ID: <E18qwIW-0000Yo-00@sc8-sf-web2.sourceforge.net>

Feature Requests item #690928, was opened at 2003-02-21 16:00
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=498106&aid=690928&group_id=61702

Category: pop3proxy
Group: None
>Status: Closed
Priority: 5
Submitted By: Carl Nygard (cnygard)
Assigned to: Tim Stone (timstone4)
Summary: turn off saving messages in popproxy

Initial Comment:

It would be nice to be able to turn off saving message
for training, and just let the settings chug.  I'm
guessing that the messages will just pile up if I don't
go in and at least discard the messages every day.

----------------------------------------------------------------------

>Comment By: Tim Stone (timstone4)
Date: 2003-03-06 08:24

Message:
Logged In: YES 
user_id=645698

Option has been added.

----------------------------------------------------------------------

Comment By: Tim Stone (timstone4)
Date: 2003-02-21 21:28

Message:
Logged In: YES 
user_id=645698

Messages are auto-deleted after 7 days, by default.  This is not well 
documented, however.

----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=498106&aid=690928&group_id=61702

From N7DR at arrisi.com  Thu Mar  6 07:47:26 2003
From: N7DR at arrisi.com (D. R. Evans)
Date: Thu Mar  6 10:02:20 2003
Subject: [Spambayes] mboxtrain.py crashes
In-Reply-To: <3WYVYXKH2XJFDC86FEXTNKC071F0SR5.3e673f1e@myst>
References: <1046929470.1829.20.camel@idefix.homelinux.org>
Message-ID: <3E66FD0E.5572.233F95DE@localhost>

On 6 Mar 2003 at 6:29, Tim Stone - Four Stones Expre wrote:

> Jean-Marc, please report this as a bug so we can track it.  You can do
> that at http://sourceforge.net/projects/spambayes/  Otherwise, your
> report will get lost in the mailing list noise.  Thanks.
> 

So I assume that I should do the same with my notice yesterday about 
pop3proxy.py crashes.

I'll file a bug report later today. I already miss my spambayes :-)

  Doc
--------------------------------------------------------------
Phone:  +1 303 494 0394
Mobile: +1 720 839 8462
Fax:    +1 781 240 0527
--------------------------------------------------------------


From MMARTINEZ at intranet.reeusda.gov  Thu Mar  6 10:33:35 2003
From: MMARTINEZ at intranet.reeusda.gov (Martinez, Michael - CSREES/ISTM)
Date: Thu Mar  6 10:32:51 2003
Subject: [Spambayes] Integration with qmail?
Message-ID: <E8E5E0D3B5C9D611B23500C00D00E9BC3036EC@CSREESSERVER>

I'm looking to integrate spambayes with a qmail smtp gateway. Any pointers
would be appreciated.

Thanks, 

Michael Martinez
CSREES/ISTM/USDA


From tim at fourstonesExpressions.com  Thu Mar  6 09:50:07 2003
From: tim at fourstonesExpressions.com (Tim Stone - Four Stones Expressions)
Date: Thu Mar  6 10:50:13 2003
Subject: [Spambayes] pop3proxy crashes
In-Reply-To: <3E660576.15567.1F786E44@localhost>
Message-ID: <OKHAJDE7WVB71WOJHESN082VZUC8.3e676e2f@myst>

Nearly as I can tell, your training database has been corrupted.  I'm not 
quite sure how this happened, but from what I see in the code, there is likely 
no recovery at this point.  When you submit a bug report, go ahead and attach 
your training database.

3/5/2003 3:11:02 PM, "D. R. Evans" <N7DR@arrisi.com> wrote:

>I made the mistake of rebooting my Linux box....
>
>Following the reboot, pop3proxy.py now dumps the following to the 
>screen whenever I try to run it:
>
>Loading database...
>Traceback (most recent call last):
>  File "./pop3proxy.py", line 1577, in ?
>    run()
>  File "./pop3proxy.py", line 1551, in run
>    state.createWorkers()
>  File "./pop3proxy.py", line 1161, in createWorkers
>    self.bayes = storage.DBDictClassifier(filename)
>  File "./spambayes/storage.py", line 140, in __init__
>    self.load()
>  File "./spambayes/storage.py", line 152, in load
>    t = self.db[self.statekey]
>  File "/usr/local/lib/python2.2/shelve.py", line 71, in __getitem__
>    return Unpickler(f).load()
>EOFError
>
>It worked fine (for about three weeks) until the reboot. I'm probably 
>forgetting to do something obvious (I hope).
>
>  Doc
>--------------------------------------------------------------
>Phone:  +1 303 494 0394
>Mobile: +1 720 839 8462
>Fax:    +1 781 240 0527
>--------------------------------------------------------------
>
>
>_______________________________________________
>Spambayes mailing list
>Spambayes@python.org
>http://mail.python.org/mailman/listinfo/spambayes
>
>


c'est moi - TimS
http://www.fourstonesExpressions.com
http://wecanstopspam.org


From noreply at sourceforge.net  Thu Mar  6 08:09:03 2003
From: noreply at sourceforge.net (SourceForge.net)
Date: Thu Mar  6 11:10:02 2003
Subject: [Spambayes] 
	[ spambayes-Bugs-698796 ] mboxtrain.py crashes on some mbox data
Message-ID: <E18qxvb-0002do-00@sc8-sf-web4.sourceforge.net>

Bugs item #698796, was opened at 2003-03-06 11:09
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=498103&aid=698796&group_id=61702

Category: None
Group: None
Status: Open
Resolution: None
Priority: 5
Submitted By: Jean-Marc Valin (jmvalin)
Assigned to: Nobody/Anonymous (nobody)
Summary: mboxtrain.py crashes on some mbox data

Initial Comment:
I'm trying to train a spam database and I'm
experiencing crashes with mboxtrain.py. I'm attaching
three mbox's (simplified to their offending e-mail)
that produce the crash. This happens with both CVS and
the last nightly build (tried both python 2.2 and
2.3a2). The message printed is:

Traceback (most recent call last):
  File "mboxtrain.py", line 284, in ?
    main()
  File "mboxtrain.py", line 271, in main
    train(h, g, False, force)
  File "mboxtrain.py", line 209, in train
    mbox_train(h, path, is_spam, force)
  File "mboxtrain.py", line 140, in mbox_train
    for msg in mbox:
  File "/opt//lib/python2.3/mailbox.py", line 35, in next
    return self.factory(_Subfile(self.fp, start, stop))
  File "/software/spambayes/spambayes/mboxutils.py",
line 116, in
get_message
    msg = email.message_from_string(obj)
  File "/opt//lib/python2.3/email/__init__.py", line 52, in
message_from_string
    return Parser(_class, strict=strict).parsestr(s)
  File "/opt//lib/python2.3/email/Parser.py", line 75,
in parsestr
    return self.parse(StringIO(text),
headersonly=headersonly)
  File "/opt//lib/python2.3/email/Parser.py", line 64,
in parse
    self._parsebody(root, fp, firstbodyline)
  File "/opt//lib/python2.3/email/Parser.py", line 239,
in _parsebody
    msgobj = self.parsestr(part)
  File "/opt//lib/python2.3/email/Parser.py", line 75,
in parsestr
    return self.parse(StringIO(text),
headersonly=headersonly)
  File "/opt//lib/python2.3/email/Parser.py", line 64,
in parse
    self._parsebody(root, fp, firstbodyline)
  File "/opt//lib/python2.3/email/Parser.py", line 146,
in _parsebody
    boundary = container.get_boundary()
  File "/opt//lib/python2.3/email/Message.py", line
701, in get_boundary
    boundary = self.get_param('boundary', missing)
  File "/opt//lib/python2.3/email/Message.py", line
566, in get_param
    for k, v in self._get_params_preserve(failobj, header):
  File "/opt//lib/python2.3/email/Message.py", line 516, in
_get_params_preserve    params =
Utils.decode_params(params)
  File "/opt//lib/python2.3/email/Utils.py", line 337,
in decode_params
    charset, language, value =
decode_rfc2231(EMPTYSTRING.join(value))
  File "/opt//lib/python2.3/email/Utils.py", line 283,
in decode_rfc2231
    charset, language, s = s.split("'", 2)
ValueError: unpack list of wrong size

----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=498103&aid=698796&group_id=61702

From bill at parducci.net  Thu Mar  6 08:02:10 2003
From: bill at parducci.net (bill parducci)
Date: Thu Mar  6 11:12:22 2003
Subject: [Spambayes] Integration with qmail?
In-Reply-To: <E8E5E0D3B5C9D611B23500C00D00E9BC3036EC@CSREESSERVER>
References: <E8E5E0D3B5C9D611B23500C00D00E9BC3036EC@CSREESSERVER>
Message-ID: <3E677102.7000607@parducci.net>

once you have procmail setup to work with qmail HAMMIE.txt (in the tarball) will walk you through the install process. if you don't have procmail setup here are a couple of places you may want to start:

http://www.flounder.net/qmail/qmail-howto.html (#10)
http://www.ornl.gov/cts/archives/mailing-lists/qmail/1998/07/msg00350.html

b

Martinez, Michael - CSREES/ISTM wrote:
> I'm looking to integrate spambayes with a qmail smtp gateway. Any pointers
> would be appreciated.
> 
> Thanks, 
> 
> Michael Martinez
> CSREES/ISTM/USDA


From skip at pobox.com  Thu Mar  6 10:18:33 2003
From: skip at pobox.com (Skip Montanaro)
Date: Thu Mar  6 11:18:47 2003
Subject: [Spambayes] mboxtrain.py crashes
In-Reply-To: <3E66FD0E.5572.233F95DE@localhost>
References: <1046929470.1829.20.camel@idefix.homelinux.org>
        <3E66FD0E.5572.233F95DE@localhost>
Message-ID: <15975.29913.77912.36528@montanaro.dyndns.org>

    Doc> So I assume that I should do the same with my notice yesterday
    Doc> about pop3proxy.py crashes.

Yes.  If it's the header parsing problem which Jeremy recently fixed, I'll
close it right out, but if not, it helps to have a chit in the system so it
doesn't get lost.

Skip


From piersh at friskit.com  Thu Mar  6 09:02:58 2003
From: piersh at friskit.com (Piers Haken)
Date: Thu Mar  6 12:01:45 2003
Subject: [Spambayes] Outlook plugin error
Message-ID: <9891913C5BFE87429D71E37F08210CB92C7519@zeus.sfhq.friskit.com>

I can't find any correlation between the assert and the incorrect field
setting. They may well be unrelated.

Do you know what is a 'win32com.gen_py.None.MailItem'?

Piers.

> -----Original Message-----
> From: Mark Hammond [mailto:mhammond@skippinet.com.au] 
> Sent: Wednesday, March 05, 2003 2:33 PM
> To: Piers Haken; Moore, Paul; Spambayes
> Subject: RE: [Spambayes] Outlook plugin error
> 
> 
> > Paul, are you using any of:
> > 1) oulook XP
> > 2) hotmail plugin for (1)
> > 3) exchange server
> >
> > ?
> >
> > I'm wondering if the problem has anything to do with the 
> fact that the 
> > spam field is set before the message is moved.
> 
> Further, when you see this behaviour, can you immediately 
> check the Pythonwin debug window for a message?  Each message 
> processed should have a message that indicates its spam 
> disposition - the first thing I need to know is if such mails 
> fire this debug trace.
> 
> Mark.
> 
> 

From piersh at friskit.com  Thu Mar  6 09:19:12 2003
From: piersh at friskit.com (Piers Haken)
Date: Thu Mar  6 12:38:44 2003
Subject: [Spambayes] Outlook plugin error
Message-ID: <9891913C5BFE87429D71E37F08210CB92C751A@zeus.sfhq.friskit.com>

Okay, I'm wondering: under what circumstances would a message NOT have
an "EntryID"?

Piers.

> -----Original Message-----
> From: Piers Haken 
> Sent: Thursday, March 06, 2003 9:03 AM
> To: Mark Hammond; Moore, Paul; Spambayes
> Subject: RE: [Spambayes] Outlook plugin error
> 
> 
> I can't find any correlation between the assert and the 
> incorrect field setting. They may well be unrelated.
> 
> Do you know what is a 'win32com.gen_py.None.MailItem'?
> 
> Piers.
> 
> > -----Original Message-----
> > From: Mark Hammond [mailto:mhammond@skippinet.com.au]
> > Sent: Wednesday, March 05, 2003 2:33 PM
> > To: Piers Haken; Moore, Paul; Spambayes
> > Subject: RE: [Spambayes] Outlook plugin error
> > 
> > 
> > > Paul, are you using any of:
> > > 1) oulook XP
> > > 2) hotmail plugin for (1)
> > > 3) exchange server
> > >
> > > ?
> > >
> > > I'm wondering if the problem has anything to do with the
> > fact that the
> > > spam field is set before the message is moved.
> > 
> > Further, when you see this behaviour, can you immediately
> > check the Pythonwin debug window for a message?  Each message 
> > processed should have a message that indicates its spam 
> > disposition - the first thing I need to know is if such mails 
> > fire this debug trace.
> > 
> > Mark.
> > 
> > 
> 
> _______________________________________________
> Spambayes mailing list
> Spambayes@python.org http://mail.python.org/mailman/listinfo/spambayes
> 

From neale at woozle.org  Thu Mar  6 09:48:40 2003
From: neale at woozle.org (Neale Pickett)
Date: Thu Mar  6 12:48:48 2003
Subject: [Spambayes] Integration with qmail?
In-Reply-To: <E8E5E0D3B5C9D611B23500C00D00E9BC3036EC@CSREESSERVER> ("Martinez,
 Michael - CSREES/ISTM"'s message of "Thu, 6 Mar 2003 10:33:35 -0500")
References: <E8E5E0D3B5C9D611B23500C00D00E9BC3036EC@CSREESSERVER>
Message-ID: <w53heagmlmv.fsf@woozle.org>

"Martinez, Michael - CSREES/ISTM" <MMARTINEZ@intranet.reeusda.gov> writes:

> I'm looking to integrate spambayes with a qmail smtp gateway. Any pointers
> would be appreciated.

Wow, d?j? vu!

I would refer you to
http://mail.python.org/pipermail/spambayes/2003-February/003322.html and
the messages following it, for starters.  I'm still looking at ways to
do this, but not at a staggering pace.  Any ideas are still appreciated
:)

Neale


From neale at woozle.org  Thu Mar  6 09:52:33 2003
From: neale at woozle.org (Neale Pickett)
Date: Thu Mar  6 12:52:37 2003
Subject: [Spambayes] mboxtrain.py crashes
In-Reply-To: <3WYVYXKH2XJFDC86FEXTNKC071F0SR5.3e673f1e@myst> (Tim Stone -
 Four Stones Expressions's message of "Thu, 06 Mar 2003 06:29:18 -0600")
References: <3WYVYXKH2XJFDC86FEXTNKC071F0SR5.3e673f1e@myst>
Message-ID: <w53el5kmlge.fsf@woozle.org>

Tim Stone - Four Stones Expressions <tim@fourstonesExpressions.com> writes:

> Jean-Marc Valin <jean-marc.valin@hermes.usherb.ca> wrote:
>
>>   File "/opt//lib/python2.3/email/Utils.py", line 283, in decode_rfc2231
>>     charset, language, s = s.split("'", 2)
>> ValueError: unpack list of wrong size
>
> Jean-Marc, please report this as a bug so we can track it.  You can do
> that at http://sourceforge.net/projects/spambayes/ Otherwise, your
> report will get lost in the mailing list noise.  Thanks.

Right.  But just for the record, it looks an awful lot like another
instance of the email package not handling really fouled-up messages
gracefully.  So the fix may be a long time coming.  In the meantime,
since that message is probably spam, you can most likely just delete it
and mboxtrain will continue to work.

Actually, I guess mboxtrain could be a little more error-resistant.
I'll add that to the todo list.

Neale

From noreply at sourceforge.net  Thu Mar  6 09:26:28 2003
From: noreply at sourceforge.net (SourceForge.net)
Date: Thu Mar  6 13:12:03 2003
Subject: [Spambayes] [ spambayes-Bugs-698852 ] can't classify messages
Message-ID: <E18qz8W-0000yp-00@sc8-sf-web1.sourceforge.net>

Bugs item #698852, was opened at 2003-03-06 17:26
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=498103&aid=698852&group_id=61702

Category: pop3proxy
Group: None
Status: Open
Resolution: None
Priority: 5
Submitted By: Jeremy Hylton (jhylton)
Assigned to: Nobody/Anonymous (nobody)
Summary: can't classify messages

Initial Comment:
Traceback (most recent call last):

  File
"/usr/local/lib/python2.3/site-packages/spambayes/Dibbler.py",
line 398, in found_terminator
    getattr(plugin, name)(**params)

  File "/usr/local/bin/pop3proxy.py", line 1064, in
onClassify
    for word, wordProb in clues:

NameError: global name 'clues' is not defined

I don't know when the code broke, but it's been like
this for a long time.  There is no binding for clues
anywhere.


----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=498103&aid=698852&group_id=61702

From nas at python.ca  Thu Mar  6 10:27:43 2003
From: nas at python.ca (Neil Schemenauer)
Date: Thu Mar  6 13:18:13 2003
Subject: [Spambayes] Integration with qmail?
In-Reply-To: <E8E5E0D3B5C9D611B23500C00D00E9BC3036EC@CSREESSERVER>
References: <E8E5E0D3B5C9D611B23500C00D00E9BC3036EC@CSREESSERVER>
Message-ID: <20030306182743.GA10575@glacier.arctrix.com>

Martinez, Michael - CSREES/ISTM wrote:
> I'm looking to integrate spambayes with a qmail smtp gateway. Any pointers
> would be appreciated.

I've got some code to do this.  I just need to make it available.
Perhaps this weekend (if MGS2 doesn't get the best of me :-).

  Neil

From noreply at sourceforge.net  Thu Mar  6 10:34:32 2003
From: noreply at sourceforge.net (SourceForge.net)
Date: Thu Mar  6 13:41:22 2003
Subject: [Spambayes] [ spambayes-Bugs-698852 ] can't classify messages
Message-ID: <E18r0CO-00018p-00@sc8-sf-web4.sourceforge.net>

Bugs item #698852, was opened at 2003-03-06 11:26
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=498103&aid=698852&group_id=61702

Category: pop3proxy
Group: None
Status: Open
Resolution: None
Priority: 5
Submitted By: Jeremy Hylton (jhylton)
>Assigned to: Tim Stone (timstone4)
Summary: can't classify messages

Initial Comment:
Traceback (most recent call last):

  File
"/usr/local/lib/python2.3/site-packages/spambayes/Dibbler.py",
line 398, in found_terminator
    getattr(plugin, name)(**params)

  File "/usr/local/bin/pop3proxy.py", line 1064, in
onClassify
    for word, wordProb in clues:

NameError: global name 'clues' is not defined

I don't know when the code broke, but it's been like
this for a long time.  There is no binding for clues
anywhere.


----------------------------------------------------------------------

>Comment By: Tim Stone (timstone4)
Date: 2003-03-06 12:34

Message:
Logged In: YES 
user_id=645698

Wow.  You're right about the long time thing.  Apparently this isn't 
something that anybody does on a regular basis... There's no 
classification code anywhere in the function!

----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=498103&aid=698852&group_id=61702

From noreply at sourceforge.net  Thu Mar  6 10:54:02 2003
From: noreply at sourceforge.net (SourceForge.net)
Date: Thu Mar  6 13:44:44 2003
Subject: [Spambayes] [ spambayes-Bugs-698852 ] can't classify messages
Message-ID: <E18r0VG-0002BO-00@sc8-sf-web4.sourceforge.net>

Bugs item #698852, was opened at 2003-03-06 11:26
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=498103&aid=698852&group_id=61702

Category: pop3proxy
Group: None
>Status: Closed
Resolution: None
Priority: 5
Submitted By: Jeremy Hylton (jhylton)
Assigned to: Tim Stone (timstone4)
Summary: can't classify messages

Initial Comment:
Traceback (most recent call last):

  File
"/usr/local/lib/python2.3/site-packages/spambayes/Dibbler.py",
line 398, in found_terminator
    getattr(plugin, name)(**params)

  File "/usr/local/bin/pop3proxy.py", line 1064, in
onClassify
    for word, wordProb in clues:

NameError: global name 'clues' is not defined

I don't know when the code broke, but it's been like
this for a long time.  There is no binding for clues
anywhere.


----------------------------------------------------------------------

>Comment By: Tim Stone (timstone4)
Date: 2003-03-06 12:54

Message:
Logged In: YES 
user_id=645698

Fixed

----------------------------------------------------------------------

Comment By: Tim Stone (timstone4)
Date: 2003-03-06 12:34

Message:
Logged In: YES 
user_id=645698

Wow.  You're right about the long time thing.  Apparently this isn't 
something that anybody does on a regular basis... There's no 
classification code anywhere in the function!

----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=498103&aid=698852&group_id=61702

From grobinson at transpose.com  Thu Mar  6 14:21:33 2003
From: grobinson at transpose.com (Gary Robinson)
Date: Thu Mar  6 14:21:29 2003
Subject: [Spambayes] Best tweak values
Message-ID: <BA8D09ED.1FE33%grobinson@transpose.com>

Hi,

On the wiki that is pointed to in my LJ article
(http://spamland.org/jsp/Wiki?GarySpamArticle), I would like to mention the
paramaters that have worked best in spambayes.

s and x?
f(w) values associated with the middle excluded words?
optimal spam/ham cutoff?

Thanks to anyone who can help--

--Gary

-- 
[http://ThisURLEnablesEmailToGetThroughOverzealousSpamFilters.org]

Gary Robinson
CEO
Transpose, LLC
grobinson@transpose.com
207-942-3463
http://www.transpose.com
http://radio.weblogs.com/0101454


From N7DR at arrisi.com  Thu Mar  6 14:12:43 2003
From: N7DR at arrisi.com (D. R. Evans)
Date: Thu Mar  6 16:12:57 2003
Subject: [Spambayes] pop3proxy crashes
In-Reply-To: <OKHAJDE7WVB71WOJHESN082VZUC8.3e676e2f@myst>
References: <3E660576.15567.1F786E44@localhost>
Message-ID: <3E67575B.3086.24A05058@localhost>

On 6 Mar 2003 at 9:50, Tim Stone - Four Stones Expre wrote:

> Nearly as I can tell, your training database has been corrupted.  I'm
> not quite sure how this happened, but from what I see in the code, there
> is likely no recovery at this point.  When you submit a bug report, go
> ahead and attach your training database.
> 

Which file is that? (he asks, hoping that its not the 45MB 
hammie.db.dat file...)

  Doc
--------------------------------------------------------------
Phone:  +1 303 494 0394
Mobile: +1 720 839 8462
Fax:    +1 781 240 0527
--------------------------------------------------------------


From skip at pobox.com  Thu Mar  6 15:21:38 2003
From: skip at pobox.com (Skip Montanaro)
Date: Thu Mar  6 16:21:50 2003
Subject: [Spambayes] pop3proxy crashes
In-Reply-To: <3E67575B.3086.24A05058@localhost>
References: <3E660576.15567.1F786E44@localhost>
        <3E67575B.3086.24A05058@localhost>
Message-ID: <15975.48098.666495.579958@montanaro.dyndns.org>

    Doc> Which file is that? (he asks, hoping that its not the 45MB
    Doc> hammie.db.dat file...)

yeah, hammie.db.*.  Just zip them up (there should be .dir and maybe .bak
files as well) and attach them.  They'll probably compress pretty well.

Skip

From T.A.Meyer at massey.ac.nz  Fri Mar  7 11:02:36 2003
From: T.A.Meyer at massey.ac.nz (Meyer, Tony)
Date: Thu Mar  6 17:03:15 2003
Subject: [Spambayes] statistical comparison of enviroment? 
Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F1318D561@its-xchg4.massey.ac.nz>

> Alex> Aye.  The problem, of course, is that we could start making
> Alex> spambayes so tricked-out that it'd be as slow as SpamAssassin. ;-)
> 
> Not necessarily.  If A and B prove to not be independent, we 
> dump one and
> keep the other.  In some situations, spambayes may actually 
> perform fewer tricks, thus speeding it up.

I must say that this is one of the things that I think spambayes has really got right.  TimP's insistance on only including the best option (via deathmatches :), and on not including anything unless testing proved that it helped, has, IMO, kept spambayes nice and neat.

(Which is not to say that more options shouldn't be examined - at least if they're in the archives, then if they are ever needed, the work is already done).

=Tony Meyer

From T.A.Meyer at massey.ac.nz  Fri Mar  7 11:06:53 2003
From: T.A.Meyer at massey.ac.nz (Meyer, Tony)
Date: Thu Mar  6 17:07:30 2003
Subject: [Spambayes] statistical comparison of enviroment? 
Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F1318D562@its-xchg4.massey.ac.nz>

> Testing of new tokens like this has dropped off since about
> last October... spambayes is already good enough for just
> about everyone to be happy.  My recent tests on training
> methods seem to show that accuracy has been dropping off for
> the last twho months, though, so it may be time to revisit
> this problem...

I'm (slowly) wading through the archives (interesting reading, but *long*), and have reached about this point.  It does seem that the majority of the testing was done on certain collections of spam (along with lots of different ham).  I wonder whether things got tuned a little too closely to that, and now that the spam is a little different, some options might need to be relooked at (rather than just retraining).

Once I'm done with the archives (and then the options stuff), I'll try and set up a testing system so that I can work on that.  I'm personally most interested in the effects of aging, the ham:spam ratio (with the current code), and how long spambayes takes to become effective, so I'll concentrate on those.

=Tony Meyer

From T.A.Meyer at massey.ac.nz  Fri Mar  7 11:08:44 2003
From: T.A.Meyer at massey.ac.nz (Meyer, Tony)
Date: Thu Mar  6 17:09:18 2003
Subject: [Spambayes] statistical comparison of enviroment?
Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F1318D563@its-xchg4.massey.ac.nz>

> ok. in the interest of time saving (i've not programmed in 
> python before), how about i [tabular] list what i find and 
> let the statistas in the group decide if there is 
> significance?
If you want anything in particular coded, feel free to post a feature request on SF and if no-one else gets to it, I'll give it a go (the implementation; I'd probably leave most of the testing to you/others).

> (unless there is a standardized sample that is preferable).
Personally, I think the more standardised samples are avoided, the better.  Otherwise, we're just building a spam filter that recognises a particular collection of spam.

=Tony Meyer

From noreply at sourceforge.net  Thu Mar  6 14:21:05 2003
From: noreply at sourceforge.net (SourceForge.net)
Date: Thu Mar  6 17:55:22 2003
Subject: [Spambayes] 
	[ spambayes-Bugs-693423 ] email message generates error in
	pop3proxy.py
Message-ID: <E18r3jd-00086p-00@sc8-sf-web2.sourceforge.net>

Bugs item #693423, was opened at 2003-02-25 23:02
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=498103&aid=693423&group_id=61702

Category: pop3proxy
Group: None
>Status: Closed
Resolution: None
Priority: 5
Submitted By: David Shaw (dshaw)
Assigned to: Tim Stone (timstone4)
Summary: email message generates error in pop3proxy.py

Initial Comment:
Hi all,
  A friend of mine had a cache file in his "unknown" folder that caused the "review" web page in pop3proxy.py to generate the following traceback:

Traceback (most recent call last):

  File "spambayes/Dibbler.py", line 398, in found_terminator
    getattr(plugin, name)(**params)

  File "pop3proxy.py", line 929, in onReview
    judgement = judgement.split(';')[0].strip()

  File "pop3proxy.py", line 815, in _makeMessageInfo
    print type(text)

AttributeError: 'list' object has no attribute 'replace' 

He sent me the offending message, and I replicated the problem:

msg = open("/Users/dshaw/Desktop/crash_spam.txt", "r")
message = mbox.get_message(msg)
part = typed_subpart_iterator(message, 'text', 'plain').next()
text = part.get_payload()
>>> text
[<email.Message.Message instance at 0x275ff0>]


So, instead of text, the payload is a list containing a single email message instance.  Here are the objects' respective payloads:

>>> message._payload
[<email.Message.Message instance at 0x279290>, <email.Message.Message instance at 0x279160>, <email.Message.Message instance at 0x279e00>, <email.Message.Message instance at 0x280b10>, <email.Message.Message instance at 0x281340>, <email.Message.Message instance at 0x2828d0>, <email.Message.Message instance at 0x283300>, <email.Message.Message instance at 0x2b60a0>, <email.Message.Message instance at 0x27f4d0>, <email.Message.Message instance at 0x2b7c70>, <email.Message.Message instance at 0x2b9ac0>, <email.Message.Message instance at 0x2b8c30>, <email.Message.Message instance at 0x2bb770>, <email.Message.Message instance at 0x2bc180>]


----------------------------------------------------------------------

Comment By: Tim Stone (timstone4)
Date: 2003-03-04 18:39

Message:
Logged In: YES 
user_id=645698

I just checked in a fix for this problem.  I have no ability to actually test it, 
though. Please try your test case again and let me know the outcome.

----------------------------------------------------------------------

Comment By: David Shaw (dshaw)
Date: 2003-02-28 10:34

Message:
Logged In: YES 
user_id=244639

Seems to be fixed!  Thanks.

----------------------------------------------------------------------

Comment By: Tim Stone (timstone4)
Date: 2003-02-27 22:29

Message:
Logged In: YES 
user_id=645698

I just checked in a fix for this problem.  I have no ability to actually test it, 
though. Please try your test case again and let me know the outcome.

----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=498103&aid=693423&group_id=61702

From mhammond at skippinet.com.au  Fri Mar  7 10:01:21 2003
From: mhammond at skippinet.com.au (Mark Hammond)
Date: Thu Mar  6 18:02:02 2003
Subject: [Spambayes] Outlook plugin error
In-Reply-To: <16E1010E4581B049ABC51D4975CEDB880113D959@UKDCX001.uk.int.atosorigin.com>
Message-ID: <LCEPIIGDJPKCOIHOBJEPCEMEOEAA.mhammond@skippinet.com.au>

> I always assumed that it was somehow related to the fact that mails arrive
> asynchronously, and could therefore arrive when the plugin "wasn't ready"
> somehow.

I have no idea how outlook send its events - but our plugin is called with
one event for each message that arrives.  We process this event
synchronousy - ie, the event handler does not return until the message has
been processed by us.

Thus, from our POV, we are always ready.  I have reason to suspect that
Outlook delivers these events synchronously on the main Outlook GUI thread,
but have no proof or documentary evidence.  Occasionaly, I have reason to
believe they do come on different threads.  Occasionally, I have reason to
believe I should check <wink>

But I see no evidence that there is conflict.  If a message is moved
underneath us, we get a MAPI_E_NOT_FOUND error (as the entryid changes).  If
something else changes the object underneath us, we get a
MAPI_E_OBJECT_CHANGED error which we can handle and retry.  We currently
*don't* have retry code in place, but we have never seen
MAPI_E_OBJECT_CHANGED (that would currently dump an exception to the debug
window, and leave the message unscored rather than zero)

The most-important-by-far thing I need to know is if a trace message, such
as:

> Message 'RE: It was nice to see at Amazon today...' had a Spam
classification of 'No'

appears for these messages with a spam score of zero which "show clues"
shows as non-zero.  Just don't forget that "show clues" reporting
5.38458e-015 is really reporting zero <wink>

Mark.


From N7DR at arrisi.com  Thu Mar  6 16:06:07 2003
From: N7DR at arrisi.com (D. R. Evans)
Date: Thu Mar  6 18:06:14 2003
Subject: [Spambayes] pop3proxy crashes
In-Reply-To: <15975.48098.666495.579958@montanaro.dyndns.org>
References: <3E67575B.3086.24A05058@localhost>
Message-ID: <3E6771EF.7385.3183D8@localhost>

On 6 Mar 2003 at 15:21, Skip Montanaro wrote:

>     Doc> Which file is that? (he asks, hoping that its not the 45MB Doc>
>     hammie.db.dat file...)
> 
> yeah, hammie.db.*.  Just zip them up (there should be .dir and maybe
> .bak files as well) and attach them.  They'll probably compress pretty
> well.
> 

I get the message from sourceforge:
  Could Not Attach File to Item: ArtifactFile: File must be > 20 bytes 
and < 256000 bytes in length Item Successfully Created 

which sort-of-suggests that it made an entry in the bug database but 
would not include the ZIPped database file (which ended up being about 
2MB after a maximum-compression ZIP).

  Doc
--------------------------------------------------------------
Phone:  +1 303 494 0394
Mobile: +1 720 839 8462
Fax:    +1 781 240 0527
--------------------------------------------------------------


From noreply at sourceforge.net  Thu Mar  6 15:11:03 2003
From: noreply at sourceforge.net (SourceForge.net)
Date: Thu Mar  6 18:19:58 2003
Subject: [Spambayes] [ spambayes-Bugs-699063 ] pop3proxy.py crashes
Message-ID: <E18r4Vz-0007QY-00@sc8-sf-web1.sourceforge.net>

Bugs item #699063, was opened at 2003-03-06 16:11
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=498103&aid=699063&group_id=61702

Category: pop3proxy
Group: None
Status: Open
Resolution: None
Priority: 5
Submitted By: D. R. Evans (n7dr)
Assigned to: Nobody/Anonymous (nobody)
Summary: pop3proxy.py crashes

Initial Comment:
pop3proxy.py worked fine for a couple of weeks.

I then rebooted my Linux box (Mandrake 8.1), and since then pop3proxy.py produces the following output 
on the console:

Loading database...
Traceback (most recent call last):
  File "./pop3proxy.py", line 1577, in ?
    run()
  File "./pop3proxy.py", line 1551, in run
    state.createWorkers()
  File "./pop3proxy.py", line 1161, in createWorkers
    self.bayes = storage.DBDictClassifier(filename)
  File "./spambayes/storage.py", line 140, in __init__
    self.load()
  File "./spambayes/storage.py", line 152, in load
    t = self.db[self.statekey]
  File "/usr/local/lib/python2.2/shelve.py", line 71, in __getitem__
    return Unpickler(f).load()
EOFError

The database files are attached.

  Doc


----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=498103&aid=699063&group_id=61702

From tim at fourstonesExpressions.com  Thu Mar  6 17:22:18 2003
From: tim at fourstonesExpressions.com (Tim Stone - Four Stones Expressions)
Date: Thu Mar  6 18:22:25 2003
Subject: [Spambayes] pop3proxy crashes
In-Reply-To: <3E6771EF.7385.3183D8@localhost>
Message-ID: <76PJRL87221SRPMKITR054ZVP98C9MK.3e67d82a@myst>

Go ahead and reply to this mail with the file attached, then...

3/6/2003 5:06:07 PM, "D. R. Evans" <N7DR@arrisi.com> wrote:

>On 6 Mar 2003 at 15:21, Skip Montanaro wrote:
>
>>     Doc> Which file is that? (he asks, hoping that its not the 45MB Doc>
>>     hammie.db.dat file...)
>> 
>> yeah, hammie.db.*.  Just zip them up (there should be .dir and maybe
>> .bak files as well) and attach them.  They'll probably compress pretty
>> well.
>> 
>
>I get the message from sourceforge:
>  Could Not Attach File to Item: ArtifactFile: File must be > 20 bytes 
>and < 256000 bytes in length Item Successfully Created 
>
>which sort-of-suggests that it made an entry in the bug database but 
>would not include the ZIPped database file (which ended up being about 
>2MB after a maximum-compression ZIP).
>
>  Doc
>--------------------------------------------------------------
>Phone:  +1 303 494 0394
>Mobile: +1 720 839 8462
>Fax:    +1 781 240 0527
>--------------------------------------------------------------
>
>
>_______________________________________________
>Spambayes mailing list
>Spambayes@python.org
>http://mail.python.org/mailman/listinfo/spambayes
>
>


c'est moi - TimS
http://www.fourstonesExpressions.com
http://wecanstopspam.org


From noreply at sourceforge.net  Thu Mar  6 15:51:13 2003
From: noreply at sourceforge.net (SourceForge.net)
Date: Thu Mar  6 18:43:51 2003
Subject: [Spambayes] 
	[ spambayes-Bugs-695142 ] Email does not render subject in the
	"Review" Page
Message-ID: <E18r58r-0003gT-00@sc8-sf-web2.sourceforge.net>

Bugs item #695142, was opened at 2003-02-28 10:40
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=498103&aid=695142&group_id=61702

Category: None
Group: None
Status: Open
Resolution: None
Priority: 5
Submitted By: David Shaw (dshaw)
Assigned to: Tim Stone (timstone4)
>Summary: Email does not render subject in the "Review" Page

Initial Comment:
I received the attached email.  When I go to the "review" web page of pop3proxy.py, all it shows is:

Messages classified as Unsure:       From:  
(none)                                            (none)

It acts as though the message has no "from" or "subject", even though they exist.  The user is not given any way to classify this message other than to click on the first "(none)" and read the raw message to determine its contents.  I will attach the message below.

----------------------------------------------------------------------

>Comment By: Tim Stone (timstone4)
Date: 2003-03-06 17:51

Message:
Logged In: YES 
user_id=645698

This is another email package parsing 'error' caused by a malformed 
header in the attached email.  The content-type header has an embedded 
/r/n, which causes the email package to barf and discard all the 
headers.

IMO, the email package is being used in Spambayes in 
ways that it was never intended for.  Malformed mail is gonna be the death 
of us, and the email package just doesn't seem to handle it very 
well.

I'm gonna leave this bug open, but there's virtually nothing 
that can be done to make things better, at least not AFAIK.

----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=498103&aid=695142&group_id=61702

From noreply at sourceforge.net  Thu Mar  6 15:51:54 2003
From: noreply at sourceforge.net (SourceForge.net)
Date: Thu Mar  6 18:43:53 2003
Subject: [Spambayes] [ spambayes-Bugs-673388 ] pop3proxy storage
Message-ID: <E18r59W-0003iN-00@sc8-sf-web2.sourceforge.net>

Bugs item #673388, was opened at 2003-01-23 16:02
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=498103&aid=673388&group_id=61702

Category: pop3proxy
Group: None
>Status: Closed
Resolution: None
Priority: 5
Submitted By: Fran�ois Granger (fgranger)
Assigned to: Nobody/Anonymous (nobody)
Summary: pop3proxy storage

Initial Comment:
I had a look in the pop3proxy folders, and I found thes strange files. They miss header and maybe part of the message.

----------------------------------------------------------------------

>Comment By: Tim Stone (timstone4)
Date: 2003-03-06 17:51

Message:
Logged In: YES 
user_id=645698

Cannot recreate.

----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=498103&aid=673388&group_id=61702

From tim at fourstonesExpressions.com  Thu Mar  6 21:07:26 2003
From: tim at fourstonesExpressions.com (Tim Stone - Four Stones Expressions)
Date: Thu Mar  6 22:07:32 2003
Subject: [Spambayes] Database corruption [WAS] pop3proxy crashes
In-Reply-To: <3E67575B.3086.24A05058@localhost>
Message-ID: <TSPJNIUSD8NHPOBGDYXHDWQQNTQ43B9.3e680cee@myst>

3/6/2003 3:12:43 PM, "D. R. Evans" <N7DR@arrisi.com> wrote:

>On 6 Mar 2003 at 9:50, Tim Stone - Four Stones Expre wrote:
>
>> Nearly as I can tell, your training database has been corrupted.  I'm
>> not quite sure how this happened, but from what I see in the code, there
>> is likely no recovery at this point.  When you submit a bug report, go
>> ahead and attach your training database.

The database is definitely corrupted.  This is the first time I've seen this.  
The 'saved state' key in the database (where spamcount and hamcount are 
maintained) has a corrupt value, that kills the unpickler.

There are >88,000 words in this database, and apparently the machine was 
rebooted without a proper shutdown.  This is bad.

D.R. I need you to do a couple things:

If you have the spam and ham saved in an mbox or something, then you can 
simply delete the database files and retrain from scratch.  This would be the 
best alternative.  If this isn't the case, if you can remember, or figure out 
some way, how many spams and hams were trained into this database, I can 
recover it for you.  Even a rough estimate will likely do. 

And... can you tell me, if you know, what dbm module is in use?  Maybe someone 
can give us a few lines of python you can run that will tell us that info.  
It's too late for me to bring it to mind...

c'est moi - TimS
http://www.fourstonesExpressions.com
http://wecanstopspam.org


From noreply at sourceforge.net  Thu Mar  6 19:56:14 2003
From: noreply at sourceforge.net (SourceForge.net)
Date: Thu Mar  6 22:52:15 2003
Subject: [Spambayes] 
	[ spambayes-Bugs-699174 ] mboxtrain only trains on cur in maildir
Message-ID: <E18r8xy-0005RU-00@sc8-sf-web4.sourceforge.net>

Bugs item #699174, was opened at 2003-03-06 21:56
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=498103&aid=699174&group_id=61702

Category: None
Group: None
Status: Open
Resolution: None
Priority: 5
Submitted By: Matthew Cowles (mdcowles)
Assigned to: Nobody/Anonymous (nobody)
Summary: mboxtrain only trains on cur in maildir

Initial Comment:
When training on a maildir, mboxtrain trains only on
the messages in the subirectory cur. It ignores
messages in the subdirectory new. Since new is for
messages that haven't been seen, I think it's worth
looking there since at least some spam will have been
filed unseen.

I'll upload a patch that makes it train on both.

----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=498103&aid=699174&group_id=61702

From mhammond at skippinet.com.au  Fri Mar  7 21:58:56 2003
From: mhammond at skippinet.com.au (Mark Hammond)
Date: Fri Mar  7 06:00:00 2003
Subject: [Spambayes] FW: Mhammond, Intelligent antispam IER software
Message-ID: <002101c2e498$8cd2eab0$530f8490@eden>

I had to share this irony :)

I received this spam, selling anti-spam software!  I was a little
dissapointed that spambayes scored it as only a "maybe".  So I checked the
clues - the top 6 ham clues were:

word                                spamprob         #ham  #spam
'*H*'                               0.0438937           -      -
'*S*'                               0.78226             -      -
'manually'                          0.0184302          35      1
'mapi'                              0.0266272           8      0
'keyword'                           0.0302013           7      0
'source,'                           0.0302013           7      0
'inbox'                             0.0401784          26      2
'algorithm'                         0.0652174           3      0

So sadly, a cruel irony is that spambayes let me down here - by knowing that
I work on anti-spam software, it scoreed this anti-spam spam as ham.

Even-funnier-is-that-I-am-slammed <wink/hic> ly,

Mark.

-----Original Message-----
From: vgarner6570@winning.com [mailto:vgarner6570@winning.com]On Behalf Of
eagleclaw3449@lawyer.com
Sent: None
To: Mhammond
Subject: Mhammond, Intelligent antispam IER software


TheVeryBest - Software Downloads
 Top-Rank Software Download Site on the Internet  
Internet->Email->Spam Remedy v1.5 PRO

Spam Remedy        (3.17MB)


Description: 

The powerful, effective and intelligent anti-spam tool.
It automatically cleans spam messages out of your mailbox before you receive
or read them. 

Features:

Automatically Blocking Spam
Spam Remedy automatically checks your mail boxes and filters unwanted,
dangerous, or offensive mail messages to save your time from manually
detecting and organizing mail messages. 
Effectively Spam Detecting
A complex Aritificial Intelligence algorithm has been used in Spam Remedy
product to detecting legitimate mail messages and spam messages,the
technique has more precision than other filter-based and keyword-based
anti-spam technologies. 
Be Sure You Get Your Right Mail Messages
Spam Remedy doesn't confirm a spam message by a single keyword in mail
content. It examines the entire message - source, headers and mail content
to confirm whether it is a spam message. 
Supports Multiple Email Types and Almost All Email Clients 
Spam Remedy supports POP3, Hotmail/MSN, IMAP4 and MAPI email
accounts,Directly works with almost all email clients(Outlook Express, Becky
Mail,Foxmail,Outlook, The bat!, Eudora etc.), espacially includes support
for web-based Hotmail/MSN email clients. Nothing you need to change to your
email clients! 
Easy to use  - You don't need to set any complex filter rules, just add your
email accounts to Spam Remedy and then it works. 
Friends List and Rejecting List
With Friends List and Rejecting List,you have the chance to decide who are
never blocked or directly treat their mail messages as spam. 
Keep your inbox clean
Spam Remedy places all intercepted spam messages to its interval mail
database so that your inbox remains uncluttered and free of spam.If for some
reason a legitimate email is flagged as spam, you can easily recover in
multiple ways. 

Editor's Rating: 


Copyright ?2002-2003 DarkSoft Group  All Rights Reserved. 
-------------- next part --------------
A non-text attachment was scrubbed...
Name: winmail.dat
Type: application/ms-tnef
Size: 1276 bytes
Desc: not available
Url : http://mail.python.org/pipermail/spambayes/attachments/20030307/53ad725e/winmail.bin
From tim at fourstonesExpressions.com  Fri Mar  7 07:05:33 2003
From: tim at fourstonesExpressions.com (Tim Stone - Four Stones Expressions)
Date: Fri Mar  7 08:05:38 2003
Subject: [Spambayes] FW: Mhammond, Intelligent antispam IER software
In-Reply-To: <002101c2e498$8cd2eab0$530f8490@eden>
Message-ID: <IERLLTXTEALHSNVRHF711UMGC7ZW3Y.3e68991d@myst>

Gary Robinson and I just yesterday had a conversation about sending spam to 
advertise wecanstopspam.org... LOL!!! We decided that it was just too low a 
stoop.

3/7/2003 4:58:56 AM, "Mark Hammond" <mhammond@skippinet.com.au> wrote:

>I had to share this irony :)
>
>I received this spam, selling anti-spam software!  I was a little
>dissapointed that spambayes scored it as only a "maybe".  So I checked the
>clues - the top 6 ham clues were:
>
>word                                spamprob         #ham  #spam
>'*H*'                               0.0438937           -      -
>'*S*'                               0.78226             -      -
>'manually'                          0.0184302          35      1
>'mapi'                              0.0266272           8      0
>'keyword'                           0.0302013           7      0
>'source,'                           0.0302013           7      0
>'inbox'                             0.0401784          26      2
>'algorithm'                         0.0652174           3      0
>
>So sadly, a cruel irony is that spambayes let me down here - by knowing that
>I work on anti-spam software, it scoreed this anti-spam spam as ham.
>
>Even-funnier-is-that-I-am-slammed <wink/hic> ly,
>
>Mark.
>
>-----Original Message-----
>From: vgarner6570@winning.com [mailto:vgarner6570@winning.com]On Behalf Of
>eagleclaw3449@lawyer.com
>Sent: None
>To: Mhammond
>Subject: Mhammond, Intelligent antispam IER software
>
>
>TheVeryBest - Software Downloads
> Top-Rank Software Download Site on the Internet  
>Internet->Email->Spam Remedy v1.5 PRO
>
>Spam Remedy        (3.17MB)
>
>
>
>Description: 
>
>The powerful, effective and intelligent anti-spam tool.
>It automatically cleans spam messages out of your mailbox before you receive
>or read them. 
>
>Features:
>
>Automatically Blocking Spam
>Spam Remedy automatically checks your mail boxes and filters unwanted,
>dangerous, or offensive mail messages to save your time from manually
>detecting and organizing mail messages. 
>Effectively Spam Detecting
>A complex Aritificial Intelligence algorithm has been used in Spam Remedy
>product to detecting legitimate mail messages and spam messages,the
>technique has more precision than other filter-based and keyword-based
>anti-spam technologies. 
>Be Sure You Get Your Right Mail Messages
>Spam Remedy doesn't confirm a spam message by a single keyword in mail
>content. It examines the entire message - source, headers and mail content
>to confirm whether it is a spam message. 
>Supports Multiple Email Types and Almost All Email Clients 
>Spam Remedy supports POP3, Hotmail/MSN, IMAP4 and MAPI email
>accounts,Directly works with almost all email clients(Outlook Express, Becky
>Mail,Foxmail,Outlook, The bat!, Eudora etc.), espacially includes support
>for web-based Hotmail/MSN email clients. Nothing you need to change to your
>email clients! 
>Easy to use  - You don't need to set any complex filter rules, just add your
>email accounts to Spam Remedy and then it works. 
>Friends List and Rejecting List
>With Friends List and Rejecting List,you have the chance to decide who are
>never blocked or directly treat their mail messages as spam. 
>Keep your inbox clean
>Spam Remedy places all intercepted spam messages to its interval mail
>database so that your inbox remains uncluttered and free of spam.If for some
>reason a legitimate email is flagged as spam, you can easily recover in
>multiple ways. 
>
>Editor's Rating: 
>
>
>Copyright ?2002-2003 DarkSoft Group  All Rights Reserved. 
>


c'est moi - TimS
http://www.fourstonesExpressions.com
http://wecanstopspam.org


From skip at pobox.com  Fri Mar  7 10:59:05 2003
From: skip at pobox.com (Skip Montanaro)
Date: Fri Mar  7 11:59:13 2003
Subject: [Spambayes] full o' spaces
Message-ID: <15976.53209.395058.683195@montanaro.dyndns.org>

I just received a message (attached) in which every word in the body was
space-separated.  There were thus no clues at all in the body and the clues
in the header weren't enough to pull it out of the unsure classification.
I'm working on a tokenizer patch.

Skip

-------------- next part --------------
A non-text attachment was scrubbed...
Name: diploma.msg
Type: application/octet-stream
Size: 2365 bytes
Desc: not available
Url : http://mail.python.org/pipermail/spambayes/attachments/20030307/685a6654/diploma.obj
From tim at fourstonesExpressions.com  Fri Mar  7 11:01:28 2003
From: tim at fourstonesExpressions.com (Tim Stone - Four Stones Expressions)
Date: Fri Mar  7 12:01:34 2003
Subject: [Spambayes] full o' spaces
In-Reply-To: <15976.53209.395058.683195@montanaro.dyndns.org>
Message-ID: <7WT08HD1U1X52KHFD5YHCPM6Z3YTOLJ.3e68d068@myst>

Ya, I noticed that same thing yesterday.  Maybe an "excessive whitespace" 
clue, or "many single character words" clue, or something like that?

3/7/2003 10:59:05 AM, Skip Montanaro <skip@pobox.com> wrote:

>I just received a message (attached) in which every word in the body was
>space-separated.  There were thus no clues at all in the body and the clues
>in the header weren't enough to pull it out of the unsure classification.
>I'm working on a tokenizer patch.
>
>Skip
>
>


c'est moi - TimS
http://www.fourstonesExpressions.com
http://wecanstopspam.org


From python-spambayes at discworld.dyndns.org  Fri Mar  7 11:19:02 2003
From: python-spambayes at discworld.dyndns.org (Charles Cazabon)
Date: Fri Mar  7 12:16:37 2003
Subject: [Spambayes] full o' spaces
In-Reply-To: <7WT08HD1U1X52KHFD5YHCPM6Z3YTOLJ.3e68d068@myst>;
	from tim@fourstonesExpressions.com on Fri, Mar 07, 2003 at 11:01:28AM -0600
References: <15976.53209.395058.683195@montanaro.dyndns.org>
	<7WT08HD1U1X52KHFD5YHCPM6Z3YTOLJ.3e68d068@myst>
Message-ID: <20030307111902.A12956@discworld.dyndns.org>

Tim Stone - Four Stones Expressions <tim@fourstonesExpressions.com> wrote:
> Ya, I noticed that same thing yesterday.  Maybe an "excessive whitespace" 
> clue, or "many single character words" clue, or something like that?

Ratio of number of spaces to number of non-spaces in the body, perhaps?  Add a
metatoken if this exceeds 0.25 or something like that.

Charles
-- 
-----------------------------------------------------------------------
Charles Cazabon                 <python-spambayes@discworld.dyndns.org>
GPL'ed software available at:     http://www.qcc.ca/~charlesc/software/
-----------------------------------------------------------------------

From piersh at friskit.com  Fri Mar  7 09:51:49 2003
From: piersh at friskit.com (Piers Haken)
Date: Fri Mar  7 12:50:32 2003
Subject: [Spambayes] Improved comparison of classifier changes?
Message-ID: <9891913C5BFE87429D71E37F08210CB9297597@zeus.sfhq.friskit.com>

(This came to me in a dream. No, really...)

When comparing two different classifier/tokenizer strategies, instead of
just comparing the numbers of false  negatives and positives, how about
comparing some function (product, sum, average,
some-more-appropriate-statistical-function?) of the spam probability of
all messages in each classification (spam, ham, false-positive,
false-negative)? This might give a slightly better indication of not
just the numbers of messages that were classified correctly/incorrectly,
but of how sure the classifier was when it made those decisions.

.. or was I just dreaming...?

Piers.

From tim at fourstonesExpressions.com  Fri Mar  7 11:59:19 2003
From: tim at fourstonesExpressions.com (Tim Stone - Four Stones Expressions)
Date: Fri Mar  7 12:59:23 2003
Subject: [Spambayes] full o' spaces
In-Reply-To: <20030307111902.A12956@discworld.dyndns.org>
Message-ID: <L4KIYS532ZPOPNQPGBCYCANJH52.3e68ddf7@myst>

3/7/2003 11:19:02 AM, Charles Cazabon <python-spambayes@discworld.dyndns.org> 
wrote:

>Tim Stone - Four Stones Expressions <tim@fourstonesExpressions.com> wrote:
>> Ya, I noticed that same thing yesterday.  Maybe an "excessive whitespace" 
>> clue, or "many single character words" clue, or something like that?
>
>Ratio of number of spaces to number of non-spaces in the body, perhaps?  Add 
a
>metatoken if this exceeds 0.25 or something like that.

Any threshold we use for anything like this has to be configurable.  Otherwise 
the spammers will simply make sure they don't exceed the threshold...

In normal (english) language usage, there is probably a relatively well 
understood distribution of unigrams, bigrams, trigrams, and longer words.  Any 
'severe' departure from this distribution could be a very good spam clue.  For 
example, I could use the following to defeat a whitespace and unigram counting 
scheme:

Bu y  m ore  st u ff  t h an  yo u  EVE R  tho ug ht  you  c ou l d  h and le.

It's a bit harder to read than regular text, but the human brain is amazingly 
adaptive to stuff like this.  This kind of trickery is likely to be one avenue 
that spammers try to heavily use to defeat us.  (the other being malformation 
of mail, imo).

Oh, and btw, don't believe for a second that spammers don't subscribe to this 
list :)


c'est moi - TimS
http://www.fourstonesExpressions.com
http://wecanstopspam.org


From nas at python.ca  Fri Mar  7 10:14:02 2003
From: nas at python.ca (Neil Schemenauer)
Date: Fri Mar  7 13:04:28 2003
Subject: [Spambayes] full o' spaces
In-Reply-To: <15976.53209.395058.683195@montanaro.dyndns.org>
References: <15976.53209.395058.683195@montanaro.dyndns.org>
Message-ID: <20030307181402.GA13499@glacier.arctrix.com>

Skip Montanaro wrote:
> I just received a message (attached) in which every word in the body was
> space-separated.

I wouldn't worry about it too much.  It doesn't look like an effective
spam to me.  I gave up reading it after the first line.  I don't think
the bozos who respond to spam would make any more of an effort to read
it.

> I'm working on a tokenizer patch.

Perhaps we should be careful about adding stuff unless we can show a
statistically significant improvement in error rates given real test
data.

That said, it seems logical that it would be better if short words were
not completely discarded by the tokenizer.  Perhaps it would be enough
to remember the ratio of dropped words to generated tokens.  Something
like:

    'shortratio:2**%d' % log2(nshort / ntokens) 

As you can tell, I love logarithms (as any true engineer should). :-)

Alternatively, perhaps we could just drop the lower limit on token
length.

  Neil

From tim at fourstonesExpressions.com  Fri Mar  7 12:13:22 2003
From: tim at fourstonesExpressions.com (Tim Stone - Four Stones Expressions)
Date: Fri Mar  7 13:13:27 2003
Subject: [Spambayes] full o' spaces
In-Reply-To: <20030307181402.GA13499@glacier.arctrix.com>
Message-ID: <C82Z1XGBVYVNHRL43BK74YTMHIHC.3e68e142@myst>

3/7/2003 12:14:02 PM, Neil Schemenauer <nas@python.ca> wrote:

>Skip Montanaro wrote:
>> I just received a message (attached) in which every word in the body was
>> space-separated.
>
>I wouldn't worry about it too much.  It doesn't look like an effective
>spam to me.  I gave up reading it after the first line.  I don't think
>the bozos who respond to spam would make any more of an effort to read
>it.

The fallacy here is that you're assuming that spammers will simply give up.  
They won't.  And a set of eyeballs looking at a mail, even if they stop 
reading after the first line, is better than no eyeballs.  So they'll keep 
trying things to defeat the algorithms, especially if their response rates are 
dropping.  

>
>> I'm working on a tokenizer patch.
>
>Perhaps we should be careful about adding stuff unless we can show a
>statistically significant improvement in error rates given real test
>data.

This strategy, which has been employed by the spambayes team up to this point, 
is very useful for research, but is quite reactive.  We're exiting the 
research phase of this project, and entering a product phase.  Reactive 
strategy is not appropriate for products (e.g. Microsoft security).  We must 
be proactive, and kill ideas before they become widespread in the spammer 
community.


c'est moi - TimS
http://www.fourstonesExpressions.com
http://wecanstopspam.org


From popiel at wolfskeep.com  Fri Mar  7 10:29:04 2003
From: popiel at wolfskeep.com (T. Alexander Popiel)
Date: Fri Mar  7 13:29:09 2003
Subject: [Spambayes] Improved comparison of classifier changes? 
In-Reply-To: Message from "Piers Haken" <piersh@friskit.com> 
	<9891913C5BFE87429D71E37F08210CB9297597@zeus.sfhq.friskit.com> 
References: <9891913C5BFE87429D71E37F08210CB9297597@zeus.sfhq.friskit.com> 
Message-ID: <20030307182904.B81A82DDC7@cashew.wolfskeep.com>

In message:  <9891913C5BFE87429D71E37F08210CB9297597@zeus.sfhq.friskit.com>
             "Piers Haken" <piersh@friskit.com> writes:
>(This came to me in a dream. No, really...)
>
>When comparing two different classifier/tokenizer strategies, instead of
>just comparing the numbers of false  negatives and positives, how about
>comparing some function (product, sum, average,
>some-more-appropriate-statistical-function?) of the spam probability of
>all messages in each classification (spam, ham, false-positive,
>false-negative)? This might give a slightly better indication of not
>just the numbers of messages that were classified correctly/incorrectly,
>but of how sure the classifier was when it made those decisions.
>
>.. or was I just dreaming...?

Here's sample output from table.py:

filename:      rcb     rcB     rCb     rCB     Rcb     RcB     RCb     RCB
ham:spam:  2000:2000       2000:2000       2000:2000       2000:2000
                   2000:2000       2000:2000       2000:2000       2000:2000
fp total:        3       3       3       3       3       3       3       3
fp %:         0.15    0.15    0.15    0.15    0.15    0.15    0.15    0.15
fn total:       12      14      16      14      12      12      12      12
fn %:         0.60    0.70    0.80    0.70    0.60    0.60    0.60    0.60
unsure t:       53      37      50      39      40      31      37      32
unsure %:     1.32    0.93    1.25    0.97    1.00    0.78    0.93    0.80
real cost:  $52.60  $51.40  $56.00  $51.80  $50.00  $48.20  $49.40  $48.40
best cost:  $48.20  $45.20  $49.20  $45.60  $37.20  $38.80  $40.60  $38.60
h mean:       0.40    0.32    0.35    0.32    0.31    0.30    0.29    0.29
h sdev:       5.39    4.71    5.12    4.68    4.55    4.47    4.47    4.43
s mean:      98.45   98.68   98.35   98.68   98.75   98.85   98.72   98.85
s sdev:       9.76    9.57   10.46    9.58    9.08    9.06    9.37    9.11
mean diff:   98.05   98.36   98.00   98.36   98.44   98.55   98.43   98.56
k:            6.47    6.89    6.29    6.90    7.22    7.28    7.11    7.28

So yes, when using the test harness and associated tools, we do
compare more than just the fp and fn counts.  We also look at
percentages, a weighted cost function, the best possible cost
achievable just by moving the ham and spam cutoffs, and the
mean scores, their separation, and their standard deviations.

We just haven't done much tokenizer testing lately, so these
reports aren't obvious in the recent archives.

- Alex

From bill at parducci.net  Fri Mar  7 11:21:06 2003
From: bill at parducci.net (bill parducci)
Date: Fri Mar  7 14:21:11 2003
Subject: [Spambayes] full o' spaces
In-Reply-To: <15976.53209.395058.683195@montanaro.dyndns.org>
References: <15976.53209.395058.683195@montanaro.dyndns.org>
Message-ID: <3E68F122.5040502@parducci.net>

welcome to MEME mail! :o) You

i have been working on some ideas on how to attack this off an on for the last few months, but it is very difficult because [the]{mind}(is)quite|g00d`at~separating+the\message'fr0m_the^TEXT. it is this work that prompted my initial query into what is being done with tokenization on this list.

if it would help, i can send/post a few sample messages that i have been using to test my work. i have also come with a crude mechainsm for trying to work around it. hasn't been tested and needs a lot of work (it is written in <blush/> vb). anyway, if anyone is interested i can show what i have come up with so far.

b

Skip Montanaro wrote:
> I just received a message (attached) in which every word in the body was
> space-separated.  There were thus no clues at all in the body and the clues
> in the header weren't enough to pull it out of the unsure classification.
> I'm working on a tokenizer patch.
> 
> Skip
1


From tim.one at comcast.net  Fri Mar  7 14:22:51 2003
From: tim.one at comcast.net (Tim Peters)
Date: Fri Mar  7 14:23:28 2003
Subject: [Spambayes] full o' spaces
In-Reply-To: <20030307181402.GA13499@glacier.arctrix.com>
Message-ID: <BIEJKCLHCIOIHAGOKOLHIEKJFBAA.tim.one@comcast.net>

[Neil Schemenauer]
> ...
> That said, it seems logical that it would be better if short words were
> not completely discarded by the tokenizer.  Perhaps it would be enough
> to remember the ratio of dropped words to generated tokens.  Something
> like:
>
>     'shortratio:2**%d' % log2(nshort / ntokens)
>
> As you can tell, I love logarithms (as any true engineer should). :-)

I've mentioned before that the metatoken

    (number of bytes)/(number of words)

was a very strong indicator in early tests.  An unusually high ratio of
bytes to words was a very strong spam indicator; spam with the interspersed
whitespace gimmick would have an unusually low ratio.  I didn't check in the
code, though, because it made no difference in error rates at the time.

But a single token doesn't carry much weight, and any gimmick that reduces
response rate (including those that make text harder to read) probably won't
last long.

> Alternatively, perhaps we could just drop the lower limit on token
> length.

Experiments were run on that, and they hurt.  See "How big should 'a word'
be?" in tokenizer.py.

Note that we have a configurable limit for the upper end of how big a word
can be.  The evidence in favor of adding it was (at best) weak.


From bill at parducci.net  Fri Mar  7 11:24:00 2003
From: bill at parducci.net (bill parducci)
Date: Fri Mar  7 14:24:04 2003
Subject: [Spambayes] full o' spaces
In-Reply-To: <7WT08HD1U1X52KHFD5YHCPM6Z3YTOLJ.3e68d068@myst>
References: <7WT08HD1U1X52KHFD5YHCPM6Z3YTOLJ.3e68d068@myst>
Message-ID: <3E68F1D0.7060806@parducci.net>

or... ratio of non 'a-z|A-Z|0-9' vs. 'a-z|A-Z|0-9'?

he says (with physical attribute analysis on the brain :o)

b

Tim Stone - Four Stones Expressions wrote:
> Ya, I noticed that same thing yesterday.  Maybe an "excessive whitespace" 
> clue, or "many single character words" clue, or something like that?
> 
> 3/7/2003 10:59:05 AM, Skip Montanaro <skip@pobox.com> wrote:


From nas at python.ca  Fri Mar  7 11:42:34 2003
From: nas at python.ca (Neil Schemenauer)
Date: Fri Mar  7 14:33:02 2003
Subject: [Spambayes] full o' spaces
In-Reply-To: <C82Z1XGBVYVNHRL43BK74YTMHIHC.3e68e142@myst>
References: <20030307181402.GA13499@glacier.arctrix.com>
	<C82Z1XGBVYVNHRL43BK74YTMHIHC.3e68e142@myst>
Message-ID: <20030307194234.GA13770@glacier.arctrix.com>

Tim Stone - Four Stones Expressions wrote:
> The fallacy here is that you're assuming that spammers will simply give up.  
> They won't.  And a set of eyeballs looking at a mail, even if they stop 
> reading after the first line, is better than no eyeballs.

I have to respectfully disagree.  Spammers _need_ people to respond to
their spam.  If a filter avoidance trick kills the response rate they
will stop using it.  There is no point in bloating spambayes with every
failed trick they try.  That's why I suggested testing with a real
corpus.  If a trick is common enough that detecting it signficantly
affects the error rate then fine, add code for it.  Otherwise, forget
about and keep spambayes lean and mean.

> So they'll keep trying things to defeat the algorithms, especially if
> their response rates are dropping.  

Sure.  However, they will only continue using a trick if it defeats
filters _and_ gets an acceptable response rate.

> This strategy, which has been employed by the spambayes team up to this point, 
> is very useful for research, but is quite reactive.  We're exiting the 
> research phase of this project, and entering a product phase.  Reactive 
> strategy is not appropriate for products (e.g. Microsoft security).

I disagree.  We should not abandon the rigorous, testing based strategy
that got SB to its current state.  Adding more code every time a spammer
comes up with a new trick is completely reactionary and will eventually
destroy the code base.  I'm mystified as to how you can call such an
approach proactive.

> We must be proactive, and kill ideas before they become widespread in
> the spammer community.

We don't need to worry about spammers' ideas that will be killed by
other forces.  Perhaps it comes down to a question of objectives.  If
your objective is to keep spam out of your mailbox then trying to detect
all spam, effective or not, makes sense.  My objective is to destroy the
spam business.  One way to do that is to have a widely deployable filter
that blocks spam that would make spammers money.  Honestly, for me to
hit delete for a few spam messages in my inbox is not a big deal.  It is
the fact that these people are wasting millions of people's time.

  Neil

From tim at fourstonesExpressions.com  Fri Mar  7 13:49:13 2003
From: tim at fourstonesExpressions.com (Tim Stone - Four Stones Expressions)
Date: Fri Mar  7 14:49:19 2003
Subject: [Spambayes] full o' spaces
In-Reply-To: <20030307194234.GA13770@glacier.arctrix.com>
Message-ID: <E0HZXA01U75SP797IPLGBVLHPKRM.3e68f7b9@myst>

This is a great discussion.  We really should hash this out to everyone's 
satisfaction.

3/7/2003 1:42:34 PM, Neil Schemenauer <nas@python.ca> wrote:

>Tim Stone - Four Stones Expressions wrote:
>> The fallacy here is that you're assuming that spammers will simply give up.  
>> They won't.  And a set of eyeballs looking at a mail, even if they stop 
>> reading after the first line, is better than no eyeballs.
>
>I have to respectfully disagree.  Spammers _need_ people to respond to
>their spam.  If a filter avoidance trick kills the response rate they
>will stop using it.  There is no point in bloating spambayes with every
>failed trick they try.

This really wasn't what I was suggesting.  Rather, when we find a significant 
hole through which effective spam can squirt, we should plug it, rather than 
wait to see if any spammers find that same hole.

>  That's why I suggested testing with a real
>corpus.  If a trick is common enough that detecting it signficantly
>affects the error rate then fine, add code for it.  Otherwise, forget
>about and keep spambayes lean and mean.
>
>> So they'll keep trying things to defeat the algorithms, especially if
>> their response rates are dropping.  
>
>Sure.  However, they will only continue using a trick if it defeats
>filters _and_ gets an acceptable response rate.

If it defeats the filters then the response rate, however dismal, will be 
better than for spam that doesn't defeat the filters.

>
>> This strategy, which has been employed by the spambayes team up to this 
point, 
>> is very useful for research, but is quite reactive.  We're exiting the 
>> research phase of this project, and entering a product phase.  Reactive 
>> strategy is not appropriate for products (e.g. Microsoft security).
>
>I disagree.  We should not abandon the rigorous, testing based strategy
>that got SB to its current state.

Absolutely.  Rigorous testing is not the issue at all, in my mind.

>  Adding more code every time a spammer
>comes up with a new trick is completely reactionary and will eventually
>destroy the code base.  I'm mystified as to how you can call such an
>approach proactive.

Again, I was suggesting that we find the holes before they do.  I think that 
we should begin to think like spammers, not like people trying to defeat 
spammers.  If we were on the other side, what would we do?  Gosh, I can think 
of things, simple things.  And if I can find something that actually crashes 
the tokenizer, all the better.  I'll look at the code, more closely than most 
on this team ever will.  I'll find the holes, and blast away.  My goal?  Not 
to get spam into mailboxes, but to destroy the anti-spam community.  Make 
people give up hope that this problem really is/can be solved.  That's the way 
to make you and me go away.  Simply make it so people don't believe in us.

>
>> We must be proactive, and kill ideas before they become widespread in
>> the spammer community.
>
>We don't need to worry about spammers' ideas that will be killed by
>other forces.  Perhaps it comes down to a question of objectives.  If
>your objective is to keep spam out of your mailbox then trying to detect
>all spam, effective or not, makes sense.  My objective is to destroy the
>spam business.

The two objectives are identical.

>  One way to do that is to have a widely deployable filter
>that blocks spam that would make spammers money.  Honestly, for me to
>hit delete for a few spam messages in my inbox is not a big deal.  It is
>the fact that these people are wasting millions of people's time.
>
>  Neil
>
>


c'est moi - TimS
http://www.fourstonesExpressions.com
http://wecanstopspam.org


From python-spambayes at discworld.dyndns.org  Fri Mar  7 14:02:47 2003
From: python-spambayes at discworld.dyndns.org (Charles Cazabon)
Date: Fri Mar  7 15:00:23 2003
Subject: [Spambayes] full o' spaces
In-Reply-To: <20030307194234.GA13770@glacier.arctrix.com>;
	from nas@python.ca on Fri, Mar 07, 2003 at 11:42:34AM -0800
References: <20030307181402.GA13499@glacier.arctrix.com>
	<C82Z1XGBVYVNHRL43BK74YTMHIHC.3e68e142@myst>
	<20030307194234.GA13770@glacier.arctrix.com>
Message-ID: <20030307140247.A16563@discworld.dyndns.org>

Neil Schemenauer <nas@python.ca> wrote:

> > This strategy, which has been employed by the spambayes team up to this
> > point, is very useful for research, but is quite reactive.  We're exiting
> > the research phase of this project, and entering a product phase.
> > Reactive strategy is not appropriate for products (e.g. Microsoft
> > security).
> 
> I disagree.  We should not abandon the rigorous, testing based strategy that
> got SB to its current state.  Adding more code every time a spammer comes up
> with a new trick is completely reactionary and will eventually destroy the
> code base.  I'm mystified as to how you can call such an approach proactive.

Hear, hear.  Don't turn SpamBayes into a convoluted, hocus-pocus collection of
ad-hoc rules a la SpamAssasin.  Keep testing; if a technique doesn't
measurably improve the result, toss it.

Charles
-- 
-----------------------------------------------------------------------
Charles Cazabon                 <python-spambayes@discworld.dyndns.org>
GPL'ed software available at:     http://www.qcc.ca/~charlesc/software/
-----------------------------------------------------------------------

From skip at pobox.com  Fri Mar  7 14:06:07 2003
From: skip at pobox.com (Skip Montanaro)
Date: Fri Mar  7 15:06:27 2003
Subject: [Spambayes] full o' spaces
In-Reply-To: <7WT08HD1U1X52KHFD5YHCPM6Z3YTOLJ.3e68d068@myst>
References: <15976.53209.395058.683195@montanaro.dyndns.org>
        <7WT08HD1U1X52KHFD5YHCPM6Z3YTOLJ.3e68d068@myst>
Message-ID: <15976.64431.573736.976718@montanaro.dyndns.org>


    Tim> Ya, I noticed that same thing yesterday.  Maybe an "excessive
    Tim> whitespace" clue, or "many single character words" clue, or
    Tim> something like that?

I tried the ratio of spaces to the total number of characters in the message
body, but that is inconclusive:

    >>> db = shelve.open("../hammie.db", "r")
    >>> for k in db.keys():
    ...   if k.startswith("space ratio"):
    ...     print k, db[k]
    ... 
    space ratio: 0.0 (1240, 399)
    space ratio: 0.1 (3950, 6603)
    space ratio: 0.2 (1405, 4562)
    space ratio: 0.3 (289, 231)
    space ratio: 0.4 (85, 51)
    space ratio: 0.5 (15, 16)
    space ratio: 0.6 (2, 2)
    space ratio: 0.8 (3, 0)

(Maybe I should be ignoring whitespace at the beginning of lines?)

The diploma message has a space ration of right around 0.5.  I haven't
looked at other messages yet to see what the other messages with similar
ratios looked like.  Maybe the ratio of single-character words to the total
number of words would be better.

Skip

From tim at fourstonesExpressions.com  Fri Mar  7 14:17:22 2003
From: tim at fourstonesExpressions.com (Tim Stone - Four Stones Expressions)
Date: Fri Mar  7 15:17:28 2003
Subject: [Spambayes] full o' spaces
In-Reply-To: <15976.64431.573736.976718@montanaro.dyndns.org>
Message-ID: <07IG71RP63WR1YTNYSZW828NB8WKH.3e68fe52@myst>

3/7/2003 2:06:07 PM, Skip Montanaro <skip@pobox.com> wrote:

>The diploma message has a space ration

I see you suffer from the same spelling disorder as I... I always write ration 
instead of ratio... lol

> of right around 0.5.  I haven't
>looked at other messages yet to see what the other messages with similar
>ratios looked like.  Maybe the ratio of single-character words to the total
>number of words would be better.

Can you look at percentage of unigrams, bigrams, trigrams, and ngrams?

>
>Skip
>
>


c'est moi - TimS
http://www.fourstonesExpressions.com
http://wecanstopspam.org


From skip at pobox.com  Fri Mar  7 14:25:06 2003
From: skip at pobox.com (Skip Montanaro)
Date: Fri Mar  7 15:25:20 2003
Subject: [Spambayes] full o' spaces
In-Reply-To: <20030307194234.GA13770@glacier.arctrix.com>
References: <20030307181402.GA13499@glacier.arctrix.com>
        <C82Z1XGBVYVNHRL43BK74YTMHIHC.3e68e142@myst>
        <20030307194234.GA13770@glacier.arctrix.com>
Message-ID: <15977.34.990208.430806@montanaro.dyndns.org>


    Neil> Tim Stone - Four Stones Expressions wrote:
    >> The fallacy here is that you're assuming that spammers will simply
    >> give up.  They won't.  And a set of eyeballs looking at a mail, even
    >> if they stop reading after the first line, is better than no
    >> eyeballs.

    Neil> I have to respectfully disagree.  Spammers _need_ people to
    Neil> respond to their spam.  If a filter avoidance trick kills the
    Neil> response rate they will stop using it.  There is no point in
    Neil> bloating spambayes with every failed trick they try.  That's why I
    Neil> suggested testing with a real corpus.  If a trick is common enough

Yes, my corpus is currently 11,000+ hams and 7,000+ spams.  My first try
failed, but I think I know why.  In addition several people have suggested
some other things to try.

    >> We must be proactive, and kill ideas before they become widespread in
    >> the spammer community.

    Neil> We don't need to worry about spammers' ideas that will be killed
    Neil> by other forces.

Precisely.  This particular message landed right in the middle of the
unsure.  Training on it didn't affect its later classification much.  That
suggests that to swing that message into the spam region, one or more new
techniques need to be developed which highlight an attribute of that
message.

    Neil> My objective is to destroy the spam business.  One way to do that
    Neil> is to have a widely deployable filter that blocks spam that would
    Neil> make spammers money.  Honestly, for me to hit delete for a few
    Neil> spam messages in my inbox is not a big deal.  It is the fact that
    Neil> these people are wasting millions of people's time.

Correct, but as we all know, the spammers learn and we have no way of
directly measuring our effectiveness at destroying their business.  All we
can measure directly is how effective we are at segregating their messages
into spam folders.  It appears that this simple technique is sufficient to
move most spams into the unsure category (and thus viewed).

Skip

From tim at fourstonesExpressions.com  Fri Mar  7 14:29:33 2003
From: tim at fourstonesExpressions.com (Tim Stone - Four Stones Expressions)
Date: Fri Mar  7 15:29:38 2003
Subject: [Spambayes] full o' spaces
In-Reply-To: <07IG71RP63WR1YTNYSZW828NB8WKH.3e68fe52@myst>
Message-ID: <KJKG3Z2XF073KE831ZE9XRB6KJ2W95TO.3e69012d@myst>

>Can you look at percentage of unigrams, bigrams, trigrams, and ngrams?

I'm thinking that, for English anyway, nu < nb < nt < nn is the rule.  If this 
rule is violated, then that's a spam indicator.  I sure don't know if that's 
the case with other languages, though...

c'est moi - TimS
http://www.fourstonesExpressions.com
http://wecanstopspam.org


From nas at python.ca  Fri Mar  7 12:40:02 2003
From: nas at python.ca (Neil Schemenauer)
Date: Fri Mar  7 15:30:28 2003
Subject: [Spambayes] full o' spaces
In-Reply-To: <E0HZXA01U75SP797IPLGBVLHPKRM.3e68f7b9@myst>
References: <20030307194234.GA13770@glacier.arctrix.com>
	<E0HZXA01U75SP797IPLGBVLHPKRM.3e68f7b9@myst>
Message-ID: <20030307204002.GD13770@glacier.arctrix.com>

Tim Stone - Four Stones Expressions wrote:
> Rather, when we find a significant hole through which effective spam
> can squirt, we should plug it, rather than wait to see if any spammers
> find that same hole.

I agree (with emphases on the word "effective").  If spammers don't care
about effectiveness than it will be extremely difficult to block their
messages.

> If it defeats the filters then the response rate, however dismal, will be 
> better than for spam that doesn't defeat the filters.

Nope.  If it costs them more money to send than what they make back it
will not be better.  Sending spam, however cheap, costs money.
Therefore, at some non-zero response rate it becomes unprofitable to
send it.

> Again, I was suggesting that we find the holes before they do.

Why?

> And if I can find something that actually crashes the tokenizer, all
> the better.

That's a different kettle of fish, I think.  Whatever the filter does,
it should not crash or lose email, no matter what the spammer does.  I'm
all for that kind of improvement.

> Not to get spam into mailboxes, but to destroy the anti-spam community.

Yikes, don't hurt me.  I think you meant the the anti-anti-spam
community. :-)  Personally, I'm content with letting the anti-anti-spam
community do what they will.  If they come up with something the spam
community adopts then I think we can deal with it.

For example, the "HTML comments inside words" trick must be effective
since I'm seeing it fairly often now.  It's really a no brainer, since
if the MTA understands HTML there is no visable difference in the
message.  Luckily SB already deals with this trick in a more general
way.

> Make people give up hope that this problem really is/can be solved.
> That's the way to make you and me go away.  Simply make it so people
> don't believe in us.

I'm having a little trouble parsing that.  I think you are saying that
if the filter doesn't achieve the objective of keeping spam out of
people's mailboxes then people will not adopt it.  That's true, but I
think the average person is fairly tolerant of FNs, as long as the FP
rate is very low.  I think FNs annoy spam filter hackers more than
regular people.

> >We don't need to worry about spammers' ideas that will be killed by
> >other forces.  Perhaps it comes down to a question of objectives.  If
> >your objective is to keep spam out of your mailbox then trying to detect
> >all spam, effective or not, makes sense.  My objective is to destroy the
> >spam business.
> 
> The two objectives are identical.

Nope.  Blocking all spam achieves both objectives while blocking only
effective spam achieves only the second.  Since effective spam is a
subset of all spam it could be easier to block.


  Neil

From nas at python.ca  Fri Mar  7 12:44:24 2003
From: nas at python.ca (Neil Schemenauer)
Date: Fri Mar  7 15:34:48 2003
Subject: [Spambayes] full o' spaces
In-Reply-To: <15976.64431.573736.976718@montanaro.dyndns.org>
References: <15976.53209.395058.683195@montanaro.dyndns.org>
	<7WT08HD1U1X52KHFD5YHCPM6Z3YTOLJ.3e68d068@myst>
	<15976.64431.573736.976718@montanaro.dyndns.org>
Message-ID: <20030307204424.GE13770@glacier.arctrix.com>

Skip Montanaro wrote:
> Maybe the ratio of single-character words to the total number of words
> would be better.

I like Tim's suggestion of bytes/tokens.  Could you give that a try?

  Neil

From python-spambayes at discworld.dyndns.org  Fri Mar  7 14:39:02 2003
From: python-spambayes at discworld.dyndns.org (Charles Cazabon)
Date: Fri Mar  7 15:36:37 2003
Subject: [Spambayes] full o' spaces
In-Reply-To: <15977.34.990208.430806@montanaro.dyndns.org>;
	from skip@pobox.com on Fri, Mar 07, 2003 at 02:25:06PM -0600
References: <20030307181402.GA13499@glacier.arctrix.com>
	<C82Z1XGBVYVNHRL43BK74YTMHIHC.3e68e142@myst>
	<20030307194234.GA13770@glacier.arctrix.com>
	<15977.34.990208.430806@montanaro.dyndns.org>
Message-ID: <20030307143902.A16967@discworld.dyndns.org>

Skip Montanaro <skip@pobox.com> wrote:
> 
> > We don't need to worry about spammers' ideas that will be killed by other
> > forces.
> 
> Precisely.  This particular message landed right in the middle of the
> unsure.  Training on it didn't affect its later classification much.  That
> suggests that to swing that message into the spam region, one or more new
> techniques need to be developed which highlight an attribute of that
> message.

As more spammers use the technique, it automatically becomes a better
indicator of spamminess.  You don't really need to manually twiddle knobs.  At
most, adding a metatoken might help, but as Tim has pushed for all along, if
that doesn't make a measurable difference don't do it.

Charles
-- 
-----------------------------------------------------------------------
Charles Cazabon                 <python-spambayes@discworld.dyndns.org>
GPL'ed software available at:     http://www.qcc.ca/~charlesc/software/
-----------------------------------------------------------------------

From nas at python.ca  Fri Mar  7 12:55:40 2003
From: nas at python.ca (Neil Schemenauer)
Date: Fri Mar  7 15:46:05 2003
Subject: [Spambayes] full o' spaces
In-Reply-To: <15977.34.990208.430806@montanaro.dyndns.org>
References: <20030307181402.GA13499@glacier.arctrix.com>
	<C82Z1XGBVYVNHRL43BK74YTMHIHC.3e68e142@myst>
	<20030307194234.GA13770@glacier.arctrix.com>
	<15977.34.990208.430806@montanaro.dyndns.org>
Message-ID: <20030307205540.GF13770@glacier.arctrix.com>

Skip Montanaro wrote:
> Correct, but as we all know, the spammers learn and we have no way of
> directly measuring our effectiveness at destroying their business.  All we
> can measure directly is how effective we are at segregating their messages
> into spam folders.

I think we can indirectly determine that by what techniques become
popular.  I suppose a quickly changing set of techniques could be
interpreted as a sign of effective filters.  Based on that, I would say
we are starting to get somewhere but the war it not over by a long shot.

  Neil

From pje at telecommunity.com  Fri Mar  7 16:01:18 2003
From: pje at telecommunity.com (Phillip J. Eby)
Date: Fri Mar  7 16:01:06 2003
Subject: [Spambayes] full o' spaces
In-Reply-To: <KJKG3Z2XF073KE831ZE9XRB6KJ2W95TO.3e69012d@myst>
References: <07IG71RP63WR1YTNYSZW828NB8WKH.3e68fe52@myst>
Message-ID: <5.1.1.6.0.20030307155256.00ab5020@telecommunity.com>

At 02:29 PM 3/7/03 -0600, Tim Stone - Four Stones Expressions wrote:
> >Can you look at percentage of unigrams, bigrams, trigrams, and ngrams?
>
>I'm thinking that, for English anyway, nu < nb < nt < nn is the rule.  If 
>this
>rule is violated, then that's a spam indicator.  I sure don't know if that's
>the case with other languages, though...

There may be a simple way to deal with the entire range of possible 
"character noise" techniques, be it whitespace, letter->number 
substitution, etc.  What if we simply create a meta-token which is driven 
by the ratio of recognized to unrecognized (non-meta) tokens?  In this way, 
the more noise a spammer adds to their message, the greater the probability 
that the message will be considered "noisy spam".  Repeats of the same 
message after training would result in the message being "recognized spam", 
repeats before training would be spotted by their being "noisy".

The natural spammer countermove to this is that they'll have to add lots of 
boilerplate "hammy" english text to bump themselves back into the "unsure" 
range, and/or begin adding noise only to highly spammy words.  I already 
get tons of spam about "seks" and "r4pe" and similar things.  I'm not sure 
what to do about these countermoves, but at least it puts us back on level 
ground with the spammers again.  I'm afraid that adding "bulk noise" like 
whitespace and punctuation to messages would be a too-easily automated 
anti-bayes move for spammers to adopt in general.


From skip at pobox.com  Fri Mar  7 15:11:56 2003
From: skip at pobox.com (Skip Montanaro)
Date: Fri Mar  7 16:12:04 2003
Subject: [Spambayes] ok, i'm confused
Message-ID: <15977.2844.223581.728734@montanaro.dyndns.org>

Here are the original X-Spambayes headers for the full-o'-spaces message:

  X-Spambayes-Debug: '*H*': 0.56; '*S*': 0.47; 'subject:none': 0.05;
          'charset:us-ascii': 0.17; 'header:Message-ID:1': 0.35; 'cc:2**2': 0.62;
          'header:Mime-Version:1': 0.65; 'skip:1 10': 0.77; 'header:Received:3': 0.90
  X-Spambayes-Classification: unsure; 0.46

After my latest tweak to the tokenizer (ratio of spaces to total number of
characters, after deleting leading and trailing whitespace on each line) and
complete retraining (11k+ ham 7k+ spam), I get:

  X-Spambayes-Debug: '*H*': 0.56; '*S*': 0.47; 'subject:none': 0.05;
          'charset:us-ascii': 0.17; 'header:Message-ID:1': 0.35;
          'cc:2**2': 0.62;        'header:Mime-Version:1': 0.65; 'skip:1 10': 0.77;
          'header:Received:3': 0.90
  X-Spambayes-Classification: spam; 0.95

I've done nothing to adjust the values displayed in the X-Spambayes-Debug
header, so all generated tokens should be displayed, and as you can see, all
displayed tokens are the same, before and after.  My space ratio token isn't
displayed (if I insert a print before the relevant yield statement I see it
has a value of 'space ratio: 0.9').  Why is the message now classified as
spam when before is was solidly in the middle of unsure?

Skip


From tim at fourstonesExpressions.com  Fri Mar  7 15:46:40 2003
From: tim at fourstonesExpressions.com (Tim Stone - Four Stones Expressions)
Date: Fri Mar  7 16:46:46 2003
Subject: [Spambayes] full o' spaces
In-Reply-To: <20030307204002.GD13770@glacier.arctrix.com>
Message-ID: <B9ZTB0SOSMNJROQNMHE72MLTQRQ3ZX.3e691340@myst>

3/7/2003 2:40:02 PM, Neil Schemenauer <nas@python.ca> wrote:

>Tim Stone - Four Stones Expressions wrote:
>> Rather, when we find a significant hole through which effective spam
>> can squirt, we should plug it, rather than wait to see if any spammers
>> find that same hole.
>
>I agree (with emphases on the word "effective").  If spammers don't care
>about effectiveness than it will be extremely difficult to block their
>messages.
>
>> If it defeats the filters then the response rate, however dismal, will be 
>> better than for spam that doesn't defeat the filters.
>
>Nope.  If it costs them more money to send than what they make back it
>will not be better.  Sending spam, however cheap, costs money.
>Therefore, at some non-zero response rate it becomes unprofitable to
>send it.

The above statement has nothing to do with the statement above it.

>
>> Again, I was suggesting that we find the holes before they do.
>
>Why?

I suppose you're satisfied with Microsoft's approach to security.  Let's just 
wait until some flood of spam makes it through our user's filters.  We'll them 
make a patch and post it.  Very few will install it.  In the meantime, users 
will conclude that our stuff doesn't work very well, and we've lost.

>> Not to get spam into mailboxes, but to destroy the anti-spam community.
>
>Yikes, don't hurt me.  I think you meant the the anti-anti-spam
>community. :-)

Ya... heh  Reminds me of a political cartoon during the days the ABM treaty 
was being negotiated.  There were Ballistic Missiles, Anti-Ballistic Missiles, 
AABMs AAABMs, etc.etc... <wink>

>  Personally, I'm content with letting the anti-anti-spam
>community do what they will.  If they come up with something the spam
>community adopts then I think we can deal with it.
>
>For example, the "HTML comments inside words" trick must be effective
>since I'm seeing it fairly often now.  It's really a no brainer, since
>if the MTA understands HTML there is no visable difference in the
>message.  Luckily SB already deals with this trick in a more general
>way.

My point exactly.  Thank you for your tacit, though obviously accidental, 
agreement!

>
>> Make people give up hope that this problem really is/can be solved.
>> That's the way to make you and me go away.  Simply make it so people
>> don't believe in us.
>
>I'm having a little trouble parsing that.  I think you are saying that
>if the filter doesn't achieve the objective of keeping spam out of
>people's mailboxes then people will not adopt it.  That's true, but I
>think the average person is fairly tolerant of FNs, as long as the FP
>rate is very low.  I think FNs annoy spam filter hackers more than
>regular people.

You parsed it correctly.  Tolerant of FN is one thing, tolerant of a LOT of FN 
is quite another.  Essentially, what we have with no filtering is all FN.  I'm 
surprised at how annoyed I get when it misses only one.  Especially if that 
one is particularly offensive and I think it *should* have caught it.  But I'm 
working on this stuff, and so my tolerance is much higher than most.  It takes 
very little to convince the teeming masses that something is not worth the 
trouble it takes to install it and keep it going, and that trouble is 
considerable for spambayes.  Thus, at some (surprisingly low) threshold of FN, 
users will conclude that this stuff isn't worth the bother.  Maybe filtering 
technology really can't evolve as quickly as spam can.  I hope that's not the 
case.

c'est moi - TimS
http://www.fourstonesExpressions.com
http://wecanstopspam.org


From tim at fourstonesExpressions.com  Fri Mar  7 16:07:51 2003
From: tim at fourstonesExpressions.com (Tim Stone - Four Stones Expressions)
Date: Fri Mar  7 17:07:57 2003
Subject: [Spambayes] ok, i'm confused
In-Reply-To: <15977.2844.223581.728734@montanaro.dyndns.org>
Message-ID: <6Z083XDA83431VFB5YA0RN1V5ZKJ6.3e691837@myst>

Doesn't spambayes use the top 20 clues or so?  Debug doesn't print all the 
clues, and the combiner doesn't use them all, either, IIRC.  Maybe debug just 
isn't printing out everything that's being used?  Strange.  On the other hand, 
maybe this explains some of the FP and FN rate increases that have been being 
reported as of late... <wink>

3/7/2003 3:11:56 PM, Skip Montanaro <skip@pobox.com> wrote:

>Here are the original X-Spambayes headers for the full-o'-spaces message:
>
>  X-Spambayes-Debug: '*H*': 0.56; '*S*': 0.47; 'subject:none': 0.05;
>          'charset:us-ascii': 0.17; 'header:Message-ID:1': 0.35; 'cc:2**2': 
0.62;
>          'header:Mime-Version:1': 0.65; 'skip:1 10': 0.77; 
'header:Received:3': 0.90
>  X-Spambayes-Classification: unsure; 0.46
>
>After my latest tweak to the tokenizer (ratio of spaces to total number of
>characters, after deleting leading and trailing whitespace on each line) and
>complete retraining (11k+ ham 7k+ spam), I get:
>
>  X-Spambayes-Debug: '*H*': 0.56; '*S*': 0.47; 'subject:none': 0.05;
>          'charset:us-ascii': 0.17; 'header:Message-ID:1': 0.35;
>          'cc:2**2': 0.62;        'header:Mime-Version:1': 0.65; 'skip:1 10': 
0.77;
>          'header:Received:3': 0.90
>  X-Spambayes-Classification: spam; 0.95
>
>I've done nothing to adjust the values displayed in the X-Spambayes-Debug
>header, so all generated tokens should be displayed, and as you can see, all
>displayed tokens are the same, before and after.  My space ratio token isn't
>displayed (if I insert a print before the relevant yield statement I see it
>has a value of 'space ratio: 0.9').  Why is the message now classified as
>spam when before is was solidly in the middle of unsure?
>
>Skip
>
>
>_______________________________________________
>Spambayes mailing list
>Spambayes@python.org
>http://mail.python.org/mailman/listinfo/spambayes
>
>


c'est moi - TimS
http://www.fourstonesExpressions.com
http://wecanstopspam.org


From tim.one at comcast.net  Fri Mar  7 17:09:15 2003
From: tim.one at comcast.net (Tim Peters)
Date: Fri Mar  7 17:09:52 2003
Subject: [Spambayes] ok, i'm confused
In-Reply-To: <15977.2844.223581.728734@montanaro.dyndns.org>
Message-ID: <BIEJKCLHCIOIHAGOKOLHKEMEFBAA.tim.one@comcast.net>

[Skip Montanaro]
> Here are the original X-Spambayes headers for the full-o'-spaces message:
>
>   X-Spambayes-Debug: '*H*': 0.56; '*S*': 0.47; 'subject:none': 0.05;
> ...
>   X-Spambayes-Classification: unsure; 0.46
>
> After my latest tweak to the tokenizer (ratio of spaces to total number of
> characters, after deleting leading and trailing whitespace on
> each line) and complete retraining (11k+ ham 7k+ spam), I get:
>
>   X-Spambayes-Debug: '*H*': 0.56; '*S*': 0.47; 'subject:none': 0.05;
> ...
>   X-Spambayes-Classification: spam; 0.95
>
> I've done nothing to adjust the values displayed in the X-Spambayes-Debug
> header, so all generated tokens should be displayed, and as you
> can see, all displayed tokens are the same, before and after.

I removed that part, in order to make an internal inconsistency clearer:
the overall score is

            prob = (S-H + 1.0) / 2.0

and 0.95 simply doesn't make any sense with H ~= 0.56 and S ~= 0.47.

> ...
> Why is the message now classified as spam when before is was solidly in
> the middle of unsure?

A sharper question is how (0.47-0.56 + 1.0) / 2.0 came out to be 0.95.
Answer that, and you'll know everything <wink>.


From tim at fourstonesExpressions.com  Fri Mar  7 17:00:47 2003
From: tim at fourstonesExpressions.com (Tim Stone - Four Stones Expressions)
Date: Fri Mar  7 18:00:54 2003
Subject: [Spambayes] full o' spaces
Message-ID: <ZUQKMKXSZUC9V09FCTOD8E0975JFWV.3e69249f@myst>

3/7/2003 4:54:23 PM, Francois Granger <francois.granger@free.fr> wrote:

>Word length seams to be a parameter with some "bracketed" values for 
>western european languages. Some food for thought here (four pages 
>pdf document):
>
>http://arxiv.org/pdf/cs.CL/0102026

Very interesting.  Perhaps we should employ a variation of this algorithm... 
perhaps a simple average of word length, with high/low thresholds beyond which 
spam is indicated...

c'est moi - TimS
http://www.fourstonesExpressions.com
http://wecanstopspam.org


From skip at pobox.com  Fri Mar  7 17:19:25 2003
From: skip at pobox.com (Skip Montanaro)
Date: Fri Mar  7 18:19:33 2003
Subject: [Spambayes] ok, i'm confused
In-Reply-To: <BIEJKCLHCIOIHAGOKOLHKEMEFBAA.tim.one@comcast.net>
References: <15977.2844.223581.728734@montanaro.dyndns.org>
        <BIEJKCLHCIOIHAGOKOLHKEMEFBAA.tim.one@comcast.net>
Message-ID: <15977.10493.531680.816324@montanaro.dyndns.org>


    Tim> I removed that part, in order to make an internal inconsistency
    Tim> clearer: the overall score is

    Tim>             prob = (S-H + 1.0) / 2.0

    Tim> and 0.95 simply doesn't make any sense with H ~= 0.56 and S ~=
    Tim> 0.47.

Problem solved.  The message had already been run through spambayes once, so
it already had X-Spambayes-Classification and X-Spambayes-Debug headers.
The second time I ran it through hammiefilter manually I forgot to set
BAYESCUSTOMIZE, so it didn't add a new debug header.  It did, however,
replace the original classification header with the new one.  (Maybe all
X-Spambayes headers should be deleted by default?)

Here's what the Spambayes headers for that message look like now:

  X-Spambayes-Classification: spam; 1.00
  X-Spambayes-Debug: '*H*': 0.00; '*S*': 1.00; 'charset:us-ascii': 0.17;
          'header:Message-ID:1': 0.34; 'cc:2**2': 0.62; 'header:Mime-Version:1': 0.66;
          'to:addr:bugs': 0.73; 'skip:1 10': 0.76; 'bytes/words: 2': 0.84;
          'cc:addr:bugsmoke': 0.84; 'cc:addr:bugsmom16': 0.84;
          'cc:addr:bugsmom_1982': 0.84; 'from:addr:diplomas.org': 0.84;
          'from:addr:learning': 0.84; 'from:name:marie': 0.84;
          'message-id:@hkgioexchange1.corp.giordano.com.hk': 0.84;
          'to:addr:moi.com': 0.84; 'pfxlen:2': 0.87; 'cc:no real name:2**2': 0.87;
          'cc:addr:mojam.com': 0.89; 'cc:addr:yahoo.com': 0.89;
          'header:Received:3': 0.90; 'cc:addr:msn.com': 0.96;
          'cc:addr:gateway.net': 0.97; 'cc:addr:bugs': 0.99

Note there are many more clues than before as well:

  X-Spambayes-Classification: unsure; 0.46
  X-Spambayes-Debug: '*H*': 0.56; '*S*': 0.47; 'subject:none': 0.05;
          'charset:us-ascii': 0.17; 'header:Message-ID:1': 0.35; 'cc:2**2': 0.62;
          'header:Mime-Version:1': 0.65; 'skip:1 10': 0.77; 'header:Received:3': 0.90

The original time it was run was against the spambayes sw and database I
have on the Mojam web server (something I didn't notice originally either).
I think either the database or the software there is getting a bit
out-of-date.  Note the lack of cc:addr headers which put this squarely in
the spam domain.

At this point, I'm going to hold off on the bytes/words ratio stuff.  If
anyone wants to play around with it, I'll be happy to send you a context
diff for tokenize.py.

Skip

From skip at pobox.com  Fri Mar  7 17:37:43 2003
From: skip at pobox.com (Skip Montanaro)
Date: Fri Mar  7 18:37:50 2003
Subject: [Spambayes] Eliminating duplicates from mbox file
Message-ID: <15977.11591.575821.556483@montanaro.dyndns.org>


While retraining today I flubbed at one point and wound up with a bunch of
duplicates in my training sets.  I wrote the attached script to eliminate
the duplicates.  I have a few questions:

    1. Is this worth checking into the contrib directory?

    2. Why did I have to subclass mailbox.PortableUnixMailbox?  It looks on
       the surface like mailbox.PortableUnixMailbox ought to work as-is (it
       has both __iter__() and next()), but if I use it directly without
       subclassing I get this:

            Traceback (most recent call last):
              File "singular.py", line 32, in ?
                main()
              File "singular.py", line 18, in main
                for msg in mbox:
            TypeError: iteration over non-sequence

       (BTW, I get the same error if I iterate over the mbox file using
       mboxutils.getmbox.)

    3. Is there a better way to emit the unique messages that doesn't
       require me to manually escape leading "From " sequences?

Skip

-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/octet-stream
Size: 722 bytes
Desc: not available
Url : http://mail.python.org/pipermail/spambayes/attachments/20030307/b3068bf5/attachment.obj
From popiel at wolfskeep.com  Fri Mar  7 16:52:49 2003
From: popiel at wolfskeep.com (T. Alexander Popiel)
Date: Fri Mar  7 19:52:53 2003
Subject: [Spambayes] statistical comparison of enviroment? 
In-Reply-To: Message from bill parducci <bill@parducci.net> 
   of "Fri, 07 Mar 2003 14:50:53 PST." <3E69224D.8010103@parducci.net> 
References: <3E668CA5.3050203@parducci.net>
	<20030306015916.5BEF62DEA4@cashew.wolfskeep.com> <3E66B1D6.90308@parducci.net>
	<20030306040336.77E4E2DEA4@cashew.wolfskeep.com>
	<3E69224D.8010103@parducci.net> 
Message-ID: <20030308005249.A465D2DDC7@cashew.wolfskeep.com>

In message:  <3E69224D.8010103@parducci.net>
             bill parducci <bill@parducci.net> writes:
>
>T. Alexander Popiel wrote:
>> We've actually got a pretty good testing infrastructure set up;
>> for tokenization tests, I personally use timcv.py with each of the
>> tokenization options and then feed the output of the runs into
>> table.py.  This produces some nice tabularizations that you may
>> notice in the mailing list archives.
>
>by any chance do you have an example of how this is initiated? (fyi: it
>seems that there is an issue with the command line 'help' option.)

Argh.  You're running into the same problem I did originally, due to the
testing stuff being in a subdir and the spambayes stuff not being on your
python path.  This is perhaps one of the most annoying bits about the
system.

I just checked in a fix to timcv.py which appropriately mangles the
python path before trying to import the spambayes stuff.  I don't
think this will break anybody... if it does, please tell me the proper
way to mangle the python path for an unprivileged user.  Remember,
I'm a relative python newbie, too.


As to more general instructions:

1. Set up your corpora in subdirectories named Data/Ham/reservoir and
   Data/Spam/reservoir, with one message per file.  The splitndirs.py
   under utilities may of help here if you're starting from mboxes,
   or es2hs.py under testtools if you're starting from an MH setup
   like mine.

2. If you're going to do any incremental testing, sort and group the
   corpora with sort+group.py.

3. Decide how many sets you want for your cross-validation.  Personally,
   I use 5.  Then use either rebal.py (from the utilities) or mksets.py
   (from testtools) to populate the sets, depending on whether or not
   you chose to sort+group... mksets.py doesn't like filenames not in the
   special format for incremental testing.

4. Set up an .ini file with whatever options you want to use as baseline.
   Set the BAYESCUSTOMIZE environment variable to that .ini file, then
   run timcv.py and capture the output.

5. Set up another .ini file with whatever options you want to test.
   Set the BAYESCUSTOMIZE environment variable to that .ini file, then
   run timcv.py again and capture the output to a different file.

6. Run table.py on the two output files from timcv.py.  Mail the results
   to the list. :-)

Enjoy.

- Alex

From bill at parducci.net  Fri Mar  7 17:19:01 2003
From: bill at parducci.net (bill parducci)
Date: Fri Mar  7 20:19:05 2003
Subject: [Spambayes] full o' spaces
In-Reply-To: <ZUQKMKXSZUC9V09FCTOD8E0975JFWV.3e69249f@myst>
References: <ZUQKMKXSZUC9V09FCTOD8E0975JFWV.3e69249f@myst>
Message-ID: <3E694505.40208@parducci.net>

i know that fixed length delimiting has been tried, but i wonder how well it would work for something like this if all the non 'a-zA-Z0-9' chars were removed first (basically creating 1 'superword' per region). it would seem to speak to a number of issues like:

s p a c e s  i n  p l a c e s

l.o.w..p.r.o.f.i.l.e,,c,h,a,r,s
and_low_profile_chars

CamelCaseTyping

(bracketing){and}[bracketing] 
(a)(n)(d) (b)(r)(a)(c)(k)(e)(t)(i)(n)(g)

fence|posting|!fence!posting

this is the direction of thinking that i started down when i was first confronted with this because the power of wetware to absorb a MEME; it led me to many hours of fruitless delimiter selection examination. this is not at all to say that this will be the case here but as new ideas are bandied about, i posit that it is a good idea to make sure that previously discarded methodologies be reexamined periodically.

b


From tim.one at comcast.net  Fri Mar  7 20:20:27 2003
From: tim.one at comcast.net (Tim Peters)
Date: Fri Mar  7 20:21:03 2003
Subject: [Spambayes] Eliminating duplicates from mbox file
In-Reply-To: <15977.11591.575821.556483@montanaro.dyndns.org>
Message-ID: <LNBBLJKPBEHFEDALKOLCEEKJEAAB.tim.one@comcast.net>

[Skip Montanaro]
> While retraining today I flubbed at one point and wound up with a bunch of
> duplicates in my training sets.  I wrote the attached script to eliminate
> the duplicates.  I have a few questions:
>
>     1. Is this worth checking into the contrib directory?

Not for Outlook users <wink>.

>     2. Why did I have to subclass mailbox.PortableUnixMailbox?

You shouldn't have to, and you shouldn't have to check for "msg is None"
either.  Note that some of the earliest scripts in the codebase don't do
either.  For example, from split.py:

    mbox = mailbox.PortableUnixMailbox(infp, mboxutils.get_message)
    for msg in mbox:
        if random.random() < percent:
            outfp = bin1out
        else:
            outfp = bin2out
        astext = str(msg)
        assert astext.endswith('\n')
        outfp.write(astext)

> ...
>     3. Is there a better way to emit the unique messages that doesn't
>        require me to manually escape leading "From " sequences?

Looks to me like the email pkg (at least the one in Python CVS) already does
the ">From" bit within msg bodies.  The *leading* "From " isn't supposed to
be escaped -- "From " at the start of a line within a body is supposed to be
escaped precisely so that an unescaped "From " at the start of a line is
recognized as the start of a new msg.


From popiel at wolfskeep.com  Fri Mar  7 18:05:24 2003
From: popiel at wolfskeep.com (T. Alexander Popiel)
Date: Fri Mar  7 21:23:37 2003
Subject: [Spambayes] Bytes/words ratio
Message-ID: <20030308020524.1FDCF2DDC7@cashew.wolfskeep.com>

Skip's bytes/words metatoken seems to be a bust.

-> <stat> tested 2052 hams & 3838 spams against 8206 hams & 15350 spams
-> <stat> tested 2052 hams & 3838 spams against 8206 hams & 15350 spams
-> <stat> tested 2052 hams & 3838 spams against 8206 hams & 15350 spams
-> <stat> tested 2051 hams & 3837 spams against 8207 hams & 15351 spams
-> <stat> tested 2051 hams & 3837 spams against 8207 hams & 15351 spams
-> <stat> tested 2052 hams & 3838 spams against 8206 hams & 15350 spams
-> <stat> tested 2052 hams & 3838 spams against 8206 hams & 15350 spams
-> <stat> tested 2052 hams & 3838 spams against 8206 hams & 15350 spams
-> <stat> tested 2051 hams & 3837 spams against 8207 hams & 15351 spams
-> <stat> tested 2051 hams & 3837 spams against 8207 hams & 15351 spams
filename:  baseline.out       
                   skiptok.out
ham:spam:  10258:19188    
                   10258:19188
fp total:       16      16
fp %:         0.16    0.16
fn total:       52      52
fn %:         0.27    0.27
unsure t:      296     303
unsure %:     1.01    1.03
real cost: $271.20 $272.60
best cost: $252.40 $254.80
h mean:       0.40    0.39
h sdev:       5.35    5.31
s mean:      99.21   99.19
s sdev:       6.79    6.85
mean diff:   98.81   98.80
k:            8.14    8.12

Not much to say; all it did was make a few more things unsure by
spreading out the spam a bit more.  Blah.

- Alex

From spambayes_discussion at cklowe.com  Sat Mar  8 02:43:11 2003
From: spambayes_discussion at cklowe.com (Chris Lowe)
Date: Fri Mar  7 21:43:14 2003
Subject: [Spambayes] Outlook Express integration
Message-ID: <00cf01c2e51c$75810570$8f526451@blueeyes>

Hello

I'm a newbie in all sorts of ways, so please forgive me for being crass

I've managet to get Spambayes working with Outlook Express, but it isn't
pretty.  Details are here:

http://www.apt202.net/cgi-bin/wiki.pl?SpamBayesOutlookExpress

Basically I've changed the hammie_header_name to 'To', so OE can filter on
it.  A few minor mods to pop3proxy.py were required because there's usually
another 'To' header present.

I personally think the HTML interface is OK for training, but I can see the
obvious attraction of an intgrated solution as offered by the Outlook
plug-in.

The technique also seems to work OK with Netscape, but then again netscape
can cope OK with 'X-Spambayes-Classification' as a custom header.

Would you be so kind as to offer some suggestions on how I could improve
this?

Cheers,

Chris Lowe


From tim.one at comcast.net  Fri Mar  7 22:57:38 2003
From: tim.one at comcast.net (Tim Peters)
Date: Fri Mar  7 22:58:10 2003
Subject: [Spambayes] full o' spaces
In-Reply-To: <20030307140247.A16563@discworld.dyndns.org>
Message-ID: <LNBBLJKPBEHFEDALKOLCKEKPEAAB.tim.one@comcast.net>

[Neil Schemenauer]
>> I disagree.  We should not abandon the rigorous, testing based
>> strategy that got SB to its current state.  Adding more code every
>> time a spammer comes up with a new trick is completely reactionary
>> and will eventually destroy the code base.

[Charles Cazabon]
> Hear, hear.  Don't turn SpamBayes into a convoluted, hocus-pocus
> collection of ad-hoc rules a la SpamAssasin.

Indeed, I'd rather keep it a convoluted, hocus-pocus collection of
tokenization gimmicks <0.9 wink>.  Really, I doubt SpamAssassin has anything
more bizarre than our "skip:" tokens, and I kept the latter because taking
them out hurt results.  I've never been sure why -- and I was never able to
find a way of summarizing thrown-out "too-long tokens" that did as well,
either.  There's magic enough to go around.  Also ego deflaters!  I'm still
convinced that preserving case should help, and also looking at (at least)
bigrams -- unfortunately, the data didn't agree.  It may in the future,
though, if spam gets more sophisticated.

> Keep testing; if a technique doesn't measurably improve the result, toss
it.

At the time I got yanked from this project, I was looking to remove code
rather than add more.  There are too many tokenization options already, and
it isn't clear that some of them do anyone any good anymore.  The
gary_combining classifier scheme should also go away.


From tim at fourstonesExpressions.com  Fri Mar  7 22:01:42 2003
From: tim at fourstonesExpressions.com (Tim Stone - Four Stones Expressions)
Date: Fri Mar  7 23:01:50 2003
Subject: [Spambayes] full o' spaces
In-Reply-To: <LNBBLJKPBEHFEDALKOLCKEKPEAAB.tim.one@comcast.net>
Message-ID: <6ZA6HGVUIC6ZBA85APKQNJETQVQQP83.3e696b26@myst>

3/7/2003 9:57:38 PM, Tim Peters <tim.one@comcast.net> wrote:

> There's magic enough to go around.  Also ego deflaters!  I'm still
>convinced that preserving case should help, and also looking at (at least)
>bigrams -- unfortunately, the data didn't agree.  It may in the future,
>though, if spam gets more sophisticated.

The war will indeed be very interesting  ;)


c'est moi - TimS
http://www.fourstonesExpressions.com
http://wecanstopspam.org


From skip at pobox.com  Fri Mar  7 22:28:22 2003
From: skip at pobox.com (Skip Montanaro)
Date: Fri Mar  7 23:28:25 2003
Subject: [Spambayes] Eliminating duplicates from mbox file
In-Reply-To: <LNBBLJKPBEHFEDALKOLCEEKJEAAB.tim.one@comcast.net>
References: <15977.11591.575821.556483@montanaro.dyndns.org>
        <LNBBLJKPBEHFEDALKOLCEEKJEAAB.tim.one@comcast.net>
Message-ID: <15977.29030.890609.602417@montanaro.dyndns.org>


    >> 2. Why did I have to subclass mailbox.PortableUnixMailbox?

    Tim> You shouldn't have to, and you shouldn't have to check for "msg is
    Tim> None" either.  Note that some of the earliest scripts in the
    Tim> codebase don't do either.  For example, from split.py:

        mbox = mailbox.PortableUnixMailbox(infp, mboxutils.get_message)
        for msg in mbox:
            if random.random() < percent:
                outfp = bin1out
        ...

Yeah, I know.  That's how I originally wrote it.  Without the test against
None it just went into an infloop.

    >> 3. Is there a better way to emit the unique messages that doesn't
    >> require me to manually escape leading "From " sequences?

    Tim> Looks to me like the email pkg (at least the one in Python CVS)
    Tim> already does the ">From" bit within msg bodies.  

I figured it must have.  Must be something other than the .as_string()
method though.  It clearly doesn't escape "\nFrom " as "\n>From ".

    Tim> The *leading* "From " isn't supposed to be escaped --

Correct.

    Tim> "From " at the start of a line within a body is supposed to be
    Tim> escaped precisely so that an unescaped "From " at the start of a
    Tim> line is recognized as the start of a new msg.

I guess I was really asking if there's something better than .as_string() to
call when I want to emit a message.  I don't see anything obvious in the
online docs though.

Skip

From skip at pobox.com  Fri Mar  7 22:32:35 2003
From: skip at pobox.com (Skip Montanaro)
Date: Fri Mar  7 23:32:38 2003
Subject: [Spambayes] Bytes/words ratio
In-Reply-To: <20030308020524.1FDCF2DDC7@cashew.wolfskeep.com>
References: <20030308020524.1FDCF2DDC7@cashew.wolfskeep.com>
Message-ID: <15977.29283.908599.21234@montanaro.dyndns.org>


    Alex> Skip's bytes/words metatoken seems to be a bust.

I take (mild) exception to that.  It was TimP's idea.  Perhaps I implemented
it wrong. ;-) Also, note that Tim indicated it helped in his early testing.

Skip

From skip at pobox.com  Fri Mar  7 22:36:38 2003
From: skip at pobox.com (Skip Montanaro)
Date: Fri Mar  7 23:36:40 2003
Subject: [Spambayes] Eliminating duplicates from mbox file
In-Reply-To: <15977.29030.890609.602417@montanaro.dyndns.org>
References: <15977.11591.575821.556483@montanaro.dyndns.org>
        <LNBBLJKPBEHFEDALKOLCEEKJEAAB.tim.one@comcast.net>
        <15977.29030.890609.602417@montanaro.dyndns.org>
Message-ID: <15977.29526.301050.465780@montanaro.dyndns.org>

    >>> 2. Why did I have to subclass mailbox.PortableUnixMailbox?

    Tim> You shouldn't have to, and you shouldn't have to check for "msg is
    Tim> None" either.  Note that some of the earliest scripts in the
    Tim> codebase don't do either.  For example, from split.py:

    Skip>         mbox = mailbox.PortableUnixMailbox(infp, mboxutils.get_message)
    Skip>         for msg in mbox:
    Skip>             if random.random() < percent:
    Skip>                 outfp = bin1out
    Skip>         ...

    Skip> Yeah, I know.  That's how I originally wrote it.  Without the test
    Skip> against None it just went into an infloop.

Yuck, badly worded.  I should have said something like

    Yeah, I know.  That's how I originally wrote it.  After subclassing
    PortableUnixMailbox to get the "for msg in mbox:" to succeed, without
    the test against None in the loop it just went into an infloop.

Skip

From tim.one at comcast.net  Sat Mar  8 00:01:05 2003
From: tim.one at comcast.net (Tim Peters)
Date: Sat Mar  8 00:01:37 2003
Subject: [Spambayes] Eliminating duplicates from mbox file
In-Reply-To: <15977.29526.301050.465780@montanaro.dyndns.org>
Message-ID: <LNBBLJKPBEHFEDALKOLCMELEEAAB.tim.one@comcast.net>

[Skip Montanaro]
> Yuck, badly worded.  I should have said something like
>
>     Yeah, I know.  That's how I originally wrote it.  After subclassing
>     PortableUnixMailbox to get the "for msg in mbox:" to succeed, without
>     the test against None in the loop it just went into an infloop.

Except that you shouldn't have needed to subclass, just as the sample code I
showed didn't need to subclass.  That's where the problem lies.  After you
subclassed it, the None problem was probably due to the subclassing (indeed,
it clearly was due to the subclassing:  you had your subclass __iter__
return self, and self.next() can return None then; the
mailbox.PortableUnixMailbox.__iter__ returns iter(self.next, None), which
cannot return None).

To get anywhere else with this and without benefit of telepathy, you should
create a self-contained small test case and make sure you're using a
self-consistent set of factory-standard <wink> software.  The problem is why
you needed to subclass to begin with:  as you orginally noted,
mailbox.PortableUnixMailbox already supplied __iter__, so it makes no sense
that you had to supply your own.  Something else is wrong.


From tim.one at comcast.net  Sat Mar  8 00:05:01 2003
From: tim.one at comcast.net (Tim Peters)
Date: Sat Mar  8 00:05:39 2003
Subject: [Spambayes] Bytes/words ratio
In-Reply-To: <15977.29283.908599.21234@montanaro.dyndns.org>
Message-ID: <LNBBLJKPBEHFEDALKOLCGELFEAAB.tim.one@comcast.net>

[Alex]
> Skip's bytes/words metatoken seems to be a bust.

[Skip]
> I take (mild) exception to that.  It was TimP's idea.  Perhaps I
> implemented it wrong. ;-) Also, note that Tim indicated it helped in his
> early testing.

Nope, I said it was a strong spam indicator, but that it made no difference
to error rates.  That's the same outcome Alex just reported (I didn't see a
asignificant difference in his before-and-after results; no change in FP or
FN, and (just) a few msgs tipped into Unsure).

Another example may help to clarify:  in just about anyone's test data,
"<br>" would be a very strong spam indicator, if the tokenizer produced it.
I expect that adding it into the mix would boost the FP rate, though -- at
least for those of us with sisters <wink>.


From tim.one at comcast.net  Sat Mar  8 00:08:14 2003
From: tim.one at comcast.net (Tim Peters)
Date: Sat Mar  8 00:08:53 2003
Subject: [Spambayes] full o' spaces
In-Reply-To: <6ZA6HGVUIC6ZBA85APKQNJETQVQQP83.3e696b26@myst>
Message-ID: <LNBBLJKPBEHFEDALKOLCKELFEAAB.tim.one@comcast.net>

[Tim Stone]
> The war will indeed be very interesting  ;)

Starting when <0.9 wink>?

From tim.one at comcast.net  Sat Mar  8 00:20:51 2003
From: tim.one at comcast.net (Tim Peters)
Date: Sat Mar  8 00:21:30 2003
Subject: [Spambayes] Eliminating duplicates from mbox file
In-Reply-To: <15977.29030.890609.602417@montanaro.dyndns.org>
Message-ID: <LNBBLJKPBEHFEDALKOLCGELGEAAB.tim.one@comcast.net>

>     Tim> Looks to me like the email pkg (at least the one in Python CVS)
>     Tim> already does the ">From" bit within msg bodies.

[Skip Montanaro]
> I figured it must have.  Must be something other than the .as_string()
> method though.  It clearly doesn't escape "\nFrom " as "\n>From ".

Stick some prints in the code.  In the _handle_text() method, see whether
this block is getting executed (it should be):

        if self._mangle_from_:
            payload = fcre.sub('>From ', payload)

If it isn't, trace it back from there.

> ...
> I guess I was really asking if there's something better than
> .as_string() to call when I want to emit a message.  I don't see anything
> obvious in the online docs though.

I think Barry usully uses str(msg), which is equivalent to

    msg.as_string(unixfrom=True)

Either way, it leads pretty directly to the _mangle_from code quoted above.


From popiel at wolfskeep.com  Fri Mar  7 21:32:54 2003
From: popiel at wolfskeep.com (T. Alexander Popiel)
Date: Sat Mar  8 00:32:59 2003
Subject: [Spambayes] Bytes/words ratio 
In-Reply-To: Message from Skip Montanaro <skip@pobox.com> 
	<15977.29283.908599.21234@montanaro.dyndns.org> 
References: <20030308020524.1FDCF2DDC7@cashew.wolfskeep.com>
	<15977.29283.908599.21234@montanaro.dyndns.org> 
Message-ID: <20030308053254.CCEC52DDC7@cashew.wolfskeep.com>

In message:  <15977.29283.908599.21234@montanaro.dyndns.org>
             Skip Montanaro <skip@pobox.com> writes:
>
>    Alex> Skip's bytes/words metatoken seems to be a bust.
>
>I take (mild) exception to that.  It was TimP's idea.  Perhaps I implemented
>it wrong. ;-) Also, note that Tim indicated it helped in his early testing.

Aye, you're right.  I should have said that it seems to be a bust
for my corpus.  My apologies.

Does anybody else have a decently sized corpus (I believe we were
using a minimum of 2000 each of spam and ham for the last shootout)
who's willing to test this goodie?

- Alex

From mike at plokta.com  Sat Mar  8 08:29:42 2003
From: mike at plokta.com (Mike Scott)
Date: Sat Mar  8 03:29:43 2003
Subject: [Spambayes] Headers and pop3proxy
Message-ID: <1C0EA690-5140-11D7-BE5F-000393DB4B0C@plokta.com>

Is there an easy way (perhaps a parameter in bayescustomize.ini) to get 
pop3proxy to add a header giving the spam probability score, as well as 
the one classifying the message as ham/unsure/spam? This would make it 
easier to fine-tune the min and max scores to get email classified 
correctly -- I get no false negatives at all, and not much in the 
unsure category, but I get  a few false positives. So I need to 
increase the spam cutoff (currently at 0.95), but I don't know how much.

-- 
Mike Scott
mike@plokta.com


From anthony at interlink.com.au  Sat Mar  8 19:45:00 2003
From: anthony at interlink.com.au (Anthony Baxter)
Date: Sat Mar  8 03:45:28 2003
Subject: [Spambayes] full o' spaces 
In-Reply-To: <E0HZXA01U75SP797IPLGBVLHPKRM.3e68f7b9@myst> 
Message-ID: <200303080845.h288j0C16756@localhost.localdomain>


>>> Tim Stone replying to Neil Schemenauer
> >  Adding more code every time a spammer
> >comes up with a new trick is completely reactionary and will eventually
> >destroy the code base.  I'm mystified as to how you can call such an
> >approach proactive.
> 
> Again, I was suggesting that we find the holes before they do. I think
> that we should begin to think like spammers, not like people trying to
> defeat spammers. If we were on the other side, what would we do? Gosh,
> I can think of things, simple things. And if I can find something
> that actually crashes the tokenizer, all the better. I'll look at the
> code, more closely than most on this team ever will. I'll find the
> holes, and blast away. My goal? Not to get spam into mailboxes, but to
> destroy the anti-spam community. Make people give up hope that this
> problem really is/can be solved. That's the way to make you and me go
> away. Simply make it so people don't believe in us.

We're not talking about something that crashes the tokenizer. We're 
talking about a new spam technique that's been seen in a very small 
number of live spams. I've not yet seen one of these, and I get an
absolute shiteload of spam every day. Note also that a lot of people
run spamassassin, and it's absolute death on this technique (called
"gappy text", from memory). The chances of this technique surviving
very long is very small.

We can sit here for days, weeks and months and think of ways to defeat
the existing classifier. We have done that, in the past. But a change that
is not tested and shown to improve existing results, does _not_ belong 
in the code base. It goes against _everything_ that has made this project 
successful. 

Sure - if you find a way to actually crash the tokeniser, then the fix
should go in. But "what if"ing serves no use, and may make things worse.

Anthony
-- 
Anthony Baxter     <anthony@interlink.com.au>   
It's never too late to have a happy childhood.

From anthony at interlink.com.au  Sat Mar  8 19:51:22 2003
From: anthony at interlink.com.au (Anthony Baxter)
Date: Sat Mar  8 03:51:50 2003
Subject: [Spambayes] full o' spaces 
In-Reply-To: <LNBBLJKPBEHFEDALKOLCKEKPEAAB.tim.one@comcast.net> 
Message-ID: <200303080851.h288pMv16800@localhost.localdomain>


>>> Tim Peters wrote
> At the time I got yanked from this project, I was looking to remove code
> rather than add more.  There are too many tokenization options already, and
> it isn't clear that some of them do anyone any good anymore.  The
> gary_combining classifier scheme should also go away.

I was wondering about that last time I was trying to get some new graphs
for the SB website. Does anyone have any real objections to this going away?
If not, I'll kill it all on monday (I'll put a Last_Gary tag on the version
before the code removal).

Anthony
-- 
Anthony Baxter     <anthony@interlink.com.au>   
It's never too late to have a happy childhood.


From skip at pobox.com  Sat Mar  8 07:32:52 2003
From: skip at pobox.com (Skip Montanaro)
Date: Sat Mar  8 08:32:54 2003
Subject: [Spambayes] Eliminating duplicates from mbox file
In-Reply-To: <LNBBLJKPBEHFEDALKOLCGELGEAAB.tim.one@comcast.net>
References: <15977.29030.890609.602417@montanaro.dyndns.org>
        <LNBBLJKPBEHFEDALKOLCGELGEAAB.tim.one@comcast.net>
Message-ID: <15977.61700.812466.667050@montanaro.dyndns.org>


    Tim> Stick some prints in the code.  In the _handle_text() method, see
    Tim> whether this block is getting executed (it should be):

    Tim>         if self._mangle_from_:
    Tim>             payload = fcre.sub('>From ', payload)

Okay, I'll give that a try.  The reason I stuck in the replace() call was
that what it told me the number of messages was (len(d), where d is the dict
using md5 checksums as keys) differed from what "egrep '^From ' out" told me
after it had generated the output file (there were four more "^From " lines
than the number of messages in the dict).  Once I added the replace() call,
they agreed.  Given that, I think there's a bug without inserting prints.
(I had planned to submit a bug report today.)

Skip

From skip at pobox.com  Sat Mar  8 08:36:48 2003
From: skip at pobox.com (Skip Montanaro)
Date: Sat Mar  8 09:36:50 2003
Subject: [Spambayes] Eliminating duplicates from mbox file
Message-ID: <15978.0.395098.109027@montanaro.dyndns.org>


    >> 2. Why did I have to subclass mailbox.PortableUnixMailbox?

    Tim> You shouldn't have to...

*sigh*

I come before the bar asking humbly for forgiveness...  I was doing all this
from my ~/tmp directory, which, lo and behold, had a version of mailbox.py
dating from September 2001.  The _Mailbox class had next() but not
__iter__.  Who knows what other semantic differences existed.

Sorry for the wasted bandwidth.

Skip


From tim at fourstonesExpressions.com  Sat Mar  8 09:10:33 2003
From: tim at fourstonesExpressions.com (Tim Stone - Four Stones Expressions)
Date: Sat Mar  8 10:10:40 2003
Subject: [Spambayes] full o' spaces 
In-Reply-To: <200303080851.h288pMv16800@localhost.localdomain>
Message-ID: <5LLGA6NM65TPWZV214285M1U622.3e6a07e9@myst>

3/8/2003 2:51:22 AM, Anthony Baxter <anthony@interlink.com.au> wrote:

>
>>>> Tim Peters wrote
>> At the time I got yanked from this project, I was looking to remove code
>> rather than add more.  There are too many tokenization options already, and
>> it isn't clear that some of them do anyone any good anymore.  The
>> gary_combining classifier scheme should also go away.
>
>I was wondering about that last time I was trying to get some new graphs
>for the SB website. Does anyone have any real objections to this going away?
>If not, I'll kill it all on monday (I'll put a Last_Gary tag on the version
>before the code removal).

I think we should get rid of any related options, too:  use_gary_combining and 
use_chi_squared_combining.  Perhaps this would be a good time to make 
experimental_ham_spam_imbalance_adjustment permanent?

>
>Anthony
>-- 
>Anthony Baxter     <anthony@interlink.com.au>   
>It's never too late to have a happy childhood.
>
>
>_______________________________________________
>Spambayes mailing list
>Spambayes@python.org
>http://mail.python.org/mailman/listinfo/spambayes
>
>


c'est moi - TimS
http://www.fourstonesExpressions.com
http://wecanstopspam.org


From tim at fourstonesExpressions.com  Sat Mar  8 09:25:21 2003
From: tim at fourstonesExpressions.com (Tim Stone - Four Stones Expressions)
Date: Sat Mar  8 10:25:27 2003
Subject: [Spambayes] full o' spaces 
In-Reply-To: <200303080845.h288j0C16756@localhost.localdomain>
Message-ID: <97746185C05Z4WKJO17232KGFCZWOM.3e6a0b61@myst>

3/8/2003 2:45:00 AM, Anthony Baxter <anthony@interlink.com.au> wrote:

>We can sit here for days, weeks and months and think of ways to defeat
>the existing classifier. We have done that, in the past. But a change that
>is not tested and shown to improve existing results, does _not_ belong 
>in the code base. It goes against _everything_ that has made this project 
>successful. 

Ok, so let me summarize what I think our discussion has boiled down to.

1. We will not make changes that regress our results on existing spam. 

2. We will engage in ongoing analysis of spam, keeping our testing corpora up 
to date as best we can.  When significant (we have yet to define significant) 
amounts of FN start happening, we will adjust the tokenizer appropriately.

Point 1 is a given.  There seems to be considerable inertia in the project 
toward using point 2 as an ongoing strategy.  I can live with it, because 
there's tremendous value in what we're doing, and it really does work.  I just 
have to say, though, that from a marketing viewpoint (believe it or not, I was 
a marketer in a former life), this strategy can potentially shoot us in the 
foot, because we aren't the ones finding problems, spammers are, and I think 
this could cause our users to lose faith in our product.  "I trained this 
stuff as spam, and this thing STILL doesn't catch it."  If that happens to a 
user more than a few times, the conclusion will be that it doesn't work.  I'm 
telling you, it doesn't take but one bad article in a ZD publication, and it's 
all over with for us.

Ok, I'm off my soapbox. <smile>  This has been a great discussion.


c'est moi - TimS
http://www.fourstonesExpressions.com
http://wecanstopspam.org


From N7DR at arrisi.com  Sat Mar  8 08:31:03 2003
From: N7DR at arrisi.com (D. R. Evans)
Date: Sat Mar  8 10:31:09 2003
Subject: [Spambayes] Database corruption [WAS] pop3proxy crashes
In-Reply-To: <TSPJNIUSD8NHPOBGDYXHDWQQNTQ43B9.3e680cee@myst>
References: <3E67575B.3086.24A05058@localhost>
Message-ID: <3E69AA47.15581.8DD9511@localhost>

On 6 Mar 2003 at 21:07, Tim Stone - Four Stones Expre wrote:

> 3/6/2003 3:12:43 PM, "D. R. Evans" <N7DR@arrisi.com> wrote:
> 
> >On 6 Mar 2003 at 9:50, Tim Stone - Four Stones Expre wrote:
> >
> >> Nearly as I can tell, your training database has been corrupted.  I'm
> >> not quite sure how this happened, but from what I see in the code,
> >> there is likely no recovery at this point.  When you submit a bug
> >> report, go ahead and attach your training database.
> 
> The database is definitely corrupted.  This is the first time I've seen
> this.  The 'saved state' key in the database (where spamcount and
> hamcount are maintained) has a corrupt value, that kills the unpickler.
> 
> There are >88,000 words in this database, and apparently the machine was
> rebooted without a proper shutdown.  This is bad.
> 

I plugged my handspring into a USB port to do a sync (as usual) and the 
machine completely froze (not as usual). Dead. Could no longer even 
reach it from other machines on the network. So I had to power down.

However, I do note that I was NOT doing any spambayes-related 
operations at the time (unless pop3proxy goes off and does things in 
the background, which I don't think it does).

> D.R. I need you to do a couple things:
> 
> If you have the spam and ham saved in an mbox or something, then you can
> simply delete the database files and retrain from scratch.  This would
> be the best alternative.  If this isn't the case, if you can remember,
> or figure out some way, how many spams and hams were trained into this
> database, I can recover it for you.  Even a rough estimate will likely
> do. 
> 

I'll just reinstall and start all over again. Not a problem. Almost 
certainly much easier than having you try to reconstruct the database.

> And... can you tell me, if you know, what dbm module is in use?  Maybe
> someone can give us a few lines of python you can run that will tell us
> that info.  It's too late for me to bring it to mind...
> 

If someone can post how to find that out, I'll gladly run it.

  Doc


> c'est moi - TimS
> http://www.fourstonesExpressions.com
> http://wecanstopspam.org
> 
> 

--------------------------------------------------------------
Phone:  +1 303 494 0394
Mobile: +1 720 839 8462
Fax:    +1 781 240 0527
--------------------------------------------------------------


From nas at python.ca  Sat Mar  8 08:43:48 2003
From: nas at python.ca (Neil Schemenauer)
Date: Sat Mar  8 11:34:11 2003
Subject: [Spambayes] Bytes/words ratio
In-Reply-To: <20030308020524.1FDCF2DDC7@cashew.wolfskeep.com>
References: <20030308020524.1FDCF2DDC7@cashew.wolfskeep.com>
Message-ID: <20030308164347.GA16439@glacier.arctrix.com>

T. Alexander Popiel wrote:
> Skip's bytes/words metatoken seems to be a bust.

I'll take the blame.  I think neither Skip nor Tim explicitly said it
was a good idea.  Thanks for testing.

  Neil

From skip at pobox.com  Sat Mar  8 11:29:49 2003
From: skip at pobox.com (Skip Montanaro)
Date: Sat Mar  8 12:30:01 2003
Subject: [Spambayes] Database corruption [WAS] pop3proxy crashes
In-Reply-To: <3E69AA47.15581.8DD9511@localhost>
References: <3E67575B.3086.24A05058@localhost>
        <3E69AA47.15581.8DD9511@localhost>
Message-ID: <15978.10381.283357.950737@montanaro.dyndns.org>

    >> There are >88,000 words in this database, and apparently the machine
    >> was rebooted without a proper shutdown.  This is bad.

    Doc> I plugged my handspring into a USB port to do a sync (as usual) and
    Doc> the machine completely froze (not as usual). Dead. Could no longer
    Doc> even reach it from other machines on the network. So I had to power
    Doc> down.

    Doc> However, I do note that I was NOT doing any spambayes-related
    Doc> operations at the time (unless pop3proxy goes off and does things
    Doc> in the background, which I don't think it does).

If pop3proxy was running, even if it wasn't analyzing any messages at that
instant, it probably had the database open.  For performance reasons, the
BerkeleyDB library does a fair amount of caching.  It is quite possible the
database was in an invalid state at the time your machine froze.

All may not be lost however.  Did your BerkeleyDB package come with a
db_recover command?  If so, it may be able to repair the damage.

For those who haven't investigated all the mysteries of the BerkeleyDB
package, it comes with a number of command-line programs which manipulate
the database in various ways:

    db_archive     db_deadlock    db_load        db_recover     db_upgrade     
    db_checkpoint  db_dump        db_printlog    db_stat        db_verify      

You can read all about them at

    http://www.sleepycat.com/docs/utility/index.html

Does anyone know if the Windows distribution of Python comes with these
utilities?  If not, it probably should.  db_dump, db_load, db_upgrade
db_verify and db_recover are particularly useful.

Skip

From francois.granger at free.fr  Sat Mar  8 19:17:09 2003
From: francois.granger at free.fr (Francois Granger)
Date: Sat Mar  8 13:17:16 2003
Subject: [Spambayes] Another issue with the email package
Message-ID: <a05200f27ba8fe157bb08@[192.168.1.20]>

Today I got a mail with a "return-space" in the subject field. It was 
not tagged at all. And I can't find it in the cache directories. I 
have a copy of it in my Eudora mailbox. But this is not of much help.

Here a copy and past of headers around this:

[...]
User-Agent: Mozilla/5.0 (Macintosh; U; PPC Mac OS X; en-US; rv:1.0.2) 
Gecko/20021120 Netscape/7.01
X-Accept-Language: en-us, en
MIME-Version: 1.0
To: francis.bebey@free.fr
Subject: demande de renseignements (sans photo attach?
  )
Content-Type: text/plain; format=flowed
Content-Transfer-Encoding: 8bit

<x-flowed>Bonjour,

Je voudrais des informations sur la disponibilite de la chanson "je vous
aime zaime zaime", je ne le trouve ici en Belgique pas dans les
[...]


-- 
http://fgranger.net1.nerim.net:8000/cgi-bin/pyblosxom.cgi

From stephena at hiwaay.net  Sat Mar  8 11:34:58 2003
From: stephena at hiwaay.net (Stephen Anderson)
Date: Sat Mar  8 14:35:38 2003
Subject: [Spambayes] Bytes/words ratio
In-Reply-To: <LNBBLJKPBEHFEDALKOLCGELFEAAB.tim.one@comcast.net>
References: <15977.29283.908599.21234@montanaro.dyndns.org>
Message-ID: <3E69D562.5861.18EF176D@localhost>

On 8 Mar 2003 at 0:05, Tim Peters wrote:

> Another example may help to clarify:  in just about anyone's test data,
> "<br>" would be a very strong spam indicator, if the tokenizer produced
> it. I expect that adding it into the mix would boost the FP rate,
> though -- at least for those of us with sisters <wink>.

Okay Tim, I just can't take it anymore.  My curiosity has gotten the best of me.  Would you 
please ask your sisters to email me a sample of one of their very pretty HTML emails you 
keep referring to.  I have a sister too, but her HTML emails are almost indistinguishable in 
presentation from that of a plain-text one.  So, can you help me with my burning question:  
Just what does pretty email look like?

Geeky regards,
Steve

From tim at fourstonesExpressions.com  Sat Mar  8 13:53:27 2003
From: tim at fourstonesExpressions.com (Tim Stone - Four Stones Expressions)
Date: Sat Mar  8 14:53:37 2003
Subject: [Spambayes] Bytes/words ratio
In-Reply-To: <3E69D562.5861.18EF176D@localhost>
Message-ID: <H2LG3WJDTQOKMGTN4XTOPLLKRPB6B7.3e6a4a37@myst>

3/8/2003 1:34:58 PM, "Stephen Anderson" <stephena@hiwaay.net> wrote:

>Okay Tim, I just can't take it anymore.  My curiosity has gotten the best of me.  Would you 
>please ask your sisters to email me a sample of one of their very pretty HTML emails you 
>keep referring to.

Woah... <wink> This ain't no matchmaking mailing list...

c'est moi - TimS
http://www.fourstonesExpressions.com
http://wecanstopspam.org


From lists at morpheus.demon.co.uk  Sat Mar  8 21:18:55 2003
From: lists at morpheus.demon.co.uk (Paul Moore)
Date: Sat Mar  8 16:21:26 2003
Subject: [Spambayes] full o' spaces
References: <200303080845.h288j0C16756@localhost.localdomain>
	<97746185C05Z4WKJO17232KGFCZWOM.3e6a0b61@myst>
Message-ID: <n2m-g.zno5wo8w.fsf@morpheus.demon.co.uk>

Tim Stone - Four Stones Expressions <tim@fourstonesExpressions.com> writes:

> 3/8/2003 2:45:00 AM, Anthony Baxter <anthony@interlink.com.au> wrote:
>
>>We can sit here for days, weeks and months and think of ways to defeat
>>the existing classifier. We have done that, in the past. But a change that
>>is not tested and shown to improve existing results, does _not_ belong 
>>in the code base. It goes against _everything_ that has made this project 
>>successful. 
>
> Ok, so let me summarize what I think our discussion has boiled down to.
>
> 1. We will not make changes that regress our results on existing spam. 
>
> 2. We will engage in ongoing analysis of spam, keeping our testing corpora up 
> to date as best we can.  When significant (we have yet to define significant) 
> amounts of FN start happening, we will adjust the tokenizer appropriately.
>
> Point 1 is a given.  There seems to be considerable inertia in the project 
> toward using point 2 as an ongoing strategy.  I can live with it, because 
> there's tremendous value in what we're doing, and it really does work.  I just 
> have to say, though, that from a marketing viewpoint (believe it or not, I was 
> a marketer in a former life), this strategy can potentially shoot us in the 
> foot, because we aren't the ones finding problems, spammers are, and I think 
> this could cause our users to lose faith in our product.  "I trained this 
> stuff as spam, and this thing STILL doesn't catch it."  If that happens to a 
> user more than a few times, the conclusion will be that it doesn't work.  I'm 
> telling you, it doesn't take but one bad article in a ZD publication, and it's 
> all over with for us.
>
> Ok, I'm off my soapbox. <smile>  This has been a great discussion.

Can I borrow that box for a moment? Thanks... :-)

The key point, for me, is that spambayes is the only anti-spam tool I
have ever used that made a real dent in my spam problem. And the
dent it made was pretty much total. While I still get unsures, and
even the occasional FN, in reality I don't have a spam problem any
more.

I don't know why spambayes is so good, but the single most distinctive
aspect of the project is the rigorous analysis of results, and
ruthless refusal to include techniques which don't pull their weight.

When I mention spambayes to friends, my "marketing" approach is,
basically:

1. It works. Really well.
2. It learns what you consider spam, and acts on that.
3. It's been tested on thousands of spam, with error rates so low as
   to be negligible.
4. You do need to maintain it - a little ongoing training helps (but
   it's not a major task, and if you don't bother, you're still going
   to get very impressive results)
5. Er. But it's a bit rough around the edges still. I'll help you
   install it, if you like.

Notice (5). That's what is killing us right now with real people (me,
I'm a figment of your imagination: be very afraid <wink>). Anything
else is minor.

Your point (2) means that we can claim that we know it works - we've
tested it (my point (3)). Pre-emptive attempts to address possible new
spam tricks loses that - you can't *prove* the effectiveness of a new
technique if you don't have corpora with evidence of that technique to
test against. I view the benefit of being able to show proof that the
program works as greater than the risk of being branded reactive.

Oh, and by the way - you use Microsoft's security strategy to
demonstrate that a reactive approach is bad. But that's FUD. Another
business that is (as far as the general public is aware) totally
reactive is the anti-virus business. If you liken the spambayes
approach to an anti-virus strategy, it suddenly looks much better :-)

OK, who wants the box next?

Paul.
-- 
This signature intentionally left blank

From tim at fourstonesExpressions.com  Sat Mar  8 16:38:08 2003
From: tim at fourstonesExpressions.com (Tim Stone - Four Stones Expressions)
Date: Sat Mar  8 17:38:18 2003
Subject: [Spambayes] full o' spaces
In-Reply-To: <n2m-g.zno5wo8w.fsf@morpheus.demon.co.uk>
Message-ID: <RQ8621Y61JG6ZYX2D0OM54JFHCZT07.3e6a70d0@myst>

3/8/2003 3:18:55 PM, Paul Moore <lists@morpheus.demon.co.uk> wrote:

>>
>> Ok, I'm off my soapbox. <smile>  This has been a great discussion.
>
>Can I borrow that box for a moment? Thanks... :-)

I yield the floor.

>1. It works. Really well.
>2. It learns what you consider spam, and acts on that.
>3. It's been tested on thousands of spam, with error rates so low as
>   to be negligible.
>4. You do need to maintain it - a little ongoing training helps (but
>   it's not a major task, and if you don't bother, you're still going
>   to get very impressive results)
>5. Er. But it's a bit rough around the edges still. I'll help you
>   install it, if you like.
>
>Notice (5). That's what is killing us right now with real people (me,
>I'm a figment of your imagination: be very afraid <wink>). Anything
>else is minor.

Absolutely.

>If you liken the spambayes
>approach to an anti-virus strategy, it suddenly looks much better :-)

Hmmm... interesting analog, but it only goes so far.  Viruses would be a 
vastly smaller threat had microsoft engaged in the strategy that I'm arguing 
for.  Trojans, worms, etc... the face of the online world would be 
considerably different had they invested in building fundamentally secure 
systems...


c'est moi - TimS
http://www.fourstonesExpressions.com
http://wecanstopspam.org


From noreply at sourceforge.net  Sat Mar  8 16:53:30 2003
From: noreply at sourceforge.net (SourceForge.net)
Date: Sat Mar  8 19:50:16 2003
Subject: [Spambayes] 
	[ spambayes-Bugs-700165 ] MoveFileEx doesn't exist on Win98
Message-ID: <E18rp4E-0001JF-00@sc8-sf-web1.sourceforge.net>

Bugs item #700165, was opened at 2003-03-08 19:53
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=498103&aid=700165&group_id=61702

Category: Outlook
Group: None
Status: Open
Resolution: None
Priority: 5
Submitted By: Tim Peters (tim_one)
Assigned to: Mark Hammond (mhammond)
Summary: MoveFileEx doesn't exist on Win98

Initial Comment:
After a CVS up, Outlook craps out on Win98SE now in 
BayesManager._MigrateFile.

  File "C:\Code\spambayes\Outlook2000\manager.py", 
line 213, in _MigrateFile
    win32con.MOVEFILE_COPY_ALLOWED)
pywintypes.error: (120, 'MoveFileEx', 'This function is 
only valid in Win32 mode.')

which really seems to mean that MoveFileEx isn't 
supported at or before Win98.

----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=498103&aid=700165&group_id=61702

From noreply at sourceforge.net  Sat Mar  8 17:06:44 2003
From: noreply at sourceforge.net (SourceForge.net)
Date: Sat Mar  8 19:58:48 2003
Subject: [Spambayes] 
	[ spambayes-Bugs-700165 ] MoveFileEx doesn't exist on Win98
Message-ID: <E18rpH2-0001fe-00@sc8-sf-web1.sourceforge.net>

Bugs item #700165, was opened at 2003-03-08 19:53
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=498103&aid=700165&group_id=61702

Category: Outlook
Group: None
Status: Open
Resolution: None
Priority: 5
Submitted By: Tim Peters (tim_one)
Assigned to: Mark Hammond (mhammond)
Summary: MoveFileEx doesn't exist on Win98

Initial Comment:
After a CVS up, Outlook craps out on Win98SE now in 
BayesManager._MigrateFile.

  File "C:\Code\spambayes\Outlook2000\manager.py", 
line 213, in _MigrateFile
    win32con.MOVEFILE_COPY_ALLOWED)
pywintypes.error: (120, 'MoveFileEx', 'This function is 
only valid in Win32 mode.')

which really seems to mean that MoveFileEx isn't 
supported at or before Win98.

----------------------------------------------------------------------

>Comment By: Tim Peters (tim_one)
Date: 2003-03-08 20:06

Message:
Logged In: YES 
user_id=31435

I checked in a patch to Outlook2000/manager.py, rev1.54, 
which worked for me on Win98.  If you're happy with this, 
just close the bug.

----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=498103&aid=700165&group_id=61702

From popiel at wolfskeep.com  Sat Mar  8 17:23:14 2003
From: popiel at wolfskeep.com (T. Alexander Popiel)
Date: Sat Mar  8 20:23:18 2003
Subject: [Spambayes] full o' spaces 
In-Reply-To: Message from Tim Stone - Four Stones Expressions
	<tim@fourstonesExpressions.com> 
	<RQ8621Y61JG6ZYX2D0OM54JFHCZT07.3e6a70d0@myst> 
References: <RQ8621Y61JG6ZYX2D0OM54JFHCZT07.3e6a70d0@myst> 
Message-ID: <20030309012314.73C832DE92@cashew.wolfskeep.com>

In message:  <RQ8621Y61JG6ZYX2D0OM54JFHCZT07.3e6a70d0@myst>
             <tim@fourstonesExpressions.com> writes:
>3/8/2003 3:18:55 PM, Paul Moore <lists@morpheus.demon.co.uk> wrote:
>
>>>
>>> Ok, I'm off my soapbox. <smile>  This has been a great discussion.
>>
>>Can I borrow that box for a moment? Thanks... :-)
>
>I yield the floor.

Okay, I'll grab the box for a moment...

>>If you liken the spambayes
>>approach to an anti-virus strategy, it suddenly looks much better :-)
>
>Hmmm... interesting analog, but it only goes so far.  Viruses would be a 
>vastly smaller threat had microsoft engaged in the strategy that I'm arguing 
>for.  Trojans, worms, etc... the face of the online world would be 
>considerably different had they invested in building fundamentally secure 
>systems...

To build a fundamentally secure system, though, we'd be replacing
SMTP with something that actively prevented impersonation and
forgery, as well as possibly providing a provable audit trail back
to original sender, along with their identity.  We're not coming
even close to that... so I think that the anti-virus analogy is
quite appropriate.  We're layering a band-aid on top of a
fundamentally insecure system, and patching any leaks as we hear
about them.

Microsoft is not to blame for all the worms and trojans.  Microsoft
is merely the juiciest target at the moment.  Do recall that the
first worm to make headline news (the Morris worm back in 1988)
targetted VAX and Sun 3 systems through sendmail vulnerabilities.
I could rant for a while that it is human nature to build weak
systems and again human nature to abuse such systems... but that's
not a particularly useful thread for the spambayes list.

- Alex

From tim at fourstonesExpressions.com  Sat Mar  8 19:35:35 2003
From: tim at fourstonesExpressions.com (Tim Stone - Four Stones Expressions)
Date: Sat Mar  8 20:35:43 2003
Subject: [Spambayes] full o' spaces 
In-Reply-To: <20030309012314.73C832DE92@cashew.wolfskeep.com>
Message-ID: <HFB0TROJC041DQO86QLWV2YYT83411V.3e6a9a67@myst>

3/8/2003 7:23:14 PM, "T. Alexander Popiel" <popiel@wolfskeep.com> wrote:

>In message:  <RQ8621Y61JG6ZYX2D0OM54JFHCZT07.3e6a70d0@myst>
>             <tim@fourstonesExpressions.com> writes:
>>3/8/2003 3:18:55 PM, Paul Moore <lists@morpheus.demon.co.uk> wrote:
>>
>>>>
>>>> Ok, I'm off my soapbox. <smile>  This has been a great discussion.
>>>
>>>Can I borrow that box for a moment? Thanks... :-)
>>
>>I yield the floor.
>
>Okay, I'll grab the box for a moment...
>
>>>If you liken the spambayes
>>>approach to an anti-virus strategy, it suddenly looks much better :-)
>>
>>Hmmm... interesting analog, but it only goes so far.  Viruses would be a 
>>vastly smaller threat had microsoft engaged in the strategy that I'm arguing 
>>for.  Trojans, worms, etc... the face of the online world would be 
>>considerably different had they invested in building fundamentally secure 
>>systems...
>
>To build a fundamentally secure system, though, we'd be replacing
>SMTP with something that actively prevented impersonation and
>forgery, as well as possibly providing a provable audit trail back
>to original sender, along with their identity.  We're not coming
>even close to that... so I think that the anti-virus analogy is
>quite appropriate.  We're layering a band-aid on top of a
>fundamentally insecure system, and patching any leaks as we hear
>about them.

All good, interesting points, but we're not talking about building a secure 
system here.  We're just thinking about a couple of alternative going forward 
strategies for our project.  One alternative is to actively try to find ways 
that spammers can get through our filter and plug those holes before the 
spammers find them.  The other is to wait until a significant amount of spam 
is pouring through the hole, then plug the hole in a much more testable, 
provable manner.

The first has the strength of potentially keeping users happier, but the 
weakness of not having a strong corpus of evolved spam to test against, so the 
effectiveness of changes to the tokenizer is not necessarily provable.

The second has the strength of provability, and the weakness of our software 
potentially appearing to be deficient.  This strategy, which we seem to be 
converging on <sigh>, bears resemblance (imo) to microsoft's "wait till a 
hacker trashes the webserver, figure out how they did it, and post a patch" 
strategy.  


c'est moi - TimS
http://www.fourstonesExpressions.com
http://wecanstopspam.org


From tim.one at comcast.net  Sat Mar  8 21:56:07 2003
From: tim.one at comcast.net (Tim Peters)
Date: Sat Mar  8 21:56:39 2003
Subject: [Spambayes] full o' spaces
In-Reply-To: <97746185C05Z4WKJO17232KGFCZWOM.3e6a0b61@myst>
Message-ID: <LNBBLJKPBEHFEDALKOLCIEMGEAAB.tim.one@comcast.net>

[Tim Stone]
> Ok, so let me summarize what I think our discussion has boiled down to.
>
> 1. We will not make changes that regress our results on existing spam.

There are two error rates, and an unsure rate, and they're all important.
I'm afraid that when someone sees a spam and suggests a gimmick to nail it,
they forget that it's also going to penalize some ham, and affect the unsure
rate too.  It's just human nature to fixate on potential benefits and
discount potential costs.  The point of statistical testing is to look at
all the effects.  A change that's a pure win on all counts has become
exceedingly hard to come up with.

> 2. We will engage in ongoing analysis of spam, keeping our
> testing corpora up to date as best we can.  When significant (we have yet
to
> define significant) amounts of FN start happening, we will adjust the
> tokenizer appropriately.

Or bad trends in FP or Unsure, and provided someone can dream up a gimmick
that addresses the problem du jour without damaging the things they're *not*
thinking about more than helping the thing they are thinking about.

> Point 1 is a given.  There seems to be considerable inertia in
> the project toward using point 2 as an ongoing strategy.

I watch my spam, ham and unsures closely, and check in a change whenever
there's an identifiable screwup.  For example, that's how the treatment of
embedded nonsense HTML tags got repaired a while ago, and very recently is
how unclosed HTML start-comment tags stopped being a problem.

I'm not seeing any loss of effectiveness in my own email, though, and it's
true I don't spend any time dreaming up ways to defeat the system.  So long
as spam uses the language and artifacts of advertising, and the tokenizer
sees those, it will be damned hard to get spam thru reliably -- and it will
be hard to get solicited commercial email thru too (it's still the case that
the first time or two I get a desired email from a given online business, it
rates Unsure or even as Spam -- it depends on how obnoxious it is).

Exceptions raised by the email pkg now appear to be the easiest approach to
hiding msg content from this particular system, and if I were a spammer
that's what I'd concentrate on.  Python allows very easy ways to catch
exceptions, though, so it's not something I'm frightened of -- we've added
alternative processing paths for email exceptions before, and we can add
more.  There's a systematic spambayes codebase problem, though, in that
people call the email pkg parsing functions directly, and that prevents
centralizing workarounds for pkg weaknesses that get discovered.

> I can live with it, because there's tremendous value in what we're doing,
> and it really does work.  I just have to say, though, that from a
marketing
> viewpoint (believe it or not, I was a marketer in a former life), this
> strategy can potentially shoot us in the foot, because we aren't the ones
> finding problems, spammers are,

I've seen no evidence that they're finding anything to exploit here, and
doubt this particular project is popular enough for them to target.  Most
spam damaged enough to make the email pkg complain appears to me to be due
to spammer incompetence, or to bugs in the software they're using to
generate the spam.  If you want to see something break, give it to a 2-year
old <0.9 wink>.  At the moment, I have a grand total of one spam from my
personal email that still breaks the system (causes an email BoundaryError
exception that the Outlook client doesn't protect itself against), and
that's it, out of tens of thousands.  I got that email last December, and
haven't gotten another like; I conclude it's evidence of a spammer who
didn't know what they were doing.

I confess I haven't fixed this bug, since it turned out to be a one-shot
thing and there are so many other things demanding my time.  Fixing a bug I
don't expect to see again just doesn't rate high enough to get done.

> and I think this could cause our users to lose faith in our product.  "I
> trained this stuff as spam, and this thing STILL doesn't catch it."

That irritation can occur even when the system is working perfectly, alas.
The flip side is that the lack of special cases to *force* classification as
one thing or another also makes it impossible to attack such a subsystem:
"preponderance of evidence" is the only way to get a score out of the
system.

> If that happens to a user more than a few times, the conclusion will be
> that it doesn't work.  I'm telling you, it doesn't take but one bad
article
> in a ZD publication, and it's all over with for us.

OTOH, one good article in a ZD publication would kill us with newbie support
requests too <0.5 wink>.


From tim.one at comcast.net  Sat Mar  8 21:56:08 2003
From: tim.one at comcast.net (Tim Peters)
Date: Sat Mar  8 21:56:43 2003
Subject: [Spambayes] Database corruption [WAS] pop3proxy crashes
In-Reply-To: <15978.10381.283357.950737@montanaro.dyndns.org>
Message-ID: <LNBBLJKPBEHFEDALKOLCKEMGEAAB.tim.one@comcast.net>

[Skip Montanaro]
> ...
> For those who haven't investigated all the mysteries of the BerkeleyDB
> package, it comes with a number of command-line programs which manipulate
> the database in various ways:
>
>     db_archive     db_deadlock    db_load        db_recover   db_upgrade
>     db_checkpoint  db_dump        db_printlog    db_stat      db_verify
>
> You can read all about them at
>
>     http://www.sleepycat.com/docs/utility/index.html
>
> Does anyone know if the Windows distribution of Python comes with these
> utilities?

It doesn't.

> If not, it probably should.  db_dump, db_load, db_upgrade
> db_verify and db_recover are particularly useful.

Enhancing the Windows installer is a "spare time" thing for me now, and I
don't have any.  IOW, fine by me, but I won't be doing the work.


From tim.one at comcast.net  Sat Mar  8 21:57:52 2003
From: tim.one at comcast.net (Tim Peters)
Date: Sat Mar  8 21:58:23 2003
Subject: [Spambayes] Eliminating duplicates from mbox file
In-Reply-To: <15978.0.395098.109027@montanaro.dyndns.org>
Message-ID: <LNBBLJKPBEHFEDALKOLCAEMHEAAB.tim.one@comcast.net>

[Skip Montanaro]
> I come before the bar asking humbly for forgiveness...  I was 
> doing all this from my ~/tmp directory, which, lo and behold, had a
> version of mailbox.py dating from September 2001.  The _Mailbox class
> had next() but not __iter__.  Who knows what other semantic differences
> existed.

Not me.  Maybe this relates to your problems with From lines too?

> Sorry for the wasted bandwidth.

I haven't trained on msgs from you as spam, so don't sweat it <wink>.

From tim.one at comcast.net  Sat Mar  8 22:05:55 2003
From: tim.one at comcast.net (Tim Peters)
Date: Sat Mar  8 22:06:25 2003
Subject: [Spambayes] full o' spaces
In-Reply-To: <200303080851.h288pMv16800@localhost.localdomain>
Message-ID: <LNBBLJKPBEHFEDALKOLCCEMIEAAB.tim.one@comcast.net>

[Tim Peters]
>> At the time I got yanked from this project, I was looking to remove code
>> rather than add more.  There are too many tokenization options
>> already, and it isn't clear that some of them do anyone any good
>> anymore.  The gary_combining classifier scheme should also go away.

[Anthony Baxter]
> I was wondering about that last time I was trying to get some new graphs
> for the SB website. Does anyone have any real objections to this
> going away?

Last person I knew was using it was Sean True, but that was last year.  I
flipped between gary_ and chi_ combining a lot myself last year too, until
gaining more confidence in the latter.

> If not, I'll kill it all on monday (I'll put a Last_Gary tag on
> the version before the code removal).

Bless you!  As TimS said, we should also nuke the options specific to it.


From tim.one at comcast.net  Sat Mar  8 22:15:10 2003
From: tim.one at comcast.net (Tim Peters)
Date: Sat Mar  8 22:15:40 2003
Subject: [Spambayes] full o' spaces
In-Reply-To: <5LLGA6NM65TPWZV214285M1U622.3e6a07e9@myst>
Message-ID: <LNBBLJKPBEHFEDALKOLCIEMJEAAB.tim.one@comcast.net>

[Tim Stone]
> ...
> I think we should get rid of any related options, too:
> use_gary_combining and use_chi_squared_combining.

Agreed.

> Perhaps this would be a good time to make
> experimental_ham_spam_imbalance_adjustment permanent?

There haven't been enough test reports on that one to decide.  It's True by
default in the Outlook client, but still appears to be False by default
everywhere else.  There are bad visible effects either way (if it's off and
you get a large ratio imbalance, it's too easy for a msg to score
incorrectly as belonging to the more popular category; if it's on and you
get a large ratio imbalance, training on another example from the more
popular category has little effect, exacerbating (for example) the "but I
trained on it and it's *still* called ham!" irritation).


From nas at python.ca  Sat Mar  8 20:14:47 2003
From: nas at python.ca (Neil Schemenauer)
Date: Sat Mar  8 23:05:11 2003
Subject: [Spambayes] full o' spaces
In-Reply-To: <LNBBLJKPBEHFEDALKOLCIEMJEAAB.tim.one@comcast.net>
References: <5LLGA6NM65TPWZV214285M1U622.3e6a07e9@myst>
	<LNBBLJKPBEHFEDALKOLCIEMJEAAB.tim.one@comcast.net>
Message-ID: <20030309041447.GA17672@glacier.arctrix.com>

Tim Peters wrote:
> > Perhaps this would be a good time to make
> > experimental_ham_spam_imbalance_adjustment permanent?
> 
> There haven't been enough test reports on that one to decide.

How do I test it?

  Neil

From tim at fourstonesExpressions.com  Sat Mar  8 22:47:05 2003
From: tim at fourstonesExpressions.com (Tim Stone - Four Stones Expressions)
Date: Sat Mar  8 23:47:17 2003
Subject: [Spambayes] full o' spaces
In-Reply-To: <LNBBLJKPBEHFEDALKOLCIEMJEAAB.tim.one@comcast.net>
Message-ID: <PMN06D9HB09RP09JG3YFDTN94LLFVQ.3e6ac749@myst>

3/8/2003 9:15:10 PM, Tim Peters <tim.one@comcast.net> wrote:

>There haven't been enough test reports on that one to decide.  It's True by
>default in the Outlook client, but still appears to be False by default
>everywhere else.  There are bad visible effects either way (if it's off and
>you get a large ratio imbalance, it's too easy for a msg to score
>incorrectly as belonging to the more popular category; if it's on and you
>get a large ratio imbalance, training on another example from the more
>popular category has little effect, exacerbating (for example) the "but I
>trained on it and it's *still* called ham!" irritation).

Rats.  I thought it was True by default.  All this time I've been using it 
thinking it was on... ok, so if I turn it on now, what would I expect?  I have 
a huge ham/spam imbalance in my notes sb database, and have been a bit 
disappointed by the classifier...

>
>
>


c'est moi - TimS
http://www.fourstonesExpressions.com
http://wecanstopspam.org


From bill at parducci.net  Sat Mar  8 22:13:35 2003
From: bill at parducci.net (bill parducci)
Date: Sun Mar  9 01:13:39 2003
Subject: [Spambayes] spaced out spam
Message-ID: <3E6ADB8F.7070709@parducci.net>

well it looks like spacemania is catching on...

b


-------- Original Message --------
Subject: none
Date: Fri, 7 Mar 2003 02:40:14 GMT
From: mcgough  <program@diplomas.org>


U N I V E R S I T Y   D I P L O M A S

O b t a i n   a   p r o s p e r o u s   f u t u r e ,   m o n e y   e a 
r n i n g   p o w e r ,   a n d
t h e   a d m i r a t i o n   o f   a l l .

D i p l o m a s   f r o m   p r e s t i g i o u s ,   n o n - a c c r e 
d i t e d
u n i v e r s i t i e s   b a s e d   o n   y o u r   p r e s e n t   k 
n o w l e d g e   a n d
l i f e   e x p e r i e n c e .

N o   r e q u i r e d   t e s t s,  c l a s s e s ,   b o o k s ,  o r 
  i n t e r v i e w s .

B a c h e l o r s ,   m a s t e r s ,   M B A ,    a n d   d  o  c t o r 
a t e   ( P h D )
d i p l o m a s    a v a i  l a b l e   i n   t h e   f i e l d   o f 
y o u r   c h o i c e .

N o   o n e    i s   t u r n e d   d o w n .

C o n f i d e n t i a l i t y   a s s u r e d .

C A L L   N O W   t o   r e c e i v e   y o u r   d i p l o m a   w i t 
h i n   d a y s ! ! !

1-817-740-5673

C a l l   2 4   h o u r s   a   d a y ,   7   d a y s   a   w e e k , 
i n c l u d i n g
S u n d a y s   a n d   h o l i d a y s .


From tim_one at email.msn.com  Sun Mar  9 01:54:10 2003
From: tim_one at email.msn.com (Tim Peters)
Date: Sun Mar  9 01:54:51 2003
Subject: [Spambayes] full o' spaces
In-Reply-To: <20030309041447.GA17672@glacier.arctrix.com>
Message-ID: <LNBBLJKPBEHFEDALKOLCKEKLDPAB.tim_one@email.msn.com>

[Neil Schemenauer, asking about experimental_ham_spam_imbalance_adjustment]
> How do I test it?

One run with True, and another with False.  If you have the same # of ham
and spam in your training data, it shouldn't make any difference.  If you
have an imbalance, it will, and then the question is which setting gives
better results.  I'm not keen on people who don't already have an imbalance
artificially creating one, though -- for example, I think mistake-based
manual training is likely to create imbalance, and that's likely to have
different characteristics than imbalance forced via picking random subsets.


From tim_one at email.msn.com  Sun Mar  9 02:14:22 2003
From: tim_one at email.msn.com (Tim Peters)
Date: Sun Mar  9 02:15:00 2003
Subject: [Spambayes] full o' spaces
In-Reply-To: <20030309012314.73C832DE92@cashew.wolfskeep.com>
Message-ID: <LNBBLJKPBEHFEDALKOLCGEKNDPAB.tim_one@email.msn.com>

[T. Alexander Popiel]
> ...
> Do recall that the first worm to make headline news (the Morris worm
> back in 1988) targetted VAX and Sun 3 systems through sendmail
> vulnerabilities.

It's curious that current sendmail holes were the hottest security topic
this week, 15 years later, and that the holes were created by "security
code".  Makes me glad I sleep with a loaded gun <wink>.


From tim_one at email.msn.com  Sun Mar  9 02:13:56 2003
From: tim_one at email.msn.com (Tim Peters)
Date: Sun Mar  9 02:15:13 2003
Subject: [Spambayes] full o' spaces
In-Reply-To: <PMN06D9HB09RP09JG3YFDTN94LLFVQ.3e6ac749@myst>
Message-ID: <LNBBLJKPBEHFEDALKOLCEEKNDPAB.tim_one@email.msn.com>

[Tim Peters]
> There haven't been enough test reports on that one to decide.
> It's True by default in the Outlook client, but still appears to be
> False by default everywhere else.  There are bad visible effects
> either way (if it's off and you get a large ratio imbalance, it's too
> easy for a msg to score incorrectly as belonging to the more popular
> category; if it's on and you get a large ratio imbalance, training on
> another example from the more popular category has little effect,
> exacerbating (for example) the "but I trained on it and it's *still*
> called ham!" irritation).

[Tim Stone]
> Rats.  I thought it was True by default.

It is if you're using the Outlook client.

> All this time I've been using it thinking it was on... ok, so if I
> turn it on now, what would I expect?

Did you read the paragraph you quoted?  I've written several small essays on
the topic here, and think the parenthetical comments above are a decent
summary.

> I have a huge ham/spam imbalance in my notes sb database,

Striving for balance is likely a better idea.

> and have been a bit disappointed by the classifier...

I'm short on telepathy tonight.  Perhaps the *way* in which you're
disappointed is related to the comments above?  For example, if you have
much more ham than spam and have a too-high FN rate, or you have much more
spam than ham and have a too-high FP rate, then the comments are directly
applicable.


From tim_one at email.msn.com  Sun Mar  9 02:22:32 2003
From: tim_one at email.msn.com (Tim Peters)
Date: Sun Mar  9 02:23:10 2003
Subject: [Spambayes] Bytes/words ratio
In-Reply-To: <20030308164347.GA16439@glacier.arctrix.com>
Message-ID: <LNBBLJKPBEHFEDALKOLCIEKODPAB.tim_one@email.msn.com>

[Neil Schemenauer]
> I'll take the blame.  I think neither Skip nor Tim explicitly said it
> was a good idea.  Thanks for testing.

Testing is always a good thing, but I don't get the umbrage and blame thing
here:  *most* ideas turn out to add no value -- and always have, and likely
always will.  Bytes/word didn't help last time I tried 'em either, and that
idea was better than *most* because it didn't hurt either <0.1 wink>.


From tim_one at email.msn.com  Sun Mar  9 02:45:44 2003
From: tim_one at email.msn.com (Tim Peters)
Date: Sun Mar  9 02:46:27 2003
Subject: [Spambayes] full o' spaces
In-Reply-To: <HFB0TROJC041DQO86QLWV2YYT83411V.3e6a9a67@myst>
Message-ID: <LNBBLJKPBEHFEDALKOLCEELADPAB.tim_one@email.msn.com>

[Tim Stone]
> ...
> One alternative is to actively try to find ways that spammers can get
> through our filter and plug those holes before the spammers find them.

Instead of arguing about this more, how about we try it once?

I'll note that we have no defense against the "white on white" HTML hiding
trick, but also that that trick hasn't been effective against my personal
classifier (the one spam of that kind I've seen rate solidly Unsure for me
lucked into hiding a news story about the DC-area snipers, after I had
trained on many msgs from friends and relatives also corresponding about
that topic at the time).

Hiding *all* the text in a .gif or .jpg on the Web merely linked to within
the email seemed like a very good trick at the start, but seems ineffective
now too -- there's nothing in the body then to offset spammish clues in the
headers.

Jeremy and Guido were both recipients of cunning spam this system couldn't
stop:  the spam took the form of replies to msgs they posted to public
mailing lists, reproducing their original subject line and a quotes from the
bodies of their msgs.  This guaranteed they contained lots of words that
were hammy to them, and also fooled the content-based whitelist boosts
python.org added to its SpamAssassin installation.  That's the cleverest
attack I've seen, but it happened last year and I haven't heard of it
happening again.


From tim_one at email.msn.com  Sun Mar  9 03:14:21 2003
From: tim_one at email.msn.com (Tim Peters)
Date: Sun Mar  9 03:15:00 2003
Subject: [Spambayes] spaced out spam
Message-ID: <LNBBLJKPBEHFEDALKOLCEELCDPAB.tim_one@email.msn.com>

[bill parducci]
> well it looks like spacemania is catching on...
> ...

This is actually the same spam, word for word & space for space, that
started the "full o' spaces" thread, here:

    http://mail.python.org/pipermail/spambayes/2003-March/003806.html

Skip later reported that running an up-to-date classifier nailed it as spam
despite the absence of body clues:

    http://mail.python.org/pipermail/spambayes/2003-March/003834.html

I think that last report was also a bit suspicious, though, as the clue
listing appeared to contain hapaxes unique to the msg being scored
(suggesting that the msg had already been trained on as spam); e.g.,

          'message-id:@hkgioexchange1.corp.giordano.com.hk': 0.84;

> Subject: none
> Date: Fri, 7 Mar 2003 02:40:14 GMT
> From: mcgough  <program at diplomas.org>
>
>
> U N I V E R S I T Y   D I P L O M A S
>
> O b t a i n   a   p r o s p e r o u s   f u t u r e ,   m o n e y   e a
> r n i n g   p o w e r ,   a n d
> t h e   a d m i r a t i o n   o f   a l l .


From tim at fourstonesExpressions.com  Sun Mar  9 07:37:32 2003
From: tim at fourstonesExpressions.com (Tim Stone - Four Stones Expressions)
Date: Sun Mar  9 08:37:38 2003
Subject: [Spambayes] full o' spaces
In-Reply-To: <LNBBLJKPBEHFEDALKOLCEEKNDPAB.tim_one@email.msn.com>
Message-ID: <RLA72YG2Z2VR4XZV898OJSOWRMJ61.3e6b439c@myst>

3/9/2003 1:13:56 AM, "Tim Peters" <tim_one@email.msn.com> wrote:

>I'm short on telepathy tonight.  Perhaps the *way* in which you're
>disappointed is related to the comments above?  For example, if you have
>much more ham than spam and have a too-high FN rate, or you have much more
>spam than ham and have a too-high FP rate, then the comments are directly
>applicable.

Can I plead nocturnal insanity?  Maybe it was all the housecleaning fluid 
fumes...

Ok, I train on virtually every piece of mail that comes into my notes inbox.  
the ratio is about 10:1 spam:ham.  I currently have about 600 spam trained 
into the database.  I still get maybe 10%-15% unsure, invariably on spam.  I 
virtually never have a FP.  Maybe I just need to adjust the spam cutoff... 
Mainly thinking out loud, and bemoaning the fact that I've annoyed my 
namesake.

c'est moi - TimS
http://www.fourstonesExpressions.com
http://wecanstopspam.org


From wsy at merl.com  Sun Mar  9 08:48:53 2003
From: wsy at merl.com (Bill Yerazunis)
Date: Sun Mar  9 08:49:29 2003
Subject: [Spambayes] spaced out spam
In-Reply-To: <3E6ADB8F.7070709@parducci.net> (message from bill parducci on
	Sat, 08 Mar 2003 22:13:35 -0800)
References: <3E6ADB8F.7070709@parducci.net>
Message-ID: <200303091348.h29DmrC15652@localhost.localdomain>


   From: bill parducci <bill@parducci.net>

   well it looks like spacemania is catching on...

   Subject: none
   Date: Fri, 7 Mar 2003 02:40:14 GMT
   From: mcgough  <program@diplomas.org>


   U N I V E R S I T Y   D I P L O M A S

Nothing we haven't seen before, with hypertextus interruptus.

SBPH feature generation has no trouble with this, as the features of
the wildcarded phrase:

       BUY YOUR GENUINE VIAGRA ONLINE NOW 

are just as significant and unique as:

       V I A G R A
 
(in fact, probably the latter is moreso, as nobody I know would 
likely use those particular letters spaced that way.  If they
were going to say "viagra" they'd just say it.

     -Bill Yerazunis

From skip at pobox.com  Sun Mar  9 08:25:45 2003
From: skip at pobox.com (Skip Montanaro)
Date: Sun Mar  9 09:25:50 2003
Subject: [Spambayes] full o' spaces
In-Reply-To: <LNBBLJKPBEHFEDALKOLCEELADPAB.tim_one@email.msn.com>
References: <HFB0TROJC041DQO86QLWV2YYT83411V.3e6a9a67@myst>
        <LNBBLJKPBEHFEDALKOLCEELADPAB.tim_one@email.msn.com>
Message-ID: <15979.20201.258754.620902@montanaro.dyndns.org>


    Tim> Jeremy and Guido were both recipients of cunning spam this system
    Tim> couldn't stop: the spam took the form of replies to msgs they
    Tim> posted to public mailing lists, reproducing their original subject
    Tim> line and a quotes from the bodies of their msgs....  That's the
    Tim> cleverest attack I've seen, but it happened last year and I haven't
    Tim> heard of it happening again.

Perhaps the cost to create such spam outweighs the potential benefit.  You
have to maintain a fair amount of information about the people you want to
spam.  In addition, it's not at all obvious that the people who post to
public mailing lists and newsgroups:

    * cover the list of candidate spam recipients very well, or

    * that they are the sorts of people who would be scammed by bigger
      manhood or MLM come-ons.

Maybe it was just a test by a spammer which returned a negative result and
was thus abandoned.

Skip

From skip at pobox.com  Sun Mar  9 08:28:55 2003
From: skip at pobox.com (Skip Montanaro)
Date: Sun Mar  9 09:30:23 2003
Subject: [Spambayes] spaced out spam
In-Reply-To: <LNBBLJKPBEHFEDALKOLCEELCDPAB.tim_one@email.msn.com>
References: <LNBBLJKPBEHFEDALKOLCEELCDPAB.tim_one@email.msn.com>
Message-ID: <15979.20391.307592.911910@montanaro.dyndns.org>


    Tim> Skip later reported that running an up-to-date classifier nailed it
    Tim> as spam despite the absence of body clues:

    Tim>     http://mail.python.org/pipermail/spambayes/2003-March/003834.html

    Tim> I think that last report was also a bit suspicious, though, as the
    Tim> clue listing appeared to contain hapaxes unique to the msg being
    Tim> scored (suggesting that the msg had already been trained on as
    Tim> spam); e.g.,

    Tim>           'message-id:@hkgioexchange1.corp.giordano.com.hk': 0.84;

Well, yes.  It was reported as unsure.  As is my normal practice, I saved it
to my spam collection and trained on it later.  That doesn't negate all the
other clues which were originally missing.  I believe I explained that I was
mixing apples and oranges, comparing the debug header info generated on one
(out-of-date) machine with the classification header generated on my (much
more up-to-date) laptop.

Skip

From skip at pobox.com  Sun Mar  9 08:38:23 2003
From: skip at pobox.com (Skip Montanaro)
Date: Sun Mar  9 09:38:24 2003
Subject: [Spambayes] full o' spaces
In-Reply-To: <RLA72YG2Z2VR4XZV898OJSOWRMJ61.3e6b439c@myst>
References: <LNBBLJKPBEHFEDALKOLCEEKNDPAB.tim_one@email.msn.com>
        <RLA72YG2Z2VR4XZV898OJSOWRMJ61.3e6b439c@myst>
Message-ID: <15979.20959.993297.864816@montanaro.dyndns.org>


    Tim> Ok, I train on virtually every piece of mail that comes into my
    Tim> notes inbox.  the ratio is about 10:1 spam:ham.  I currently have
    Tim> about 600 spam trained into the database.  I still get maybe
    Tim> 10%-15% unsure, invariably on spam.  I virtually never have a FP.
    Tim> Maybe I just need to adjust the spam cutoff...  Mainly thinking out
    Tim> loud, and bemoaning the fact that I've annoyed my namesake.

Tim,

I know your Notes environment may not allow this, but I do a couple things
to minimize the number of duplicate postings that ever get considered.  At
the very start of my .procmailrc file I remove messages with a message-id
I've seen recently:

    # make sure we don't get two copies of the same message
    :0 Wh: msgid.lock
    | $FORMAIL -D 16384 $HOME/tmp/msgid.cache

Later, after a message has been determined to be spam, I run my loose
checksum script and dump the message if it looks the same as a previous
spam:

    :0
    * ^X-Spambayes-Classification: spam
    {
        ### this recipe gobbles items with matching body checksums (taken
        ### loosely to try and avoid obvious tricks)
        :0 W: cksum.lock
        | $PYCKSUM -v $HOME/tmp/cksum.cache

        :0:
        $SPAM
    }

If I didn't take these steps I'm sure I'd get more spam (and probably see
more mistakes).  Since building my initial large training set, I have
generally only trained on mistakes and unsures.  Accordingly, I have about
12,000 saved hams and 7,000 saved spams.  If the code changes I retrain
completely, but generally only retrain on new messages.

I think either of these techniques (message-id caching and loose checksums)
could be incorporated into pop3proxy without much effort.

Maybe you could use something like the script I posted the other day to
remove duplicates from your collection and bring your spam:ham ratio into
something closer to 1:1.

Skip


From bill at parducci.net  Sun Mar  9 06:58:03 2003
From: bill at parducci.net (bill parducci)
Date: Sun Mar  9 09:58:07 2003
Subject: [Spambayes] full o' spaces
In-Reply-To: <15979.20959.993297.864816@montanaro.dyndns.org>
References: <LNBBLJKPBEHFEDALKOLCEEKNDPAB.tim_one@email.msn.com>
	<RLA72YG2Z2VR4XZV898OJSOWRMJ61.3e6b439c@myst>
	<15979.20959.993297.864816@montanaro.dyndns.org>
Message-ID: <3E6B567B.7080503@parducci.net>

Skip Montanaro wrote:

[...]

> Maybe you could use something like the script I posted the other day to
> remove duplicates from your collection and bring your spam:ham ratio into
> something closer to 1:1.

is there a query that can be run to see what the current ratio of trained messages is?

thanks

b


From skip at pobox.com  Sun Mar  9 09:51:41 2003
From: skip at pobox.com (Skip Montanaro)
Date: Sun Mar  9 10:52:13 2003
Subject: [Spambayes] full o' spaces
In-Reply-To: <3E6B567B.7080503@parducci.net>
References: <LNBBLJKPBEHFEDALKOLCEEKNDPAB.tim_one@email.msn.com>
        <RLA72YG2Z2VR4XZV898OJSOWRMJ61.3e6b439c@myst>
        <15979.20959.993297.864816@montanaro.dyndns.org>
        <3E6B567B.7080503@parducci.net>
Message-ID: <15979.25357.466002.698085@montanaro.dyndns.org>


    >> Maybe you could use something like the script I posted the other day
    >> to remove duplicates from your collection and bring your spam:ham
    >> ratio into something closer to 1:1.

    bill> is there a query that can be run to see what the current ratio of
    bill> trained messages is?

I use mbox-formatted files, so it's fairly easy on Unix-like systems:

    % egrep '^From ' newham.clean.save | wc -l
       11870 
    % egrep '^From ' newspam.clean.save | wc -l
        6994 

Skip

From nas at python.ca  Sun Mar  9 09:58:16 2003
From: nas at python.ca (Neil Schemenauer)
Date: Sun Mar  9 12:48:37 2003
Subject: [Spambayes] full o' spaces
In-Reply-To: <LNBBLJKPBEHFEDALKOLCKEKLDPAB.tim_one@email.msn.com>
References: <20030309041447.GA17672@glacier.arctrix.com>
	<LNBBLJKPBEHFEDALKOLCKEKLDPAB.tim_one@email.msn.com>
Message-ID: <20030309175816.GA19182@glacier.arctrix.com>

Tim Peters wrote:
> One run with True, and another with False.  If you have the same # of ham
> and spam in your training data, it shouldn't make any difference.

Okay, I tested with a natural inbalance.  Looks like it doesn't hurt or
help me.

out/unbalanced-bases.txt -> out/unbalanced-adjusts.txt
-> <stat> tested 547 hams & 389 spams against 2188 hams & 1556 spams
-> <stat> tested 547 hams & 389 spams against 2188 hams & 1556 spams
-> <stat> tested 547 hams & 389 spams against 2188 hams & 1556 spams
-> <stat> tested 547 hams & 389 spams against 2188 hams & 1556 spams
-> <stat> tested 547 hams & 389 spams against 2188 hams & 1556 spams
-> <stat> tested 547 hams & 389 spams against 2188 hams & 1556 spams
-> <stat> tested 547 hams & 389 spams against 2188 hams & 1556 spams
-> <stat> tested 547 hams & 389 spams against 2188 hams & 1556 spams
-> <stat> tested 547 hams & 389 spams against 2188 hams & 1556 spams
-> <stat> tested 547 hams & 389 spams against 2188 hams & 1556 spams

false positive percentages
    0.731  0.731  tied          
    0.366  0.366  tied          
    0.183  0.548  lost  +199.45%
    0.183  0.183  tied          
    0.183  0.183  tied          

won   0 times
tied  4 times
lost  1 times

total unique fp went from 9 to 11 lost   +22.22%
mean fp % went from 0.329067641682 to 0.402193784278 lost   +22.22%

false negative percentages
    0.000  0.000  tied          
    0.000  0.000  tied          
    0.000  0.000  tied          
    0.000  0.000  tied          
    0.257  0.257  tied          

won   0 times
tied  5 times
lost  0 times

total unique fn went from 1 to 1 tied          
mean fn % went from 0.051413881748 to 0.051413881748 tied          

ham mean                     ham sdev
   2.61    2.94  +12.64%       11.66   12.40   +6.35%
   2.66    2.94  +10.53%       11.20   11.87   +5.98%
   2.42    2.71  +11.98%       11.25   12.20   +8.44%
   1.78    2.00  +12.36%        9.05    9.81   +8.40%
   1.92    2.15  +11.98%        9.00    9.68   +7.56%

ham mean and sdev for all runs
   2.28    2.55  +11.84%       10.50   11.26   +7.24%

spam mean                    spam sdev
  99.56   99.63   +0.07%        3.29    2.50  -24.01%
  99.22   99.30   +0.08%        5.03    4.68   -6.96%
  99.63   99.68   +0.05%        2.82    2.55   -9.57%
  99.46   99.55   +0.09%        3.96    3.20  -19.19%
  99.17   99.22   +0.05%        6.41    6.12   -4.52%

spam mean and sdev for all runs
  99.41   99.48   +0.07%        4.50    4.06   -9.78%

ham/spam mean difference: 97.13 96.93 -0.20
547 389

From nas at python.ca  Sun Mar  9 12:08:09 2003
From: nas at python.ca (Neil Schemenauer)
Date: Sun Mar  9 14:58:30 2003
Subject: [Spambayes] better Received header tokens
Message-ID: <20030309200808.GA19398@glacier.arctrix.com>

I wasted some time today trying to improve the mine_received_headers
option.  The goal was to generate fewer more useful tokens.  Also,
I wanted to be resistent to received header forgery.  For the sake of
posterity, here's what I came up with:

    ippat = '\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}'
    received_re = re.compile(r"from .*\b(%s)[)\]].*\b"
                             r"by (\S+)\s+([^;]*)" % ippat, re.M|re.S)
    hops = 0
    network = None
    for hdr in msg.get_all("received", []):
        m = received_re.search(hdr)
        if m:
            ip = m.group(1)
            n = '.'.join(ip.split('.')[:2])
            if n != network:
                hops += 1 
                network = n
                yield 'received:%d:%s' (hops, network)
    yield 'received:%d' % hops

I expected this to do better than the current code.  Testing shows
otherwise.  Perhaps using a more specific or more general network
(instead of /16) would help.

  Neil

From dan at tobias.name  Sun Mar  9 14:39:53 2003
From: dan at tobias.name (Daniel R. Tobias)
Date: Sun Mar  9 15:03:09 2003
Subject: [Spambayes] SpamBayes
Message-ID: <3E6B9889.4030207@tobias.name>

I'm just trying out the SpamBayes proxy software now; seems like a very 
good idea.

However, I've had some problems with the program tending to have a total 
nervous breakdown any time its data structure gets in any way different 
from what it expects, like the database index getting corrupted due to a 
system crash, or some of the inbound messages getting deleted by a virus 
scanning program during processing.  It seems you haven't programmed any 
sort of graceful recovery when any data file becomes missing or 
corrupted, but just crash the script altogether.  Once, the proxy 
wouldn't even start at all due to some data error and I had to wipe its 
data files out entirely and start over.  Other times (like when the 
virus scanner kills a message in between it being downloaded and being 
reviewed for ham/spam training purposes), a message will show in the 
list of messages to review, but when you try to do anything with it, the 
program crashes.  This process needs improving to reach a level of 
robustness needed to use in a production environment rather than just 
for testing and experimentation purposes.

-- 
== Dan ==
Dan's Web Tips: http://webtips.dan.info/
Dan's Domain Site: http://domains.dan.info/


From noreply at sourceforge.net  Sun Mar  9 00:33:31 2003
From: noreply at sourceforge.net (SourceForge.net)
Date: Sun Mar  9 15:03:15 2003
Subject: [Spambayes] 
	[ spambayes-Bugs-700165 ] MoveFileEx doesn't exist on Win98
Message-ID: <E18rwFP-0004qW-00@sc8-sf-web2.sourceforge.net>

Bugs item #700165, was opened at 2003-03-09 11:53
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=498103&aid=700165&group_id=61702

Category: Outlook
Group: None
>Status: Closed
>Resolution: Fixed
Priority: 5
Submitted By: Tim Peters (tim_one)
Assigned to: Mark Hammond (mhammond)
Summary: MoveFileEx doesn't exist on Win98

Initial Comment:
After a CVS up, Outlook craps out on Win98SE now in 
BayesManager._MigrateFile.

  File "C:\Code\spambayes\Outlook2000\manager.py", 
line 213, in _MigrateFile
    win32con.MOVEFILE_COPY_ALLOWED)
pywintypes.error: (120, 'MoveFileEx', 'This function is 
only valid in Win32 mode.')

which really seems to mean that MoveFileEx isn't 
supported at or before Win98.

----------------------------------------------------------------------

>Comment By: Mark Hammond (mhammond)
Date: 2003-03-09 19:33

Message:
Logged In: YES 
user_id=14198

Thanks!

----------------------------------------------------------------------

Comment By: Tim Peters (tim_one)
Date: 2003-03-09 12:06

Message:
Logged In: YES 
user_id=31435

I checked in a patch to Outlook2000/manager.py, rev1.54, 
which worked for me on Win98.  If you're happy with this, 
just close the bug.

----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=498103&aid=700165&group_id=61702

From tim.one at comcast.net  Sun Mar  9 15:44:21 2003
From: tim.one at comcast.net (Tim Peters)
Date: Sun Mar  9 15:45:12 2003
Subject: [Spambayes] full o' spaces
In-Reply-To: <RLA72YG2Z2VR4XZV898OJSOWRMJ61.3e6b439c@myst>
Message-ID: <LNBBLJKPBEHFEDALKOLCIEOAEAAB.tim.one@comcast.net>

[Tim Stone]
> Can I plead nocturnal insanity?  Maybe it was all the housecleaning fluid
> fumes...

Only if you mainlined them <wink>.

> Ok, I train on virtually every piece of mail that comes into my
> notes inbox.  the ratio is about 10:1 spam:ham.  I currently have about
> 600 spam trained into the database.

Implying that you've trained on a total of about 60 ham?  If so, that's very
light training (for this system).

> I still get maybe 10%-15% unsure, invariably on spam.  I virtually
> never have a FP.

Peculiar!  Try turning on the experimental imbalance adjustment just to see
what happens.  I don't expect it will help, but I wouldn't have expected the
outcome you're getting either.

> Maybe I just need to adjust the spam cutoff...

Can't guess from here.

> Mainly thinking out loud, and bemoaning the fact that I've annoyed my
> namesake.

Na, acting irritated is just fun for Tims <wink>.


From tim.one at comcast.net  Sun Mar  9 15:49:32 2003
From: tim.one at comcast.net (Tim Peters)
Date: Sun Mar  9 15:50:04 2003
Subject: [Spambayes] full o' spaces
In-Reply-To: <15979.20201.258754.620902@montanaro.dyndns.org>
Message-ID: <LNBBLJKPBEHFEDALKOLCAEOBEAAB.tim.one@comcast.net>

[Tim]
> Jeremy and Guido were both recipients of cunning spam this system
> couldn't stop [as replies to their postings]


[Skip Montanaro]
> ...
> Maybe it was just a test by a spammer which returned a negative result
> and was thus abandoned.

That's what I figured -- any form of targeting spam adds expense.  Targeting
posters to tech mailing lists has got to be close to a zero-response
approach.


From nas at python.ca  Sun Mar  9 13:00:11 2003
From: nas at python.ca (Neil Schemenauer)
Date: Sun Mar  9 15:50:30 2003
Subject: [Spambayes] Integration with qmail?
In-Reply-To: <E8E5E0D3B5C9D611B23500C00D00E9BC3036EC@CSREESSERVER>
References: <E8E5E0D3B5C9D611B23500C00D00E9BC3036EC@CSREESSERVER>
Message-ID: <20030309210011.GA19599@glacier.arctrix.com>

Martinez, Michael - CSREES/ISTM wrote:
> I'm looking to integrate spambayes with a qmail smtp gateway. Any pointers
> would be appreciated.

See http://arctrix.com/nas/qmail/spambayes/ .  The code is still a bit
rough and the instructions were hastily written.

The cool part about the system is that it should be suitable for
deployment at the mail server level.  Users don't need to do anything
and should not have too worry about legitimate email being rejected.
Obviously it doesn't not perform quite as well as a personal filter but
it is much better than no filter at all.

  Neil

From tim.one at comcast.net  Sun Mar  9 16:14:55 2003
From: tim.one at comcast.net (Tim Peters)
Date: Sun Mar  9 16:15:28 2003
Subject: [Spambayes] full o' spaces
In-Reply-To: <20030309175816.GA19182@glacier.arctrix.com>
Message-ID: <LNBBLJKPBEHFEDALKOLCCEOEEAAB.tim.one@comcast.net>

[Neil Schemenauer, tests the experimental imbalance adjustment]
> Okay, I tested with a natural inbalance.  Looks like it doesn't hurt or
> help me.
>
> out/unbalanced-bases.txt -> out/unbalanced-adjusts.txt
> -> <stat> tested 547 hams & 389 spams against 2188 hams & 1556 spams
> ...

This is a very mild imbalance, so I don't expect much change.  The option
was introduced when people reported imbalance ratios close to 20; yours is
under 1.5.  Since you have more ham than spam, without the adjustmet the
spamprob of a ham word can get closer to 0 than the spamprob of a spam word
can get to 1, effectively giving ham words more strength than spam words.
The effect of the adjustment is to make ham words "less hammy", which should
tend to reduce FN and increase FP.  The larget the imbalance ratio, the more
pronounded these effects should be.

> false positive percentages
>     0.731  0.731  tied
>     0.366  0.366  tied
>     0.183  0.548  lost  +199.45%
>     0.183  0.183  tied
>     0.183  0.183  tied
>
> won   0 times
> tied  4 times
> lost  1 times
>
> total unique fp went from 9 to 11 lost   +22.22%
> mean fp % went from 0.329067641682 to 0.402193784278 lost   +22.22%
>
> false negative percentages
>     0.000  0.000  tied
>     0.000  0.000  tied
>     0.000  0.000  tied
>     0.000  0.000  tied
>     0.257  0.257  tied
>
> won   0 times
> tied  5 times
> lost  0 times
>
> total unique fn went from 1 to 1 tied
> mean fn % went from 0.051413881748 to 0.051413881748 tied
>
> ham mean                     ham sdev
>    2.61    2.94  +12.64%       11.66   12.40   +6.35%
>    2.66    2.94  +10.53%       11.20   11.87   +5.98%
>    2.42    2.71  +11.98%       11.25   12.20   +8.44%
>    1.78    2.00  +12.36%        9.05    9.81   +8.40%
>    1.92    2.15  +11.98%        9.00    9.68   +7.56%
>
> ham mean and sdev for all runs
>    2.28    2.55  +11.84%       10.50   11.26   +7.24%
>
> spam mean                    spam sdev
>   99.56   99.63   +0.07%        3.29    2.50  -24.01%
>   99.22   99.30   +0.08%        5.03    4.68   -6.96%
>   99.63   99.68   +0.05%        2.82    2.55   -9.57%
>   99.46   99.55   +0.09%        3.96    3.20  -19.19%
>   99.17   99.22   +0.05%        6.41    6.12   -4.52%
>
> spam mean and sdev for all runs
>   99.41   99.48   +0.07%        4.50    4.06   -9.78%

Since words look "less hammy" after the adjusment, an increase in both means
is expected, and the appearance of ham words in spam doesn't yank down the
spam scores as much so a decrease in spam sdev is also expected.  OTOH, the
ham words in ham are also less hammy after adjustment, so ham scores are
expected to spread more (-> increase in ham sdev).

So the changes were all qualitatively expected, and overall didn't make a
real bottom-line difference.  Imbalance this mild isn't what the gimmick was
aiming at, though -- it was aimed at stopping disastrous embarrassments for
people with extreme training ratios.

Thank you for trying it!


From tim at fourstonesExpressions.com  Sun Mar  9 15:23:02 2003
From: tim at fourstonesExpressions.com (Tim Stone - Four Stones Expressions)
Date: Sun Mar  9 16:23:10 2003
Subject: [Spambayes] full o' spaces
In-Reply-To: <LNBBLJKPBEHFEDALKOLCIEOAEAAB.tim.one@comcast.net>
Message-ID: <TOHC4XURQMCBB8FBPO85OUQPNNLSOLK.3e6bb0b6@myst>

3/9/2003 2:44:21 PM, Tim Peters <tim.one@comcast.net> wrote:

>Implying that you've trained on a total of about 60 ham?  If so, that's very
>light training (for this system).

Yeah... but that's the ratio I get.  My notes inbox is not heavily used for 
legitimate mail, but the mail that IS there is extremely important.

>
>> I still get maybe 10%-15% unsure, invariably on spam.  I virtually
>> never have a FP.
>
>Peculiar!  Try turning on the experimental imbalance adjustment just to see
>what happens.  I don't expect it will help, but I wouldn't have expected the
>outcome you're getting either.

I'm going to play with this one, and with the spamcutoff as well.  I'm also 
going to do some clues investigation, which will be a bit of a trick because 
there's no place in notes that I can store a 'header'...  More at 11...


c'est moi - TimS
http://www.fourstonesExpressions.com
http://wecanstopspam.org


From tim.one at comcast.net  Sun Mar  9 16:30:42 2003
From: tim.one at comcast.net (Tim Peters)
Date: Sun Mar  9 16:31:14 2003
Subject: [Spambayes] Bytes/words ratio
In-Reply-To: <3E69D562.5861.18EF176D@localhost>
Message-ID: <LNBBLJKPBEHFEDALKOLCMEOFEAAB.tim.one@comcast.net>

[Stephen Anderson]
> Okay Tim, I just can't take it anymore.  My curiosity has gotten
> the best of me.  Would you  please ask your sisters to email me a
> sample of one of their very pretty HTML emails you keep referring
> to.

Nope -- if they sent email to people they didn't grow up with, they'd get a
spam problem.  They have no presence in CyberSpace -- not even google can
find them <wink>.

> I have a sister too, but her HTML emails are almost indistinguishable
> in presentation from that of a plain-text one.  So, can you help me
> with my burning question:  Just what does pretty email look like?

It helps if your sister is an artist, can use image and sound manipulation
programs, and doesn't pay much attention to copyright notices on web sites.
One of my sisters even taught herself (just) enough about Java to reuse Java
applets, specifying different parameters.  A pretty email is a coordinated
combination of sound and color or images, and sometimes animation.  At the
guts-of-the-HTML level, it has a lot in common with fancy spam.  At the
human level, though, it's pleasing (or even poignant) instead of obnoxious.
I don't know how to automate telling the difference.

When MSN first started, there used to be a lot of that on their proprietary
newsgroups too.  Dialup speed pretty much killed it (along with MSN's
attempts to sell proprietary "rich" content).


From skip at pobox.com  Sun Mar  9 15:35:07 2003
From: skip at pobox.com (Skip Montanaro)
Date: Sun Mar  9 16:35:08 2003
Subject: [Spambayes] SpamBayes
In-Reply-To: <3E6B9889.4030207@tobias.name>
References: <3E6B9889.4030207@tobias.name>
Message-ID: <15979.45963.260615.349902@montanaro.dyndns.org>

    Dan> It seems you haven't programmed any sort of graceful recovery when
    Dan> any data file becomes missing or corrupted, but just crash the
    Dan> script altogether.  

In the face of a corrupt database, all I think we could do is toss out the
database and work with no clues.  Every message would wind up "unsure" with
a score of 0.50.  Is that acceptable to you?

    Dan> Once, the proxy wouldn't even start at all due to some data error
    Dan> and I had to wipe its data files out entirely and start over.

In think that's about all it could do automatically.

    Dan> Other times (like when the virus scanner kills a message in between
    Dan> it being downloaded and being reviewed for ham/spam training
    Dan> purposes), a message will show in the list of messages to review,
    Dan> but when you try to do anything with it, the program crashes.  

I think this could be more easily recovered from.

    Dan> This process needs improving to reach a level of robustness needed
    Dan> to use in a production environment rather than just for testing and
    Dan> experimentation purposes.

Granted.

Skip

From lists at morpheus.demon.co.uk  Sun Mar  9 21:41:43 2003
From: lists at morpheus.demon.co.uk (Paul Moore)
Date: Sun Mar  9 16:50:05 2003
Subject: [Spambayes] full o' spaces
References: <20030309175816.GA19182@glacier.arctrix.com>
	<LNBBLJKPBEHFEDALKOLCCEOEEAAB.tim.one@comcast.net>
Message-ID: <n2m-g.bs0k1alk.fsf@morpheus.demon.co.uk>

Tim Peters <tim.one@comcast.net> writes:

> This is a very mild imbalance, so I don't expect much change.  The option
> was introduced when people reported imbalance ratios close to 20; yours is
> under 1.5.

I have a spam:ham imbalance of 10:1 in my current database. However,
my available corpus is pretty small - under 200 ham and 3500 spam
(I've been retaining spams for a while now, but my saved ham is
basically just mails I found interesting enough to archive). And that
corpus is what I trained on, in any case.

So I don't have any test data. And I don't really understand how to
run tests with what I do have :-(

I'd happily run some tests on this database, if you could give me some
details on how to go about it.

Paul.
-- 
This signature intentionally left blank

From tim at fourstonesExpressions.com  Sun Mar  9 16:04:07 2003
From: tim at fourstonesExpressions.com (Tim Stone - Four Stones Expressions)
Date: Sun Mar  9 17:04:13 2003
Subject: [Spambayes] Bytes/words ratio
In-Reply-To: <LNBBLJKPBEHFEDALKOLCMEOFEAAB.tim.one@comcast.net>
Message-ID: <WTTS53LILHDWV98OIPNX532U87YUO.3e6bba57@myst>

3/9/2003 3:30:42 PM, Tim Peters <tim.one@comcast.net> wrote:

>[Stephen Anderson]
>> Okay Tim, I just can't take it anymore.  My curiosity has gotten
>> the best of me.  Would you  please ask your sisters to email me a
>> sample of one of their very pretty HTML emails you keep referring
>> to.
>
>Nope -- if they sent email to people they didn't grow up with, they'd get a
>spam problem.  They have no presence in CyberSpace -- not even google can
>find them <wink>.


Saaaaaaaayyyy.... so all this stuff about needing to be easy enough for your 
sisters was just so much smoke?  <wink**2>

c'est moi - TimS
http://www.fourstonesExpressions.com
http://wecanstopspam.org


From T.A.Meyer at massey.ac.nz  Mon Mar 10 12:16:24 2003
From: T.A.Meyer at massey.ac.nz (Meyer, Tony)
Date: Sun Mar  9 18:17:06 2003
Subject: [Spambayes] Outlook Express integration
Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F1318CD7A@its-xchg4.massey.ac.nz>

> I've managet to get Spambayes working with Outlook Express, 
> but it isn't pretty.

Well, nor is OE :)

> Basically I've changed the hammie_header_name to 'To', so OE 
> can filter on
> it.  A few minor mods to pop3proxy.py were required because 
> there's usually another 'To' header present.

I note from the wiki that you're using a2.  Spambayes can now add classification information in the subject line (add "pop3proxy_notate_subject: True" to your config file).  This should avoid having to use the 'To' header, with all the inherant problems.

> I personally think the HTML interface is OK for training, but 
> I can see the obvious attraction of an intgrated solution as
> offered by the Outlook plug-in.

There are two (main) problems.  One is that integration into clients is a *lot* of work, and there are a *lot* of clients around.  The other is that OE is such a limited client that just about any integration is either impossible, or even more work.

> Would you be so kind as to offer some suggestions on how I 
> could improve this?

Sure:
1. Get the latest CVS. (I'm thinking that it's time for a3, anyway, especially once the gary_ stuff is gone).
2. Try using pop3proxy_notate_subject - you'll have to rewrite your rules, but it should work better.
3. Try using the smtpproxy (see below).
4. Send your comments & ideas back to the list :)

SMTPProxy (maybe something should be added to the docs?)
--------------------------------------------------------
This is an alternative method for training, which really needs evaluation.  Setup is just like pop3proxy - go to http://localhost:8880 and in the options put your normal smtp server(s) and port(s).  In OE, set the outgoing SMTP server to localhost.

You can now train by forwarding/bouncing mail to special addresses - these default to spambayes_ham@localhost and spambayes_spam@localhost, but you can set them to whatever you like (smtpproxy_ham_address and smtpproxy_spam_address in your config file).  No need to regularly go to the config web page at all.

Note that the smtpproxy is, by default, not active - you'll need to configure it.  Note also, that since OE doesn't treat headers nicely, you'll need to set pop3proxy_add_mailid_to to "body".

Finally, to start the smtpproxy, add "-s" when you start pop3proxy (i.e. "pop3proxy -s", if you don't use any other options).

Hope this helps :)

=Tony Meyer

From T.A.Meyer at massey.ac.nz  Mon Mar 10 12:19:06 2003
From: T.A.Meyer at massey.ac.nz (Meyer, Tony)
Date: Sun Mar  9 18:19:42 2003
Subject: [Spambayes] Headers and pop3proxy
Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F13C8C892@its-xchg4.massey.ac.nz>

> Is there an easy way (perhaps a parameter in 
> bayescustomize.ini) to get pop3proxy to add a header giving the
> spam probability score, as well as 
> the one classifying the message as ham/unsure/spam?

As far as I can tell, no.  This would be very simple to add, though.

Quick poll from the list: do I provide this as a patch, or check it in?

=Tony Meyer

From T.A.Meyer at massey.ac.nz  Mon Mar 10 12:26:32 2003
From: T.A.Meyer at massey.ac.nz (Meyer, Tony)
Date: Sun Mar  9 18:27:06 2003
Subject: [Spambayes] full o' spaces
Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F13C8C893@its-xchg4.massey.ac.nz>

> this is not at all to say 
> that this will be the case here but as new ideas are bandied 
> about, i posit that it is a good idea to make sure that 
> previously discarded methodologies be reexamined periodically.

I would absolutely agree with this.  To grab the box for a minute and add my 2c to the discussion about being reactive or proactive:

I think that we should be as proactive as possible in trying to find new ways to tag mail that distinguish spam & ham - like the bytes/word count, and so on.  But I don't think these should be checked in, unless they do demonstrate that they make a difference.  The important thing is to code them, and test them, and note those tests & code so that later on, (when, for example, white space spam is really common), we can be as quickly reactive as possible, just grabbing code from the archive, re-testing it and deploying it.

Along with this, it would be great if every now and then, some of these rejected ideas were retested against with the current code and current ham/spam.  Plus, of course, testing the odd idea that is in the code that might not still need to be there.

Just my thoughts...

=Tony Meyer

From mhammond at skippinet.com.au  Mon Mar 10 10:30:20 2003
From: mhammond at skippinet.com.au (Mark Hammond)
Date: Sun Mar  9 18:31:27 2003
Subject: [Spambayes] Headers and pop3proxy
In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F13C8C892@its-xchg4.massey.ac.nz>
Message-ID: <LCEPIIGDJPKCOIHOBJEPCEFOOFAA.mhammond@skippinet.com.au>

> Quick poll from the list: do I provide this as a patch, or check it in?

If you consider it "safe", the impact would be restricted to the single
application, and you are currently actively maintaining that application,
then go for it!

Mark.


From mhammond at skippinet.com.au  Mon Mar 10 10:42:43 2003
From: mhammond at skippinet.com.au (Mark Hammond)
Date: Sun Mar  9 18:43:48 2003
Subject: [Spambayes] experimental_ham_spam_imbalance_adjustment result
Message-ID: <LCEPIIGDJPKCOIHOBJEPEEFPOFAA.mhammond@skippinet.com.au>

Here are my current results on the imbalance option.  Interestingly, my
initial "-n2" results looked better than my "-n 5" results below.

FWIW, Outlook users should remember there is an "export.py" script in the
addin directory.  This will export your ham and spam into the
"spambayes\testdata\Data" directory, which is the default place the test
scripts the test tools use.  Just run this from the command line.

And for everyone else, once you have a "Data" directory, running the tests
means:

* Create testtools\bayescustomize.ini with the options you want to test
* run "timtest.py -n 2 > result1.txt"
* run "rates result1.txt"  - this creates "result1s.txt"
* Repeat the above, changing the options, and redirecting to
  "result2.txt" and getting "result2s.txt" as final output.
* Run "cmp.py result1s.txt result2s.txt"

Well - if it *doesn't* mean that, then you can ignore my results too <wink>.
My results below are for "-n 5".

Mark.

\temp\imbalance_falses.txt -> \temp\imbalance_trues.txt
-> <stat> tested 412 hams & 1004 spams against 429 hams & 1019 spams
-> <stat> tested 440 hams & 1076 spams against 429 hams & 1019 spams
-> <stat> tested 397 hams & 1054 spams against 429 hams & 1019 spams
-> <stat> tested 477 hams & 1056 spams against 429 hams & 1019 spams
-> <stat> tested 429 hams & 1019 spams against 412 hams & 1004 spams
-> <stat> tested 440 hams & 1076 spams against 412 hams & 1004 spams
-> <stat> tested 397 hams & 1054 spams against 412 hams & 1004 spams
-> <stat> tested 477 hams & 1056 spams against 412 hams & 1004 spams
-> <stat> tested 429 hams & 1019 spams against 440 hams & 1076 spams
-> <stat> tested 412 hams & 1004 spams against 440 hams & 1076 spams
-> <stat> tested 397 hams & 1054 spams against 440 hams & 1076 spams
-> <stat> tested 477 hams & 1056 spams against 440 hams & 1076 spams
-> <stat> tested 429 hams & 1019 spams against 397 hams & 1054 spams
-> <stat> tested 412 hams & 1004 spams against 397 hams & 1054 spams
-> <stat> tested 440 hams & 1076 spams against 397 hams & 1054 spams
-> <stat> tested 477 hams & 1056 spams against 397 hams & 1054 spams
-> <stat> tested 429 hams & 1019 spams against 477 hams & 1056 spams
-> <stat> tested 412 hams & 1004 spams against 477 hams & 1056 spams
-> <stat> tested 440 hams & 1076 spams against 477 hams & 1056 spams
-> <stat> tested 397 hams & 1054 spams against 477 hams & 1056 spams
-> <stat> tested 412 hams & 1004 spams against 429 hams & 1019 spams
-> <stat> tested 440 hams & 1076 spams against 429 hams & 1019 spams
-> <stat> tested 397 hams & 1054 spams against 429 hams & 1019 spams
-> <stat> tested 477 hams & 1056 spams against 429 hams & 1019 spams
-> <stat> tested 429 hams & 1019 spams against 412 hams & 1004 spams
-> <stat> tested 440 hams & 1076 spams against 412 hams & 1004 spams
-> <stat> tested 397 hams & 1054 spams against 412 hams & 1004 spams
-> <stat> tested 477 hams & 1056 spams against 412 hams & 1004 spams
-> <stat> tested 429 hams & 1019 spams against 440 hams & 1076 spams
-> <stat> tested 412 hams & 1004 spams against 440 hams & 1076 spams
-> <stat> tested 397 hams & 1054 spams against 440 hams & 1076 spams
-> <stat> tested 477 hams & 1056 spams against 440 hams & 1076 spams
-> <stat> tested 429 hams & 1019 spams against 397 hams & 1054 spams
-> <stat> tested 412 hams & 1004 spams against 397 hams & 1054 spams
-> <stat> tested 440 hams & 1076 spams against 397 hams & 1054 spams
-> <stat> tested 477 hams & 1056 spams against 397 hams & 1054 spams
-> <stat> tested 429 hams & 1019 spams against 477 hams & 1056 spams
-> <stat> tested 412 hams & 1004 spams against 477 hams & 1056 spams
-> <stat> tested 440 hams & 1076 spams against 477 hams & 1056 spams
-> <stat> tested 397 hams & 1054 spams against 477 hams & 1056 spams

false positive percentages
    1.699  1.214  won    -28.55%
    0.909  0.682  won    -24.97%
    1.008  0.756  won    -25.00%
    0.210  0.210  tied
    0.932  0.699  won    -25.00%
    0.682  0.227  won    -66.72%
    1.008  0.504  won    -50.00%
    0.000  0.000  tied
    0.466  0.233  won    -50.00%
    0.243  0.243  tied
    1.259  0.504  won    -59.97%
    0.210  0.000  won   -100.00%
    0.699  0.466  won    -33.33%
    1.456  0.728  won    -50.00%
    1.818  1.591  won    -12.49%
    0.839  0.210  won    -74.97%
    0.466  0.233  won    -50.00%
    0.728  0.485  won    -33.38%
    0.455  0.227  won    -50.11%
    1.259  0.756  won    -39.95%

won  17 times
tied  3 times
lost  0 times

total unique fp went from 40 to 26 won    -35.00%
mean fp % went from 0.817290959648 to 0.49835280855 won    -39.02%

false negative percentages
    0.398  0.498  lost   +25.13%
    0.093  0.186  lost  +100.00%
    0.380  0.474  lost   +24.74%
    0.189  0.189  tied
    0.294  0.294  tied
    0.000  0.372  lost  +(was 0)
    0.190  0.285  lost   +50.00%
    0.379  0.568  lost   +49.87%
    0.491  0.883  lost   +79.84%
    0.896  1.195  lost   +33.37%
    0.664  1.139  lost   +71.54%
    0.189  0.379  lost  +100.53%
    0.294  0.393  lost   +33.67%
    0.498  0.697  lost   +39.96%
    0.093  0.093  tied
    0.189  0.379  lost  +100.53%
    0.687  1.374  lost  +100.00%
    1.195  1.295  lost    +8.37%
    0.651  0.929  lost   +42.70%
    0.474  0.664  lost   +40.08%

won   0 times
tied  3 times
lost 17 times

total unique fn went from 44 to 66 lost   +50.00%
mean fn % went from 0.412283315133 to 0.614303438288 lost   +49.00%

ham mean                     ham sdev
   3.82    2.97  -22.25%       14.77   12.72  -13.88%
   3.31    2.42  -26.89%       13.21   10.90  -17.49%
   3.57    2.69  -24.65%       13.66   11.26  -17.57%
   4.26    3.20  -24.88%       15.98   13.14  -17.77%
   3.37    2.65  -21.36%       14.07   12.03  -14.50%

ham mean and sdev for all runs
   3.67    2.79  -23.98%       14.38   12.05  -16.20%

spam mean                    spam sdev
  98.10   96.94   -1.18%        8.44   10.32  +22.27%
  97.83   96.47   -1.39%        9.10   11.49  +26.26%
  97.63   96.24   -1.42%       10.29   12.59  +22.35%
  98.13   96.83   -1.32%        8.23   10.58  +28.55%
  96.93   95.49   -1.49%       11.94   14.13  +18.34%

spam mean and sdev for all runs
  97.72   96.40   -1.35%        9.71   11.91  +22.66%

ham/spam mean difference: 94.05 93.61 -0.44


From tim at fourstonesExpressions.com  Sun Mar  9 17:47:48 2003
From: tim at fourstonesExpressions.com (Tim Stone - Four Stones Expressions)
Date: Sun Mar  9 18:47:53 2003
Subject: [Spambayes] Headers and pop3proxy
In-Reply-To: <LCEPIIGDJPKCOIHOBJEPCEFOOFAA.mhammond@skippinet.com.au>
Message-ID: <D9JG76UQ51XUOKTORM82A753OJ5265PK.3e6bd2a4@myst>

3/9/2003 5:30:20 PM, "Mark Hammond" <mhammond@skippinet.com.au> wrote:

>> Quick poll from the list: do I provide this as a patch, or check it in?
>

Check it in, dude!

c'est moi - TimS
http://www.fourstonesExpressions.com
http://wecanstopspam.org


From T.A.Meyer at massey.ac.nz  Mon Mar 10 12:55:44 2003
From: T.A.Meyer at massey.ac.nz (Meyer, Tony)
Date: Sun Mar  9 18:57:33 2003
Subject: [Spambayes] Headers and pop3proxy
Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F1318CD7B@its-xchg4.massey.ac.nz>

> > Is there an easy way (perhaps a parameter in 
> > bayescustomize.ini) to get pop3proxy to add a header giving the
> > spam probability score, as well as 
> > the one classifying the message as ham/unsure/spam?

Ok, new answer: yes (with the latest CVS).  Set pop3proxy_include_prob to "True" in your config file.

Note: This is such a simple patch that I can't see how it would break anything, *but*, I have not tested anything apart from that it is off by default (so no change for everyone), and that it works if turned on with my rather vanilla system.

It currently changes the header from "X-Spambayes-Classification: Spam" (or whatever) to "X-Spambayes-Classification: Spam, .953246327" (or whatever).  If people would like it in a seperate header, or formatted (to 2 decimal places for example), let me know.

=Tony Meyer

From francois.granger at free.fr  Mon Mar 10 01:15:41 2003
From: francois.granger at free.fr (Francois Granger)
Date: Sun Mar  9 19:15:47 2003
Subject: [Spambayes] full o' spaces
In-Reply-To: 
 <1ED4ECF91CDED24C8D012BCF2B034F13C8C893@its-xchg4.massey.ac.nz>
References: <1ED4ECF91CDED24C8D012BCF2B034F13C8C893@its-xchg4.massey.ac.nz>
Message-ID: <a05200f3cba9189481821@[192.168.1.20]>

At 12:26 +1300 10/03/2003, in message RE: [Spambayes] full o' spaces, 
Meyer, Tony wrote:
>  > this is not at all to say
>>  that this will be the case here but as new ideas are bandied
>>  about, i posit that it is a good idea to make sure that
>>  previously discarded methodologies be reexamined periodically.
>
>I would absolutely agree with this.  To grab the box for a minute 
>and add my 2c to the discussion about being reactive or proactive:
>
>  But I don't think these should be checked in, unless they do 
>demonstrate that they make a difference.  The important thing is to 
>code them, and test them, and note those tests & code so that later 
>on, (when, for example, white space spam is really common), we can 
>be as quickly reactive as possible, just grabbing code from the 
>archive, re-testing it and deploying it.

This bring the idea of creating kind of a "plugin" concept for adding 
or removing rules ?


-- 
Hofstadter's Law :
It always takes longer than you expect, even when you take into 
account Hofstadter's Law.

From T.A.Meyer at massey.ac.nz  Mon Mar 10 13:18:34 2003
From: T.A.Meyer at massey.ac.nz (Meyer, Tony)
Date: Sun Mar  9 19:19:15 2003
Subject: [Spambayes] full o' spaces
Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F13C8C897@its-xchg4.massey.ac.nz>

[Tony]
> >  But I don't think these should be checked in, unless they do 
> >demonstrate that they make a difference.  The important thing is to 
> >code them, and test them, and note those tests & code so that later 
> >on, (when, for example, white space spam is really common), we can 
> >be as quickly reactive as possible, just grabbing code from the 
> >archive, re-testing it and deploying it.
[Francois]
> This bring the idea of creating kind of a "plugin" concept for adding 
> or removing rules ?

Oooh.  I hadn't thought of that, but I do like it.  Not as a release type tool, but definately as a debug type one.  I wonder how this could be done in a simple, non-bloat kind of way.

=Tony Meyer

From popiel at wolfskeep.com  Sun Mar  9 18:29:08 2003
From: popiel at wolfskeep.com (T. Alexander Popiel)
Date: Sun Mar  9 21:29:14 2003
Subject: [Spambayes] 
	Re: [Spambayes-checkins] spambayes/testtools timtest.py,1.2,1.3 
In-Reply-To: Message from "Tony Meyer" <anadelonbrin@users.sourceforge.net> 
	<E18sBWH-0001Uw-00@sc8-pr-cvs1.sourceforge.net> 
References: <E18sBWH-0001Uw-00@sc8-pr-cvs1.sourceforge.net> 
Message-ID: <20030310022908.0883D2DE80@cashew.wolfskeep.com>

In message:  <E18sBWH-0001Uw-00@sc8-pr-cvs1.sourceforge.net>
             "Tony Meyer" <anadelonbrin@users.sourceforge.net> writes:
>
>Modified Files:
>	timtest.py 
>Log Message:
>Mangle path for those without spambayes in pythonpath, like Alex's
>mod of testcv.

Heh.  Glad people thought it was a good idea. ;-)

- Alex

From tim.one at comcast.net  Sun Mar  9 21:31:43 2003
From: tim.one at comcast.net (Tim Peters)
Date: Sun Mar  9 21:32:20 2003
Subject: [Spambayes] Bytes/words ratio
In-Reply-To: <WTTS53LILHDWV98OIPNX532U87YUO.3e6bba57@myst>
Message-ID: <LNBBLJKPBEHFEDALKOLCEELPDPAB.tim.one@comcast.net>

[Tim Stone]
> Saaaaaaaayyyy.... so all this stuff about needing to be easy
> enough for your sisters was just so much smoke?  <wink**2>

Heh -- you're confusing me with the "ease of use" people.  The only effect
this project will have on my siblings is in whether their email gets
unjustly blocked by someone *else* as spam.  Toward avoiding that outcome, I
don't want to penalize HTML mail just for using HTML.  I expect it's not
possible to make this (or any other visible) system easy enough for them to
*use*-- themselves --with Outlook Express.  If they had spam problems (which
they don't), I'd urge them to switch to Outlook and use Mark's spiffy addin.
I'm pretty sure they could use that one (one sister on her own, the other
with some long-distance phone coaching).


From popiel at wolfskeep.com  Sun Mar  9 18:37:09 2003
From: popiel at wolfskeep.com (T. Alexander Popiel)
Date: Sun Mar  9 21:37:13 2003
Subject: [Spambayes] better Received header tokens 
In-Reply-To: Message from Neil Schemenauer <nas@python.ca> 
	<20030309200808.GA19398@glacier.arctrix.com> 
References: <20030309200808.GA19398@glacier.arctrix.com> 
Message-ID: <20030310023710.F11BA2DE80@cashew.wolfskeep.com>

In message:  <20030309200808.GA19398@glacier.arctrix.com>
             Neil Schemenauer <nas@python.ca> writes:

>I wasted some time today trying to improve the mine_received_headers
>option.  The goal was to generate fewer more useful tokens.  Also,
>I wanted to be resistent to received header forgery. [...]

>I expected this to do better than the current code.  Testing shows
>otherwise.  Perhaps using a more specific or more general network
>(instead of /16) would help.

Something that has occured to me recently: how many tokens does it
take to significantly change the scores?  Most of the recent tokenizing
experiments have been adding between one and a handful of tokens, or
even reducing token count.  Perhaps our problem is not that the
identification methods we're coming up with are bad (heck, Tim did
indicate that the bytes/word token _was_ a strong indicator... I
didn't look at the values for the token itself), but rather that
these new methods of identification are getting drowned out in the
noise.

Perhaps we should figure out some way to give metatokens extra
weight in the combining calculations?  I'm afraid that I don't
have a strong enough math background to know how to do this.

Alternately, we could drop the limit on the number of tokens looked
at from 150 back down to around 20...

- Alex

From tim.one at comcast.net  Sun Mar  9 22:13:40 2003
From: tim.one at comcast.net (Tim Peters)
Date: Sun Mar  9 22:14:16 2003
Subject: [Spambayes] better Received header tokens
In-Reply-To: <20030310023710.F11BA2DE80@cashew.wolfskeep.com>
Message-ID: <LNBBLJKPBEHFEDALKOLCEEMCDPAB.tim.one@comcast.net>

[T. Alexander Popiel]
> Something that has occured to me recently: how many tokens does it
> take to significantly change the scores?  Most of the recent tokenizing
> experiments have been adding between one and a handful of tokens, or
> even reducing token count.  Perhaps our problem is not that the
> identification methods we're coming up with are bad (heck, Tim did
> indicate that the bytes/word token _was_ a strong indicator... I
> didn't look at the values for the token itself), but rather that
> these new methods of identification are getting drowned out in the
> noise.

Oddly, I doubt it matters.  The median ham score is near 0, and the median
spam score near 100, so most messages are very solidly at one end.  When a
new token is added, it's not going to have any substantial effect on those,
it's going to affect Unsures, and msgs near the Unsure cutoffs.  One token
is enough to swing a msg near a boundary to the other side.

Note that strong indicators aren't necessarily *good* indicators, either:
if they're strongly correlated with other strong indicators, a bad decision
is easy to get.  That's why we strip HTML decorations, for example.  For
another, about the only spam I see rate unsure anymore is stuff that leaks
thru SpamAssassin via python.org.  spambayes *usually* wouldn't have any
trouble with such spam on its own, but there are a dozen header clues all
effectively saying "this came from python.org" then, and those are all
strong ham clues (thanks to SpamAssassin's usual effectiveness).  However,
they're really all the same clue, and the system has no way to realize that;
treating them as a dozen distinct clues gives them way more credence than
they deserve.


From T.A.Meyer at massey.ac.nz  Mon Mar 10 16:21:09 2003
From: T.A.Meyer at massey.ac.nz (Meyer, Tony)
Date: Sun Mar  9 22:21:57 2003
Subject: [Spambayes] experimental_ham_spam_imbalance_adjustment result
Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F13C8C898@its-xchg4.massey.ac.nz>

Hmm...I created the data set, and then did these:
> * Create testtools\bayescustomize.ini with the options you 
> want to test
> * run "timtest.py -n 2 > result1.txt"
> * run "rates result1.txt"  - this creates "result1s.txt"
> * Repeat the above, changing the options, and redirecting to
>   "result2.txt" and getting "result2s.txt" as final output.

But when I did this:
> * Run "cmp.py result1s.txt result2s.txt"

cmp.py gave me lots of errors, because the lines were not what was expected.  My results docs started with a copy of the options, so I dumped those, but then it had trouble with everything else as well.  The docs do have nice histograms, but cmp.py doesn't give me what it gave Mark :)

Advice, please?

Thanks,
Tony Meyer

From tim.one at comcast.net  Sun Mar  9 22:39:54 2003
From: tim.one at comcast.net (Tim Peters)
Date: Sun Mar  9 22:40:32 2003
Subject: [Spambayes] FW: Mhammond, Intelligent antispam IER software
In-Reply-To: <002101c2e498$8cd2eab0$530f8490@eden>
Message-ID: <LNBBLJKPBEHFEDALKOLCEEMEDPAB.tim.one@comcast.net>

[Mark Hammond]
> I had to share this irony :)
>
> I received this spam, selling anti-spam software!  I was a little
> dissapointed that spambayes scored it as only a "maybe".

Whereas when I got the same spam, I was disappointed to see it scored as
spam!  I like checking out the competition <wink>.

> So I checked the clues - the top 6 ham clues were:
>
> word                                spamprob         #ham  #spam
> '*H*'                               0.0438937           -      -
> '*S*'                               0.78226             -      -

Those two lines imply the overall score was in the high 80s -- do you have
your spam cutoff set to 90?  (Mine is at 80, BTW -- but then I still look at
every new spam every day, and have no fear of FP)
-------------- next part --------------
A non-text attachment was scrubbed...
Name: winmail.dat
Type: application/ms-tnef
Size: 1028 bytes
Desc: not available
Url : http://mail.python.org/pipermail/spambayes/attachments/20030309/e7fb97d7/winmail.bin
From skip at pobox.com  Sun Mar  9 21:51:30 2003
From: skip at pobox.com (Skip Montanaro)
Date: Sun Mar  9 22:51:40 2003
Subject: [Spambayes] better Received header tokens 
In-Reply-To: <20030310023710.F11BA2DE80@cashew.wolfskeep.com>
References: <20030309200808.GA19398@glacier.arctrix.com>
        <20030310023710.F11BA2DE80@cashew.wolfskeep.com>
Message-ID: <15980.3010.940085.926858@montanaro.dyndns.org>


    Alex> Alternately, we could drop the limit on the number of tokens
    Alex> looked at from 150 back down to around 20...

I look at all those tokens as many different ways for a message to exonerate
or incriminate itself.  If the various meta-tokens provide five (just to
pick a number out of thin air) more-or-less independent ways to say, "this
looks like spam", it's less likely that a spammer will successfully figure
out how to circumvent all five schemes.  The only positive effect I can
imagine is improved performance of the classifier, which would generally be
drowned out by either Python startup costs or networking overhead.

Skip


From skip at pobox.com  Sun Mar  9 21:55:41 2003
From: skip at pobox.com (Skip Montanaro)
Date: Sun Mar  9 22:55:45 2003
Subject: [Spambayes] better Received header tokens
In-Reply-To: <LNBBLJKPBEHFEDALKOLCEEMCDPAB.tim.one@comcast.net>
References: <20030310023710.F11BA2DE80@cashew.wolfskeep.com>
        <LNBBLJKPBEHFEDALKOLCEEMCDPAB.tim.one@comcast.net>
Message-ID: <15980.3261.634515.784710@montanaro.dyndns.org>


    Tim> For another, about the only spam I see rate unsure anymore is stuff
    Tim> that leaks thru SpamAssassin via python.org.  spambayes *usually*
    Tim> wouldn't have any trouble with such spam on its own, but there are
    Tim> a dozen header clues all effectively saying "this came from
    Tim> python.org" ....

That's correct when considering the rather narrow Python email universe, but
I suspect most people live in a somewhat more diverse electronic world than
that, so the python.org effect won't be quite as strong in the normal case.

Skip

From anthony at interlink.com.au  Mon Mar 10 16:12:12 2003
From: anthony at interlink.com.au (Anthony Baxter)
Date: Mon Mar 10 00:13:06 2003
Subject: [Spambayes] full o' spaces 
In-Reply-To: <n2m-g.zno5wo8w.fsf@morpheus.demon.co.uk> 
Message-ID: <200303100512.h2A5CCn08173@localhost.localdomain>


>>> Paul Moore wrote
> 5. Er. But it's a bit rough around the edges still. I'll help you
>    install it, if you like.
> 
> Notice (5). That's what is killing us right now with real people (me,
> I'm a figment of your imagination: be very afraid <wink>). Anything
> else is minor.

Speaking of which, what happened to that alpha-2 release? 

I've got Wednesday off, and can work on it then...


-- 
Anthony Baxter     <anthony@interlink.com.au>   
It's never too late to have a happy childhood.


From anthony at interlink.com.au  Mon Mar 10 16:19:37 2003
From: anthony at interlink.com.au (Anthony Baxter)
Date: Mon Mar 10 00:20:27 2003
Subject: [Spambayes] full o' spaces 
In-Reply-To: <200303100512.h2A5CCn08173@localhost.localdomain> 
Message-ID: <200303100519.h2A5JbR08230@localhost.localdomain>


>>> Anthony Baxter wrote
> 
> >>> Paul Moore wrote
> > 5. Er. But it's a bit rough around the edges still. I'll help you
> >    install it, if you like.
> > 
> > Notice (5). That's what is killing us right now with real people (me,
> > I'm a figment of your imagination: be very afraid <wink>). Anything
> > else is minor.
> 
> Speaking of which, what happened to that alpha-2 release? 
> 
> I've got Wednesday off, and can work on it then...

Hm. I didn't look closely enough - it's there, but the website's not
been updated.

Are we at a point where another release is useful, or should I update
the website to point to the current -a2 release?

-- 
Anthony Baxter     <anthony@interlink.com.au>   
It's never too late to have a happy childhood.


From nas at python.ca  Sun Mar  9 21:39:06 2003
From: nas at python.ca (Neil Schemenauer)
Date: Mon Mar 10 00:29:27 2003
Subject: [Spambayes] experimental_ham_spam_imbalance_adjustment result
In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F13C8C898@its-xchg4.massey.ac.nz>
References: <1ED4ECF91CDED24C8D012BCF2B034F13C8C898@its-xchg4.massey.ac.nz>
Message-ID: <20030310053906.GA20786@glacier.arctrix.com>

Meyer, Tony wrote:
> cmp.py gave me lots of errors, because the lines were not what was
> expected.

I'm guessing you ran "rates.py test > tests".  rates.py creates its own
output file and writes something a little different to stdout.  cmp.py
can't understand the stdout data.

  Neil

From Paul.Moore at atosorigin.com  Mon Mar 10 09:08:05 2003
From: Paul.Moore at atosorigin.com (Moore, Paul)
Date: Mon Mar 10 04:09:30 2003
Subject: [Spambayes] Outlook plugin error
Message-ID: <16E1010E4581B049ABC51D4975CEDB88619A0A@UKDCX001.uk.int.atosorigin.com>

[Spam marked with a score of 0%, but with clear spam status in the clues]

From: Mark Hammond [mailto:mhammond@skippinet.com.au]
>> I'm wondering if the problem has anything to do with the fact that the
>> spam field is set before the message is moved.

> Further, when you see this behaviour, can you immediately check the
> Pythonwin debug window for a message?  Each message processed should have a
> message that indicates its spam disposition - the first thing I need to know
> is if such mails fire this debug trace.

Actually, the only time I've seen it happen since that message is when the
plugin is doing "catchup" when I first start Outlook in the morning, and it
goes and processes a load of messages (400-odd this morning) from the server.

I checked the traceutil output, and the message with a 0% score is not in
there. The message has a spam property of 0%, and the clues show 100% spam:

Spam Score: 1


word                                spamprob         #ham  #spam
'*H*'                               0                   -      -
'*S*'                               1                   -      -
'damages'                           0.019311          566      8
'austin,'                           0.184822           13      2
'related'                           0.205474          263     50
'such'                              0.221215          879    184
'these'                             0.321268         1464    511
'skip:c 10'                         0.339304         1975    748
'for'                               0.347324         6260   2457
'makes'                             0.347573          410    161
'skip:f 10'                         0.359639          681    282
'box'                               0.37294           504    221
'list'                              0.374491         2498   1103
[rest omitted...]

Paul.

From joe at rockymountains.net  Sun Mar  9 20:59:08 2003
From: joe at rockymountains.net (Joseph Conrad)
Date: Mon Mar 10 07:45:01 2003
Subject: [Spambayes] Confused
Message-ID: <3E6C0D8C.4030004@rockymountains.net>

Spambayes,

I have a system running the pop3proxy, it's amazingly accurate with very 
little training.  I would like to integrate into  our postfix SMTP 
server as and incomming filter.  It looks simple enough, I already run 
virus scanning.  The thing I am not getting is hammiesrv, when I try to 
run it I get:

AttributeError: 'module' object has no attribute 'DEFAULTDB'

I have looked at the documentation, there really not much more than an a 
quick mention of hammiesrv.

I'm not at all familiar with python, but I suspect that if someone could 
drop me a hint I could take it from there.

Thanks,
Joseph Conrad


From tim at fourstonesExpressions.com  Mon Mar 10 08:11:40 2003
From: tim at fourstonesExpressions.com (Tim Stone - Four Stones Expressions)
Date: Mon Mar 10 09:11:46 2003
Subject: [Spambayes] SpamBayes
In-Reply-To: <15979.45963.260615.349902@montanaro.dyndns.org>
Message-ID: <LHQK8KE7977265POICSRQIEP94HG.3e6c9d1c@myst>

3/9/2003 3:35:07 PM, Skip Montanaro <skip@pobox.com> wrote:

>
>    Dan> This process needs improving to reach a level of robustness needed
>    Dan> to use in a production environment rather than just for testing and
>    Dan> experimentation purposes.
>

D.R.Evans had a database corruption similar to this a while back.  This is 
going to be an ongoing problem.  I believe we should append records to a log 
file each time a message is trained.  The spamcount and hamcount (at least) 
should be logged, so it can be recovered.  Perhaps even the tokens being 
trained should be logged, but that might make the log quite large...


c'est moi - TimS
http://www.fourstonesExpressions.com
http://wecanstopspam.org


From skip at pobox.com  Mon Mar 10 10:27:51 2003
From: skip at pobox.com (Skip Montanaro)
Date: Mon Mar 10 11:28:00 2003
Subject: [Spambayes] Perhaps a level header would be useful?
Message-ID: <15980.48391.655560.225683@montanaro.dyndns.org>

I've got a few people here at Northwestern set up with Spambayes now.
Classification is being done by me on the server, not by the users on their
desktops.  I just just chatting with a couple of the admins here who
commented that SpamAssassin's X-Spam-Level header is nice because you can
tell users to just add or delete a star from their Eudora filter to
fine-tune the break between spam and ham. 

That might be a bit weird with Spambayes since it's a three-state system,
but I think it might be useful to add an X-Spambayes-Level header where the
number of stars is equal to int(score*10).  I control the ham and spam
cutoffs, and thus the inclusion of the words "ham", "unsure" and "spam", but
this would make it easy for people to filter on a score basis in their mail
client.  Sort of a fine-tuning knob.

or-a-fake-thermostat-ly, y'rs,

Skip

From tim at fourstonesExpressions.com  Mon Mar 10 11:04:58 2003
From: tim at fourstonesExpressions.com (Tim Stone - Four Stones Expressions)
Date: Mon Mar 10 12:05:04 2003
Subject: [Spambayes] Perhaps a level header would be useful?
In-Reply-To: <15980.48391.655560.225683@montanaro.dyndns.org>
Message-ID: <ON6GC3ZFDBA72QKRMBAZYC8A9VSF0E0.3e6cc5ba@myst>

3/10/2003 10:27:51 AM, Skip Montanaro <skip@pobox.com> wrote:

>I've got a few people here at Northwestern set up with Spambayes now.
>Classification is being done by me on the server, not by the users on their
>desktops.  I just just chatting with a couple of the admins here who
>commented that SpamAssassin's X-Spam-Level header is nice because you can
>tell users to just add or delete a star from their Eudora filter to
>fine-tune the break between spam and ham. 

Funny, I was just thinking about the same thing today.  There was a request 
for the pop3proxy to do this a couple months back.  Never made it as a feature 
request, but I remember it.  Seems like a reasonable thing to do.


c'est moi - TimS
http://www.fourstonesExpressions.com
http://wecanstopspam.org


From tim.one at comcast.net  Mon Mar 10 12:15:11 2003
From: tim.one at comcast.net (Tim Peters)
Date: Mon Mar 10 12:15:53 2003
Subject: [Spambayes] better Received header tokens
In-Reply-To: <15980.3261.634515.784710@montanaro.dyndns.org>
Message-ID: <LNBBLJKPBEHFEDALKOLCAEONDPAB.tim.one@comcast.net>

[Tim]
> For another, about the only spam I see rate unsure anymore is stuff
> that leaks thru SpamAssassin via python.org.  spambayes *usually*
> wouldn't have any trouble with such spam on its own, but there are
> a dozen header clues all effectively saying "this came from
> python.org" ....

[Skip Montanaro]
> That's correct when considering the rather narrow Python email
> universe, but I suspect most people live in a somewhat more diverse
> electronic world than that, so the python.org effect won't be quite as
> strong in the normal case.

It was an example of harmful correlation, by way of illustrating why a
strong indicator isn't necessarily a desirable indicator.  This particular
example applies pretty directly to any source from which a user rarely (but
not never) gets spam, and leaves clues about itself.


From wsy at merl.com  Mon Mar 10 12:32:21 2003
From: wsy at merl.com (Bill Yerazunis)
Date: Mon Mar 10 12:32:55 2003
Subject: [Spambayes] Perhaps a level header would be useful?
In-Reply-To: <15980.48391.655560.225683@montanaro.dyndns.org> (message from
	Skip Montanaro on Mon, 10 Mar 2003 10:27:51 -0600)
References: <15980.48391.655560.225683@montanaro.dyndns.org>
Message-ID: <200303101732.h2AHWLL19489@localhost.localdomain>


   From: Skip Montanaro <skip@pobox.com>

   Classification is being done by me on the server, not by the users on their
   desktops.  I just just chatting with a couple of the admins here who
   commented that SpamAssassin's X-Spam-Level header is nice because you can
   tell users to just add or delete a star from their Eudora filter to
   fine-tune the break between spam and ham. 

   That might be a bit weird with Spambayes since it's a three-state system,
   but I think it might be useful to add an X-Spambayes-Level header where the
   number of stars is equal to int(score*10).  I control the ham and spam
   cutoffs, and thus the inclusion of the words "ham", "unsure" and "spam", but
   this would make it easy for people to filter on a score basis in their mail
   client.  Sort of a fine-tuning knob.

   or-a-fake-thermostat-ly, y'rs,

I've also had multiple requests for a continuous output match parameter in
CRM114, so I settled on this:

      pR = - (log (Pspam) - log (Pnonspam)

This goes from roughly +350 to -350, and (nicely) the uncertains 
and errors all seem to group around +/- 100 . 

90%+ of the messages come out either > 200 or < -200, so it's an 
effective human-understood representation.

I know the CAMRAM people wanted it pretty badly; expect them to 
start using it soon.

(it's called pR for the same reason pH is called pH - it's the 
negative log of the ratios of the match probabilities, just like
pH is the negative log of the ion ratios.)

  -Bill Yerazunis


From skip at pobox.com  Mon Mar 10 11:47:20 2003
From: skip at pobox.com (Skip Montanaro)
Date: Mon Mar 10 12:47:29 2003
Subject: [Spambayes] better Received header tokens
In-Reply-To: <LNBBLJKPBEHFEDALKOLCAEONDPAB.tim.one@comcast.net>
References: <15980.3261.634515.784710@montanaro.dyndns.org>
        <LNBBLJKPBEHFEDALKOLCAEONDPAB.tim.one@comcast.net>
Message-ID: <15980.53160.87499.28570@montanaro.dyndns.org>

    Tim> It was an example of harmful correlation, by way of illustrating
    Tim> why a strong indicator isn't necessarily a desirable indicator.
    Tim> This particular example applies pretty directly to any source from
    Tim> which a user rarely (but not never) gets spam, and leaves clues
    Tim> about itself.

True enough.  I'm sure there are lots of such correlations.  But if a
person's incoming mail isn't dominated by one source, such harmful
correlations will have less impact on the final score of any given message,
right?  As an example, I just grep'd my ham collection for the Sender field,
squashed case, sorted and uniq'd, then sorted again.  The tail end looked
like

     150 sender: folkmusic-admin@grassyhill.org
     221 sender: zope-admin@zope.org
     255 sender: folk music presenters <folkvenu@lists.psu.edu>
     450 sender: spambayes-bounces@python.org
     550 sender: python-checkins-admin@python.org
     555 sender: owner-6pack@autox.team.net
     688 sender: python-dev-admin@python.org
     821 sender: spamassassin-talk-admin@lists.sourceforge.net
    1387 sender: cedu-admin@manatee.mojam.com
    3091 sender: python-list-admin@python.org

This is out of 9609 Sender headers (just under 12,000 hams).  If I remember
comments you've made on this topic in the past, I expect your Sender:
headers to be more strongly dominated by Python-related messages than this.

Just the presence of a Sender header irregardless of where it came from
seems to be a pretty strong ham clue (something spammers could/do exploit?).
My roughly 7,000 spams only have 759 Sender headers.  I haven't experimented
with adding it to Options.options.address_headers, but your comment in
tokenizer.py suggests this probably wouldn't be too wise.

Skip

From phil.west at gtri.gatech.edu  Mon Mar 10 13:41:00 2003
From: phil.west at gtri.gatech.edu (Phil West)
Date: Mon Mar 10 13:55:04 2003
Subject: [Spambayes] 
	Problem installing SpamBayes-Outlook on outlook xp [Unable to
	register spambayes_addin.dll]
Message-ID: <462E202877E3D54AADAF076E175B60A91ADF69@mail.elsys-exchange.elsys>

Hi:
I'm running Outlook 2002 on a win2k pro machine, when I start my python
2.2 IDE it sez: PythonWin 2.2.1 (#34, Apr  9 2002, 19:34:33) [MSC 32 bit
(Intel)] on win32. 
 
 When I run the SpamBayes-Outlook-Setup.exe program, it encounters the
error:
 
 
As one would expect, neither retry nor Ignore yield a working
installation.  Any pointers on how to resolve this would be appreciated.
 
Thanks,
Phil
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: image/bmp
Size: 346006 bytes
Desc: Outlook.bmp
Url : http://mail.python.org/pipermail/spambayes/attachments/20030310/b31b0bac/attachment-0001.bin
From tim at fourstonesExpressions.com  Mon Mar 10 13:03:55 2003
From: tim at fourstonesExpressions.com (Tim Stone - Four Stones Expressions)
Date: Mon Mar 10 14:04:04 2003
Subject: [Spambayes]  Problem installing SpamBayes-Outlook on outlook xp
	[Unable to register spambayes_addin.dll]
In-Reply-To: <462E202877E3D54AADAF076E175B60A91ADF69@mail.elsys-exchange.elsys>
Message-ID: <9771A5YWUSNI65POCAHFPN51YVFEC0IF.3e6ce19b@myst>

3/10/2003 12:41:00 PM, "Phil West" <phil.west@gtri.gatech.edu> wrote:

>Hi:
>I'm running Outlook 2002 on a win2k pro machine, when I start my python
>2.2 IDE it sez: PythonWin 2.2.1 (#34, Apr  9 2002, 19:34:33) [MSC 32 bit
>(Intel)] on win32. 
> 
> When I run the SpamBayes-Outlook-Setup.exe program, it encounters the
>error:

Are we missing something here?  I don't see an error.

c'est moi - TimS
http://www.fourstonesExpressions.com
http://wecanstopspam.org


From bill at parducci.net  Mon Mar 10 11:11:42 2003
From: bill at parducci.net (bill parducci)
Date: Mon Mar 10 14:12:00 2003
Subject: [Spambayes] single message test question
Message-ID: <3E6CE36E.5040903@parducci.net>

would someone be so kind as to instruct me in what the most straightforward way to test my current filter against a single message would be? i have a note that scored very high in spamminess and i would like to know why. (i have the note isolated into a single mbox file at the moment.) 

thanks

b


From tim.one at comcast.net  Mon Mar 10 14:22:15 2003
From: tim.one at comcast.net (Tim Peters)
Date: Mon Mar 10 14:23:02 2003
Subject: [Spambayes]  Problem installing SpamBayes-Outlook on outlook
 xp[Unable to register spambayes_addin.dll]
In-Reply-To: <9771A5YWUSNI65POCAHFPN51YVFEC0IF.3e6ce19b@myst>
Message-ID: <LNBBLJKPBEHFEDALKOLCAEPHDPAB.tim.one@comcast.net>

[Phil West]
> I'm running Outlook 2002 on a win2k pro machine, when I start my python
> 2.2 IDE it sez: PythonWin 2.2.1 (#34, Apr  9 2002, 19:34:33) [MSC 32 bit
> (Intel)] on win32.
>
> When I run the SpamBayes-Outlook-Setup.exe program, it encounters the
> error:


[Tim Stone]
> Are we missing something here?  I don't see an error.

There was a giant .bmp file attached, and God only knows what will happen to
that.  It was a Windows error box; the only interesting part said

    Unable to register the DLL/OCX:  DllRegisterServer
                                     failed; code 0x00000000.

The error code is frightening <wink>.


From tim at fourstonesExpressions.com  Mon Mar 10 13:26:03 2003
From: tim at fourstonesExpressions.com (Tim Stone - Four Stones Expressions)
Date: Mon Mar 10 14:26:08 2003
Subject: [Spambayes] single message test question
In-Reply-To: <3E6CE36E.5040903@parducci.net>
Message-ID: <NH5A6C9GSQGFCB8A7GFGDA3XFDBA.3e6ce6cb@myst>

3/10/2003 1:11:42 PM, bill parducci <bill@parducci.net> wrote:

>would someone be so kind as to instruct me in what the most straightforward 
way to test my current filter against a single message would be? i have a note 
that scored very high in spamminess and i would like to know why. (i have the 
note isolated into a single mbox file at the moment.)

For me, the easiest way is to bring up the pop3proxy, with the -u option, and 
use the cut-and-paste entry field to classify the message.
 
>
>thanks
>
>b
>
>
>_______________________________________________
>Spambayes mailing list
>Spambayes@python.org
>http://mail.python.org/mailman/listinfo/spambayes
>
>


c'est moi - TimS
http://www.fourstonesExpressions.com
http://wecanstopspam.org


From popiel at wolfskeep.com  Mon Mar 10 11:53:31 2003
From: popiel at wolfskeep.com (T. Alexander Popiel)
Date: Mon Mar 10 14:53:36 2003
Subject: [Spambayes] single message test question 
In-Reply-To: Message from bill parducci <bill@parducci.net> 
   of "Mon, 10 Mar 2003 11:11:42 PST." <3E6CE36E.5040903@parducci.net> 
References: <3E6CE36E.5040903@parducci.net> 
Message-ID: <20030310195331.9D22D2DDD7@cashew.wolfskeep.com>

In message:  <3E6CE36E.5040903@parducci.net>
             bill parducci <bill@parducci.net> writes:
>would someone be so kind as to instruct me in what the most
>straightforward way to test my current filter against a single
>message would be? i have a note that scored very high in spamminess
>and i would like to know why. (i have the note isolated into a
>single mbox file at the moment.) 

My approach would be to create a config file with hammie_debug_header
set to true, then set the BAYESCUSTOMIZE environment variable to
that config file, then run the message through hammiefilter.  Actually,
I have hammie_debug_header turned on in my default config file, so all
I have to do is look at all the headers for the message (I normally
don't display the debug header as I'm reading mail).

- Alex

From mhammond at skippinet.com.au  Tue Mar 11 08:43:12 2003
From: mhammond at skippinet.com.au (Mark Hammond)
Date: Mon Mar 10 16:44:14 2003
Subject: [Spambayes] New bsddb3 for Python 2.2
Message-ID: <LCEPIIGDJPKCOIHOBJEPAEKBOFAA.mhammond@skippinet.com.au>

There have been some reports of problems running the spambayes Outlook
plugin on Python 2.2 with bsddb3.

It turns out that the bsddb3 release itself was bad.  The bsddb3 maintainers
have released a new version of the binary (from the usual place -
http://sourceforge.net/project/showfiles.php?group_id=13900).

If you install this version of bsddb3, spambayes should work fine (and
fast!).  This bsddb module is also built using the same database version as
the Python 2.3 bsddb module, so our database can be freely used between
stock Python 2.3, and Python 2.2+bsddb3.

Remember that there is no pickle->db migration code in Outlook - you are
probably going to need to do a full re-train if you install bsddb3.  If the
startup/shutdown times are annoying you, it is well worth it though.

Mark.


From tim at fourstonesExpressions.com  Mon Mar 10 17:08:58 2003
From: tim at fourstonesExpressions.com (Tim Stone - Four Stones Expressions)
Date: Mon Mar 10 18:09:05 2003
Subject: [Spambayes] New bsddb3 for Python 2.2
In-Reply-To: <LCEPIIGDJPKCOIHOBJEPAEKBOFAA.mhammond@skippinet.com.au>
Message-ID: <1WPNVTEDIFYXPOK53WRTPZVJFQODBDA.3e6d1b0a@myst>

3/10/2003 3:43:12 PM, "Mark Hammond" <mhammond@skippinet.com.au> wrote:


>Remember that there is no pickle->db migration code in Outlook - you are
>probably going to need to do a full re-train if you install bsddb3.  If the
>startup/shutdown times are annoying you, it is well worth it though.

dbExpImp.py can be used to migrate from pickle to db.

c'est moi - TimS
http://www.fourstonesExpressions.com
http://wecanstopspam.org


From T.A.Meyer at massey.ac.nz  Tue Mar 11 12:45:34 2003
From: T.A.Meyer at massey.ac.nz (Meyer, Tony)
Date: Mon Mar 10 18:46:41 2003
Subject: [Spambayes] full o' spaces 
Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F13C8C89A@its-xchg4.massey.ac.nz>

> Are we at a point where another release is useful, or should I update
> the website to point to the current -a2 release?

I think we are definately at the point where another release is useful.  Browsing through the check-ins list, there have been quite a few significant improvements* since a2.

=Tony Meyer

* Not in the way of improving rates, really, but in fixing bugs and adding features.

From mhammond at skippinet.com.au  Tue Mar 11 10:45:43 2003
From: mhammond at skippinet.com.au (Mark Hammond)
Date: Mon Mar 10 18:46:49 2003
Subject: [Spambayes] New bsddb3 for Python 2.2
In-Reply-To: <1WPNVTEDIFYXPOK53WRTPZVJFQODBDA.3e6d1b0a@myst>
Message-ID: <LCEPIIGDJPKCOIHOBJEPIEKHOFAA.mhammond@skippinet.com.au>


> dbExpImp.py can be used to migrate from pickle to db.

I'm sure it can, but I'm also fairly certain that simply running it won't do
the right thing for outlook :)  If someone wants to work out the exact
command to use, that would be great.

Mark.


From tim at fourstonesExpressions.com  Mon Mar 10 18:15:31 2003
From: tim at fourstonesExpressions.com (Tim Stone - Four Stones Expressions)
Date: Mon Mar 10 19:15:40 2003
Subject: [Spambayes] New bsddb3 for Python 2.2
In-Reply-To: <LCEPIIGDJPKCOIHOBJEPIEKHOFAA.mhammond@skippinet.com.au>
Message-ID: <M3XICHFWA7QP0475IHF0ZWV1UB606.3e6d2aa3@myst>

3/10/2003 5:45:43 PM, "Mark Hammond" <mhammond@skippinet.com.au> wrote:

>
>> dbExpImp.py can be used to migrate from pickle to db.
>
>I'm sure it can, but I'm also fairly certain that simply running it won't do
>the right thing for outlook :)  If someone wants to work out the exact
>command to use, that would be great.

Well, it certainly doesn't understand any of the other databases... :(  forgot 
about those.  But it can change the main wordinfo database.  If you can send 
me an example of the other databases, I'd be happy to fix it to manage those 
too...

c'est moi - TimS
http://www.fourstonesExpressions.com
http://wecanstopspam.org


From tim at fourstonesExpressions.com  Mon Mar 10 18:29:38 2003
From: tim at fourstonesExpressions.com (Tim Stone - Four Stones Expressions)
Date: Mon Mar 10 19:29:48 2003
Subject: [Spambayes] New bsddb3 for Python 2.2
In-Reply-To: <LCEPIIGDJPKCOIHOBJEPIEKHOFAA.mhammond@skippinet.com.au>
Message-ID: <53UQA6GIGBIDTR42EDWVJE74DC7321.3e6d2df2@myst>

3/10/2003 5:45:43 PM, "Mark Hammond" <mhammond@skippinet.com.au> wrote:

>
>> dbExpImp.py can be used to migrate from pickle to db.
>
>I'm sure it can, but I'm also fairly certain that simply running it won't do
>the right thing for outlook :)  If someone wants to work out the exact
>command to use, that would be great.
>
Ok, so here I am, replying to the same message twice... braindeath is a 
terrible thing.

The commands would be:

dbExpImp.py -e -d mypickledwordinfo -f mypickledwordinfo.export
dbExpImp.py -i -D mybsddbwordinfo -f mypickledwordinfo.export

or...

dbExpImp.py -h for that and several more scenarios

c'est moi - TimS
http://www.fourstonesExpressions.com
http://wecanstopspam.org


From tim.one at comcast.net  Mon Mar 10 21:28:28 2003
From: tim.one at comcast.net (Tim Peters)
Date: Mon Mar 10 21:29:08 2003
Subject: [Spambayes] better Received header tokens
In-Reply-To: <15980.53160.87499.28570@montanaro.dyndns.org>
Message-ID: <LNBBLJKPBEHFEDALKOLCIEBNEAAB.tim.one@comcast.net>

[Tim]
> It was an example of harmful correlation, by way of illustrating
> why a strong indicator isn't necessarily a desirable indicator.
> This particular example applies pretty directly to any
> source from which a user rarely (but not never) gets spam, and
> leaves clues about itself.

[Skip Montanaro]
> True enough.  I'm sure there are lots of such correlations.  But if a
> person's incoming mail isn't dominated by one source, such harmful
> correlations will have less impact on the final score of any
> given message, right?

Strictly less, yes, but it's a second-order distinction and would have
trouble being *significantly* less.  Say you have H total ham and S total
spam, and that a particular token appears in h ham and s spam.  The
unadjusted spamprob for that token is then

   s/S
---------
s/S + h/H

which can be rearranged as

    H
----------
H + (h/s)S

The magnitudes of h and s don't matter to the result, nor even the
magnitudes of h and s relative to H and S -- all that matters is the ratio
of h to s.  So it makes no difference at this level whether the token
appears in 99% of your training data, or in 0.0001% of it:  if it appears in
(say) 20 times more ham msgs than spam msgs, the first-order spamprob guess
is the same whether that's a total of 20 msgs or 20 million. Or, IOW, if 1%
of my python.org mail is spam, and 1% of my guysnamedtim.com mail is spam,
and 1% of my friendsofskip.org mail is spam, a clue unique to any of those
sources gets the same first-order spamprob, and regardless of what
percentages of my total email derive from these sources.

The Bayesian adjustment goes on to fiddle the guess, taking *some* measure
of the magnitude of h+s into account, but as h+s increases it has a smaller
and smaller effect.  If I only have one msg total from guysnamedtim.com, the
adjustment is large, but unknown_word_strength is under 0.5 by default and
we approach the by-counting spamprob guess quickly as h+s increases.

> As an example, I just grep'd my ham collection for the
> Sender field, squashed case, sorted and uniq'd, then sorted again.  The
> tail end looked like
>
>      150 sender: folkmusic-admin@grassyhill.org
>      221 sender: zope-admin@zope.org
>      255 sender: folk music presenters <folkvenu@lists.psu.edu>
>      450 sender: spambayes-bounces@python.org
>      550 sender: python-checkins-admin@python.org
>      555 sender: owner-6pack@autox.team.net
>      688 sender: python-dev-admin@python.org
>      821 sender: spamassassin-talk-admin@lists.sourceforge.net
>     1387 sender: cedu-admin@manatee.mojam.com
>     3091 sender: pthon-list-admin@python.org
>
> This is out of 9609 Sender headers (just under 12,000 hams).  If
> I remember comments you've made on this topic in the past, I expect
> your Sender:  headers to be more strongly dominated by Python-related
> messages than this.

They are, but, as above, that has a minor effect on spamprobs.  What's worse
about python.org mail is that there are so *many* tokens unique to it,
and they're (equally) strong ham clues.  Of course there are two sides to
the story:  while that makes it easy for spam from python.org to rate
unsure, it also virtually guarantees that ham from python.org never rates
unsure.

> Just the presence of a Sender header irregardless of where it came from
> seems to be a pretty strong ham clue (something spammers could/do
> exploit?).
> My roughly 7,000 spams only have 759 Sender headers.

Then they're not very consistent in exploiting it <wink>.

> I haven't experimented with adding it to Options.options.address_headers,
> but your comment in tokenizer.py suggests this probably wouldn't be too
> wise.

It's on by default in the Outlook client.  It's deadly for research on
mixed-source corpora, but for live email I expect it to help.  This wasn't
formally tested, though, and should be.  I can testify from experience
that's it not deadly in real-life Outlook use <wink>.


From T.A.Meyer at massey.ac.nz  Tue Mar 11 15:42:00 2003
From: T.A.Meyer at massey.ac.nz (Meyer, Tony)
Date: Mon Mar 10 21:42:49 2003
Subject: [Spambayes] Perhaps a level header would be useful?
Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F13C8C89D@its-xchg4.massey.ac.nz>

Ok, there's now the following headers available in pop3proxy:

X-Spambayes-Classification: {ham | spam | unsure}
X-Spambayes-Spam-Probability: (message score)
X-Spambayes-Level: (thermostat, one * = 10%)
X-Spambayes-Evidence: (list of clues, like hammie's debug)
X-Spambayes-MailId: (unique id for the message)

Apart from Classification, all of these are off by default.  The rest can be turned on via the configuration page in the ui, or via the following options in a config file:
pop3proxy_include_prob: {True | False}
pop3proxy_include_thermostat: {True | False}
pop3proxy_include_evidence: {True | False}
pop3proxy_add_mailid_to: {"" | "header" | "body" | "header body" | "body header"}

You can, of course, change any of the header names - look in Options.py for the details.

As when I committed the prob header, I've done limited testing here.  Nothing changes as far as I can tell until you change the default settings, so those that don't want these options should find nothing different.  Each header seems to add what it should.  I didn't really know what other tests to do!  Let me/the list know if something isn't right.

Enjoy :)

=Tony Meyer

From T.A.Meyer at massey.ac.nz  Tue Mar 11 15:50:39 2003
From: T.A.Meyer at massey.ac.nz (Meyer, Tony)
Date: Mon Mar 10 21:51:37 2003
Subject: [Spambayes] experimental_ham_spam_imbalance_adjustment result
Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F13C8C89E@its-xchg4.massey.ac.nz>

> Meyer, Tony wrote:
> > cmp.py gave me lots of errors, because the lines were not what was
> > expected.
[Neil]
> I'm guessing you ran "rates.py test > tests".  rates.py 
> creates its own
> output file and writes something a little different to stdout.  cmp.py
> can't understand the stdout data.

Ah, this is exactly what I did.  I should have read Mark's instructions somewhat more closely.

Thanks for the help, and apologies for the stupidity ;)

Cheers,
Tony

From T.A.Meyer at massey.ac.nz  Tue Mar 11 15:54:44 2003
From: T.A.Meyer at massey.ac.nz (Meyer, Tony)
Date: Mon Mar 10 21:55:41 2003
Subject: [Spambayes] experimental_ham_spam_imbalance_adjustment result
Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F13C8C89F@its-xchg4.massey.ac.nz>

> Here are my current results on the imbalance option.  
And here are mine.

imbalance_false4s.txt -> imbalance_true4s.txt
-> <stat> tested 372 hams & 48 spams against 983 hams & 155 spams
-> <stat> tested 333 hams & 56 spams against 1022 hams & 147 spams
-> <stat> tested 329 hams & 48 spams against 1026 hams & 155 spams
-> <stat> tested 321 hams & 51 spams against 1034 hams & 152 spams
-> <stat> tested 372 hams & 48 spams against 983 hams & 155 spams
-> <stat> tested 333 hams & 56 spams against 1022 hams & 147 spams
-> <stat> tested 329 hams & 48 spams against 1026 hams & 155 spams
-> <stat> tested 321 hams & 51 spams against 1034 hams & 152 spams

false positive percentages
    0.000  0.000  tied          
    0.000  0.000  tied          
    0.000  0.000  tied          
    0.000  0.000  tied          

won   0 times
tied  4 times
lost  0 times

total unique fp went from 0 to 0 tied          
mean fp % went from 0.0 to 0.0 tied          

false negative percentages
    6.250  6.250  tied          
    0.000  0.000  tied          
    6.250  6.250  tied          
    3.922  3.922  tied          

won   0 times
tied  4 times
lost  0 times

total unique fn went from 8 to 8 tied          
mean fn % went from 4.10539215686 to 4.10539215686 tied          

ham mean                     ham sdev
   0.39    0.39   +0.00%        3.46    3.46   +0.00%
   0.09    0.09   +0.00%        0.91    0.91   +0.00%
   0.65    0.65   +0.00%        4.57    4.57   +0.00%
   1.40    1.40   +0.00%        7.93    7.93   +0.00%

ham mean and sdev for all runs
   0.62    0.62   +0.00%        4.87    4.87   +0.00%

spam mean                    spam sdev
  87.62   87.62   +0.00%       28.34   28.34   +0.00%
  90.83   90.83   +0.00%       18.01   18.01   +0.00%
  91.17   91.17   +0.00%       25.61   25.61   +0.00%
  85.65   85.65   +0.00%       25.97   25.97   +0.00%

spam mean and sdev for all runs
  88.85   88.85   +0.00%       24.68   24.68   +0.00%

ham/spam mean difference: 88.23 88.23 +0.00

My ham:spam ratio is about 7:1 (Mark's was about 1:2.5).  Forgive the newbie question, but does this mean that:
(a) for my corpus, the options makes no difference at all?
(b) I haven't tested with a big enough corpus?
(c) I did something wrong ;)

Thanks,
Tony Meyer

From tim.one at comcast.net  Mon Mar 10 22:10:09 2003
From: tim.one at comcast.net (Tim Peters)
Date: Mon Mar 10 22:11:04 2003
Subject: [Spambayes] experimental_ham_spam_imbalance_adjustment result
In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F13C8C89F@its-xchg4.massey.ac.nz>
Message-ID: <LNBBLJKPBEHFEDALKOLCKECBEAAB.tim.one@comcast.net>

[Meyer, Tony]
> imbalance_false4s.txt -> imbalance_true4s.txt
> -> <stat> tested 372 hams & 48 spams against 983 hams & 155 spams
> -> <stat> tested 333 hams & 56 spams against 1022 hams & 147 spams
> -> <stat> tested 329 hams & 48 spams against 1026 hams & 155 spams
> -> <stat> tested 321 hams & 51 spams against 1034 hams & 152 spams
> -> <stat> tested 372 hams & 48 spams against 983 hams & 155 spams
> -> <stat> tested 333 hams & 56 spams against 1022 hams & 147 spams
> -> <stat> tested 329 hams & 48 spams against 1026 hams & 155 spams
> -> <stat> tested 321 hams & 51 spams against 1034 hams & 152 spams
>
> false positive percentages
>     0.000  0.000  tied
>     0.000  0.000  tied
>     0.000  0.000  tied
>     0.000  0.000  tied
>
> won   0 times
> tied  4 times
> lost  0 times
>
> total unique fp went from 0 to 0 tied
> mean fp % went from 0.0 to 0.0 tied
>
> false negative percentages
>     6.250  6.250  tied
>     0.000  0.000  tied
>     6.250  6.250  tied
>     3.922  3.922  tied
>
> won   0 times
> tied  4 times
> lost  0 times
>
> total unique fn went from 8 to 8 tied
> mean fn % went from 4.10539215686 to 4.10539215686 tied
>
> ham mean                     ham sdev
>    0.39    0.39   +0.00%        3.46    3.46   +0.00%
>    0.09    0.09   +0.00%        0.91    0.91   +0.00%
>    0.65    0.65   +0.00%        4.57    4.57   +0.00%
>    1.40    1.40   +0.00%        7.93    7.93   +0.00%
>
> ham mean and sdev for all runs
>    0.62    0.62   +0.00%        4.87    4.87   +0.00%
>
> spam mean                    spam sdev
>   87.62   87.62   +0.00%       28.34   28.34   +0.00%
>   90.83   90.83   +0.00%       18.01   18.01   +0.00%
>   91.17   91.17   +0.00%       25.61   25.61   +0.00%
>   85.65   85.65   +0.00%       25.97   25.97   +0.00%
>
> spam mean and sdev for all runs
>   88.85   88.85   +0.00%       24.68   24.68   +0.00%
>
> ham/spam mean difference: 88.23 88.23 +0.00
>
> My ham:spam ratio is about 7:1 (Mark's was about 1:2.5).  Forgive
> the newbie question, but does this mean that:
> (a) for my corpus, the options makes no difference at all?
> (b) I haven't tested with a big enough corpus?
> (c) I did something wrong ;)

(d) Something went wrong somewhere.  The listings of means and sdevs are
supremely sensitive to even the tiniest changes:  I've never seen them all
zero unless the classifiers and tokenizers going into them were actually
identical.

Given that you have more ham than spam, the expected effect of enabling the
option is to decrease your FN rate (which, at 4%, is high), and possibly
increase your FP rate (which is 0).


From tony-bayes at lownds.com  Mon Mar 10 21:25:47 2003
From: tony-bayes at lownds.com (Tony Lownds)
Date: Tue Mar 11 00:26:18 2003
Subject: [Spambayes] Perhaps a level header would be useful?
In-Reply-To: 
 <1ED4ECF91CDED24C8D012BCF2B034F13C8C89D@its-xchg4.massey.ac.nz>
References: <1ED4ECF91CDED24C8D012BCF2B034F13C8C89D@its-xchg4.massey.ac.nz>
Message-ID: <a05200f5dba9316553395@[204.162.121.55]>

Hi,

<lurk mode off>

How about putting these in a seperate namespace? I have been writing 
a GUI for spambayes using PyObjC and it could benefit from more 
regular option names here.

header_spam_probability: {True|False}
header_level: {True|False}
header_evidence: {True|False}
header_mailid: {True|False}
pop3proxy_mailid_notate_body: {True|False}
pop3proxy_classification_notate_to: {True|False}

Ok, the last two options aren't very regularly named, but then again, 
they do irregular things.

There may be some bugs lurking, I'm now getting 
"X-Spambayes-Classification: ham" in the body of my emails. Also, 
this bit of code around line 163 in pop3proxy.py doesn't account for 
the extra possible headers.

# HEADER_EXAMPLE is the longest possible header - the length of this one
# is added to the size of each message.
HEADER_FORMAT = '%s: %%s\r\n' % options.hammie_header_name
HEADER_EXAMPLE = '%s: xxxxxxxxxxxxxxxxxxxx\r\n' % options.hammie_header_name

BTW, I'm pretty excited about the mailid stuff you have done. Being 
able to correct a single message without seeing all of my mail again 
will be great.

-Tony Lownds

At 3:42 PM +1300 3/11/03, Meyer, Tony wrote:
>Ok, there's now the following headers available in pop3proxy:
>
>X-Spambayes-Classification: {ham | spam | unsure}
>X-Spambayes-Spam-Probability: (message score)
>X-Spambayes-Level: (thermostat, one * = 10%)
>X-Spambayes-Evidence: (list of clues, like hammie's debug)
>X-Spambayes-MailId: (unique id for the message)
>
>Apart from Classification, all of these are off by default.  The 
>rest can be turned on via the configuration page in the ui, or via 
>the following options in a config file:
>pop3proxy_include_prob: {True | False}
>pop3proxy_include_thermostat: {True | False}
>pop3proxy_include_evidence: {True | False}
>pop3proxy_add_mailid_to: {"" | "header" | "body" | "header body" | 
>"body header"}
>
>You can, of course, change any of the header names - look in 
>Options.py for the details.
>
>As when I committed the prob header, I've done limited testing here. 
>Nothing changes as far as I can tell until you change the default 
>settings, so those that don't want these options should find nothing 
>different.  Each header seems to add what it should.  I didn't 
>really know what other tests to do!  Let me/the list know if 
>something isn't right.
>
>Enjoy :)
>
>=Tony Meyer
>
>_______________________________________________
>Spambayes mailing list
>Spambayes@python.org
>http://mail.python.org/mailman/listinfo/spambayes


From noreply at sourceforge.net  Tue Mar 11 00:48:17 2003
From: noreply at sourceforge.net (SourceForge.net)
Date: Tue Mar 11 07:48:33 2003
Subject: [Spambayes] 
	[ spambayes-Bugs-701413 ] dbExpImp.py fails (python 2.2, win XP)
Message-ID: <E18sfQn-0007Pe-00@sc8-sf-web3.sourceforge.net>

Bugs item #701413, was opened at 2003-03-11 09:48
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=498103&aid=701413&group_id=61702

Category: None
Group: None
Status: Open
Resolution: None
Priority: 5
Submitted By: Fredrik Rodland (fmmr)
Assigned to: Nobody/Anonymous (nobody)
Summary: dbExpImp.py fails (python 2.2, win XP)

Initial Comment:
dbExpImp.py fails to run, and exists with the following 
error:

C:\Programfiler\_UTIL\spambayes_cvs\spambayes>C:\P
ROGRA~1\_DEV\Python22\python.exe dbExpImp.py
  File "dbExpImp.py", line 98
    from __future__ import generators
SyntaxError: from __future__ imports must occur at the 
beginning of the file

I tried to move the import-statements on top, as 
indicated by the error-msg, and this seemed to work. 
I.E.
MOVE:
####################################
from __future__ import generators

import spambayes.storage
from spambayes.Options import options
import sys, os, getopt, errno, re
import urllib
####################################
OVER:
####################################
try:
    True, False
except NameError:
    # Maintain compatibility with Python 2.2
    True, False = 1, 0
####################################


I've never written anything in Python, so I have no clue 
as to what this really means.


os: win XP HOME (norwegian)
python 2.2
bsddb3: 4.1.4
spambayes: latest CVS as of 2003-03-11

----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=498103&aid=701413&group_id=61702

From noreply at sourceforge.net  Tue Mar 11 05:52:01 2003
From: noreply at sourceforge.net (SourceForge.net)
Date: Tue Mar 11 08:50:37 2003
Subject: [Spambayes] 
	[ spambayes-Bugs-701413 ] dbExpImp.py fails (python 2.2, win XP)
Message-ID: <E18skAj-0006Mr-00@sc8-sf-web2.sourceforge.net>

Bugs item #701413, was opened at 2003-03-11 02:48
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=498103&aid=701413&group_id=61702

Category: None
Group: None
>Status: Closed
Resolution: None
Priority: 5
Submitted By: Fredrik Rodland (fmmr)
>Assigned to: Tim Stone (timstone4)
Summary: dbExpImp.py fails (python 2.2, win XP)

Initial Comment:
dbExpImp.py fails to run, and exists with the following 
error:

C:\Programfiler\_UTIL\spambayes_cvs\spambayes>C:\P
ROGRA~1\_DEV\Python22\python.exe dbExpImp.py
  File "dbExpImp.py", line 98
    from __future__ import generators
SyntaxError: from __future__ imports must occur at the 
beginning of the file

I tried to move the import-statements on top, as 
indicated by the error-msg, and this seemed to work. 
I.E.
MOVE:
####################################
from __future__ import generators

import spambayes.storage
from spambayes.Options import options
import sys, os, getopt, errno, re
import urllib
####################################
OVER:
####################################
try:
    True, False
except NameError:
    # Maintain compatibility with Python 2.2
    True, False = 1, 0
####################################


I've never written anything in Python, so I have no clue 
as to what this really means.


os: win XP HOME (norwegian)
python 2.2
bsddb3: 4.1.4
spambayes: latest CVS as of 2003-03-11

----------------------------------------------------------------------

>Comment By: Tim Stone (timstone4)
Date: 2003-03-11 07:52

Message:
Logged In: YES 
user_id=645698

Fixed.  Wish they were all this easy.  Now what dummy put the generators 
import there anyway?

----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=498103&aid=701413&group_id=61702

From skip at pobox.com  Tue Mar 11 08:00:27 2003
From: skip at pobox.com (Skip Montanaro)
Date: Tue Mar 11 09:00:50 2003
Subject: [Spambayes] Perhaps a level header would be useful?
In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F13C8C89D@its-xchg4.massey.ac.nz>
References: <1ED4ECF91CDED24C8D012BCF2B034F13C8C89D@its-xchg4.massey.ac.nz>
Message-ID: <15981.60411.170824.504685@montanaro.dyndns.org>


    Tony> Ok, there's now the following headers available in pop3proxy:
    Tony> X-Spambayes-Classification: {ham | spam | unsure}
    Tony> X-Spambayes-Spam-Probability: (message score)
    Tony> X-Spambayes-Level: (thermostat, one * = 10%)
    Tony> X-Spambayes-Evidence: (list of clues, like hammie's debug)
    Tony> X-Spambayes-MailId: (unique id for the message)

Perhaps adding/deleting headers should be controlled by their own section in
the options file and a headers module should be written, so all apps which
tweak headers can say something like:

    from spambayes import headers
    ...
    headers.add_spambayes_headers(msg, ...)
    ...

and not have to worry further about specific headers.

On a related note, it seems to me that if a spambayes tool is going to
delete one of the headers (in case the message has been classified
previously or spammers try to exploit them), then all of them should be
deleted:

    from spambayes import headers
    ...
    headers.delete_spambayes_headers(msg)
    ...

Skip


From T.A.Meyer at massey.ac.nz  Tue Mar 11 17:45:42 2003
From: T.A.Meyer at massey.ac.nz (Meyer, Tony)
Date: Tue Mar 11 09:08:07 2003
Subject: [Spambayes] experimental_ham_spam_imbalance_adjustment result
Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F1318CD7F@its-xchg4.massey.ac.nz>

> (d) Something went wrong somewhere.  The listings of means 
> and sdevs are
> supremely sensitive to even the tiniest changes:  I've never 
> seen them all zero unless the classifiers and tokenizers
> going into them were actually identical.

Which  was the case here.  <blush>.  The mistake was that timtest wasn't finding the new config, so it was running the same test twice and comparing it.  Not surprisingly, option=false did the same as option=false :)  Thanks for the help :)

> Given that you have more ham than spam, the expected effect 
> of enabling the option is to decrease your FN rate (which,
> at 4%, is high), and possibly increase your FP rate (which is 0).

Which is what happened.  From 4% to 1% and from 0% to 0.2%.  The 3 fp's were (1) a "you're almost ready to start using" email from habeas.com (this does better in my personal set since I check for the habeas headers), (2) an announcement from mtnsms.com about their new smspop service, and (3) a "thank you for installing" message from Real.

I think this says that for me, it's a loss.  All three of these (particularly the first two) were important at the time, and I would not have wanted to wade through the spam folder for them.  I would much rather put up with the fn's.

Here are (hopefully) correct results:
imbal_falses.txt -> imbal_trues.txt
-> <stat> tested 372 hams & 48 spams against 983 hams & 155 spams
-> <stat> tested 333 hams & 56 spams against 1022 hams & 147 spams
-> <stat> tested 329 hams & 48 spams against 1026 hams & 155 spams
-> <stat> tested 321 hams & 51 spams against 1034 hams & 152 spams
-> <stat> tested 372 hams & 48 spams against 983 hams & 155 spams
-> <stat> tested 333 hams & 56 spams against 1022 hams & 147 spams
-> <stat> tested 329 hams & 48 spams against 1026 hams & 155 spams
-> <stat> tested 321 hams & 51 spams against 1034 hams & 152 spams

false positive percentages
    0.000  0.000  tied          
    0.000  0.000  tied          
    0.000  0.304  lost  +(was 0)
    0.000  0.623  lost  +(was 0)

won   0 times
tied  2 times
lost  2 times

total unique fp went from 0 to 3 lost  +(was 0)
mean fp % went from 0.0 to 0.231751081821 lost  +(was 0)

false negative percentages
    6.250  2.083  won    -66.67%
    0.000  0.000  tied          
    6.250  2.083  won    -66.67%
    3.922  0.000  won   -100.00%

won   3 times
tied  1 times
lost  0 times

total unique fn went from 8 to 2 won    -75.00%
mean fn % went from 4.10539215686 to 1.04166666667 won    -74.63%

ham mean                     ham sdev
   0.39    1.45 +271.79%        3.46    6.76  +95.38%
   0.09    1.30 +1344.44%        0.91    6.05 +564.84%
   0.65    2.56 +293.85%        4.57    9.96 +117.94%
   1.40    3.37 +140.71%        7.93   14.06  +77.30%

ham mean and sdev for all runs
   0.62    2.14 +245.16%        4.87    9.65  +98.15%

spam mean                    spam sdev
  87.62   94.09   +7.38%       28.34   16.45  -41.95%
  90.83   99.06   +9.06%       18.01    3.61  -79.96%
  91.17   94.81   +3.99%       25.61   17.83  -30.38%
  85.65   94.52  +10.36%       25.97   14.35  -44.74%

spam mean and sdev for all runs
  88.85   95.74   +7.75%       24.68   14.10  -42.87%

ham/spam mean difference: 88.23 93.60 +5.37

=Tony Meyer

From T.A.Meyer at massey.ac.nz  Tue Mar 11 19:07:27 2003
From: T.A.Meyer at massey.ac.nz (Meyer, Tony)
Date: Tue Mar 11 09:15:08 2003
Subject: [Spambayes] Perhaps a level header would be useful?
Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F13C8C8A5@its-xchg4.massey.ac.nz>

[Bill Yerazunis]
> I've also had multiple requests for a continuous output match 
> parameter in
> CRM114, so I settled on this:
> 
>       pR = - (log (Pspam) - log (Pnonspam)
> 
> This goes from roughly +350 to -350, and (nicely) the uncertains 
> and errors all seem to group around +/- 100 . 

Curious, and (sort of) able to now run tests (thanks Tim & Mark), I changed the "prob = (S-H + 1.0) / 2.0" equation in classifier.py to use this method.  I had to also fiddle with 0's since log(0) isn't nice (how does CRM114 do this?), plus I moved it from -350to+350 to 0-1.  Surprisingly I got good (well, perfect, actually) results.  Is this just my tiny-weeny sets?  A fluke?  *Another* mistake on my part?

The change I made was to replace line 245 ("prob = (S-H + 1.0) / 2.0") of classifier.py with:
"""
            from math import log
            if H == 0:
                H = 0.00000001
            if S == 0:
                S = 0.00000001
            prob = ((-(log(S) - log(H)))/350) + 0.5
"""

pr_falses.txt -> pr_trues.txt
-> <stat> tested 333 hams & 56 spams against 372 hams & 48 spams
-> <stat> tested 329 hams & 48 spams against 372 hams & 48 spams
-> <stat> tested 321 hams & 51 spams against 372 hams & 48 spams
-> <stat> tested 372 hams & 48 spams against 333 hams & 56 spams
-> <stat> tested 329 hams & 48 spams against 333 hams & 56 spams
-> <stat> tested 321 hams & 51 spams against 333 hams & 56 spams
-> <stat> tested 372 hams & 48 spams against 329 hams & 48 spams
-> <stat> tested 333 hams & 56 spams against 329 hams & 48 spams
-> <stat> tested 321 hams & 51 spams against 329 hams & 48 spams
-> <stat> tested 372 hams & 48 spams against 321 hams & 51 spams
-> <stat> tested 333 hams & 56 spams against 321 hams & 51 spams
-> <stat> tested 329 hams & 48 spams against 321 hams & 51 spams
-> <stat> tested 333 hams & 56 spams against 372 hams & 48 spams
-> <stat> tested 329 hams & 48 spams against 372 hams & 48 spams
-> <stat> tested 321 hams & 51 spams against 372 hams & 48 spams
-> <stat> tested 372 hams & 48 spams against 333 hams & 56 spams
-> <stat> tested 329 hams & 48 spams against 333 hams & 56 spams
-> <stat> tested 321 hams & 51 spams against 333 hams & 56 spams
-> <stat> tested 372 hams & 48 spams against 329 hams & 48 spams
-> <stat> tested 333 hams & 56 spams against 329 hams & 48 spams
-> <stat> tested 321 hams & 51 spams against 329 hams & 48 spams
-> <stat> tested 372 hams & 48 spams against 321 hams & 51 spams
-> <stat> tested 333 hams & 56 spams against 321 hams & 51 spams
-> <stat> tested 329 hams & 48 spams against 321 hams & 51 spams

false positive percentages
    0.000  0.000  tied          
    0.000  0.000  tied          
    0.312  0.000  won   -100.00%
    0.000  0.000  tied          
    0.304  0.000  won   -100.00%
    0.935  0.000  won   -100.00%
    0.000  0.000  tied          
    0.000  0.000  tied          
    0.623  0.000  won   -100.00%
    0.000  0.000  tied          
    0.000  0.000  tied          
    0.000  0.000  tied          

won   4 times
tied  8 times
lost  0 times

total unique fp went from 4 to 0 won   -100.00%
mean fp % went from 0.181092520524 to 0.0 won   -100.00%

false negative percentages
    0.000  0.000  tied          
    2.083  0.000  won   -100.00%
    0.000  0.000  tied          
    2.083  0.000  won   -100.00%
    2.083  0.000  won   -100.00%
    0.000  0.000  tied          
    2.083  0.000  won   -100.00%
    0.000  0.000  tied          
    0.000  0.000  tied          
    6.250  0.000  won   -100.00%
    0.000  0.000  tied          
    4.167  0.000  won   -100.00%

won   6 times
tied  6 times
lost  0 times

total unique fn went from 5 to 0 won   -100.00%
mean fn % went from 1.5625 to 0.0 won   -100.00%

ham mean                     ham sdev
   3.64   55.82 +1433.52%       11.61    3.14  -72.95%
   3.68   55.64 +1411.96%       12.69    3.18  -74.94%
   2.84   55.75 +1863.03%       10.59    3.09  -70.82%
   2.08   56.10 +2597.12%        7.78    3.12  -59.90%

ham mean and sdev for all runs
   3.05   55.83 +1730.49%       10.83    3.14  -71.01%

spam mean                    spam sdev
  92.59   45.50  -50.86%       17.72    3.41  -80.76%
  94.02   44.72  -52.44%       16.04    3.48  -78.30%
  93.46   45.01  -51.84%       16.94    3.44  -79.69%
  87.89   45.01  -48.79%       22.86    3.88  -83.03%

spam mean and sdev for all runs
  91.98   45.07  -51.00%       18.75    3.57  -80.96%

ham/spam mean difference: 88.93 -10.76 -99.69

Comments?

=Tony Meyer

From anthony at interlink.com.au  Wed Mar 12 01:22:47 2003
From: anthony at interlink.com.au (Anthony Baxter)
Date: Tue Mar 11 09:23:11 2003
Subject: [Spambayes] Perhaps a level header would be useful? 
In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F13C8C8A5@its-xchg4.massey.ac.nz> 
Message-ID: <200303111422.h2BEMlX27103@localhost.localdomain>


>>> "Meyer, Tony" wrote
> Curious, and (sort of) able to now run tests (thanks Tim & Mark), I
> changed the "prob = (S-H + 1.0) / 2.0" equation in classifier.py to
> use this method. I had to also fiddle with 0's since log(0) isn't nice
> (how does CRM114 do this?), plus I moved it from -350to+350 to 0-1.
> Surprisingly I got good (well, perfect, actually) results. Is this
> just my tiny-weeny sets? A fluke? *Another* mistake on my part?

Um, I'd say "mistake". Look at the numbers. Your ham mean has gone
from around 3 to around 55, while the spam mean's gone from around
92 to around 45. So you've moved everything solidly into the "unsure"
bucket. 

This, of course, will remove your FN/FP numbers. But then, dumping
your email directly into the unsure folder without running spambayes
will do that, too <wink>

Worse yet, your spam is scoring, on average, less than your ham! Oops.

Anthony


> ham mean and sdev for all runs
>    3.05   55.83 +1730.49%       10.83    3.14  -71.01%
> 
> spam mean and sdev for all runs
>   91.98   45.07  -51.00%       18.75    3.57  -80.96%


From tim at fourstonesExpressions.com  Tue Mar 11 11:14:21 2003
From: tim at fourstonesExpressions.com (Tim Stone - Four Stones Expressions)
Date: Tue Mar 11 12:14:29 2003
Subject: [Spambayes] Perhaps a level header would be useful?
In-Reply-To: <15981.60411.170824.504685@montanaro.dyndns.org>
Message-ID: <BAEBHG43TRDC1YULJVTNI2Z54KGDAXT.3e6e196d@myst>

>Perhaps adding/deleting headers should be controlled by their own section in
>the options file and a headers module should be written

This is a great idea.  I'll take this one on.  I'll fix pop3proxy and 
notesfilter.  I suppose hammiefilter will need to be adjusted.  I'm not sure 
how interesting this will be to the outlook code.


c'est moi - TimS
http://www.fourstonesExpressions.com
http://wecanstopspam.org


From skip at pobox.com  Tue Mar 11 11:25:50 2003
From: skip at pobox.com (Skip Montanaro)
Date: Tue Mar 11 12:26:02 2003
Subject: [Spambayes] Perhaps a level header would be useful?
In-Reply-To: <BAEBHG43TRDC1YULJVTNI2Z54KGDAXT.3e6e196d@myst>
References: <15981.60411.170824.504685@montanaro.dyndns.org>
        <BAEBHG43TRDC1YULJVTNI2Z54KGDAXT.3e6e196d@myst>
Message-ID: <15982.7198.16444.97558@montanaro.dyndns.org>


    >> Perhaps adding/deleting headers should be controlled by their own
    >> section in the options file and a headers module should be written

    Tim> This is a great idea.  I'll take this one on.  I'll fix pop3proxy
    Tim> and notesfilter.  I suppose hammiefilter will need to be adjusted.

I'll twiddle hammiefilter.

    Tim> I'm not sure how interesting this will be to the outlook code.

If it's checking various header options, they will need changing if the
names are changed.

Skip

From kjellqvist at nordkalak.se  Tue Mar 11 21:04:55 2003
From: kjellqvist at nordkalak.se (=?iso-8859-1?q?G=F6ran=20K=E4llqvist?=)
Date: Tue Mar 11 15:28:17 2003
Subject: [Spambayes] Crash after upgrading KDE
Message-ID: <200303112104.55945.kjellqvist@nordkalak.se>

Hi!
I've just upgraded to KDE 3.1. Have used spambayes with KDE 3.04 for several 
weeks without problem, and it still starts OK:

>gorank@triathlon:~/spambayes> /usr/bin/pop3proxy.py
>Loading database... Done.
>Listener on port 1110 is proxying m1.970.telia.com:110
>User interface url is http://localhost:8880/

But when I try to fetch my mail (with Kmail 1.5) I get the following error:

>error: uncaptured python exception, closing channel 
><__main__.ServerLineReader connected at 0x883bed4> (exceptions.EOFError: 
>[/usr/lib/python2.2/asyncore.py|poll|94] 
>[/usr/lib/python2.2/asyncore.py|handle_read_event|391] 
>[/usr/lib/python2.2/asynchat.py|handle_read|130] 
>[/usr/bin/pop3proxy.py|found_terminator|200] 
>[/usr/bin/pop3proxy.py|onServerLine|268] 
>[/usr/bin/pop3proxy.py|onResponse|342] 
>[/usr/bin/pop3proxy.py|onTransaction|438] [/usr/bin/pop3proxy.py|onRetr|485] 
>[/usr/lib/python2.2/site-packages/spambayes/classifier.py|chi2_spamprob|217] 
>[/usr/lib/python2.2/site-packages/spambayes/classifier.py|_getclues|437] 
>[/usr/lib/python2.2/site-packages/spambayes/storage.py|_wordinfoget|192] 
>[/usr/lib/python2.2/shelve.py|get|66] 
>[/usr/lib/python2.2/shelve.py|__getitem__|71])

The webinterface is still working. I'm running spambayes on linux 2.4.18.
Anyone seen a similar problem? And the solution?
Greetings G?ran K?llqvist

From T.A.Meyer at massey.ac.nz  Wed Mar 12 10:07:10 2003
From: T.A.Meyer at massey.ac.nz (Meyer, Tony)
Date: Tue Mar 11 16:11:41 2003
Subject: [Spambayes] Perhaps a level header would be useful?
Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F13C8C8A6@its-xchg4.massey.ac.nz>

> Perhaps adding/deleting headers should be controlled by their 
> own section in
> the options file and a headers module should be written

Definately +1 here.  Anything that simplifies the options, or modularises them is good, IMO.  I look forward to seeing it when Tim's done :)

=Tony Meyer

From T.A.Meyer at massey.ac.nz  Wed Mar 12 10:11:56 2003
From: T.A.Meyer at massey.ac.nz (Meyer, Tony)
Date: Tue Mar 11 16:12:44 2003
Subject: [Spambayes] Perhaps a level header would be useful?
Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F13C8C8A7@its-xchg4.massey.ac.nz>

> How about putting these in a seperate namespace? I have been writing 
> a GUI for spambayes using PyObjC and it could benefit from more 
> regular option names here.

Does Skip's proposal sound ok?

> header_spam_probability: {True|False}
> header_level: {True|False}
> header_evidence: {True|False}
> header_mailid: {True|False}
> pop3proxy_mailid_notate_body: {True|False}
> pop3proxy_classification_notate_to: {True|False}

Personally, I prefer the current method of mailid.  Originally it was like this (for perhaps 24 hours), but there are already way too many options.  So I dropped the T/F add_to_body option and changed the add to a string.  Shouldn't matter to developers, and end-users should have a nice GUI hiding it all anyway.

> There may be some bugs lurking, I'm now getting 
> "X-Spambayes-Classification: ham" in the body of my emails.

I will check this ASAP.

> Also, 
> this bit of code around line 163 in pop3proxy.py doesn't account for 
> the extra possible headers.

Drat.  Good spotting.  Will fix this too.

> BTW, I'm pretty excited about the mailid stuff you have done. Being 
> able to correct a single message without seeing all of my mail again 
> will be great.

We aim to please ;)

=Tony Meyer

From T.A.Meyer at massey.ac.nz  Wed Mar 12 10:40:39 2003
From: T.A.Meyer at massey.ac.nz (Meyer, Tony)
Date: Tue Mar 11 16:41:35 2003
Subject: [Spambayes] Perhaps a level header would be useful? 
Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F1318CD80@its-xchg4.massey.ac.nz>

> Um, I'd say "mistake". Look at the numbers. Your ham mean has gone
> from around 3 to around 55, while the spam mean's gone from around
> 92 to around 45. So you've moved everything solidly into the "unsure"
> bucket. 

<blush again>.  I realised that I'd stuffed it up just after I went home.  Too much rushing at the end of the day.

Looking at:
> pR = - (log (Pspam) - log (Pnonspam)
> This goes from roughly +350 to -350, and (nicely) the uncertains 
> and errors all seem to group around +/- 100 . 

I should have been more careful, since obviously a Pspam and Pnonspam ranging from 0->1 will not end up with many scores near 350, unless there are some *very* accurate floating point numbers.

Apologies for the foolishness.

=Tony Meyer

From tim.one at comcast.net  Tue Mar 11 16:41:36 2003
From: tim.one at comcast.net (Tim Peters)
Date: Tue Mar 11 16:42:13 2003
Subject: [Spambayes] Perhaps a level header would be useful?
In-Reply-To: <15982.7198.16444.97558@montanaro.dyndns.org>
Message-ID: <LNBBLJKPBEHFEDALKOLCIEGFEAAB.tim.one@comcast.net>

[TimS]
> I'm not sure how interesting this will be to the outlook code.

Not to worry -- it shouldn't affect the Outlook client one way or the other.
That stores the spam score as a kind of metadata ("custom field") on the
message object; it doesn't alter the headers.


From tony-bayes at lownds.com  Tue Mar 11 16:59:22 2003
From: tony-bayes at lownds.com (Tony Lownds)
Date: Tue Mar 11 19:59:24 2003
Subject: [Spambayes] Perhaps a level header would be useful?
In-Reply-To: 
 <1ED4ECF91CDED24C8D012BCF2B034F13C8C8A7@its-xchg4.massey.ac.nz>
References: <1ED4ECF91CDED24C8D012BCF2B034F13C8C8A7@its-xchg4.massey.ac.nz>
Message-ID: <a05200f87ba942ddce2cf@[204.162.121.55]>

At 10:11 AM +1300 3/12/03, Meyer, Tony wrote:
>Does Skip's proposal sound ok?

Yes, sounds like a good idea.

>Personally, I prefer the current method of mailid.  Originally it 
>was like this (for perhaps 24 hours), but there are already way too 
>many options.  So I dropped the T/F add_to_body option and changed 
>the add to a string.  Shouldn't matter to developers, and end-users 
>should have a nice GUI hiding it all anyway.

I see what you mean. Maybe someone will want mailid to appear at the 
front of the e-mail body, or maybe in the subject, or....

>  > BTW, I'm pretty excited about the mailid stuff you have done. Being
>>  able to correct a single message without seeing all of my mail again
>  > will be great.
>
>We aim to please ;)
>

Great stuff!

-Tony


From tony-bayes at lownds.com  Tue Mar 11 17:18:50 2003
From: tony-bayes at lownds.com (Tony Lownds)
Date: Tue Mar 11 20:18:53 2003
Subject: [Spambayes] Perhaps a level header would be useful?
In-Reply-To: 
 <1ED4ECF91CDED24C8D012BCF2B034F13C8C8A7@its-xchg4.massey.ac.nz>
References: <1ED4ECF91CDED24C8D012BCF2B034F13C8C8A7@its-xchg4.massey.ac.nz>
Message-ID: <a05200f8cba9439c1ac5d@[204.162.121.55]>

>  > There may be some bugs lurking, I'm now getting
>>  "X-Spambayes-Classification: ham" in the body of my emails.
>
>I will check this ASAP.
>

This fixes it.

--- pop3proxy.py        11 Mar 2003 02:48:29 -0000      1.65
+++ pop3proxy.py        12 Mar 2003 01:10:36 -0000
@@ -503,7 +503,7 @@

              headers, body = re.split(r'\n\r?\n', messageText, 1)
              messageName = state.getNewMessageName()
-            headers += '\r\n%s: %s\r\n' % (options.hammie_header_name,
+            headers += '\n%s: %s\r\n' % (options.hammie_header_name,
                                             disposition)
              if command == 'RETR' and not state.isTest:
                  if options.pop3proxy_add_mailid_to.find("header") != -1:

>  > Also,
>>  this bit of code around line 163 in pop3proxy.py doesn't account for
>>  the extra possible headers.
>
>Drat.  Good spotting.  Will fix this too.

I think this bug is deeper :)

-Tony

From bill at parducci.net  Tue Mar 11 17:29:26 2003
From: bill at parducci.net (bill parducci)
Date: Tue Mar 11 20:29:33 2003
Subject: [Spambayes] weighting question
Message-ID: <3E6E8D76.4060206@parducci.net>

is there currently a way to weight the smtp envelope information (in particular, 'mail from') independently from the payload of the message? the reason i ask is that if someone that i work with forwards an obvious spam note to me with little preamble (e.g. 'note the use of underscores') the remaining content of the message forces it right into the spam bucket. i have two other cases where retraining doesn't seem to be improving false positives as well. 

thanks

b

p.s. yes, i know that 'mail from' isn't a reliable authentication assertion. :o)


From tim.one at comcast.net  Tue Mar 11 20:33:25 2003
From: tim.one at comcast.net (Tim Peters)
Date: Tue Mar 11 20:34:04 2003
Subject: [Spambayes] Perhaps a level header would be useful?
In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F13C8C8A5@its-xchg4.massey.ac.nz>
Message-ID: <LNBBLJKPBEHFEDALKOLCAEHIEAAB.tim.one@comcast.net>

[Meyer, Tony]
> ...
> The change I made was to replace line 245 ("prob = (S-H + 1.0) /
> 2.0") of classifier.py with:
> """
>             from math import log
>             if H == 0:
>                 H = 0.00000001
>             if S == 0:
>                 S = 0.00000001
>             prob = ((-(log(S) - log(H)))/350) + 0.5
> """

Apart from the technical glitches you bumped into, there's a reason we don't
want to combine H and S via any expression of this form.  Because the
difference of logs is the log of the quotient, and the negation of a log is
the log of the reciprocal, the heart of this expression is log(H/S), and
it's the H/S part that's undesirable.

If, say, H is 0.99, and S is 0.0099, H/S is 100 and there's no problem with
concluding that we're sure the msg is ham.

But suppose H is .0001 and S is .000001.  Then H/S is also 100, but it's
plain nuts to be exactly as sure that the msg is ham:  H on its own says the
system thinks there's virtually no chance the msg looks like what it's been
taught about ham, and the low S says the same about what it's been taught
about spam:  it doesn't look like either, so Unsure is the "proper"
response.  If the system *had* to guess one or the other, then ham is the
best guess it can make, but H on its own says the system doesn't believe
that guess.  (Note that in pH calculations, small magnitudes don't "say"
anything significant -- a factor of 100 is equally signficant in that domain
no matter how small the input magnitudes.)

Rob Hooft crafted the simple combining formula we use to give a high
combined score in the first example and a solid Unsure in the second
example.  We used a different expression involving a ratio before that, and
examples of the second kind are exactly where it screwed up.  Don't want to
do that again <wink>.

BTW, and IIRC, cmp.py never got updated to deal sensibly with unsures.  If
that's right, it shouldn't be used except when spam_cutoff == ham_cutoff.
Then you've got a two-outcome classifier (no unsures), and cmp.py won't
"forget" any msgs.


From skip at pobox.com  Tue Mar 11 19:41:54 2003
From: skip at pobox.com (Skip Montanaro)
Date: Tue Mar 11 20:42:21 2003
Subject: [Spambayes] weighting question
In-Reply-To: <3E6E8D76.4060206@parducci.net>
References: <3E6E8D76.4060206@parducci.net>
Message-ID: <15982.36962.234270.381858@montanaro.dyndns.org>


    bill> is there currently a way to weight the smtp envelope information
    bill> (in particular, 'mail from') independently from the payload of the
    bill> message?

You can set this in your Options.py file:

    [Tokenizer]
    address_headers: from to cc

You can add other headers (Sender is used by the Outlook plugin) which
contain addresses too.

Skip

From bill at parducci.net  Tue Mar 11 18:00:55 2003
From: bill at parducci.net (bill parducci)
Date: Tue Mar 11 21:00:59 2003
Subject: [Spambayes] weighting question
In-Reply-To: <15982.36962.234270.381858@montanaro.dyndns.org>
References: <3E6E8D76.4060206@parducci.net>
	<15982.36962.234270.381858@montanaro.dyndns.org>
Message-ID: <3E6E94D7.7040106@parducci.net>

Skip Montanaro wrote:
> You can set this in your Options.py file:
> 
>     [Tokenizer]
>     address_headers: from to cc

[Tokenizer]
address_headers: from

is the default value on my system, which makes me think that spambayes is already considering the 'mail from' information. (unless another flag needs to be set to enable this: basic_header_tokenize?)

if that is the case then i would think retraining with the original in the spam mbox and the forwarded version in an ham mbox should score the sender (forwarder) very strongly HAM, right? (of course, to do this i would have to hand hack the mbox file, not having the original spam.)  if so, then the question becomes is that enough to qualify subsequent messages from sender as ham? and if it isn't, then i am back full circle to wanting to be able to weight it separately from the message payload! :o)

thanks

b


From T.A.Meyer at massey.ac.nz  Wed Mar 12 15:52:38 2003
From: T.A.Meyer at massey.ac.nz (Meyer, Tony)
Date: Tue Mar 11 21:54:04 2003
Subject: [Spambayes] Perhaps a level header would be useful?
Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F13C8C8AD@its-xchg4.massey.ac.nz>

> Apart from the technical glitches you bumped into, there's a 
> reason we don't
> want to combine H and S via any expression of this form.
[technical explanation cut]

Thanks for that Tim.  It's been a few years since I've done maths...I was playing around with number (a non-broken version), and came to much the same conclusion myself, but without the nice theory.

> BTW, and IIRC, cmp.py never got updated to deal sensibly with 
> unsures.  If
> that's right, it shouldn't be used except when spam_cutoff == 
> ham_cutoff.
> Then you've got a two-outcome classifier (no unsures), and 
> cmp.py won't "forget" any msgs.

I think this is still the case.  If there is going to be a minor increase in testing again, which is the better option, to have ham_cutoff==spam_cutoff, or to update to reveal unsure info?  (I suspect the latter).

Thanks again.  [Must think more before posting.  Must think more before posting. Must think...]

=Tony Meyer

From joel at prettyhipprogramming.com  Tue Mar 11 21:57:15 2003
From: joel at prettyhipprogramming.com (Joel Ricker)
Date: Tue Mar 11 22:00:04 2003
Subject: [Spambayes] [OT] Converting Outlook MSGs to mbox
Message-ID: <000201c2e843$17244120$c9e03942@nc.rr.com>


Hi all,

I wanted to know if anyone had any experience with converting Outlook's
e-mail message format into mbox or other format.  My problem is that
I've tried using the Outlook Spambayes add-in but it didn't quite work.
It's more than likely my installation of Outlook rather than the add-in.
My Outlook install flakes out from time to time.  The plugin installed
ok and I was able to define my corpus and it started to working.
However, the next time I brought up Outlook, the plug-in was gone and it
won't let me reinstall for some reason.

So for stability's sake and plus as a solution in case I ever decide to
use a different e-mail program, I decided to try the pop3proxy script.
I've got two large folders with my  (corpii?) but Outlook appears to use
some proprietary storage of some sort for the e-mails.  I can save the
messages as text, as long as I do them one at a time and it doesn't
store all of the message headers, only From, To, Subject, and Date.

Has anybody seen a converter for this MSG format?  

Thanks
Joel


From T.A.Meyer at massey.ac.nz  Wed Mar 12 16:09:39 2003
From: T.A.Meyer at massey.ac.nz (Meyer, Tony)
Date: Tue Mar 11 22:10:20 2003
Subject: [Spambayes] Perhaps a level header would be useful?
Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F13C8C8AE@its-xchg4.massey.ac.nz>

> >  > There may be some bugs lurking, I'm now getting
> >>  "X-Spambayes-Classification: ham" in the body of my emails.
> >
> >I will check this ASAP.
> This fixes it.
[patch]

Thanks.  My cvs access is a bit spotty today (actually I think it's my network access in general), but it should hopefully go through soon.  This was me not reading the regex closely enough when I updated things.

=Tony Meyer

From T.A.Meyer at massey.ac.nz  Wed Mar 12 16:44:15 2003
From: T.A.Meyer at massey.ac.nz (Meyer, Tony)
Date: Tue Mar 11 22:44:53 2003
Subject: [Spambayes] pop3proxy HEADER_EXAMPLE and HEADER_FORMAT
Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F1318CD81@its-xchg4.massey.ac.nz>

Is Richie around at the moment?  I get the feeling he would be most help here.  TimS maybe?

An issue that Tony Lownds brought up is that pop3proxy currently has HEADER_EXAMPLE, which is used in response to a pop3 STAT or LIST command to calculate the new size of the message, in case the mailer needs to know, and asks.

With the new headers, this is a problem.  Level and MailId are easy enough, but evidence (i.e. hammie_debug) could be just about any size.

What's the collective answer?  I do recall from previous messages that Richie was originally much more careful about making things the right size ("No " and "Yes", for example), and then IIRC decided to give this up and fix it if anyone broke.

Do any mailers use STAT or LIST for something important like allocating a certain amount of memory?

Advice appreciated :)

Along similar lines, HEADER_FORMAT used to define the header format, which is now hard coded.  Should the decision be to wipe HEADER_FORMAT out, or to have a HEADER_FORMAT for each header?  (This could go into the header module that TimS is building).  It stoped being used at r1.36, without any comments in the checkin about why (it was a big checkin, though).

=Tony Meyer

From tim at fourstonesExpressions.com  Tue Mar 11 21:52:13 2003
From: tim at fourstonesExpressions.com (Tim Stone - Four Stones Expressions)
Date: Tue Mar 11 22:53:26 2003
Subject: [Spambayes] pop3proxy HEADER_EXAMPLE and HEADER_FORMAT
In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F1318CD81@its-xchg4.massey.ac.nz>
Message-ID: <OK7WSA0WSSQFK1W2YVQD997LK2XGC.3e6eaeed@myst>

3/11/2003 9:44:15 PM, "Meyer, Tony" <T.A.Meyer@massey.ac.nz> wrote:

>An issue that Tony Lownds brought up is that pop3proxy currently has 
HEADER_EXAMPLE, which is used in response to a pop3 STAT or LIST command to 
calculate the new size of the message, in case the mailer needs to know, and 
asks.
>
>With the new headers, this is a problem.  Level and MailId are easy enough, 
but evidence (i.e. hammie_debug) could be just about any size.
>
>What's the collective answer?  I do recall from previous messages that Richie 
was originally much more careful about making things the right size ("No " and 
"Yes", for example), and then IIRC decided to give this up and fix it if 
anyone broke.

I was noticing this very thing today as I started preparing to do that header 
module thing.  This is a problem, because AFAIK mailers expect the pop3proxy 
to give them a buffer size when they do a list, or stat.  One idea here is to 
add the headers willy-nilly, then determine the length of the resulting header 
text.  Another would be to place an upper boundary on how much text we will 
add to the headers, report a header that size, and make sure we never exceed 
that size.  If we do, we could drop headers in some sort of priority order 
until we're under the limit.  I like the first idea better, but I'm not sure 
it works with STAT.  You *could* do a test with all those mailers you have 
installed and see if any of them *use* stat...  If you set options.verbose = 
True, pop3proxy produces a log of all the interactions it proxys...


>Along similar lines, HEADER_FORMAT used to define the header format, which is 
now hard coded.  Should the decision be to wipe HEADER_FORMAT out, or to have 
a HEADER_FORMAT for each header?  (This could go into the header module that 
TimS is building).  It stoped being used at r1.36, without any comments in the 
checkin about why (it was a big checkin, though).

I don't see that there's any value to this field...


c'est moi - TimS
http://www.fourstonesExpressions.com
http://wecanstopspam.org


From T.A.Meyer at massey.ac.nz  Wed Mar 12 17:00:41 2003
From: T.A.Meyer at massey.ac.nz (Meyer, Tony)
Date: Tue Mar 11 23:01:21 2003
Subject: [Spambayes] pop3proxy HEADER_EXAMPLE and HEADER_FORMAT
Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F1318CD82@its-xchg4.massey.ac.nz>

It seems to me that saying that a message is bigger than it actually is is *not* a problem, but the reverse would be (if, for example, memory was set aside for it).  So X-Spambayes-MailID is easy, X-Spambayes-Level is easy, X-Spambayes-Prob is easy, and X-Spambayes-Classification is easy.  X-Spambayes-Evidence is the tricky one.

> One idea here is to add the headers willy-nilly, then
> determine the length of the resulting header text.

I'm not sure I get what you are suggesting here.

> Another would be to place an upper boundary on how 
> much text we will add to the headers, report a header that
> size, and make sure we never exceed that size.  If we do,
> we could drop headers in some sort of priority order 
> until we're under the limit.

I guess we could limit the number of words in the evidence, and if more are present, just not include them (or include a "...", or "too many words! go to the web ui!" message).

> You *could* do a test with all those mailers you have 
> installed and see if any of them *use* stat...  If you set 
> options.verbose = True, pop3proxy produces a log of all
> the interactions it proxys...

<sigh>  I supposed I could, at that.

[HEADER_FORMAT]
> I don't see that there's any value to this field...

Nor do I.  +1 to deleting it, then.  It's certainly been a long time since it was used in pop3proxy.

=Tony Meyer

From tim at fourstonesExpressions.com  Tue Mar 11 22:12:55 2003
From: tim at fourstonesExpressions.com (Tim Stone - Four Stones Expressions)
Date: Tue Mar 11 23:13:01 2003
Subject: [Spambayes] pop3proxy HEADER_EXAMPLE and HEADER_FORMAT
In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F1318CD82@its-xchg4.massey.ac.nz>
Message-ID: <AFDMI09JESN82QL2XCB98VR1TUQA6NJ.3e6eb3c7@myst>

3/11/2003 10:00:41 PM, "Meyer, Tony" <T.A.Meyer@massey.ac.nz> wrote:

>It seems to me that saying that a message is bigger than it actually is is 
*not* a problem, but the reverse would be (if, for example, memory was set 
aside for it).  So X-Spambayes-MailID is easy, X-Spambayes-Level is easy, X-
Spambayes-Prob is easy, and X-Spambayes-Classification is easy.  X-Spambayes-
Evidence is the tricky one.
>
>> One idea here is to add the headers willy-nilly, then
>> determine the length of the resulting header text.
>
>I'm not sure I get what you are suggesting here.

Yeah, I'm suggesting simply adding the headers to the message, then reporting 
how big the resulting message is.  It's a bit of a hack, but it'll be 
accurate.  Upon further rumination, though, it won't work with STAT, cause you 
don't have the message to add headers to.  So STAT is gonna have to make an 
estimate.  But upon further further rumination, I seriously doubt that mailers 
actually use STAT to allocate buffer space, for example.  That doesn't make 
much sense to me.  Probably more to simply put a mail size in the mailer ui, 
or to determine if it's larger than some threshold value set in the mailer 
configuration as the maximum size of a mail to download, etc. etc. etc.

>I guess we could limit the number of words in the evidence, and if more are 
present, just not include them (or include a "...", or "too many words! go to 
the web ui!" message).

That works, too.

c'est moi - TimS
http://www.fourstonesExpressions.com
http://wecanstopspam.org


From tim at fourstonesExpressions.com  Tue Mar 11 22:21:46 2003
From: tim at fourstonesExpressions.com (Tim Stone - Four Stones Expressions)
Date: Tue Mar 11 23:21:53 2003
Subject: [Spambayes] pop3proxy HEADER_EXAMPLE and HEADER_FORMAT
In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F13C8C8B2@its-xchg4.massey.ac.nz>
Message-ID: <C0GB63E0MIMGQPSPX1NM3X4YXRWTS.3e6eb5da@myst>

3/11/2003 10:16:01 PM, "Meyer, Tony" <T.A.Meyer@massey.ac.nz> wrote:

>
>I'll <sigh> go through and see if the mailers I have installed use STAT or 
LIST.  But I won't get time until tomorrow (NZ time) to get to this.  I'll 
update the list when I've got to it.

Well, they certainly use LIST, and I'm relatively certain they use STAT.  My 
Opera 6.05 mailer uses 'em both.  Here's the start of a recent pop3proxy log:

OK Cubic Circle's v1.31 1998/05/13 POP3 ready 
<4715000052b56e3e@mail.powweb.com>
USER timstone
+OK timstone selected
PASS f04g0t
+OK Congratulations!
STAT
+OK 1 196987
UIDL
+OK But remember to DELETE messages REGULARLY
1 2e3009d532010300
.
LIST 1
+OK 51 196937
RETR 1
+OK 196937 octets

c'est moi - TimS
http://www.fourstonesExpressions.com
http://wecanstopspam.org


From tim.one at comcast.net  Tue Mar 11 23:24:49 2003
From: tim.one at comcast.net (Tim Peters)
Date: Tue Mar 11 23:25:27 2003
Subject: [Spambayes] Perhaps a level header would be useful?
In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F13C8C8AD@its-xchg4.massey.ac.nz>
Message-ID: <LNBBLJKPBEHFEDALKOLCGEIFEAAB.tim.one@comcast.net>

[Tim]
> IIRC, cmp.py never got updated to deal sensibly with
> unsures.  If that's right, it shouldn't be used except when spam_cutoff
> == ham_cutoff.  Then you've got a two-outcome classifier (no unsures),
> and cmp.py won't "forget" any msgs.

[Meyer, Tony]
> I think this is still the case.  If there is going to be a minor
> increase in testing again, which is the better option, to have
> ham_cutoff==spam_cutoff, or to update to reveal unsure info?  (I
> suspect the latter).

It depends on what you're trying to accomplish, of course <wink>.  Updating
cmp.py is a project, because it never intended to deal with unsures, and
they don't fit well with its very detailed analysis of FP and FN.  Note that
the less-exhaustive table.py *does* deal with unsures already, and with
automating cutoff analysis (based on your histogram option settings).  After
Alex invented table.py, I rarely used cmp.py again except to zero in on
changes with very small effects.  Using table.py, you can skip the rates.py
step(s) too (table.py works directly with the output files produced by
timtest.py (if you must) or timcv.py (preferred)).

> Thanks again.  [Must think more before posting.  Must think more before
> posting. Must think...]

You're doing fine!  Thinking is overrated <wink>, and if I can't remember
why we did something one way instead of another, we should probably throw it
out and start that part over again.


From T.A.Meyer at massey.ac.nz  Wed Mar 12 17:24:08 2003
From: T.A.Meyer at massey.ac.nz (Meyer, Tony)
Date: Tue Mar 11 23:26:44 2003
Subject: [Spambayes] pop3proxy HEADER_EXAMPLE and HEADER_FORMAT
Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F13C8C8B3@its-xchg4.massey.ac.nz>

> Well, they certainly use LIST, and I'm relatively certain 
> they use STAT.  My 
> Opera 6.05 mailer uses 'em both.  Here's the start of a 
> recent pop3proxy log:

Good (now I don't need to test :).  We've got no way of knowing what the mailers do with this information, really (apart from nice open source ones ;).

So is it:
(a) put limits on the size of our headers
(b) no limits, and if someone reports a bug, then we reconsider things :)

=Tony Meyer

From skybow at hotkey.net.au  Wed Mar 12 15:29:58 2003
From: skybow at hotkey.net.au (Geoff Moyle)
Date: Tue Mar 11 23:27:20 2003
Subject: [Spambayes] Spambayes installation problem
Message-ID: <CMEFJNEIFOHGOFIDIFIIAEPBCFAA.skybow@hotkey.net.au>

Cannot get spambayes to install using drive H:

Laptop win 2000 installs to outlook 2000 ok (c:)

main machine windows 2000 does not appear on outlook. installation appears
to go ok

Geoff Moyle
Knowledge Engineer

---
Outgoing mail is certified Virus Free.
Checked by AVG anti-virus system (http://www.grisoft.com).
Version: 6.0.461 / Virus Database: 260 - Release Date: 10/03/2003


From tim at fourstonesExpressions.com  Tue Mar 11 22:29:40 2003
From: tim at fourstonesExpressions.com (Tim Stone - Four Stones Expressions)
Date: Tue Mar 11 23:29:55 2003
Subject: [Spambayes] pop3proxy HEADER_EXAMPLE and HEADER_FORMAT
In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F13C8C8B3@its-xchg4.massey.ac.nz>
Message-ID: <MIFAMG2YA9NHURZXB8B02YHFHBBA71SQ.3e6eb7b4@myst>

3/11/2003 10:24:08 PM, "Meyer, Tony" <T.A.Meyer@massey.ac.nz> wrote:

>
>Good (now I don't need to test :).

Why does simply figuring things out not occur to me earlier?  Sometimes I'm 
just a stoopidhead.
>
>So is it:
>(a) put limits on the size of our headers
>(b) no limits, and if someone reports a bug, then we reconsider things :)

I vote (b)  :)  Long live user testing!

c'est moi - TimS
http://www.fourstonesExpressions.com
http://wecanstopspam.org


From T.A.Meyer at massey.ac.nz  Wed Mar 12 17:31:47 2003
From: T.A.Meyer at massey.ac.nz (Meyer, Tony)
Date: Tue Mar 11 23:32:51 2003
Subject: [Spambayes] Spambayes installation problem
Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F13C8C8B4@its-xchg4.massey.ac.nz>

> Cannot get spambayes to install using drive H:
> Laptop win 2000 installs to outlook 2000 ok (c:)
> main machine windows 2000 does not appear on outlook. 
> installation appears to go ok

I can't think why the drive would matter.  I've installed the outlook addin from C: and D: (both partitions on a single drive), E: (another drive), and H: (a network drive) - with Outlook on D:.

What exactly goes wrong?  What do you mean by "windows 2000 does not appear on outlook"?  Nothing appears when you open Outlook?

What version of spambayes are you using?  The latest CVS?  Alpha1?  Alpha2?  The Outlook plugin installer?

=Tony Meyer

From T.A.Meyer at massey.ac.nz  Wed Mar 12 17:33:45 2003
From: T.A.Meyer at massey.ac.nz (Meyer, Tony)
Date: Tue Mar 11 23:34:26 2003
Subject: [Spambayes] pop3proxy HEADER_EXAMPLE and HEADER_FORMAT
Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F13C8C8B5@its-xchg4.massey.ac.nz>

> >So is it:
> >(a) put limits on the size of our headers
> >(b) no limits, and if someone reports a bug, then we 
> reconsider things :)
> 
> I vote (b)  :)  Long live user testing!

+1 for me too.  (Why else is this alpha software? ;)  Unless anyone complains, I'll only make one little change - I'll fix it so that an approximate size of the level, prob and mailid headers are added (if those options are checked), but ignore any effect of enabling the evidence header.  It's all off by default, anyway.

=Tony Meyer

From popiel at wolfskeep.com  Tue Mar 11 21:59:28 2003
From: popiel at wolfskeep.com (T. Alexander Popiel)
Date: Wed Mar 12 00:59:34 2003
Subject: [Spambayes] pop3proxy HEADER_EXAMPLE and HEADER_FORMAT 
In-Reply-To: Message from "Meyer, Tony" <T.A.Meyer@massey.ac.nz> 
	<1ED4ECF91CDED24C8D012BCF2B034F13C8C8B3@its-xchg4.massey.ac.nz> 
References: <1ED4ECF91CDED24C8D012BCF2B034F13C8C8B3@its-xchg4.massey.ac.nz> 
Message-ID: <20030312055928.983EE2DEA0@cashew.wolfskeep.com>

In message:  <1ED4ECF91CDED24C8D012BCF2B034F13C8C8B3@its-xchg4.massey.ac.nz>
             "Meyer, Tony" <T.A.Meyer@massey.ac.nz> writes:
>
>So is it:
>(a) put limits on the size of our headers
>(b) no limits, and if someone reports a bug, then we reconsider things >:)

Or (c) when we get a STAT or LIST or something which requires
reporting the size of the message, we could fetch the message
and analyze it and report the proper size after headers have
been added...

Of course, I'm not volunteering to code it, and I have no
idea whether that would make the proxy too slow/expensive
for people on dialups...

- Alex

From tim at fourstonesExpressions.com  Wed Mar 12 07:07:03 2003
From: tim at fourstonesExpressions.com (Tim Stone - Four Stones Expressions)
Date: Wed Mar 12 08:07:09 2003
Subject: [Spambayes] pop3proxy HEADER_EXAMPLE and HEADER_FORMAT 
In-Reply-To: <20030312055928.983EE2DEA0@cashew.wolfskeep.com>
Message-ID: <VR1YD8BPJIB0IHKHTQZV2UPNFBRP3Z.3e6f30f7@myst>

3/11/2003 11:59:28 PM, "T. Alexander Popiel" <popiel@wolfskeep.com> wrote:

>In message:  <1ED4ECF91CDED24C8D012BCF2B034F13C8C8B3@its-xchg4.massey.ac.nz>
>             "Meyer, Tony" <T.A.Meyer@massey.ac.nz> writes:
>>
>>So is it:
>>(a) put limits on the size of our headers
>>(b) no limits, and if someone reports a bug, then we reconsider things >:)
>
>Or (c) when we get a STAT or LIST or something which requires
>reporting the size of the message, we could fetch the message
>and analyze it and report the proper size after headers have
>been added...
>
>Of course, I'm not volunteering to code it, and I have no
>idea whether that would make the proxy too slow/expensive
>for people on dialups...

One of the points stat is to not have to fetch mails that are excessively 
large unless the user wishes it.  Fetching the mail on a stat really violates 
the protocol.

>
>- Alex
>
>


c'est moi - TimS
http://www.fourstonesExpressions.com
http://wecanstopspam.org


From mhammond at skippinet.com.au  Thu Mar 13 00:26:32 2003
From: mhammond at skippinet.com.au (Mark Hammond)
Date: Wed Mar 12 08:27:35 2003
Subject: [Spambayes] Windows service version of pop3proxy available
Message-ID: <LCEPIIGDJPKCOIHOBJEPKEAAOGAA.mhammond@skippinet.com.au>

I couldn't resist :)  See the new "windows" directory, and read the comments
in pop3proxy_service.py.  Windows 2000/XP only - no Win9x support.

My intention is to create a single Windows installer for both pop3proxy and
the Outlook plugin.  IMO, a "background" version of pop3proxy for Win9x
would be good (so we can call it a "service" on all Windows versions).  Let
me know if you are interested in helping.

Mark.


From T.A.Meyer at massey.ac.nz  Thu Mar 13 11:13:03 2003
From: T.A.Meyer at massey.ac.nz (Meyer, Tony)
Date: Wed Mar 12 17:17:20 2003
Subject: [Spambayes] Windows service version of pop3proxy available
Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F13C8C8B7@its-xchg4.massey.ac.nz>

> I couldn't resist :)  See the new "windows" directory, and 
> read the comments
> in pop3proxy_service.py.  Windows 2000/XP only - no Win9x support.

If there's now a windows directory for all windows specific stuff, does this mean that the Outlook directory will move into that?  Given that the plugin (apparently) works for Outlook 2k2, the directory could be renamed, anyway.  Outlook isn't available on any other platforms, right?  (I believe MS have said that the mac Outlook is being dropped and exchange support being built into entourage).

Just a thought...

=Tony Meyer

From T.A.Meyer at massey.ac.nz  Thu Mar 13 11:19:41 2003
From: T.A.Meyer at massey.ac.nz (Meyer, Tony)
Date: Wed Mar 12 17:20:19 2003
Subject: [Spambayes] Spambayes installation problem
Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F13C8C8B8@its-xchg4.massey.ac.nz>

> using the windows installer from mark's web site. not sure of 
> version ..
> downloaded it yesterday. is installed in program files 
> directory but does not appear on outlook.

Hopefully Mark will chip in, because he'll have more of an idea about this that I will.  (Mark: can any trace information be obtained with the installer version?  do people have to have Python installed to be able to get this?)

Suggestions:
* This is definately *Outlook*, and not *Outlook Express*, right?  (I just have to check...)
* Are you displaying the "Standard" toolbar?  This is where the plugin buttons will appear.
* Uninstall the plugin. Reset the toolbars in Outlook.  Reinstall the toolbars.
* Choose "Customize current view" in the inbox.  Look at the "fields defined in this folder" section.  If there is a "Spam" column there, add it to the view.  If this is there & new mail has scores appear, then it's working, but not showing you the GUI.
* Do you have any other Outlook plugins installed?  If so, what?

Hope this helps.

=Tony Meyer

From mhammond at skippinet.com.au  Thu Mar 13 09:44:20 2003
From: mhammond at skippinet.com.au (Mark Hammond)
Date: Wed Mar 12 17:55:11 2003
Subject: [Spambayes] Windows service version of pop3proxy available
In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F13C8C8B7@its-xchg4.massey.ac.nz>
Message-ID: <LCEPIIGDJPKCOIHOBJEPAEBKOGAA.mhammond@skippinet.com.au>

> If there's now a windows directory for all windows specific
> stuff, does this mean that the Outlook directory will move into
> that?  Given that the plugin (apparently) works for Outlook 2k2,
> the directory could be renamed, anyway.  Outlook isn't available
> on any other platforms, right?  (I believe MS have said that the
> mac Outlook is being dropped and exchange support being built
> into entourage).

I thought of the top-level "windows" directory being a kind of "helper", but
not containing complete applications.  Eg, pop3proxy_service just hooks on
the back of pop3proxy, but I don't think it makes sense to have in the
top-level directory.  Things like the installer script etc also make sense
here.

The outlook plugin is a large, stand-alone application, and IMO should be in
its own directory.  I don't really mind if this was moved to *under* the
windows directory, but I see no real need.

I'd still much rather see the top-level directory cleaned up even more, with
pop3proxy and hammie getting their own, application specific directories.
In this case, pop3proxy_service in the pop3proxy directory makes more sense,
and this "windows" directory could be replaced with one simply for the
installer.

Mark.


From T.A.Meyer at massey.ac.nz  Thu Mar 13 12:14:56 2003
From: T.A.Meyer at massey.ac.nz (Meyer, Tony)
Date: Wed Mar 12 18:16:35 2003
Subject: [Spambayes] Windows service version of pop3proxy available
Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F13C8C8B9@its-xchg4.massey.ac.nz>

> I thought of the top-level "windows" directory being a kind 
> of "helper", but not containing complete applications.

Ah, I see.

> The outlook plugin is a large, stand-alone application, and 
> IMO should be in its own directory.  I don't really mind
> if this was moved to *under* the windows directory, but
> I see no real need.

Fair enough, I get what you mean now.  Any thoughts about renaming it to Outlook, rather than Outlook2000?  Not worth the bother?

> I'd still much rather see the top-level directory cleaned up 
> even more, with pop3proxy and hammie getting their own,
> application specific directories.

So would I, but I've read the debates about this in the past, and I'm staying clear ;)

=Tony Meyer

From T.A.Meyer at massey.ac.nz  Thu Mar 13 13:22:21 2003
From: T.A.Meyer at massey.ac.nz (Meyer, Tony)
Date: Wed Mar 12 19:23:48 2003
Subject: [Spambayes] Spambayes installation problem
Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F13C8C8C2@its-xchg4.massey.ac.nz>

> Its definitely outlook
> yep std toolbar
> yes reinstalled
> yes I do have another plugin installed which is the avg virus 
> checker. this
> is the only diff between machines so I am assuming that this 
> is probably the problem.

Well, I installed version 6.0 (build 645) of the free AVG virus checker, then the plugin, and it seems ok for me.  If you have a different version of the checker, then it still might be that.

Mark: FYI the AVG virus checker does integrate with Outlook - a button appears on the standard toolbar.

> Any advice

Two things:
* Wait for Mark to come up with the answer ;)  Seriously, while I know a little about the Outlook plugin, I know almost nothing about the installer, and he knows everything ;) about both.

* See if the CVS version works for you.  This is much more complicated, though - you'll need Python installed, and to do an anonymous checkout of the source.

Sorry I haven't been more use (yet!).

=Tony Meyer

---
Outgoing mail is certified Virus Free.
Checked by AVG anti-virus system (http://www.grisoft.com).
Version: 6.0.459 / Virus Database: 258 - Release Date: 25/02/2003
 

From mhammond at skippinet.com.au  Thu Mar 13 11:34:45 2003
From: mhammond at skippinet.com.au (Mark Hammond)
Date: Wed Mar 12 19:35:31 2003
Subject: [Spambayes] Spambayes installation problem
In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F13C8C8C2@its-xchg4.massey.ac.nz>
Message-ID: <LCEPIIGDJPKCOIHOBJEPKECGOGAA.mhammond@skippinet.com.au>

Sorry I haven't replied yet.  I have no clue :(  The existing plugin still
redirects all messages to Pythonwin, so to see any debug output you will
need Python+win32all.

I think I will hack together a simple "redirect log to file" when run from
the installer.  I'm currently upgrading the HTML so that users who stumble
on it and aren't real geeks can get it going easily.

Mark.

> -----Original Message-----
> From: Meyer, Tony [mailto:T.A.Meyer@massey.ac.nz]
> Sent: Thursday, 13 March 2003 11:22 AM
> To: Geoff Moyle
> Cc: Mark Hammond; spambayes@python.org
> Subject: RE: [Spambayes] Spambayes installation problem
>
>
> > Its definitely outlook
> > yep std toolbar
> > yes reinstalled
> > yes I do have another plugin installed which is the avg virus
> > checker. this
> > is the only diff between machines so I am assuming that this
> > is probably the problem.
>
> Well, I installed version 6.0 (build 645) of the free AVG virus
> checker, then the plugin, and it seems ok for me.  If you have a
> different version of the checker, then it still might be that.
>
> Mark: FYI the AVG virus checker does integrate with Outlook - a
> button appears on the standard toolbar.
>
> > Any advice
>
> Two things:
> * Wait for Mark to come up with the answer ;)  Seriously, while I
> know a little about the Outlook plugin, I know almost nothing
> about the installer, and he knows everything ;) about both.
>
> * See if the CVS version works for you.  This is much more
> complicated, though - you'll need Python installed, and to do an
> anonymous checkout of the source.
>
> Sorry I haven't been more use (yet!).
>
> =Tony Meyer
>
> ---
> Outgoing mail is certified Virus Free.
> Checked by AVG anti-virus system (http://www.grisoft.com).
> Version: 6.0.459 / Virus Database: 258 - Release Date: 25/02/2003
>


From T.A.Meyer at massey.ac.nz  Thu Mar 13 16:23:40 2003
From: T.A.Meyer at massey.ac.nz (Meyer, Tony)
Date: Wed Mar 12 22:26:53 2003
Subject: [Spambayes] Spambayes installation problem
Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F13C8C8C6@its-xchg4.massey.ac.nz>

> I have no clue :(  The existing 
> plugin still redirects all messages to Pythonwin, so to see
> any debug output you will
> need Python+win32all.

Geoff: if you're willing, then installing Python and win32all would mean that we could look at whatever error the plugin is throwing up (assuming that it is!).

The longer I run with the thing installed, the more I suspect that it is the virus program.

> I think I will hack together a simple "redirect log to file" 
> when run from the installer.

Sounds like a good plan.

=Tony Meyer

---
Outgoing mail is certified Virus Free.
Checked by AVG anti-virus system (http://www.grisoft.com).
Version: 6.0.459 / Virus Database: 258 - Release Date: 25/02/2003
 

From T.A.Meyer at massey.ac.nz  Thu Mar 13 17:29:28 2003
From: T.A.Meyer at massey.ac.nz (Meyer, Tony)
Date: Wed Mar 12 23:30:07 2003
Subject: [Spambayes] UpdatableConfigParser
Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F1318CD87@its-xchg4.massey.ac.nz>

Those that were paying attention will recall a discussion a couple of weeks back about config files, paticularly updating them.

I've just committed a new module - UpdatableConfigParser.  This extends ConfigParser so that config files can be updated (retaining whitespace and comments), rather than simpyl rewritten.  It should work fine with multiple config files, like ConfigParser, although there are issues to consider when doing so.

Only those using OptionConfig to change their options should notice any difference at all.  For everyone else the functions are either almost identical to ConfigParser, or are the ConfigParser functions.

Those that do use OptionConfig will now be able to retain comments and whitespace in their ini files.  The Outlook plugin *might* also someday use this module.

I've tried to test this as thoroughly as possible, but no doubt as soon as I commit it, there will be an error.  I'll try to get this fixed ASAP.  Anyone wanting to do more testing with multiple files (there are so many possibilities!) is very welcome to do so (OptionConfig only works with one file, so this will not actually effect anyone currently using it).

The __doc__ has a lot more information.  I would hope this would be useful to any UI that allows modification of config files (are there any apart from OptionConfig at the moment?).

=Tony Meyer

From T.A.Meyer at massey.ac.nz  Thu Mar 13 17:45:12 2003
From: T.A.Meyer at massey.ac.nz (Meyer, Tony)
Date: Wed Mar 12 23:45:53 2003
Subject: [Spambayes] Storing Options
Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F13C8C8C9@its-xchg4.massey.ac.nz>

Ignoring the fact that it's scattered throughout the code base, does anyone like the current method of getting options?

What I personally do not like (in order of dislike):
* That sections are ignored, leading to names like pop3proxy_servers.

* Updating the options object does not update the underlying ConfigParser (now UpdatableConfigParser ;) object, so a write() (or update()) will not write the updated values.

* Having all the defaults in Options.py, rather than a much simpler default config file (IIRC the reason for folding the file in was so that it didn't matter which directory you were running from, but the envar should take care of that, yes?)

I know I'm not completely alone here, but I'd like to know if there are lots of people (or even a few of the right people ;) that like it as it is.  If people (a) don't care, or (b) also don't like it, then I'll try and come up with a better scheme (and present it before making any changes!).

=Tony Meyer

From popiel at wolfskeep.com  Wed Mar 12 21:34:31 2003
From: popiel at wolfskeep.com (T. Alexander Popiel)
Date: Thu Mar 13 00:34:34 2003
Subject: [Spambayes] Storing Options 
In-Reply-To: Message from "Meyer, Tony" <T.A.Meyer@massey.ac.nz> 
	<1ED4ECF91CDED24C8D012BCF2B034F13C8C8C9@its-xchg4.massey.ac.nz> 
References: <1ED4ECF91CDED24C8D012BCF2B034F13C8C8C9@its-xchg4.massey.ac.nz> 
Message-ID: <20030313053431.4B4962DE8A@cashew.wolfskeep.com>

In message:  <1ED4ECF91CDED24C8D012BCF2B034F13C8C8C9@its-xchg4.massey.ac.nz>
             "Meyer, Tony" <T.A.Meyer@massey.ac.nz> writes:

[talking about handling options]

>If people (a) don't care, or (b) also don't like it, then I'll try
>and come up with a better scheme (and present it before making any
>changes!).

+1

As a person who juggles many different options sets for doing
testing, I would ask that one of the design constraints be to
make such juggling reasonably easy.

- Alex

From anthony at interlink.com.au  Thu Mar 13 21:39:53 2003
From: anthony at interlink.com.au (Anthony Baxter)
Date: Thu Mar 13 05:40:45 2003
Subject: [Spambayes] wanted: malformed email messages.
Message-ID: <200303131040.h2DAdrq18384@localhost.localdomain>


If you've got spam that breaks python's email parser in some way, don't
just gripe - send it to me. I'm going to make a fairly serious go at 
seeing what I can do to make the email parser more robust, and also make
sure it notes what it had to do in order to get the message to parse 
(these notes will almost certainly be very good clues). 

Please don't just forward them, unless you're sure your mailer can do
_correct_ message/rfc822 encapsulation. If you're not sure, mail a tarball
or zipfile containing the message(s).

Thanks,
Anthony

From mhammond at skippinet.com.au  Thu Mar 13 21:51:00 2003
From: mhammond at skippinet.com.au (Mark Hammond)
Date: Thu Mar 13 05:51:42 2003
Subject: [Spambayes] Storing Options
In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F13C8C8C9@its-xchg4.massey.ac.nz>
Message-ID: <LCEPIIGDJPKCOIHOBJEPGEDNOGAA.mhammond@skippinet.com.au>

[Tony]
...
> I know I'm not completely alone here, but I'd like to know if
> there are lots of people (or even a few of the right people ;)
> that like it as it is.  If people (a) don't care, or (b) also
> don't like it, then I'll try and come up with a better scheme
> (and present it before making any changes!).

I'm certainly +1 on the concept.  I think you should go for it!  We are
still alpha, so we can get away with lots.

Now is better, too - the longer we go, the harder it gets.  I've already
discovered that splitting the database from an inheritance model is much
harder than it looks - largely because there is already a kind of decay in
the code - some __getstate__, some explicit, some pickling, some zodb, etc.

It seems the math is done - if Tim can't measure improvement, I'm sure as
hell that I can't <0.0 wink>.  So the more architecture stuff we can do now,
the better, and the more future users of this technology can benefit.

Mark.


From spambayes at rodland.no  Thu Mar 13 12:16:20 2003
From: spambayes at rodland.no (Fredrik Rodland)
Date: Thu Mar 13 06:16:33 2003
Subject: [Spambayes] mails that fail when filtered in outlook
In-Reply-To: <LCEPIIGDJPKCOIHOBJEPGEDNOGAA.mhammond@skippinet.com.au>
Message-ID: <OLEKJBLGLGDHBDLHGIINOEDFCNAA.spambayes@rodland.no>

I've got a couple of emails (allready fetched by outlook) which make the
filterering fail.

I'm unsure how to report this as a bug.

If I forward the mails, the errors seems to go away - probably outlook
"fixes" the email so that the errors disappear.

There seems to be a different error for these two mails - both listed at the
end of this mail.

I've no idea as to what makes the first one fail.  The other one seems to
have a '\r\n' included in the subject.  I guess this is not good, but it
shouldn't make the plugin fail, should it?

also - If manual filtering is started, and one e-mail fails, the rest of the
filetering seems to be skipped.  couldn't the filtering continue, skipping
the message which failed?

appriciate any comments on these.  I'll be happy to post some or all of
these as bugs, but - as I said - I'm unsure how to include a message for
reproducing the errors.

ERROR 1:
Error getting property from stream (-2147221233, 'OLE error 0x8004010f',
None, None)
pythoncom error: Python error invoking COM method.
Traceback (most recent call last):
  File
"C:\PROGRA~1\_DEV\Python22\lib\site-packages\win32com\server\policy.py",
line 275, in _Invoke_
    return self._invoke_(dispid, lcid, wFlags, args)
  File
"C:\PROGRA~1\_DEV\Python22\lib\site-packages\win32com\server\policy.py",
line 280, in _invoke_
    return S_OK, -1, self._invokeex_(dispid, lcid, wFlags, args, None, None)
  File
"C:\PROGRA~1\_DEV\Python22\lib\site-packages\win32com\server\policy.py",
line 541, in _invokeex_
    return apply(func, args)
  File "c:\Programfiler\_UTIL\spambayes_cvs\spambayes\Outlook2000\addin.py",
line 160, in OnClick
    self.handler(*self.args)
  File "c:\Programfiler\_UTIL\spambayes_cvs\spambayes\Outlook2000\addin.py",
line 225, in ShowClues
    score, clues = mgr.score(msgstore_message, evidence=True)
  File
"c:\Programfiler\_UTIL\spambayes_cvs\spambayes\Outlook2000\manager.py", line
439, in score
    email = msg.GetEmailPackageObject()
  File
"c:\Programfiler\_UTIL\spambayes_cvs\spambayes\Outlook2000\msgstore.py",
line 639, in GetEmailPackageObject
    text = self._GetMessageText()
  File
"c:\Programfiler\_UTIL\spambayes_cvs\spambayes\Outlook2000\msgstore.py",
line 582, in _GetMessageText
    assert msg.is_multipart()
exceptions.AssertionError:


ERROR 2:
FAILED to create email.message from:  'X-Exchange-Message: true\nSubject:
RE: Les p\xe5 www.aftenposten.no: \r\nSinte brennkopper herjer\r\n\nTo:
reidar.rodland@hydro.com\nCC: Mona R\xf8dland\n\n\nDet var nok "bare" - som
legen sa - myggstikk.  Det som er litt skitt er at martin, siste natta p\xe5
Korsika ble igjen angrepet noe s\xe5 til de grader.  Vi kom ut av tellinga
etter \xe5 ha passert 60 stikk p\xe5 hode og hender.  Det ser ikke godt ut,
og er det sikkert heller ikke.  Han klorer seg til blods i \xf8ret, og
skirker en del...  Mona er ogs\xe5 lettere angrepet, mens jeg stort sett har
sluppet ganske billig fra det....\r\n\r\nVi driver med myggreduserende
tiltak om dagen:\r\n-Eurax\r\n-sitronkonsentrat\r\n-ting i stikkkontakter
som skal hodle dem borte\r\n-lemmer for
vinduer\r\n-myggnett\r\n-etc....\r\n\r\ndet er ikke greit det er med myggen;
men ellers er vi ved meget godt mot etter at vi er tilbake i Grasse.  Det
var nok ganske smart \xe5 v\xe6re her noen dager p\xe5 forh\xe5nd, for p\xe5
mange m\xe5ter f\xf8ltes som \xe5 komme hjem i stedet for bare til nok et
"nytt sted med nye rutiner \xe5 etablere".  Vi slapper av. I morgen skal vi
p\xe5 marked i Pleymenade!\r\n\r\n\r\n\r\nF\r\n\r\n\r\n--\r\nFredrik
R\xf8dland           ASTON Technology      Phone: +47 23 28 40
17\r\nTechnical Architect       Stocknet              Fax  : +47 910 73
621\r\nFredrik.Rodland@aston.no  http://www.aston.no   Mob  : +47 992 19
817\r\n \r\n\r\n> -----Original Message-----\r\n> From:
reidar.rodland@hydro.com [mailto:reidar.rodland@hydro.com]\r\n> Sent: 12.
september 2002 08:43\r\n> To: frodland@aston.no\r\n> Subject: Les p\xe5
www.aftenposten.no: Sinte brennkopper herjer \r\n> \r\n> \r\n> Dette er et
tips som reidar.rodland@hydro.com har sendt fra \r\n> Aftenposten
Nettutgaven.\r\n> \r\n> \r\n> Sinte brennkopper herjer\r\n> \r\n> \r\n> En
brennkoppeepidemi er i ferd med \xe5 n\xe5 alle deler av landet. \r\n>
\xc5rets brennkopper er langt mer aggressive enn vanlig og mye \r\n>
vanskeligere \xe5 behandle.\r\n> \r\n> Les mer her: \r\n>
http://www.aftenposten.no/forbruker/helse/article.jhtml?articleID=397867\r\n
> \r\n> -------------------------------------------------\r\n> Beskjed fra
reidar.rodland@hydro.com:\r\n> Ref les varicelles. Til info!. Brennkopper
starter gjerne med \r\n> vannkopper og s\xe5 g\xe5r det betennelse i
s\xe5rene.\r\n> http://www.aftenposten.no \r\n>
P\r\n> -------------------------------------------------\r\n> '
pythoncom error: Python error invoking COM method.
Traceback (most recent call last):
  File
"C:\PROGRA~1\_DEV\Python22\lib\site-packages\win32com\server\policy.py",
line 275, in _Invoke_
    return self._invoke_(dispid, lcid, wFlags, args)
  File
"C:\PROGRA~1\_DEV\Python22\lib\site-packages\win32com\server\policy.py",
line 280, in _invoke_
    return S_OK, -1, self._invokeex_(dispid, lcid, wFlags, args, None, None)
  File
"C:\PROGRA~1\_DEV\Python22\lib\site-packages\win32com\server\policy.py",
line 541, in _invokeex_
    return apply(func, args)
  File "c:\Programfiler\_UTIL\spambayes_cvs\spambayes\Outlook2000\addin.py",
line 160, in OnClick
    self.handler(*self.args)
  File "c:\Programfiler\_UTIL\spambayes_cvs\spambayes\Outlook2000\addin.py",
line 225, in ShowClues
    score, clues = mgr.score(msgstore_message, evidence=True)
  File
"c:\Programfiler\_UTIL\spambayes_cvs\spambayes\Outlook2000\manager.py", line
439, in score
    email = msg.GetEmailPackageObject()
  File
"c:\Programfiler\_UTIL\spambayes_cvs\spambayes\Outlook2000\msgstore.py",
line 641, in GetEmailPackageObject
    msg = email.message_from_string(text)
  File "C:\PROGRA~1\_DEV\Python22\lib\email\__init__.py", line 52, in
message_from_string
    return Parser(_class, strict=strict).parsestr(s)
  File "C:\PROGRA~1\_DEV\Python22\lib\email\Parser.py", line 75, in parsestr
    return self.parse(StringIO(text), headersonly=headersonly)
  File "C:\PROGRA~1\_DEV\Python22\lib\email\Parser.py", line 62, in parse
    self._parseheaders(root, fp)
  File "C:\PROGRA~1\_DEV\Python22\lib\email\Parser.py", line 128, in
_parseheaders
    raise Errors.HeaderParseError(
email.Errors.HeaderParseError: Not a header, not a continuation: ``Sinte
brennkopper herjer''


F


--
Fredrik R?dland	Technical Architect, Stocknet, Oslo, Norway
Stocknet:		http://www.stocknet.com		phone: +47 23 28 40 17
Private:		http://rodland.no			phone: +47 99 21 98 17


From spambayes at djl.freeuk.com  Thu Mar 13 11:22:57 2003
From: spambayes at djl.freeuk.com (David Leftley)
Date: Thu Mar 13 06:23:04 2003
Subject: [Spambayes] wanted: malformed email messages.
In-Reply-To: <200303131040.h2DAdrq18384@localhost.localdomain>
References: <200303131040.h2DAdrq18384@localhost.localdomain>
Message-ID: <rdp07v09b8e3hl0eo3idlv78rveurbu1as@4ax.com>

On Thu, 13 Mar 2003 21:39:53 +1100, Anthony Baxter
<anthony@interlink.com.au> wrote:
>
>If you've got spam that breaks python's email parser in some way, don't
>just gripe - send it to me. I'm going to make a fairly serious go at 
>seeing what I can do to make the email parser more robust, and also make
>sure it notes what it had to do in order to get the message to parse 
>(these notes will almost certainly be very good clues). 

I was just about to send some messages that the Outlook plugin was
choking on. These messages have malformed headers, either with
unexpected lines of Base64 or header lines broken across several
lines.

But having just upgraded to version 2.5b1 of the email package, all
the dodgy messages I have received to date are now processed without
errors. The only further improvement I would like to see regarding
these messages is to try and decode the Base64 in the headers rather
than just discarding it - currently I have a few spam messages with
very low scores, presumably because spambayes has thrown away all the
clues in the body of the message.

David.


From spambayes at djl.freeuk.com  Thu Mar 13 11:29:59 2003
From: spambayes at djl.freeuk.com (David Leftley)
Date: Thu Mar 13 06:30:04 2003
Subject: [Spambayes] mails that fail when filtered in outlook
In-Reply-To: <OLEKJBLGLGDHBDLHGIINOEDFCNAA.spambayes@rodland.no>
References: <LCEPIIGDJPKCOIHOBJEPGEDNOGAA.mhammond@skippinet.com.au>
	<OLEKJBLGLGDHBDLHGIINOEDFCNAA.spambayes@rodland.no>
Message-ID: <ujq07vk8uae4qe3vn6iqn8pb1v3hospucf@4ax.com>

On Thu, 13 Mar 2003 12:16:20 +0100, "Fredrik Rodland"
<spambayes@rodland.no> wrote:
>I've got a couple of emails (allready fetched by outlook) which make the
>filterering fail.

>I've no idea as to what makes the first one fail.  The other one seems to
>have a '\r\n' included in the subject.  I guess this is not good, but it
>shouldn't make the plugin fail, should it?

There seem to be some big improvements in the handling of malformed
headers in the latest python email package. I was getting an error
similar to the second of yours until I upgraded to version 2.5b1, from
http://sourceforge.net/project/showfiles.php?group_id=25568
>
>also - If manual filtering is started, and one e-mail fails, the rest of the
>filetering seems to be skipped.  couldn't the filtering continue, skipping
>the message which failed?

Yes, this is something I would like to see as well. It can sometimes
be tricky to work out which of the 2000 messages in the spam corpus is
causing filtering to fail!

David.

From spambayes at rodland.no  Thu Mar 13 13:05:38 2003
From: spambayes at rodland.no (Fredrik Rodland)
Date: Thu Mar 13 07:05:45 2003
Subject: [Spambayes] wanted: malformed email messages.
In-Reply-To: <rdp07v09b8e3hl0eo3idlv78rveurbu1as@4ax.com>
Message-ID: <OLEKJBLGLGDHBDLHGIINMEDHCNAA.spambayes@rodland.no>

On Thu, 13 Mar 2003 21:39:53 +1100, Anthony Baxter
<anthony@interlink.com.au> wrote:
>
>If you've got spam that breaks python's email parser in some way, don't
>just gripe - send it to me. I'm going to make a fairly serious go at
>seeing what I can do to make the email parser more robust, and also make
>sure it notes what it had to do in order to get the message to parse
>(these notes will almost certainly be very good clues).

I'd love to - but as I wrote in my other post - outlook (which is the MUA I
use at the moment ) fixes these messages, so that they don't fail anymore.

does anybody have any tips on how to save/send a message with all of it's
origianl content from outlook (2000)?


F


--
Fredrik Rodland	Technical Architect, Stocknet, Oslo, Norway
Stocknet:		http://www.stocknet.com		phone: +47 23 28 40 17
Private:		http://rodland.no			phone: +47 99 21 98 17


From spambayes at rodland.no  Thu Mar 13 13:09:14 2003
From: spambayes at rodland.no (Fredrik Rodland)
Date: Thu Mar 13 07:09:21 2003
Subject: FW: [Spambayes] mails that fail when filtered in outlook
Message-ID: <OLEKJBLGLGDHBDLHGIINEEDICNAA.spambayes@rodland.no>


> -----Original Message-----
> From: spambayes-bounces@python.org
> [mailto:spambayes-bounces@python.org]On Behalf Of David Leftley
> Sent: 13. mars 2003 12:30
> To: spambayes@python.org
> Subject: Re: [Spambayes] mails that fail when filtered in outlook
>
>
> Yes, this is something I would like to see as well. It can sometimes
> be tricky to work out which of the 2000 messages in the spam corpus is
> causing filtering to fail!

exactly.  I had to split the mails I wanted to filter (a total of 2700) into
smaller portions to narrow down to the email that actually failed.  Actaully
it was 2 of them - both described in my original post.


F


--
Fredrik Rodland	Technical Architect, Stocknet, Oslo, Norway
Stocknet:		http://www.stocknet.com		phone: +47 23 28 40 17
Private:		http://rodland.no			phone: +47 99 21 98 17


From spambayes at rodland.no  Thu Mar 13 13:18:22 2003
From: spambayes at rodland.no (Fredrik Rodland)
Date: Thu Mar 13 07:18:29 2003
Subject: [Spambayes] mails that fail when filtered in outlook
In-Reply-To: <ujq07vk8uae4qe3vn6iqn8pb1v3hospucf@4ax.com>
Message-ID: <OLEKJBLGLGDHBDLHGIINCEDJCNAA.spambayes@rodland.no>


> -----Original Message-----
> From: spambayes-bounces@python.org
> [mailto:spambayes-bounces@python.org]On Behalf Of David Leftley
> Sent: 13. mars 2003 12:30
> To: spambayes@python.org
> Subject: Re: [Spambayes] mails that fail when filtered in outlook
>
>
> On Thu, 13 Mar 2003 12:16:20 +0100, "Fredrik Rodland"
> <spambayes@rodland.no> wrote:
>
> >I've no idea as to what makes the first one fail.  The other one seems to
> >have a '\r\n' included in the subject.  I guess this is not good, but it
> >shouldn't make the plugin fail, should it?
>
> There seem to be some big improvements in the handling of malformed
> headers in the latest python email package. I was getting an error
> similar to the second of yours until I upgraded to version 2.5b1, from
> http://sourceforge.net/project/showfiles.php?group_id=25568


Thanx - I followed your advice & installed the mail-lib to 2.5b1 -  this
helped on the message which had \r\n in the subject.

however - the other (with the assertion-error) one still fails with the same
error.


F


--
Fredrik Rodland	Technical Architect, Stocknet, Oslo, Norway
Stocknet:		http://www.stocknet.com		phone: +47 23 28 40 17
Private:		http://rodland.no			phone: +47 99 21 98 17


From noreply at sourceforge.net  Wed Mar 12 21:32:26 2003
From: noreply at sourceforge.net (SourceForge.net)
Date: Thu Mar 13 07:46:22 2003
Subject: [Spambayes] 
	[ spambayes-Bugs-702758 ] When manually filtering the results are
	not right.
Message-ID: <E18tLKM-0006kI-00@sc8-sf-web1.sourceforge.net>

Bugs item #702758, was opened at 2003-03-13 18:32
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=498103&aid=702758&group_id=61702

Category: Outlook
Group: None
Status: Open
Resolution: None
Priority: 5
Submitted By: Tony Meyer (anadelonbrin)
Assigned to: Mark Hammond (mhammond)
Summary: When manually filtering the results are not right.

Initial Comment:
When doing a manual filter (via the filter dialog), the 
results displayed (found x ham, x spam, x unsure) are 
for the last folder filtered only, not the total over all 
folders, as one would expect.

This is because in filter.py the update() function of the 
dictionary is used, and the docs have this as a[x] = b[x], 
not a[x] += b[x], which is what would be wanted here.

Unless this is changed in a later version of Python, then 
this should really be fixed.  I might get to it :)

----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=498103&aid=702758&group_id=61702

From noreply at sourceforge.net  Thu Mar 13 04:36:16 2003
From: noreply at sourceforge.net (SourceForge.net)
Date: Thu Mar 13 07:46:28 2003
Subject: [Spambayes] 
	[ spambayes-Bugs-642740 ] "Recover from Spam" wrong folder
Message-ID: <E18tRwW-0004TU-00@sc8-sf-web1.sourceforge.net>

Bugs item #642740, was opened at 2002-11-23 15:00
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=498103&aid=642740&group_id=61702

Category: None
Group: None
Status: Open
Resolution: Works For Me
Priority: 5
Submitted By: Mark Hammond (mhammond)
Assigned to: Mark Hammond (mhammond)
Summary: "Recover from Spam" wrong folder

Initial Comment:
Outlook addin:

Selecting "Recover From Spam" recovers the selected
message to the Inbox folder - which is not necessarily
where came from.  The filterer will need to save the
folder it came from before we can do this.

----------------------------------------------------------------------

Comment By: Fredrik Rodland (fmmr)
Date: 2003-03-13 13:36

Message:
Logged In: YES 
user_id=724871

I haven't seen this after I entered my previous comment.  I 
gues I was working on an old message, as I mentioned...

I guess you could close this bug...

----------------------------------------------------------------------

Comment By: Fredrik Rodland (fmmr)
Date: 2003-03-04 12:03

Message:
Logged In: YES 
user_id=724871

OK - i've tested some more.  this seems to work sometimes, 
and sometimes not.  It may be related to the other bug you're 
refering to, but I'll try to walk thorugh an example.

- I've got a message in a folder (inbox/maillister/locker).  The 
message was filtered by outlooks rules to this folder this 
morning - i.e. I've never viewed neither the message or the 
clues from any other folder.
- I run a manual filter on this folder (which returns with 1 good 
msg as expected) - WILL THIS FORGET THE FOLDER OF 
THIS MSG?
- I press the "delete as spam" button, and the message 
appears in my SPAM-folder.
- I enter my spam-folder and press the "recover from spam"-
button.
- the message appears in my INBOX

The message was ORIGINALLY (this morning local time) 
filtered using the 1.0.a2 version of spambayes, while I now 
use the latest CVS-version.

the following appears in the trace-collector:
Deleting and spam training message '[Lockergnome Penguin 
Shell]  Network Shutdown' -  trained as spam
Recovering to folder 'Inbox' and ham training 
message '[Lockergnome Penguin Shell]  Network Shutdown' -
  trained as ham

If you add some more debug, I'll be happy to run some tests 
on this msg.  Is there anyway to check whether this message 
actually 


----------------------------------------------------------------------

Comment By: Mark Hammond (mhammond)
Date: 2003-03-04 11:43

Message:
Logged In: YES 
user_id=14198

Can you post an example of something that fails?

Note that a remaining potential problem is out of our
control: occasionally the "Inbox" will see a message before
the builtin rules.  In this case, we filter it from the
Inbox, not from where the Outlook rule would have moved it.
 Thus, when we recover, we see the inbox as the source.

Note that I also fixed another bug related to this -
previously, simply scoring a message would store that folder
name as the "source" of the message.  Thus, if you had
previously viewed the clues for a message once in the wrong
folder, the correct source folder would have been lost.  So
please ensure you are testing with mail received since I
said I fixed this.

----------------------------------------------------------------------

Comment By: Mark Hammond (mhammond)
Date: 2003-02-04 07:23

Message:
Logged In: YES 
user_id=14198

/cvsroot/spambayes/spambayes/Outlook2000/addin.py,v  <-- 
addin.py
new revision: 1.48; previous revision: 1.47
/cvsroot/spambayes/spambayes/Outlook2000/filter.py,v  <-- 
filter.py
new revision: 1.16; previous revision: 1.15
/cvsroot/spambayes/spambayes/Outlook2000/msgstore.py,v  <--
 msgstore.py
new revision: 1.39; previous revision: 1.38


----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=498103&aid=642740&group_id=61702

From noreply at sourceforge.net  Thu Mar 13 04:38:40 2003
From: noreply at sourceforge.net (SourceForge.net)
Date: Thu Mar 13 07:46:36 2003
Subject: [Spambayes] 
	[ spambayes-Bugs-702920 ] Manual filtering (Outlook) fails if one
	message fails
Message-ID: <E18tRyq-0004Xt-00@sc8-sf-web1.sourceforge.net>

Bugs item #702920, was opened at 2003-03-13 13:38
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=498103&aid=702920&group_id=61702

Category: Outlook
Group: None
Status: Open
Resolution: None
Priority: 5
Submitted By: Fredrik Rodland (fmmr)
Assigned to: Mark Hammond (mhammond)
Summary: Manual filtering (Outlook) fails if one message fails

Initial Comment:
I've posted tyhis question on the maillist, and with (at 
least) one positive feedback, I enter it here:

If manual filtering is started, and one e-mail fails, the 
rest of the filetering seems to be skipped.  

couldn't the filtering of the remaining messages 
continue, skipping the message which failed?


----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=498103&aid=702920&group_id=61702

From tim at fourstonesExpressions.com  Thu Mar 13 07:07:20 2003
From: tim at fourstonesExpressions.com (Tim Stone - Four Stones Expressions)
Date: Thu Mar 13 08:07:26 2003
Subject: [Spambayes] wanted: malformed email messages.
In-Reply-To: <200303131040.h2DAdrq18384@localhost.localdomain>
Message-ID: <EDW82WS732VEACZW1WCWVY65JLG.3e708288@myst>

Anthony, I've been working on the Parser myself for a couple days. I've 
attached my version of it.  I have to tell you that I think the parser is 
fairly poorly written.  I haven't done any of the formal regression tests on 
it as of yet.

There is a mail attached to spambayes bug #695142 that has a malformed 
continuation header (text starts in column 1).


c'est moi - TimS
http://www.fourstonesExpressions.com
http://wecanstopspam.org
-------------- next part --------------
A non-text attachment was scrubbed...
Name: Parser.py
Type: application/octet-stream
Size: 12630 bytes
Desc: not available
Url : http://mail.python.org/pipermail/spambayes/attachments/20030313/c758eeab/Parser.obj
From skip at pobox.com  Thu Mar 13 07:14:46 2003
From: skip at pobox.com (Skip Montanaro)
Date: Thu Mar 13 08:15:12 2003
Subject: [Spambayes] UpdatableConfigParser
In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F1318CD87@its-xchg4.massey.ac.nz>
References: <1ED4ECF91CDED24C8D012BCF2B034F1318CD87@its-xchg4.massey.ac.nz>
Message-ID: <15984.33862.819218.177867@montanaro.dyndns.org>


    Tony> Those that were paying attention will recall a discussion a couple
    Tony> of weeks back about config files, paticularly updating them.

Hmmm...  What applications modify config files?  That usually seems to me to
be the province of special config file editors or humans armed with text
editors.  You're not proposing that applications like pop3proxy should
modify them are you?

Skip

From skip at pobox.com  Thu Mar 13 07:15:50 2003
From: skip at pobox.com (Skip Montanaro)
Date: Thu Mar 13 08:16:04 2003
Subject: [Spambayes] Storing Options
In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F13C8C8C9@its-xchg4.massey.ac.nz>
References: <1ED4ECF91CDED24C8D012BCF2B034F13C8C8C9@its-xchg4.massey.ac.nz>
Message-ID: <15984.33926.350804.451009@montanaro.dyndns.org>

    Tony> I know I'm not completely alone here, but I'd like to know if
    Tony> there are lots of people (or even a few of the right people ;)
    Tony> that like it as it is.

I like it the way it is.  I'd prefer to fiddle my options with a text
editor.

Skip


From tim at fourstonesExpressions.com  Thu Mar 13 07:30:30 2003
From: tim at fourstonesExpressions.com (Tim Stone - Four Stones Expressions)
Date: Thu Mar 13 08:30:36 2003
Subject: [Spambayes] UpdatableConfigParser
In-Reply-To: <15984.33862.819218.177867@montanaro.dyndns.org>
Message-ID: <PNA8WHEPN5YWSRQYLH1VWSYUYT85RM.3e7087f6@myst>

I'm reasonably sure that there is code in several places that modifies 
specific options temporarily, counting on the fact that those modifications 
are not permanent. options.verbose modification is one of those things that 
gets twiddled every now and then.  I suppose persisting option changes should 
be explicit.

3/13/2003 7:14:46 AM, Skip Montanaro <skip@pobox.com> wrote:

>
>    Tony> Those that were paying attention will recall a discussion a couple
>    Tony> of weeks back about config files, paticularly updating them.
>
>Hmmm...  What applications modify config files?  That usually seems to me to
>be the province of special config file editors

We have one of those.

> or humans armed with text editors.

We have one of those.

>  You're not proposing that applications like pop3proxy should
>modify them are you?

It already does.

c'est moi - TimS
http://www.fourstonesExpressions.com
http://wecanstopspam.org


From tim at fourstonesExpressions.com  Thu Mar 13 07:31:25 2003
From: tim at fourstonesExpressions.com (Tim Stone - Four Stones Expressions)
Date: Thu Mar 13 08:31:33 2003
Subject: [Spambayes] Storing Options
In-Reply-To: <15984.33926.350804.451009@montanaro.dyndns.org>
Message-ID: <RMNH531T1VLJJD03WLFTS2XROMHKJCB.3e70882d@myst>

3/13/2003 7:15:50 AM, Skip Montanaro <skip@pobox.com> wrote:

>    Tony> I know I'm not completely alone here, but I'd like to know if
>    Tony> there are lots of people (or even a few of the right people ;)
>    Tony> that like it as it is.
>
>I like it the way it is.  I'd prefer to fiddle my options with a text
>editor.

Nothing that Tony is proposing precludes this as a possibility.

c'est moi - TimS
http://www.fourstonesExpressions.com
http://wecanstopspam.org


From tim at fourstonesExpressions.com  Thu Mar 13 07:33:09 2003
From: tim at fourstonesExpressions.com (Tim Stone - Four Stones Expressions)
Date: Thu Mar 13 08:33:13 2003
Subject: [Spambayes] Storing Options
In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F13C8C8C9@its-xchg4.massey.ac.nz>
Message-ID: <41RLMINJ98C72197WT54SQPK3VUTMGEC.3e708895@myst>

3/12/2003 10:45:12 PM, "Meyer, Tony" <T.A.Meyer@massey.ac.nz> wrote:

>Ignoring the fact that it's scattered throughout the code base, does anyone 
like the current method of getting options?
>
>What I personally do not like (in order of dislike):
>* That sections are ignored, leading to names like pop3proxy_servers.
>
>* Updating the options object does not update the underlying ConfigParser 
(now UpdatableConfigParser ;) object, so a write() (or update()) will not 
write the updated values.
>
>* Having all the defaults in Options.py, rather than a much simpler default 
config file (IIRC the reason for folding the file in was so that it didn't 
matter which directory you were running from, but the envar should take care 
of that, yes?)

The only thing I really like about Options.py is the cracker, which returns an 
object of the correct type given the stringness, numberness, or booleanness of 
the option.

>
>I know I'm not completely alone here, but I'd like to know if there are lots 
of people (or even a few of the right people ;) that like it as it is.  If 
people (a) don't care, or (b) also don't like it, then I'll try and come up 
with a better scheme (and present it before making any changes!).

+1 for me.

>
>=Tony Meyer
>
>_______________________________________________
>Spambayes mailing list
>Spambayes@python.org
>http://mail.python.org/mailman/listinfo/spambayes
>
>


c'est moi - TimS
http://www.fourstonesExpressions.com
http://wecanstopspam.org


From skip at pobox.com  Thu Mar 13 08:37:09 2003
From: skip at pobox.com (Skip Montanaro)
Date: Thu Mar 13 09:37:20 2003
Subject: [Spambayes] UpdatableConfigParser
In-Reply-To: <PNA8WHEPN5YWSRQYLH1VWSYUYT85RM.3e7087f6@myst>
References: <15984.33862.819218.177867@montanaro.dyndns.org>
        <PNA8WHEPN5YWSRQYLH1VWSYUYT85RM.3e7087f6@myst>
Message-ID: <15984.38805.111595.613581@montanaro.dyndns.org>


    >> You're not proposing that applications like pop3proxy should modify
    >> them are you?

    Tim> It already does.

I meant modify them and save those modifications to the underlying config
file.  I realize that the options get suitably modified at runtime.  I'm
concerned that if I set the verbose flag on the command line that my config
file will get modified so that verbose is then the default.  I definitely
don't want that.

Skip


From tim at fourstonesExpressions.com  Thu Mar 13 08:44:51 2003
From: tim at fourstonesExpressions.com (Tim Stone - Four Stones Expressions)
Date: Thu Mar 13 09:44:57 2003
Subject: [Spambayes] UpdatableConfigParser
In-Reply-To: <15984.38805.111595.613581@montanaro.dyndns.org>
Message-ID: <43C8Q1ZRPC7VQGC3DA1YDB73WT975Z.3e709963@myst>

3/13/2003 8:37:09 AM, Skip Montanaro <skip@pobox.com> wrote:

>
>    >> You're not proposing that applications like pop3proxy should 
modify
>    >> them are you?
>
>    Tim> It already does.
>
>I meant modify them and save those modifications to the underlying 
config
>file.  I realize that the options get suitably modified at runtime.  
I'm
>concerned that if I set the verbose flag on the command line that 
my config
>file will get modified so that verbose is then the default.  I 
definitely
>don't want that.

For sure on that one.  The pop3proxy has an Option Configuration 
page, where options that pertain to the proxy can be manipulated by 
the user.  Those manipulations actually do modify the ini file.

>
>Skip
>
>
>


c'est moi - TimS
http://www.fourstonesExpressions.com
http://wecanstopspam.org


From bill at parducci.net  Thu Mar 13 08:24:15 2003
From: bill at parducci.net (bill parducci)
Date: Thu Mar 13 11:24:18 2003
Subject: [Spambayes] training issues
Message-ID: <3E70B0AF.5080303@parducci.net>

i receive a couple of newsletters and [travel] updates that i cannot get trained properly for the life of me. every time one of them comes in it is classified as spam (high 90s not uncommon) and dumped into my spam folder. 

i moved the message into my inbox and fired off mboxtrain each time this happens. in looking at the note afterwards i see

X-Spambayes-Trained: ham

in the header. 

however, the next time that a similar message arrives, it is dumped into spam. my guess is that the weighting of the content (e.g. state department travel warnings bear a tremendous degree of similarity with the scams from nigeria if you just look at the occurrences of 'low freq' words) overcomes the effect of the training (which i am guessing acts by raising the header information to high ham probabilities as a result of much of the other information being previously trained as spam).

the bottom line is that i am not sure how to correct for this. suggestions? 

thanks

b


From trebor at animeigo.com  Thu Mar 13 10:20:16 2003
From: trebor at animeigo.com (Robert Woodhead)
Date: Thu Mar 13 11:25:11 2003
Subject: [Spambayes] Email Certificates of Approval
In-Reply-To: <E18tS6z-0003IJ-01@mail.python.org>
References: <E18tS6z-0003IJ-01@mail.python.org>
Message-ID: <a05210216ba9651b6143a@[192.168.1.101]>

Guys,

Been toying with a new, complementary idea for spam reduction. 
Wanted to pass it by you before unleashing it on the unsuspecting 
masses.

http://www.madoverlord.com/Projects/SPAMIDEA.t

Comments much appreciated, of course.

Best
R

Crossposted; spambayes & spam-l

-- 

Woodhead's Law: "The further you are from your server,  the more likely
it is to crash."

From db3l at fitlinxx.com  Thu Mar 13 14:25:09 2003
From: db3l at fitlinxx.com (David Bolen)
Date: Thu Mar 13 14:25:14 2003
Subject: [Spambayes] Re: wanted: malformed email messages.
References: <200303131040.h2DAdrq18384@localhost.localdomain>
	<rdp07v09b8e3hl0eo3idlv78rveurbu1as@4ax.com>
Message-ID: <u4r67m5m2.fsf@fitlinxx.com>

David Leftley <spambayes@djl.freeuk.com> writes:

(...)
> But having just upgraded to version 2.5b1 of the email package, all
> the dodgy messages I have received to date are now processed without
> errors.  (...)

I had similar behavior - I started getting a large rash of messages
that would fail to parse due to bad continuation lines (often
containing HTML comments or some such noise in the headers).  In my
case I actually switched to Python 2.3a2 for the add-in (which looks
like it has 2.5a1 of the e-mail package) and all the parsing problems
went away.

So at the very least, I think we would want to stress the need to be
using a very current email package, since for me in the span of a few
days I went from having an occasional such message to having a good
percentage each day (must have been a new format some spam-bot is
using or something).

In the context of the Outlook plugin, it also made me think that it
might be nice if the plugin didn't abort on an individual message
failure, but kept working on any remaining messages so as to at least
process as many as possible.

-- David


From tim at fourstonesExpressions.com  Thu Mar 13 15:12:39 2003
From: tim at fourstonesExpressions.com (Tim Stone - Four Stones Expressions)
Date: Thu Mar 13 16:12:45 2003
Subject: [Spambayes] Are we ready for alpha 3?
Message-ID: <SOWUHCXFCKI1UPLFBAXV75DD04CB.3e70f447@myst>

Give me some votes and I'll release alpha 3 tonight, if the votes are aye 
<wink>

c'est moi - TimS
http://www.fourstonesExpressions.com
http://wecanstopspam.org


From mhammond at skippinet.com.au  Fri Mar 14 08:11:54 2003
From: mhammond at skippinet.com.au (Mark Hammond)
Date: Thu Mar 13 16:12:52 2003
Subject: [Spambayes] Storing Options
In-Reply-To: <15984.33926.350804.451009@montanaro.dyndns.org>
Message-ID: <LCEPIIGDJPKCOIHOBJEPEEGBOGAA.mhammond@skippinet.com.au>

[Skip]
>     Tony> I know I'm not completely alone here, but I'd like to know if
>     Tony> there are lots of people (or even a few of the right people ;)
>     Tony> that like it as it is.
>
> I like it the way it is.  I'd prefer to fiddle my options with a text
> editor.

My understanding is that an updatable options class would allow the
pop3proxy configuration page to save its options back to a file.  I don't
think there is any suggestion that we try and get clever by "remembering"
options implicitly.

I like this idea for Outlook - I would prefer to have the options maintained
by the Outlook GUI be stored back in a text based options file.


Mark.


From noreply at sourceforge.net  Thu Mar 13 13:28:14 2003
From: noreply at sourceforge.net (SourceForge.net)
Date: Thu Mar 13 16:22:28 2003
Subject: [Spambayes] [ spambayes-Bugs-699063 ] pop3proxy.py crashes
Message-ID: <E18taFK-0004EK-00@sc8-sf-web1.sourceforge.net>

Bugs item #699063, was opened at 2003-03-06 17:11
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=498103&aid=699063&group_id=61702

Category: pop3proxy
Group: None
>Status: Closed
Resolution: None
Priority: 5
Submitted By: D. R. Evans (n7dr)
>Assigned to: Tim Stone (timstone4)
Summary: pop3proxy.py crashes

Initial Comment:
pop3proxy.py worked fine for a couple of weeks.

I then rebooted my Linux box (Mandrake 8.1), and since then pop3proxy.py produces the following output 
on the console:

Loading database...
Traceback (most recent call last):
  File "./pop3proxy.py", line 1577, in ?
    run()
  File "./pop3proxy.py", line 1551, in run
    state.createWorkers()
  File "./pop3proxy.py", line 1161, in createWorkers
    self.bayes = storage.DBDictClassifier(filename)
  File "./spambayes/storage.py", line 140, in __init__
    self.load()
  File "./spambayes/storage.py", line 152, in load
    t = self.db[self.statekey]
  File "/usr/local/lib/python2.2/shelve.py", line 71, in __getitem__
    return Unpickler(f).load()
EOFError

The database files are attached.

  Doc


----------------------------------------------------------------------

>Comment By: Tim Stone (timstone4)
Date: 2003-03-13 15:28

Message:
Logged In: YES 
user_id=645698

We currently have no way of recovering from this kind of error should it 
occur.  We believe, however, that the defect is actually a bsddb defect 
that has been corrected in a subsequent release of bsddb.

----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=498103&aid=699063&group_id=61702

From noreply at sourceforge.net  Thu Mar 13 13:29:02 2003
From: noreply at sourceforge.net (SourceForge.net)
Date: Thu Mar 13 16:22:35 2003
Subject: [Spambayes] 
	[ spambayes-Bugs-699174 ] mboxtrain only trains on cur in maildir
Message-ID: <E18taG6-0001ft-00@sc8-sf-web2.sourceforge.net>

Bugs item #699174, was opened at 2003-03-06 21:56
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=498103&aid=699174&group_id=61702

Category: None
Group: None
>Status: Closed
Resolution: None
Priority: 5
Submitted By: Matthew Cowles (mdcowles)
Assigned to: Nobody/Anonymous (nobody)
Summary: mboxtrain only trains on cur in maildir

Initial Comment:
When training on a maildir, mboxtrain trains only on
the messages in the subirectory cur. It ignores
messages in the subdirectory new. Since new is for
messages that haven't been seen, I think it's worth
looking there since at least some spam will have been
filed unseen.

I'll upload a patch that makes it train on both.

----------------------------------------------------------------------

>Comment By: Tim Stone (timstone4)
Date: 2003-03-13 15:29

Message:
Logged In: YES 
user_id=645698

This is a feature request.  If this remains as a requirement, please 
resubmit as such.

----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=498103&aid=699174&group_id=61702

From tim at fourstonesExpressions.com  Thu Mar 13 15:29:06 2003
From: tim at fourstonesExpressions.com (Tim Stone - Four Stones Expressions)
Date: Thu Mar 13 16:29:12 2003
Subject: [Spambayes] Federal Trade Commission Workshop on Spam
Message-ID: <1VTRKG87YU87FCD8UPB6RY1T2XHP.3e70f822@myst>

Well, maybe the feds are starting to wake up... April 30 to May 2, the FTC 
will be having a workshop on the spam problem.  Anybody in that general 
vicinity?

http://www.ftc.gov/bcp/workshops/spam/index.html

c'est moi - TimS
http://www.fourstonesExpressions.com
http://wecanstopspam.org


From skip at pobox.com  Thu Mar 13 15:48:52 2003
From: skip at pobox.com (Skip Montanaro)
Date: Thu Mar 13 16:49:02 2003
Subject: [Spambayes] Federal Trade Commission Workshop on Spam
In-Reply-To: <1VTRKG87YU87FCD8UPB6RY1T2XHP.3e70f822@myst>
References: <1VTRKG87YU87FCD8UPB6RY1T2XHP.3e70f822@myst>
Message-ID: <15984.64708.540798.130761@montanaro.dyndns.org>


    Tim> Well, maybe the feds are starting to wake up... April 30 to May 2,
    Tim> the FTC will be having a workshop on the spam problem.  Anybody in
    Tim> that general vicinity?

Well, PythonLabs is in that general vicinity.  I suspect none of them could
spare all three days though.  

Skip


From trebor at animeigo.com  Thu Mar 13 20:02:36 2003
From: trebor at animeigo.com (Robert Woodhead)
Date: Thu Mar 13 20:03:08 2003
Subject: [Spambayes] Email Certificates of Approval
Message-ID: <a05210234ba96da4d17cf@[192.168.1.101]>

Forgot to post this to the list

At 2:48 PM -0500 3/13/03, Eric S. Johansson wrote:
>Robert Woodhead wrote:
>>Guys,
>>
>>Been toying with a new, complementary idea for spam reduction. 
>>Wanted to pass it by you before unleashing it on the unsuspecting 
>>masses.
>>
>>http://www.madoverlord.com/Projects/SPAMIDEA.t
>>
>>Comments much appreciated, of course.
>
>several major problems with this proposal.  It fails if:
>
>a registrar fails to list a spammer as a spammer

Well, SSL fails if a registrar doesn't do his job and issues bogus 
certs.  At some point, you have to trust someone.

>CRL reporting latency is too great

Not really that much of an issue.  Remember, this is just another 
data point.  If someone starts broadly spamming using a cert, enough 
users will note it to get the word out before most of the recipients 
grab it from their mailserver.

>virus lifts certificates from various machines
>the implementation follows all of the usual security human factors 
>failures (i.e. passphrases etc.)

Yes, but you're going to have these problems with any system.  Heck, 
a virus could grab your mailserver password and the evil spammers 
could reconfigure you as a relay.

At a certain point, you have to just say "this is good enough", and 
you can get there.

>
>it also fails because it doesn't allow truly anonymous speech and 
>opens the door for elected and non elected governance controlling 
>your ability to e-mail.

First of all, one could create anonymous certificates.  But the flip 
side is, users may decide to give less weight to an email from an 
anonymous source.  That's their choice.

99.999999999% of all email is not anonymous.  And note that anonymous 
remailers could get certs and certify that the source of the email 
(one of their users) is not a spammer.  Finally, all such a scheme is 
really saying is "if you are willing to certify who you are, your 
recipients will be more willing to trust that what you are sending is 
not spam".

Bluntly, the anonymity of email -- or more precisely, the ease of 
obscuring the origin of email -- is one of the major flaws in the 
current email system design that makes spam so easy to inflict on us 
all.

>
>simple proof of work stamps get around all of these problems and 
>still put a big burden on the spammer.

Proof of work is an interesting idea.  But if it is worth enough, 
custom silicon can easily give 100x or even 1000x the throughput of a 
general-purpose processor.  And also keep in mind that there are 
legitimate emailers who need to send out a lot of email.

At 12:05 PM -0800 3/13/03, T. Alexander Popiel wrote:
>1. A single fee of $50 per registration would not be sufficient to
>    support the registrar; there are ongoing costs which only an
>    up-front fee cannot address, unless certificates expire... which
>    is horrible for the reputation aspects of the system.  The
>    registrar would have to be a subscription service, much like the
>    DNS registrars... but likely with higher costs because they
>    wouldn't be able to securely delegate authority (or perhaps
>    they could securely delegate... but people wouldn't believe it
>    was secure, so wouldn't trust it).  I doubt people will want to
>    pay $50 every couple months just to have a reputation.

It's unclear at present what the costs might be, but this is a valid 
concern.  I'm not thinking of something that has to be frequently 
renewed (unless it gets revoked).  Also, most cert owners would be 
mailserver operators anyway.

>
>2. The registrar could be infiltrated, bribed, or otherwise compromised.
>    Not helpful.  There would be no provable protections against such,
>    so the registrar would have to be a trusted party (in the negative
>    connotation of "you trust them because there's no way to verify
>    their veracity").  I think that keeping my own database would be
>    preferable.

So can DNS registrars.  So can SSL registrars.  But note that such a 
compromise will be immediately obvious, so there is a great incentive 
for the registrar to play fair.

>
>3. Many people pay good money to be jerks.  That's pretty much the
>    definition of email marketing... the spamhouses charge a pretty
>    penny for running one of the blast-o-grams.  An additional $50
>    per blast-o-gram for a new reputation token is minor compared
>    to the $1k-$10k+ per mailing...

Point well taken.  But note that I was talking about people buying 
certs to use to trash the reps of others.  As for the blastogram 
operators, after a while they'll find they can't buy certs from the 
registrar anymore.

>
>4. Adding a message signature to the Received headers (which is
>    effectively what you're doing) would be a wonderful thing... but
>    there's no need to centralize the signature keys.  Even if each
>    mail handler had their own privately kept & guarded keys, it'd
>    help tracking immensely.

True.

>
>5. If people are foolish enough to not scan their tagged-as-spam
>    mailboxes for important things like their boss's name as sender,
>    then they deserve to have the company go belly-up. ;-)  More
>    generally, completely ignoring the tagged-as-spam stuff is
>    dangerous and dumb, because _NO_ system is going to be perfect.
>    Sorting the messages by sender makes it fairly easy and quick to
>    dispose of them.

I agree, but even the vigilant screw up occasionally.  I scan the 
sender/title of all of my spam (it gets filtered to the bottom of the 
inbox), but even with whitelisting, an occasional email from a legit 
sender (who has never emailed before, and is clueless enough to not 
put a nice descriptive subject line) gets by.

>
>6. I also think you're overstating the reaction speed of such a system;
>    if a spammer has a new certificate for each mailing (or each day),
>    then most people will not have read the message (and registered it
>    as spam) at the time when other people's mailers or procmail scripts
>    need to classify it... classification always happens before reading,
>    and classification is usually immediately after receipt, while
>    reading is delayed some arbitrary amount.

With enough users, this problem goes away.

>
>7. If you put in something saying that certificates under a certain age
>    (or with only a few votes) are suspect, then you unfairly penalize
>    new (or casual) email users, while merely inconveniencing the
>    spammers (who then have to pre-buy and age their certificates, and/or
>    ballot-stuff them).

Hadn't thought to do that.

>
>8. Ballot-stuffing would be a major problem, and if done at a reasonable
>    rate it'd be nearly impossible to detect.  (How do you distinguish
>    between you and forty clone machines ballot-stuffing at about 20
>    votes per day vs. someone who regularly communicates via email to
>    everyone in his workplace or on a local mailing list?)

Ballot stuffing is an issue, but consider that to stuff more than one 
ballot on a particular email, you'd have to have multiple 
certificates.  Unless someone puts together a gang of like-minded 
dipwads, all stuffing, the "popiel is a lousy scumbag" votes are 
going to get overwhelmed by the "popiel is a nice guy" votes. 
THere's probably some cute things that can be done to detect stuffing.

Appreciate the comments, keep them coming.

R

-- 

===========================================================
Robert Woodhead, CEO, AnimEigo     http://www.animeigo.com/
===========================================================
http://selfpromotion.com/   The Net's only URL registration
SHARESERVICE.  A power tool for power webmasters.

From noreply at sourceforge.net  Thu Mar 13 14:57:53 2003
From: noreply at sourceforge.net (SourceForge.net)
Date: Thu Mar 13 20:12:18 2003
Subject: [Spambayes] 
	[ spambayes-Feature Requests-703283 ] mboxtrain only trains on cur
	in maildir
Message-ID: <E18tbe5-00043C-00@sc8-sf-web4.sourceforge.net>

Feature Requests item #703283, was opened at 2003-03-13 16:57
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=498106&aid=703283&group_id=61702

Category: None
Group: None
Status: Open
Priority: 5
Submitted By: Matthew Cowles (mdcowles)
Assigned to: Nobody/Anonymous (nobody)
Summary: mboxtrain only trains on cur in maildir

Initial Comment:
When training on a maildir, mboxtrain trains only on
the messages in the subirectory cur. It ignores
messages in the subdirectory new. Since new is for
messages that haven't been seen, I think it's worth
looking there since at least some spam will have been
filed unseen.

This is the same as bug 699174 which Tim Stone closed
saying, "This is a feature request.  If this remains as
a requirement, please
resubmit as such."

The patch attached to that bug report fixes the
behavior which I still consider a bug.

----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=498106&aid=703283&group_id=61702

From tim.one at comcast.net  Thu Mar 13 20:26:50 2003
From: tim.one at comcast.net (Tim Peters)
Date: Thu Mar 13 20:28:20 2003
Subject: [Spambayes] Federal Trade Commission Workshop on Spam
In-Reply-To: <15984.64708.540798.130761@montanaro.dyndns.org>
Message-ID: <LNBBLJKPBEHFEDALKOLCCEAOEBAB.tim.one@comcast.net>

[TimS]
> Well, maybe the feds are starting to wake up... April 30
> to May 2, the FTC will be having a workshop on the spam problem.
> Anybody in Tim> that general vicinity?

[SkipM]
> Well, PythonLabs is in that general vicinity.  I suspect none of
> them could spare all three days though.

I doubt our employer would agree to one hour -- we're not in the spam
business.  I suppose that can be read more than one way, some more obviously
true than others <wink>.


From T.A.Meyer at massey.ac.nz  Fri Mar 14 15:13:39 2003
From: T.A.Meyer at massey.ac.nz (Meyer, Tony)
Date: Thu Mar 13 21:14:29 2003
Subject: [Spambayes] UpdatableConfigParser
Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F1318CD89@its-xchg4.massey.ac.nz>

Oh well, it's the thought that counts ;)

For those that don't read the check-ins, I've reverted Options.py and OptionConfig.py to ConfigParser, not UpdatableConfigParser.

Two main problems:
* ConfigParser has changed in more recent Python (I'll take a look at the new version and see how exactly).
* Without starting a debate, there are issues about hooking into 'private' attributes.

I've left UpdatableConfigParser.py there, although it's not imported by any module.  I'll tinker with it and maybe get it so that it's acceptable :)  I still stand by the idea ;)

I would have got to this faster, but you people that live on the wrong side of the world found the problems when I was asleep :)  Those that cvs-up'd since my update (21 hours ago) should do so again.

=Tony Meyer

From T.A.Meyer at massey.ac.nz  Fri Mar 14 15:15:25 2003
From: T.A.Meyer at massey.ac.nz (Meyer, Tony)
Date: Thu Mar 13 21:16:47 2003
Subject: [Spambayes] Are we ready for alpha 3?
Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F13C8C8D0@its-xchg4.massey.ac.nz>

> Give me some votes and I'll release alpha 3 tonight, if the 
> votes are aye <wink>

+1 as long as it takes Options.py and OptionConfig.py after I dropped UpdatableConfigParser.

Don't forget to update the website to note that a3 is there.

=Tony Meyer

From T.A.Meyer at massey.ac.nz  Fri Mar 14 15:51:50 2003
From: T.A.Meyer at massey.ac.nz (Meyer, Tony)
Date: Thu Mar 13 21:52:59 2003
Subject: [Spambayes] Storing Options
Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F1318CD8B@its-xchg4.massey.ac.nz>

I don't think I was very clear.  Let me try again:

[Skip]
> Hmmm...  What applications modify config files?
pop3proxy (via OptionConfig) and the Outlook plugin.  These are the only two applications with a ui at the moment, aren't they?  So they'd be the only ones that do it.

> That usually  seems to me to be the province of special
> config file editors or humans armed with text
> editors.  You're not proposing that applications
> like pop3proxy should modify them are you?
Those applications aimed more at end user type people will have some sort of capability to change options, and will need to be able to store these somehow, so the applications will edit them.  This doesn't remove the ability to manually edit them - and in some applications (those that use hammie, for example), this (hand-edit) would probabably always be the only option.

[TimS]
> I'm reasonably sure that there is code in several places that 
> modifies specific options temporarily
[...]
> I suppose persisting option changes should be explicit.
I'm not proposing changes on that magnitude!  In operation, I would think nothing much would change.  The config file(s) would be (and are) changed only when the user clicks the save button in the web ui config page, or clicks OK in the Outlook manager dialog, or ...

[Mark]
> My understanding is that an updatable options class would allow the
> pop3proxy configuration page to save its options back to a 
> file.
Which is exactly what it does now (although not as nicely as it could).  This idea is to improve how this is done, behind the scenes (before we get to beta and it's too late!).

=Tony Meyer

From anthony at interlink.com.au  Fri Mar 14 13:55:17 2003
From: anthony at interlink.com.au (Anthony Baxter)
Date: Thu Mar 13 21:55:58 2003
Subject: [Spambayes] Are we ready for alpha 3? 
In-Reply-To: <SOWUHCXFCKI1UPLFBAXV75DD04CB.3e70f447@myst> 
Message-ID: <200303140255.h2E2tHU12723@localhost.localdomain>


>>> Tim Stone - Four Stones Expressions wrote
> Give me some votes and I'll release alpha 3 tonight, if the votes are aye 
> <wink>

There needs to be documentation for people upgrading from earlier versions.

The website should be updated when the release is made, as should PyPI.

Anthony


From tim.one at comcast.net  Thu Mar 13 22:28:29 2003
From: tim.one at comcast.net (Tim Peters)
Date: Thu Mar 13 22:29:04 2003
Subject: [Spambayes] wanted: malformed email messages.
In-Reply-To: <OLEKJBLGLGDHBDLHGIINMEDHCNAA.spambayes@rodland.no>
Message-ID: <LNBBLJKPBEHFEDALKOLCCEBJEBAB.tim.one@comcast.net>

[Fredrik Rodland]
> I'd love to - but as I wrote in my other post - outlook (which is
> the MUA I use at the moment ) fixes these messages, so that they don't
> fail anymore.
>
> does anybody have any tips on how to save/send a message with all of it's
> origianl content from outlook (2000)?

Outlook doesn't store the original content, so it's not possible.  Just look
at the code in the Outlook2000 directory of this project to see all the pain
it takes to partially reconstruct the original!  Outlook simply wasn't
designed with current Internet email standards in mind, and scatters the
message it gets into a large number of fields and properties that seem
originally designed for a proprietary MS email format.  Some things can't be
recovered at all (e.g., the original MIME armor is *almost* always lost).


From T.A.Meyer at massey.ac.nz  Fri Mar 14 16:29:57 2003
From: T.A.Meyer at massey.ac.nz (Meyer, Tony)
Date: Thu Mar 13 22:31:17 2003
Subject: [Spambayes] 
	RE: [Spambayes-checkins] spambayes/spambayes Options.py,1.22,1.23 
Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F13C8C8D7@its-xchg4.massey.ac.nz>

> [Tony - please do something with your mailer to keep lines under 80
> columns]

I'm not sure I can.  I use Outlook with Exchange.  "Internet email"
is set to wrap at 74 chars, but there isn't a setting AFAIK to wrap
mail sent through exchange.  If anyone else knows differently, please
let me know.  I'll try and remember to hard wrap lines myself :(.

My check-in messages are also not wrapped - these are generated by
TortoiseCVS.  I've posted a request to wrap them, but you open-source,
who knows when/if it will get done ;)  I'll try to remember to hard
wrap these too.

=Tony Meyer

From tim at fourstonesExpressions.com  Thu Mar 13 21:33:30 2003
From: tim at fourstonesExpressions.com (Tim Stone - Four Stones Expressions)
Date: Thu Mar 13 22:33:37 2003
Subject: [Spambayes] Are we ready for alpha 3? 
In-Reply-To: <200303140255.h2E2tHU12723@localhost.localdomain>
Message-ID: <DBI32E99653A0EA8MIUQQLKIZVA9XT.3e714d8a@myst>

3/13/2003 8:55:17 PM, Anthony Baxter <anthony@interlink.com.au> wrote:

>
>>>> Tim Stone - Four Stones Expressions wrote
>> Give me some votes and I'll release alpha 3 tonight, if the votes are aye 
>> <wink>
>
>There needs to be documentation for people upgrading from earlier versions.

Good point.  Won't be tonight... ;)

c'est moi - TimS
http://www.fourstonesExpressions.com
http://wecanstopspam.org


From trebor at animeigo.com  Thu Mar 13 20:03:23 2003
From: trebor at animeigo.com (Robert Woodhead)
Date: Thu Mar 13 22:40:46 2003
Subject: [Spambayes] Email Certificates of Approval
In-Reply-To: <3E711F51.10105@harvee.org>
References: <E18tS6z-0003IJ-01@mail.python.org>
 <a05210216ba9651b6143a@[192.168.1.101]> <3E70E094.9060902@harvee.org>
 <a05210225ba969de2ee8a@[192.168.1.101]> <3E711F51.10105@harvee.org>
Message-ID: <a05210232ba96d114ee44@[192.168.1.101]>

At 7:16 PM -0500 3/13/03, Eric S. Johansson wrote:
>I hope you realize that as I play Devils advocate, there is 
>absolutely no animosity towards you or your idea.  This "think like 
>a spammer" role-playing was an essential part of the process in 
>making camram (a sender pays antispam system) more robust.

Oh of course.  I do that all the time myself.

If I think you're being a dickhead, you'll be the second to know.  ;^)

>trusting someone is the fundamental leverage point a con artist counts on.

Actually, this is not entirely correct.  Con artists depend on the 
greed of the marks.

>Now one thing to consider is the reputation damage the industry 
>would have given a sufficiently motivated set of spammers.  They can 
>keep setting up certificate authorities faster than you can knock 
>them down.  If they stayed legit long enough, they can burn a whole 
>bunch of people and get them to cease trusting certificates and 
>certificate authorities.
>
>reputation capital such a wonderful thing...

Obviously, there would have to be a central registrar who is 
responsible for certifying subregistrars.  We have that already with 
DNS.  And clearly, the amount of vetting required to become a 
subregistrar would be significant.

>how will they get the word out?  Are you envisioning some form of 
>peer-to-peer reporting structure?

Not quite clear on this, but it will involve some central servers 
keeping track of the votes.

>   How do you deal with false reports?  Imagine someone collecting 
>certificates and then a network of people report them as spammers.

No, a certificate holder can only vote that a particular email from 
another cert holder is spam, he would have to register his vote 
within a reasonable period of time, and his vote (and the yea votes) 
would decay over time.  So to get tagged as a spammer you would have 
to get voted against by a significant fraction of the electorate 
(those receiving emails from you) in a short period of time [there 
would have to be some provision that if A emails B, B can't forward 
to C and both B and C vote it to be spam].

This provides a form of traffic analysis.  Spammers email to a lot of 
people over a short period of time.  Hammers email to a much smaller 
number of people -- almost all known to them -- over longer periods.

The only reasonable targets for a smear campaign are large legit bulk 
emailers (say, amazon.com) and mailing lists.

>  Instantly, you can cause loss of e-mail access to a large number of 
>people. Also, what about indirect reputation trashing.  Someone gets 
>a certificate with your name and identity.  Obviously, it won't 
>match your certificate but most people won't know that.

The cert isn't intended to identify you, though in most cases it can. 
It's used to tell you "emailer 23734282932823732732929 is regarded as 
a spammer".  Rarely will end-users bother to, or need, to find out 
that 23734282932823732732929 is dipwad27@hotmail.com.

This isn't supposed to be a replacement for other systems of spam 
detection, just another data point used in deciding what to do.  The 
more orthoganal detection methods we have, the harder it will be to 
spam.  What this does is give you an estimate of the reputation of 
someone you've never heard of before.

>
>On the latency issue, a spammer can get out an awful lot of e-mail 
>in a small number of hours.  The distribution of a certificate 
>revocation notice worldwide will need to be under 10 minutes in 
>order for it to be only moderately effective.  I suggest you do the 
>math of propagation and figure out how far and how fast spam can go 
>in only a few minutes.

You're missing something.  From the standpoint of an enduser, it 
doesn't matter how fast the spam gets to his mailserver.  All that 
matters is how long it is between the start of the spam run and the 
time his mailreader downloads the email from the mailserver and 
checks its reputation.  For most email users, this averages several 
hours, enough time for the earlybirds who check their email every 5 
minutes to vote on the reputation.

>The problem with certificates and this kind of identity theft is 
>that it directly affects your reputation.  You can be barred from 
>ever having access to e-mail again based on this form of identity 
>theft.  You could potentially even be barred from accessing the 
>Internet ever.  How you repudiate something that's  supposedly 
>something you can't repudiate?  After all, it's your electronic 
>identity.  I don't know about you but I want to deny that I ever 
>wrote some of my e-mail. ;-)

No, not at all.  Worst case, you buy another certificate.  Or even, 
if reputations decay, just wait a couple of days and you'll have a 
decent rep again.

Consider the horrible case, a worm that goes around stealing 
certificates and giving them to spammers.  What happens?  The 
reputation system becomes unreliable for a few days until the apps 
get patched.  If we're clever in the implementation, in such a case, 
the cert holders could get a fresh cert at no charge if they wanted.

>I agree with most of which you say.  I think that certificate based 
>or, more correctly stated, identity based e-mail antispam filters 
>can be made to work if you make them decentralized and based on who 
>you know.  The trouble comes when you try to send e-mail to someone 
>that you don't know.

Which is the problem I'm addressing.

>   If you assume web of trust, then you have a "six degrees from 
>Kevin Bacon" type problem as you try to find someone you know who's 
>willing to introduce you to someone who knows the person you're 
>trying to get in touch with.

Right.  It's an issue for new users.

>
>but you still haven't dealt with the issue of elected and unelected 
>governance and their influence on your ability to generate e-mail.

Nobody is stopping you from emailing to your heart's content.  Your 
readers are merely making a recommendation to new people you might 
want to email as to what kind of guy you are.

Nobody is being forced to stop emailing.  Nobody is being forced to 
not read an email.  It's just a suggestion.  No more, or less, 
important than "your bayesian filter  thinks this is spammy" or "your 
dnsbl says this comes from a known spam source".

End users can be stupid and trash emails based on the recommendation. 
But they can also do that based on what their spamfilter says.

>Now how does this apply to legitimate bulk e-mail?  All bulk e-mail 
>should be opt-in.  Therefore, after you have established a 
>relationship with a bulk e-mail source, they are now defined as 
>"friend" and sign their messages to you. Otherwise, they can just 
>sit there and generate stamps.

It's an interesting approach.

>actually, if you want identity based systems to control spam, you 
>have to have a and identity associated with every e-mail account. 
>At a $50 price point, it ain't going to happen. *I* wouldn't spend 
>$50 on a certificate when I know they cost pennies to generate. 
>Given a sufficient high price point, you create an opportunity for 
>folks to come in and trash the reputation capital of the entire 
>system.

The $ is really for running the reputation database.  If that can 
effectively be distributed, then that problem goes away.

>Actually, this raises an important point.  Why should I spend money 
>to clean up somebody else's mess.  Certificate based systems such as 
>you propose further increase the receiver pays nature of e-mail.  I 
>pay when Spam comes in, I pay to keep Spam out.

No.  Only senders who want a cert pay.  You can receive email and use 
the system without a cert (but you don't get a vote).

>>So can DNS registrars.  So can SSL registrars.  But note that such 
>>a compromise will be immediately obvious, so there is a great 
>>incentive for the registrar to play fair.
>
>how does it become immediately obvious?

Because certs that should quickly get tagged as belonging to spammers won't.

>   Will there be worldwide bulletins on CNN?  Will the Attorney 
>General's of 15 states lead a SWAT team into some small tropical 
>country to shut down the naughty registrar?  And if the registrar 
>has managed to accrue a few hundred thousand customers who are 
>legitimate?  What happens to them?  How do you get people to change 
>their certificates when the user interface to add them is so 
>painfully horrible that they won't use them in the first place?

That's an implementation issue.  But note that while you may have 
resellers, there needs to be a central registrar who has the database 
of certs (like the DNS root servers).  So that's the point of 
compromise.

>>Point well taken.  But note that I was talking about people buying 
>>certs to use to trash the reps of others.  As for the blastogram 
>>operators, after a while they'll find they can't buy certs from the 
>>registrar anymore.
>
>so they will form their own.

And have to go through background checks.  And put up a bond. 
Spammers won't do this.

>the trick is knowing when the transition happens and even experts 
>screwed up some of the time.  How can you ever hope to get someone 
>who is uninterested to give that level of attention to detail?

You don't.  They will tend to freeride on the power-users who will 
form the voting elite.  The whole point, as I've said before, is to 
have another orthogonal detection system, reducing false positives 
for the clueless, who are most likely to get bitten by them.

>like I said.  You need extremely fast propagation of information, 
>reliable dissemination points and reliable connections to those 
>dissemination points.  I still believe it's 10 minutes propagation 
>worldwide with full redundancy on all connections.

See above.  The rest-stop on the mailserver before the POP session 
gives you the time.

>
>also, what happens if someone can get to the dissemination points? 
>Does all the e-mail get held up and what do they get all of their 
>e-mail regardless?

Sure they do.  They just don't get the benefit of an opinion.  It 
fails gracefully.

>
>And who is going to pay for all this infrastructure?  Could it be 
>the receiver of the e-mail?  The Spammer isn't paying.

The cert purchasers.

>you can deal with one form of ballot stuffing from the certificate 
>identity which prevents multiple votes by the same certificate on 
>the same topic.  Then you can also use source IP address to see if 
>you get a lot of certificates from the same address.  Unfortunately, 
>this test would fail for organizations with address translation 
>gateways.

Note that if you're sending to someone who has a cert, you could 
encode that info into the header line, so that only the holder of the 
cert (the recipient) could vote against you.

This would also, btw, have the nice feature of having a robust 
X-Original-Recipient field in the email, great for detecting bounces 
from clueless isps who return bizarre bounce messages.

R

-- 

===========================================================
Robert Woodhead, CEO, AnimEigo     http://www.animeigo.com/
===========================================================
http://selfpromotion.com/   The Net's only URL registration
SHARESERVICE.  A power tool for power webmasters.

From T.A.Meyer at massey.ac.nz  Fri Mar 14 16:53:41 2003
From: T.A.Meyer at massey.ac.nz (Meyer, Tony)
Date: Thu Mar 13 22:55:19 2003
Subject: [Spambayes] 
	RE: [Spambayes-checkins] spambayes/spambayes Options.py,1.22,1.23 
Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F1318CD8D@its-xchg4.massey.ac.nz>

> See above - that the ConfigParser didn't expose this interface is 
> probably just an indication that no-one had needed to do this before.
> There's been a lot of changes in it for Python2.3, so it seems like
> you're not the first person to run into this.

Indeed, almost as an aside (so it seems) this was done.  <sigh> If 
only I'd checked the Python CVS first...

> If the 2.3 ConfigParser class is better, there's nothing 
> saying we can't include it in the package (we already do this with
> the sets and heapq module).

Well, my UpdatableConfigParser still adds functionality - most
particularly, it lets OptionConfig.py (and 'one day soon' Outlook)
update config files without stripping comments.  It will work as is
with Python 2.2.2, but it has the deplorable ;) hooks into the private
attributes.  It should work exactly the same with the CVS Python 
(without the hooks, changing __sections and __read to _sections and _read).

So, do we:
(a) include the latest ConfigParser, so that the code can be all
    the same?
(b) have a version check that does the ugly hooking if we're pre
    2.3, and otherwise is nice?
(c) get Tony to give up on this ;)

=Tony Meyer

From T.A.Meyer at massey.ac.nz  Fri Mar 14 17:26:47 2003
From: T.A.Meyer at massey.ac.nz (Meyer, Tony)
Date: Thu Mar 13 23:27:34 2003
Subject: [Spambayes] Spambayes installation problem
Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F1318CD90@its-xchg4.massey.ac.nz>

[I'm throwing this back to the list in the hope that someone will have a good idea]

To summarise: Geoff is unable to get the Outlook plugin running.  He's using the installer that Mark put together.  It goes through and creates the C:\Program Files\Outlook Plugin\ directory (whatever it is called), and it also seems to register the COM plugin since it appears in Outlook's list of such plugins.  However, when opening Outlook, the GUI doesn't appear.  The trace window shows nothing (not even the "loading" lines).

Geoff does have a couple of other plugins that might be causing the problem.  One is a virus checker called AVG, which adds a button to the toolbar and adds text to messages.  I installed the free version of this (6.0), but the spambayes plugin still worked for me.  Geoff might be using a different version of AVG, however.  He also has a synronisation plugin.

IIRC Mark did say that the installer would fail if there was already a COM plugin that was written in Python.  Is there any chance that either of these might be?

[Geoff]
> It is in the COM add-ins but its checkbox is not ticked. Ticking and
> reloading makes no difference.
It should definately be ticked.  Just to check, after you tick it, and close & reopen Outlook, are you looking at a mail folder (like the Inbox) and not something else (like Outlook Today)?  The items won't appear until you do.

[Geoff]
> However there is a synchronisation add-in which I believe is 
> from a palm pde
Ticked, or unticked?

Unless someone on the list has ideas, the only one I have left is that Geoff progresses past the nice package Mark put together and gets the CVS version.

=Tony Meyer

From popiel at wolfskeep.com  Thu Mar 13 21:23:56 2003
From: popiel at wolfskeep.com (T. Alexander Popiel)
Date: Fri Mar 14 00:24:00 2003
Subject: [Spambayes] Email Certificates of Approval 
In-Reply-To: Message from Robert Woodhead <trebor@animeigo.com> 
	<a05210232ba96d114ee44@[192.168.1.101]> 
References: <E18tS6z-0003IJ-01@mail.python.org>
	<a05210216ba9651b6143a@[192.168.1.101]> <3E70E094.9060902@harvee.org>
	<a05210225ba969de2ee8a@[192.168.1.101]> <3E711F51.10105@harvee.org>
	<a05210232ba96d114ee44@[192.168.1.101]> 
Message-ID: <20030314052356.1A7112DE88@cashew.wolfskeep.com>

In message:  <a05210232ba96d114ee44@[192.168.1.101]>
             Robert Woodhead <trebor@animeigo.com> writes:
>>
>>On the latency issue [...]
>
>You're missing something.  From the standpoint of an enduser, it 
>doesn't matter how fast the spam gets to his mailserver.  All that 
>matters is how long it is between the start of the spam run and the 
>time his mailreader downloads the email from the mailserver and 
>checks its reputation.  For most email users, this averages several 
>hours, enough time for the earlybirds who check their email every 5 
>minutes to vote on the reputation.

So... the people who form the basis for the judgements of the
system (those that check their email every 5 minutes) are exactly
those people who get no benefit from it (because there hasn't
yet been enough input to form a good judgement).  Sounds like
there's no incentive to participate and actually make the system
work.

It also doesn't do a bloody thing for those of us who get their
mail delivered realtime to the *nix mailserver with procmail
segregating it into MH mailboxes (or similar).  Yeah, I know it's
horribly anachronistic to actually have a login account on the
mailserver and not use POP or IMAP... but it's far easier to
grep through 30000 message mailboxes that way.  I suppose there's
not many of us classic users left, though.

- Alex

From spambayes at rodland.no  Fri Mar 14 08:58:19 2003
From: spambayes at rodland.no (Fredrik Rodland)
Date: Fri Mar 14 03:00:37 2003
Subject: [Spambayes] Re: wanted: malformed email messages.
In-Reply-To: <u4r67m5m2.fsf@fitlinxx.com>
Message-ID: <OLEKJBLGLGDHBDLHGIINOEFLCNAA.spambayes@rodland.no>


> David Leftley <spambayes@djl.freeuk.com> writes:
>
> In the context of the Outlook plugin, it also made me think that it
> might be nice if the plugin didn't abort on an individual message
> failure, but kept working on any remaining messages so as to at least
> process as many as possible.

I've posted bug #702920 which addresses this problem.  It could be argued
that this should be a feature request, though....


Fredrik


--
Fredrik Rodland	Technical Architect, Stocknet, Oslo, Norway
Stocknet:		http://www.stocknet.com		phone: +47 23 28 40 17
Private:		http://rodland.no			phone: +47 99 21 98 17


From trebor at animeigo.com  Fri Mar 14 07:01:47 2003
From: trebor at animeigo.com (Robert Woodhead)
Date: Fri Mar 14 07:22:50 2003
Subject: [Spambayes] Email Certificates of Approval
In-Reply-To: <20030314052356.1A7112DE88@cashew.wolfskeep.com>
References: <E18tS6z-0003IJ-01@mail.python.org>
 <a05210216ba9651b6143a@[192.168.1.101]> <3E70E094.9060902@harvee.org>
 <a05210225ba969de2ee8a@[192.168.1.101]> <3E711F51.10105@harvee.org> 
 <a05210232ba96d114ee44@[192.168.1.101]>
 <20030314052356.1A7112DE88@cashew.wolfskeep.com>
Message-ID: <a0521020aba9774007a45@[192.168.1.101]>

At 9:23 PM -0800 3/13/03, T. Alexander Popiel wrote:
>In message:  <a05210232ba96d114ee44@[192.168.1.101]>
>              Robert Woodhead <trebor@animeigo.com> writes:
>>>
>>>On the latency issue [...]
>>
>>You're missing something.  From the standpoint of an enduser, it
>>doesn't matter how fast the spam gets to his mailserver.  All that
>>matters is how long it is between the start of the spam run and the
>>time his mailreader downloads the email from the mailserver and
>>checks its reputation.  For most email users, this averages several
>>hours, enough time for the earlybirds who check their email every 5
>>minutes to vote on the reputation.
>
>So... the people who form the basis for the judgements of the
>system (those that check their email every 5 minutes) are exactly
>those people who get no benefit from it (because there hasn't
>yet been enough input to form a good judgement).  Sounds like
>there's no incentive to participate and actually make the system
>work.

Not quite, it's a probabilistic thing.  Someone who checks their 
email every 5 minutes is more likely to look at it before an opinion 
has been formed, but it is not a sure thing.  It all depends on 
whether they were early or late in the spam run, for example.

Again, keep in mind this is not intended to be a be-all-end-all 
method.  It is intended to be part of a suite of methods used to make 
life hard for the spammer.  I'll repeat the mantra: orthogonality.

>
>It also doesn't do a bloody thing for those of us who get their
>mail delivered realtime to the *nix mailserver with procmail
>segregating it into MH mailboxes (or similar).  Yeah, I know it's
>horribly anachronistic to actually have a login account on the
>mailserver and not use POP or IMAP... but it's far easier to
>grep through 30000 message mailboxes that way.  I suppose there's
>not many of us classic users left, though.

True.  You neanderthals will simply have to suffer.  ;^)
-- 

===========================================================
Robert Woodhead, CEO, AnimEigo     http://www.animeigo.com/
===========================================================
http://selfpromotion.com/   The Net's only URL registration
SHARESERVICE.  A power tool for power webmasters.

From noreply at sourceforge.net  Fri Mar 14 10:02:41 2003
From: noreply at sourceforge.net (SourceForge.net)
Date: Fri Mar 14 12:54:40 2003
Subject: [Spambayes] 
	[ spambayes-Bugs-695142 ] Email does not render subject in the
	"Review" Page
Message-ID: <E18ttVx-0008RT-00@sc8-sf-web3.sourceforge.net>

Bugs item #695142, was opened at 2003-02-28 10:40
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=498103&aid=695142&group_id=61702

Category: None
Group: None
Status: Open
Resolution: None
Priority: 5
Submitted By: David Shaw (dshaw)
Assigned to: Tim Stone (timstone4)
>Summary: Email does not render subject in the "Review" Page

Initial Comment:
I received the attached email.  When I go to the "review" web page of pop3proxy.py, all it shows is:

Messages classified as Unsure:       From:  
(none)                                            (none)

It acts as though the message has no "from" or "subject", even though they exist.  The user is not given any way to classify this message other than to click on the first "(none)" and read the raw message to determine its contents.  I will attach the message below.

----------------------------------------------------------------------

>Comment By: Tim Stone (timstone4)
Date: 2003-03-14 12:02

Message:
Logged In: YES 
user_id=645698

We are now actively engaged in improving the email package parser, 
which should resolve these malformation related errors.

----------------------------------------------------------------------

Comment By: Tim Stone (timstone4)
Date: 2003-03-06 17:51

Message:
Logged In: YES 
user_id=645698

This is another email package parsing 'error' caused by a malformed 
header in the attached email.  The content-type header has an embedded 
/r/n, which causes the email package to barf and discard all the 
headers.

IMO, the email package is being used in Spambayes in 
ways that it was never intended for.  Malformed mail is gonna be the death 
of us, and the email package just doesn't seem to handle it very 
well.

I'm gonna leave this bug open, but there's virtually nothing 
that can be done to make things better, at least not AFAIK.

----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=498103&aid=695142&group_id=61702

From noreply at sourceforge.net  Fri Mar 14 15:38:22 2003
From: noreply at sourceforge.net (SourceForge.net)
Date: Fri Mar 14 18:28:26 2003
Subject: [Spambayes] 
	[ spambayes-Bugs-642740 ] "Recover from Spam" wrong folder
Message-ID: <E18tyko-00063G-00@sc8-sf-web3.sourceforge.net>

Bugs item #642740, was opened at 2002-11-24 01:00
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=498103&aid=642740&group_id=61702

Category: None
Group: None
>Status: Closed
>Resolution: Fixed
Priority: 5
Submitted By: Mark Hammond (mhammond)
Assigned to: Mark Hammond (mhammond)
>Summary: "Recover from Spam" wrong folder

Initial Comment:
Outlook addin:

Selecting "Recover From Spam" recovers the selected
message to the Inbox folder - which is not necessarily
where came from.  The filterer will need to save the
folder it came from before we can do this.

----------------------------------------------------------------------

Comment By: Fredrik Rodland (fmmr)
Date: 2003-03-13 23:36

Message:
Logged In: YES 
user_id=724871

I haven't seen this after I entered my previous comment.  I 
gues I was working on an old message, as I mentioned...

I guess you could close this bug...

----------------------------------------------------------------------

Comment By: Fredrik Rodland (fmmr)
Date: 2003-03-04 22:03

Message:
Logged In: YES 
user_id=724871

OK - i've tested some more.  this seems to work sometimes, 
and sometimes not.  It may be related to the other bug you're 
refering to, but I'll try to walk thorugh an example.

- I've got a message in a folder (inbox/maillister/locker).  The 
message was filtered by outlooks rules to this folder this 
morning - i.e. I've never viewed neither the message or the 
clues from any other folder.
- I run a manual filter on this folder (which returns with 1 good 
msg as expected) - WILL THIS FORGET THE FOLDER OF 
THIS MSG?
- I press the "delete as spam" button, and the message 
appears in my SPAM-folder.
- I enter my spam-folder and press the "recover from spam"-
button.
- the message appears in my INBOX

The message was ORIGINALLY (this morning local time) 
filtered using the 1.0.a2 version of spambayes, while I now 
use the latest CVS-version.

the following appears in the trace-collector:
Deleting and spam training message '[Lockergnome Penguin 
Shell]  Network Shutdown' -  trained as spam
Recovering to folder 'Inbox' and ham training 
message '[Lockergnome Penguin Shell]  Network Shutdown' -
  trained as ham

If you add some more debug, I'll be happy to run some tests 
on this msg.  Is there anyway to check whether this message 
actually 


----------------------------------------------------------------------

Comment By: Mark Hammond (mhammond)
Date: 2003-03-04 21:43

Message:
Logged In: YES 
user_id=14198

Can you post an example of something that fails?

Note that a remaining potential problem is out of our
control: occasionally the "Inbox" will see a message before
the builtin rules.  In this case, we filter it from the
Inbox, not from where the Outlook rule would have moved it.
 Thus, when we recover, we see the inbox as the source.

Note that I also fixed another bug related to this -
previously, simply scoring a message would store that folder
name as the "source" of the message.  Thus, if you had
previously viewed the clues for a message once in the wrong
folder, the correct source folder would have been lost.  So
please ensure you are testing with mail received since I
said I fixed this.

----------------------------------------------------------------------

Comment By: Mark Hammond (mhammond)
Date: 2003-02-04 17:23

Message:
Logged In: YES 
user_id=14198

/cvsroot/spambayes/spambayes/Outlook2000/addin.py,v  <-- 
addin.py
new revision: 1.48; previous revision: 1.47
/cvsroot/spambayes/spambayes/Outlook2000/filter.py,v  <-- 
filter.py
new revision: 1.16; previous revision: 1.15
/cvsroot/spambayes/spambayes/Outlook2000/msgstore.py,v  <--
 msgstore.py
new revision: 1.39; previous revision: 1.38


----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=498103&aid=642740&group_id=61702

From noreply at sourceforge.net  Fri Mar 14 15:39:44 2003
From: noreply at sourceforge.net (SourceForge.net)
Date: Fri Mar 14 18:28:31 2003
Subject: [Spambayes] 
	[ spambayes-Bugs-702920 ] Manual filtering (Outlook) fails if one
	message fails
Message-ID: <E18tym8-00065T-00@sc8-sf-web3.sourceforge.net>

Bugs item #702920, was opened at 2003-03-13 23:38
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=498103&aid=702920&group_id=61702

Category: Outlook
Group: None
Status: Open
Resolution: None
Priority: 5
Submitted By: Fredrik Rodland (fmmr)
Assigned to: Mark Hammond (mhammond)
Summary: Manual filtering (Outlook) fails if one message fails

Initial Comment:
I've posted tyhis question on the maillist, and with (at 
least) one positive feedback, I enter it here:

If manual filtering is started, and one e-mail fails, the 
rest of the filetering seems to be skipped.  

couldn't the filtering of the remaining messages 
continue, skipping the message which failed?


----------------------------------------------------------------------

>Comment By: Mark Hammond (mhammond)
Date: 2003-03-15 10:39

Message:
Logged In: YES 
user_id=14198

Can you please post a traceback? (and sorry if I missed it
on the list)

----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=498103&aid=702920&group_id=61702

From noreply at sourceforge.net  Sat Mar 15 09:03:07 2003
From: noreply at sourceforge.net (SourceForge.net)
Date: Sat Mar 15 18:27:10 2003
Subject: [Spambayes] [ spambayes-Patches-704188 ] non-interactive hammie
Message-ID: <E18uF3r-0001g0-00@sc8-sf-web3.sourceforge.net>

Patches item #704188, was opened at 2003-03-15 17:03
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=498105&aid=704188&group_id=61702

Category: None
Group: None
Status: Open
Resolution: None
Priority: 5
Submitted By: Toby Dickenson (htrd)
Assigned to: Nobody/Anonymous (nobody)
Summary: non-interactive hammie

Initial Comment:
When hammie is training, it displays a message counter to 
stdout when processing every message in the mailbox. 
 
I have recently updated the training phase of my 
procmail integration to run under cron, and this verbose 
output is unwelcome. 
 
This attached patch causes hammie to only update the 
counter for every message if stdout is a tty. If it is not 
(such as when run under cron) it only displays the final 
total at the end of processing a mailbox. 

----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=498105&aid=704188&group_id=61702

From dan at tobias.name  Sat Mar 15 19:45:21 2003
From: dan at tobias.name (Daniel R. Tobias)
Date: Sat Mar 15 22:35:14 2003
Subject: [Spambayes] It crapped out yet again...
Message-ID: <3E73C921.6040204@tobias.name>

Once again, my ham/spam database seems to have gone belly-up, and I 
can't get the proxy to start up.  This is the error I get:

Traceback (most recent call last):
   File "C:\Program Files\spambayes-1.0a2\pop3proxy.py", line 1577, in ?
     run()
   File "C:\Program Files\spambayes-1.0a2\pop3proxy.py", line 1551, in run
     state.createWorkers()
   File "C:\Program Files\spambayes-1.0a2\pop3proxy.py", line 1161, in 
createWorkers
     self.bayes = storage.DBDictClassifier(filename)
   File "C:\Program Files\spambayes-1.0a2\spambayes\storage.py", line 
140, in __init__
     self.load()
   File "C:\Program Files\spambayes-1.0a2\spambayes\storage.py", line 
152, in load
     t = self.db[self.statekey]
   File "C:\Python22\lib\shelve.py", line 71, in __getitem__
     return Unpickler(f).load()
EOFError


-- 
== Dan ==
Dan's Web Tips: http://webtips.dan.info/
Dan's Domain Site: http://domains.dan.info/


From lists at webcrunchers.com  Sun Mar 16 18:00:50 2003
From: lists at webcrunchers.com (John D.)
Date: Sun Mar 16 22:20:37 2003
Subject: [Spambayes] Other pop3proxy options
In-Reply-To: <039101c2d847$3b1eff40$a100a8c0@zlichstein>
Message-ID: <v031107a8ba9adc2779bb@[192.168.0.2]>

Hi,

been away from the list for a while,   but want to comments on some of the earlier postings from a long time ago....   So excuse the lateness of this posting.

>I would like to extend the options for how disposition is identified by the pop3proxy implementation.  In particular, I would like the option of
>
>A. X-Spambayes-Classification: <disposition>  as now
>B. To: XXXXX <disposition> as is in CVS now
>C. Subject line munging to append <disposition>
>
>Is there any reason that was not included? (beside the obvious potential for a spammer to slip in a workaround)  I use Outlook Express, and obviously can't use the arbitrary header technique - and am most interested in adding a [***SPAM***] header so that I can correctly bucketize those messages - but leave [***UNSURE***] in my primary box, and not molest ham messages at all.
>
>Is there any reason not to do this?  Would you accept it if I did?  Is there any reason why you aren't using the email module Parser API to crack the headers?  I have found a certain number of messages are not parsed correctly by the re that you are using.  They show up as From: (none) Subj: (none) in the UI - but I haven't determined why just yet (though I can see that some part of the message is getting stuck with the header by your re.split(r'\n\r?\n', messageText, 1) expression.

So - the reason why we are changing this,  is to accommodate Outlook users who can't filter on the "X-Spambayes-Classification"?

John


From lists at webcrunchers.com  Sun Mar 16 18:43:47 2003
From: lists at webcrunchers.com (John D.)
Date: Sun Mar 16 22:20:44 2003
Subject: [Spambayes] Use of email package
In-Reply-To: <15955.48207.421755.891103@gargle.gargle.HOWL>
References: <GDHDGB7VSTP8URQMDBPKYWZWQEBL.3e53a6df@myst>
Message-ID: <v031107a9ba9ade2bf2d0@[192.168.0.2]>

Barry writes....

>>>>>> "TS" == Tim Stone <tim@fourstonesexpressions.com> writes:
>
>    TS> We've got to either seriously harden our code so it knows what
>    TS> to do when the email package raises an exception, or consider
>    TS> not using the email package.  I think I'll be reworking
>    TS> pop3proxy so that it no longer uses the email package for
>    TS> anything.  The Corpus stuff currently has most (all?) the
>    TS> function that is needed by pop3proxy anyway.
>
>Let me take this opportunity to elaborate on the architecture of the
>email package.  There was a deliberate separation between the
>representation of email messages and the parsing of flat text to that
>object model (and in generating flat text from the object model, but
>that may not be relevant).
>
>Thus, it was designed with an eye toward the use of application
>specific parsers, and it may well be that the default parsers (both
>the strict and the lax parsers) may not be appropriate for an
>application that tends to see intentionally ill-formed messages.  My
>suggestion would be to write a parser that can handle the really bad
>messages, then use the default lax parser for most things, and fall
>back to the "adaptive parser" for the really horrendous messages.
>
>Then donate that parser back to Python. <wink>

I've already spent a lot of time developing my system using the "email" package and the classic "Message" classes.

I'm also aware of the bugs in the email.Parser,   especially when it comes to parsing MINE type messages,   in particular the KlezH virii I keep getting,  which in most cases GAGS my mail processing system.

Right now,   I skip processing these messages,   and leave them on the POP server,  and manually deal with them.   

I'm hoping we can still use these packages,   because we already spent a lot of time using them,   but lets just try and fix the Parser to work right.

I'm still using a very much earlier version of the SpamBayes project,  and I know I need to catch up,  but was planning to hold off in doing that until I can get another OpenBSD box on our Co_location rack,  which we plan to earmark for Specific SpamBayes development.

On top of that,  I'm also working on our SMS (Spam Management System) under Open Source,  where we plan to "Collect" spam into a SQL database,  with the idea of developing a spam processing system.   This involves building the Database,  then as spam comes in,  to PROCESS it so we can keep track of REPEAT spam,  and be able to do really cool things to allow sending the spam to SpamCop,  FTC,  etc.   It's also going to test the opt out mechanisms of the spam and further classify it in order to identify the really bad ones.

Each database entry allows one to take specific notes on the spam,  to allow for easy tracing of the spammer and locating them through "whois" lookups on the sites they hock in the spam.   

I've already got some pretty solid code to extract URL's and opt out addresses, and other routines to test the validity of the opt out URL'S.   So the idea is to be able to instantly look up specific spams I report to the Authorities to verify if gatways are still open,   or bring up notes on pending investigations against spammers,   and also bring up the Whois contact info on the domains...

I'm doing this manually right now,   but eventually want to get this automation working soon.   I already got about 10 spammer's domains shut down because their Whois is bogus,   so it would automatically link to the Domain name issuers complaint forms pages,    keeping track of the "ticket numbers"  allowing me to easily follow up my complaints to unsure they revoke the spammers domain name,  or put in accurate Whois info so that their contact info is accurate.

I have all of this almost working on my LOCAL box on my LAN,  and hopefully within a few weeks,   want to being up "spamcruncher.com" server box with a web site,  PostGres,  Python,  and the SMS libraries and CGI's that drive the web based GUI,   and setup a few Alpha and Beta testers.   On it,  would be a pop3proxy,  SMTP Proxy,  Database,  Spambayes,  etc.   Would then be looking for anyone wanting to participate in our SMS development.

Any comments?   Forward them to "crunch@shopip.com" as I use this mail address specifically for my Mailing lists,  and I download all my list mail every week.

John


From anthony at interlink.com.au  Mon Mar 17 14:30:27 2003
From: anthony at interlink.com.au (Anthony Baxter)
Date: Sun Mar 16 22:31:44 2003
Subject: [Spambayes] Use of email package 
In-Reply-To: <v031107a9ba9ade2bf2d0@[192.168.0.2]> 
Message-ID: <200303170330.h2H3USq15462@localhost.localdomain>


[John, if possible please keep your email messages to less than 80 columns
per line - thanks!]

>>> "John D." wrote
> I've already spent a lot of time developing my system using the
> "email" package and the classic "Message" classes.
>
> I'm also aware of the bugs in the email.Parser, especially when it
> comes to parsing MINE type messages, in particular the KlezH virii I
> keep getting, which in most cases GAGS my mail processing system.

If this is still broken under email 2.5b1, can you send me a complete 
sample of the broken message? 

Thanks,
Anthony

-- 
Anthony Baxter     <anthony@interlink.com.au>   
It's never too late to have a happy childhood.


From lists at webcrunchers.com  Sun Mar 16 19:34:39 2003
From: lists at webcrunchers.com (John D.)
Date: Sun Mar 16 22:34:42 2003
Subject: [Spambayes] 
	On Merging System wide corpuses with specific User's Corpuses.
Message-ID: <v031107acba9af1627685@[192.168.0.2]>

Had any thoughts or discussions been made about the idea of "Merging" system wide Spam Corpuses with "Local" ones?

For instance,  as what was discussed earlier,  people are not willing to be submitting their personal mail to the "ham" corpus (At least not all of it),  but in instances where a domain has multiple users,   I think it would be nice in the training phase to mark an item to put into a "system wide" pool of spam or ham,   or put it into a "local" or a corpus specific to a parcicular user.

But when classifying it,  treat the corpus as a "single" file.

John


From lists at webcrunchers.com  Sun Mar 16 20:21:11 2003
From: lists at webcrunchers.com (John D.)
Date: Sun Mar 16 23:21:46 2003
Subject: [Spambayes] wanted: malformed email messages.
In-Reply-To: <rdp07v09b8e3hl0eo3idlv78rveurbu1as@4ax.com>
References: <200303131040.h2DAdrq18384@localhost.localdomain>
 <200303131040.h2DAdrq18384@localhost.localdomain>
Message-ID: <v031107adba9afd163693@[192.168.0.2]>

I have some malformed messages that Parse failes to resolve.  Most are
KlezH Virus attachments that fail to put an additional space between
the sections as per the RFC states.   Do you want these as well?

John


From noreply at sourceforge.net  Sun Mar 16 17:50:39 2003
From: noreply at sourceforge.net (SourceForge.net)
Date: Mon Mar 17 09:48:53 2003
Subject: [Spambayes] 
	[ spambayes-Bugs-702758 ] When manually filtering the results are
	not right.
Message-ID: <E18ujlv-00061l-00@sc8-sf-web1.sourceforge.net>

Bugs item #702758, was opened at 2003-03-13 18:32
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=498103&aid=702758&group_id=61702

Category: Outlook
Group: None
>Status: Closed
Resolution: None
Priority: 5
Submitted By: Tony Meyer (anadelonbrin)
Assigned to: Mark Hammond (mhammond)
Summary: When manually filtering the results are not right.

Initial Comment:
When doing a manual filter (via the filter dialog), the 
results displayed (found x ham, x spam, x unsure) are 
for the last folder filtered only, not the total over all 
folders, as one would expect.

This is because in filter.py the update() function of the 
dictionary is used, and the docs have this as a[x] = b[x], 
not a[x] += b[x], which is what would be wanted here.

Unless this is changed in a later version of Python, then 
this should really be fixed.  I might get to it :)

----------------------------------------------------------------------

>Comment By: Tony Meyer (anadelonbrin)
Date: 2003-03-17 13:50

Message:
Logged In: YES 
user_id=552329

v1.9 of filter.py fixes this (well, works for me).  Thanks Mark.

----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=498103&aid=702758&group_id=61702

From noreply at sourceforge.net  Mon Mar 17 03:06:19 2003
From: noreply at sourceforge.net (SourceForge.net)
Date: Mon Mar 17 09:49:00 2003
Subject: [Spambayes] 
	[ spambayes-Bugs-702920 ] Manual filtering (Outlook) stops if one
	message fails
Message-ID: <E18usRf-0001ih-00@sc8-sf-web2.sourceforge.net>

Bugs item #702920, was opened at 2003-03-13 13:38
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=498103&aid=702920&group_id=61702

Category: Outlook
Group: None
Status: Open
Resolution: None
Priority: 5
Submitted By: Fredrik Rodland (fmmr)
Assigned to: Mark Hammond (mhammond)
>Summary: Manual filtering (Outlook) stops if one message fails

Initial Comment:
I've posted tyhis question on the maillist, and with (at 
least) one positive feedback, I enter it here:

If manual filtering is started, and one e-mail fails, the 
rest of the filetering seems to be skipped.  

couldn't the filtering of the remaining messages 
continue, skipping the message which failed?


----------------------------------------------------------------------

>Comment By: Fredrik Rodland (fmmr)
Date: 2003-03-17 12:06

Message:
Logged In: YES 
user_id=724871

I (sligthly) chqanged the summary.

I've included one traceback.  However I've run into several 
different ones in the past when filtering manual, and all seems 
to stop the actual filter-process.  What I want/wish is that the 
filtering process continues with the remaining messages even 
if one message fails.  There have also been several other 
comments on this subject on the list.

the actual traceback as requested:
Error getting property from stream (-2147221233, 'OLE error 
0x8004010f', None, None)
Exception in thread Thread-2:
Traceback (most recent call last):
  File "C:\PROGRA~1\_DEV\Python22\lib\threading.py", line 
408, in __bootstrap
    self.run()
  File "C:\PROGRA~1\_DEV\Python22\lib\threading.py", line 
396, in run
    apply(self.__target, self.__args, self.__kwargs)
  
File "c:\Programfiler\_UTIL\spambayes_cvs\spambayes\Outlo
ok2000\dialogs\AsyncDialog.py", line 115, in thread_target
    self._DoProcess()
  
File "c:\Programfiler\_UTIL\spambayes_cvs\spambayes\Outlo
ok2000\dialogs\FilterDialog.py", line 375, in _DoProcess
    self.filterer(self.mgr, self.progress)
  
File "c:\Programfiler\_UTIL\spambayes_cvs\spambayes\Outlo
ok2000\filter.py", line 100, in filterer
    this_dispositions = filter_folder(f, mgr, progress)
  
File "c:\Programfiler\_UTIL\spambayes_cvs\spambayes\Outlo
ok2000\filter.py", line 80, in filter_folder
    disposition = filter_message(message, mgr, all_actions)
  
File "c:\Programfiler\_UTIL\spambayes_cvs\spambayes\Outlo
ok2000\filter.py", line 15, in filter_message
    prob = mgr.score(msg)
  
File "c:\Programfiler\_UTIL\spambayes_cvs\spambayes\Outlo
ok2000\manager.py", line 439, in score
    email = msg.GetEmailPackageObject()
  
File "c:\Programfiler\_UTIL\spambayes_cvs\spambayes\Outlo
ok2000\msgstore.py", line 639, in GetEmailPackageObject
    text = self._GetMessageText()
  
File "c:\Programfiler\_UTIL\spambayes_cvs\spambayes\Outlo
ok2000\msgstore.py", line 582, in _GetMessageText
    assert msg.is_multipart()
AssertionError

----------------------------------------------------------------------

Comment By: Mark Hammond (mhammond)
Date: 2003-03-15 00:39

Message:
Logged In: YES 
user_id=14198

Can you please post a traceback? (and sorry if I missed it
on the list)

----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=498103&aid=702920&group_id=61702

From noreply at sourceforge.net  Mon Mar 17 03:14:44 2003
From: noreply at sourceforge.net (SourceForge.net)
Date: Mon Mar 17 09:49:06 2003
Subject: [Spambayes] [ spambayes-Bugs-704921 ] "Train now" (outlook) fails 
Message-ID: <E18usZo-0004zj-00@sc8-sf-web1.sourceforge.net>

Bugs item #704921, was opened at 2003-03-17 12:14
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=498103&aid=704921&group_id=61702

Category: Outlook
Group: None
Status: Open
Resolution: None
Priority: 5
Submitted By: Fredrik Rodland (fmmr)
Assigned to: Mark Hammond (mhammond)
Summary: "Train now" (outlook) fails 

Initial Comment:
I updated to the last CVS-version - which has the option 
of re-scoring messages after training.

however when clicking on "train now" in the main plugin 
dialog, the following traceback is caught.

the training-dialog seems "deqad" and does not react to 
the "train now"-button.

traceback:

Traceback (most recent call last):
  
File "c:\Programfiler\_UTIL\spambayes_cvs\spambayes\
Outlook2000\dialogs\TrainingDialog.py", line 70, in 
OnInitDialog
    self.UpdateStatus()
  
File "c:\Programfiler\_UTIL\spambayes_cvs\spambayes\
Outlook2000\dialogs\TrainingDialog.py", line 103, in 
UpdateStatus
    if self.config.rescore:
AttributeError: _ConfigurationContainer instance has no 
attribute 'rescore'
win32ui: OnInitDialog() virtual handler (<bound method 
TrainingDialog.OnInitDialog of 
<dialogs.TrainingDialog.TrainingDialog instance at 
0x02AAC1D8>>) raised an exception


----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=498103&aid=704921&group_id=61702

From acunningham at rsasecurity.com  Mon Mar 17 15:11:05 2003
From: acunningham at rsasecurity.com (Cunningham, Andy)
Date: Mon Mar 17 10:05:57 2003
Subject: [Spambayes] Outlook 2002
Message-ID: <418A63CAEBF2D4118A1A00508BB1A0B8029F16F3@exuk01>

Hi There.

I just installed spambayes with python 2.2, latest email package, and
win32app build 152.  I'm using Outlook 2002 on Windows 2000 Professional.
Looking through the archives of this mailing list, it seems like this should
work, but I can't get any of the Folder lists to display - I just get an
empty dialog box.

The debug trace shows the following error when I try to bring up a folder
list:

Traceback (most recent call last):
  File
"C:\andyc\Install\spambayes\spambayes-1.0a2\Outlook2000\dialogs\FolderSelect
or.py", line 310, in OnInitDialog
    tree = BuildFolderTreeMAPI(self.manager.message_store.session)
  File
"C:\andyc\Install\spambayes\spambayes-1.0a2\Outlook2000\dialogs\FolderSelect
or.py", line 93, in BuildFolderTreeMAPI
    msgstore = session.OpenMsgStore(0, eid, None, mapi.MDB_NO_MAIL |
pywintypes.com_error: (-2147219968, 'OLE error 0x80040600', None, None)
win32ui: OnInitDialog() virtual handler (<bound method
FolderSelector.OnInitDialog of <dialogs.FolderSelector.FolderSelector
instance at 0x034D4650>>) raised an exception

Can anyone help?

--
Andy Cunningham
Senior IS Consultant
RSA Security UK Ltd

From David.Vaughan at trizetto.com  Mon Mar 17 13:49:15 2003
From: David.Vaughan at trizetto.com (Vaughan, David)
Date: Mon Mar 17 16:15:25 2003
Subject: [Spambayes] setup
Message-ID: <F0CFA9BF0DAD3B4FAAD47A49793452A756649C@s-coengl-e06>


	I did not have python so I set it up on Win2k for the first time.

	I also have spambayes-1.0a2 but know how to neither setup.py build
nor setup.py install .  Kindly point me in the right direction.

	I am hoping to use spambayes with my netscape email account
VaughanDA@Netscape.net .  Any information you happen to have on how to
connect netscape mail client to the netscape mail server would be
appreciated.  Today, I just use http to use netscape mail.  I'd prefer to
use the netscape mail client but have never set it up and don't quite know
how.

	Thank you for your help.  I look forward to your response.

DVaughan19@SprintPCS.com                    M:(678) 478-5983
David.Vaughan@TriZetto.com W:(770) 225-3057   (877) 751-6025


From mhammond at skippinet.com.au  Tue Mar 18 08:47:51 2003
From: mhammond at skippinet.com.au (Mark Hammond)
Date: Mon Mar 17 17:06:44 2003
Subject: [Spambayes] Outlook 2002
In-Reply-To: <418A63CAEBF2D4118A1A00508BB1A0B8029F16F3@exuk01>
Message-ID: <LCEPIIGDJPKCOIHOBJEPCECOOHAA.mhammond@skippinet.com.au>

>     msgstore = session.OpenMsgStore(0, eid, None, mapi.MDB_NO_MAIL |
> pywintypes.com_error: (-2147219968, 'OLE error 0x80040600', None, None)

The error code for this is MAPI_E_CORRUPT_STORE, which doesn't sound good!

I have checked in a change so that any errors when walking the folder tree
are ignored.  However, this same error is going to happen, so that part of
your folder tree will *not* appear in the dialog.  Hopefully only a small
part of your tree is corrupt, so the folders you want will still be there -
you will have to try it and see.

Mark.


From tim at fourstonesExpressions.com  Mon Mar 17 16:26:50 2003
From: tim at fourstonesExpressions.com (Tim Stone - Four Stones Expressions)
Date: Mon Mar 17 17:26:58 2003
Subject: [Spambayes] setup
In-Reply-To: <F0CFA9BF0DAD3B4FAAD47A49793452A756649C@s-coengl-e06>
Message-ID: <21MKZX53IFPRM1T8598VPSO3YOIEVR.3e764baa@myst>

Ok, David, the first step for you is going to be to setup the netscape mailer.  
This is completely independent of spambayes (at this point).  I googled on 
'netscape pop3 setup and turned up a number of pages where there are 
instructions on how to do this.  The first was at 
http://documentation.ascinet.com/www/print.asp?CourseNumber=1040.  Once you 
get that set up and working, then drop us a line, and we'll get you going on 
getting spambayes setup.  In the meantime, be sure you look closely at 
http://spambayes.sourceforge.net/, our homepage.  There is considerable 
information there on how to setup spambayes, including setting up and running 
the pop3proxy, which is what you'll need.


c'est moi - TimS
http://www.fourstonesExpressions.com
http://wecanstopspam.org


From T.A.Meyer at massey.ac.nz  Tue Mar 18 11:13:55 2003
From: T.A.Meyer at massey.ac.nz (Meyer, Tony)
Date: Mon Mar 17 18:26:54 2003
Subject: [Spambayes] Other pop3proxy options
Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F13C8C8F1@its-xchg4.massey.ac.nz>

> >I would like to extend the options for how disposition is 
> identified by the pop3proxy implementation.  In particular, I 
> would like the option of
> >A. X-Spambayes-Classification: <disposition>  as now
> >B. To: XXXXX <disposition> as is in CVS now
> >C. Subject line munging to append <disposition>
[...]
> So - the reason why we are changing this,  is to accommodate 
> Outlook users who can't filter on the "X-Spambayes-Classification"?

This wasn't so much a change as an addition.  The default behaviour is still to just add the classification header and nothing else.  If you want to, however, you can munge the To: or Subject: lines as well.

This was added to accomodate Outlook *Express* (Outlook has better spambayes integration than just about anything) users, in particular, but also any other 'thin' clients.

=Tony Meyer

From noreply at sourceforge.net  Mon Mar 17 16:17:33 2003
From: noreply at sourceforge.net (SourceForge.net)
Date: Mon Mar 17 19:36:59 2003
Subject: [Spambayes] [ spambayes-Bugs-704921 ] "Train now" (outlook) fails 
Message-ID: <E18v4nN-0007pU-00@sc8-sf-web1.sourceforge.net>

Bugs item #704921, was opened at 2003-03-17 23:14
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=498103&aid=704921&group_id=61702

Category: Outlook
Group: None
Status: Open
Resolution: None
Priority: 5
Submitted By: Fredrik Rodland (fmmr)
Assigned to: Mark Hammond (mhammond)
>Summary: "Train now" (outlook) fails 

Initial Comment:
I updated to the last CVS-version - which has the option 
of re-scoring messages after training.

however when clicking on "train now" in the main plugin 
dialog, the following traceback is caught.

the training-dialog seems "deqad" and does not react to 
the "train now"-button.

traceback:

Traceback (most recent call last):
  
File "c:\Programfiler\_UTIL\spambayes_cvs\spambayes\
Outlook2000\dialogs\TrainingDialog.py", line 70, in 
OnInitDialog
    self.UpdateStatus()
  
File "c:\Programfiler\_UTIL\spambayes_cvs\spambayes\
Outlook2000\dialogs\TrainingDialog.py", line 103, in 
UpdateStatus
    if self.config.rescore:
AttributeError: _ConfigurationContainer instance has no 
attribute 'rescore'
win32ui: OnInitDialog() virtual handler (<bound method 
TrainingDialog.OnInitDialog of 
<dialogs.TrainingDialog.TrainingDialog instance at 
0x02AAC1D8>>) raised an exception


----------------------------------------------------------------------

>Comment By: Tony Meyer (anadelonbrin)
Date: 2003-03-18 12:17

Message:
Logged In: YES 
user_id=552329

r1.7 of config.py should fix this bug.  Please test if this works 
for you.

----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=498103&aid=704921&group_id=61702

From noreply at sourceforge.net  Mon Mar 17 19:37:45 2003
From: noreply at sourceforge.net (SourceForge.net)
Date: Mon Mar 17 22:37:21 2003
Subject: [Spambayes] [ spambayes-Bugs-704921 ] "Train now" (outlook) fails 
Message-ID: <E18v7v7-0000Ls-00@sc8-sf-web3.sourceforge.net>

Bugs item #704921, was opened at 2003-03-17 22:14
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=498103&aid=704921&group_id=61702

Category: Outlook
Group: None
>Status: Closed
>Resolution: Fixed
Priority: 5
Submitted By: Fredrik Rodland (fmmr)
Assigned to: Mark Hammond (mhammond)
>Summary: "Train now" (outlook) fails 

Initial Comment:
I updated to the last CVS-version - which has the option 
of re-scoring messages after training.

however when clicking on "train now" in the main plugin 
dialog, the following traceback is caught.

the training-dialog seems "deqad" and does not react to 
the "train now"-button.

traceback:

Traceback (most recent call last):
  
File "c:\Programfiler\_UTIL\spambayes_cvs\spambayes\
Outlook2000\dialogs\TrainingDialog.py", line 70, in 
OnInitDialog
    self.UpdateStatus()
  
File "c:\Programfiler\_UTIL\spambayes_cvs\spambayes\
Outlook2000\dialogs\TrainingDialog.py", line 103, in 
UpdateStatus
    if self.config.rescore:
AttributeError: _ConfigurationContainer instance has no 
attribute 'rescore'
win32ui: OnInitDialog() virtual handler (<bound method 
TrainingDialog.OnInitDialog of 
<dialogs.TrainingDialog.TrainingDialog instance at 
0x02AAC1D8>>) raised an exception


----------------------------------------------------------------------

>Comment By: Mark Hammond (mhammond)
Date: 2003-03-18 14:37

Message:
Logged In: YES 
user_id=14198

Fixed in filter.py, rev 1.20.

The "dead dialog" problem seems a little deeper then this,
and affects all dialogs - I will open a new bug.

----------------------------------------------------------------------

Comment By: Tony Meyer (anadelonbrin)
Date: 2003-03-18 11:17

Message:
Logged In: YES 
user_id=552329

r1.7 of config.py should fix this bug.  Please test if this works 
for you.

----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=498103&aid=704921&group_id=61702

From noreply at sourceforge.net  Mon Mar 17 19:39:47 2003
From: noreply at sourceforge.net (SourceForge.net)
Date: Mon Mar 17 22:37:28 2003
Subject: [Spambayes] 
	[ spambayes-Bugs-705378 ] Cancelled "full train" leaves bad
	database.
Message-ID: <E18v7x5-0000Of-00@sc8-sf-web3.sourceforge.net>

Bugs item #705378, was opened at 2003-03-18 14:39
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=498103&aid=705378&group_id=61702

Category: Outlook
Group: None
Status: Open
Resolution: None
Priority: 5
Submitted By: Mark Hammond (mhammond)
Assigned to: Mark Hammond (mhammond)
Summary: Cancelled "full train" leaves bad database.

Initial Comment:
If you go to the training dialog, select "rebuild
entire database", start the train, then cancel it, the
database is left in a useless state.

We should probably train to a new database, then move
it over once complete.

----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=498103&aid=705378&group_id=61702

From noreply at sourceforge.net  Mon Mar 17 19:43:10 2003
From: noreply at sourceforge.net (SourceForge.net)
Date: Mon Mar 17 22:37:36 2003
Subject: [Spambayes] 
	[ spambayes-Bugs-705379 ] Outlook dialogs sometimes become
	unresponsive
Message-ID: <E18v80M-0005Na-00@sc8-sf-web1.sourceforge.net>

Bugs item #705379, was opened at 2003-03-18 14:43
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=498103&aid=705379&group_id=61702

Category: Outlook
Group: None
Status: Open
Resolution: None
Priority: 5
Submitted By: Mark Hammond (mhammond)
Assigned to: Mark Hammond (mhammond)
Summary: Outlook dialogs sometimes become unresponsive

Initial Comment:
The training and filtering dialogs sometimes become
unresponsive during filtering/training.  They
shouldn't, as hoops are jumped through to keep the UI
and worker in separate threads.  Further, it only seems
to happen on "large" folders - eg, I can provoke it on
my Inbox, but not on smaller folders.  I'm guessing
some bullshit COM/Outlook thread rule I am breaking.

----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=498103&aid=705379&group_id=61702

From mhammond at skippinet.com.au  Tue Mar 18 19:03:08 2003
From: mhammond at skippinet.com.au (Mark Hammond)
Date: Tue Mar 18 03:04:12 2003
Subject: [Spambayes] New Outlook binary available
Message-ID: <LCEPIIGDJPKCOIHOBJEPMEENOHAA.mhammond@skippinet.com.au>

I have made a new Outlook installer binary on my starship page -
http://starship.python.net/crew/mhammond/spambayes/  (Should I be putting
these on the main spambayes page, even though they aren't official releases?
I'm happy to!)

This version fixes alot of problems in the first version - both problems
that exist in the source-code version, and installer-specific problems.  We
have better docs aimed more at the first time user, output is redirected to
a log file, the apply() warnings have gone, etc.

If you are running the old version, please uninstall and try the new one -
the uninstall will not delete your databases.

Thanks,

Mark.


From spambayes at djl.freeuk.com  Tue Mar 18 15:06:17 2003
From: spambayes at djl.freeuk.com (David Leftley)
Date: Tue Mar 18 10:06:24 2003
Subject: [Spambayes] Why was this e-mail's body ignored?
Message-ID: <3vce7vclqjg0f9q0b93e3058b9n9eum4mr@4ax.com>

I was surprised to see the message below appear towards the lower end
of my "possible spam" range - but looking at the breakdown of how the
message was classified, it turns out that spambayes is ignoring the
entire message body.

What is it about this message that makes spambayes think it has no
relevant content? Is it simply that we don't try to handle
multipart/alternative messages?

David.


>Return-path: <ijooksk883q@juno.com>
>Delivery-date: Tue, 18 Mar 2003 14:51:57 +0000
>Received: from hypnos.uk.clara.net ([213.253.16.103])
>	by chaos.uk.clara.net with esmtp (Exim 4.12)
>	id 18vIRZ-0002zI-00; Tue, 18 Mar 2003 14:51:57 +0000
>Received: from [200.86.159.217] (helo=213.253.16.103)
>	by hypnos.uk.clara.net with smtp (Exim 3.33 #2)
>	id 18vINe-000O5C-00; Tue, 18 Mar 2003 14:47:55 +0000
>Received: from 1v5.tbom9.net [45.100.189.59] by 213.253.16.103; Tue, 18 Mar 2003 18:39:48 +0400
>Message-ID: <g4n4$$$ftzx3@jerpy>
>From: "Leigh Skaggs" <ijooksk883q@juno.com>
>To: <addresses@removed>
>Subject: Who said money won't get you laid?
>Date: Tue, 18 Mar 03 18:39:48 GMT
>X-Priority: 3
>X-MSMail-Priority: Normal
>X-Mailer: The Bat! (v1.52f) Business
>MIME-Version: 1.0
>Content-Type: multipart/alternative;
>	boundary="1._1C_DFD.9.C4AFD"
>X-RBL-Warning: (bl.spamcop.net) Blocked - see http://spamcop.net/bl.shtml?200.86.159.217
>X-UIDL: 1047999119.11657.chaos.uk.clara.net
>X-RCPT: djl
>Status: U 
>X-Spambayes-Classification: unsure
>X-Spambayes-Spam-Probability: 0.235233369482
>
>This is a multi-part message in MIME format.
>
>--1._1C_DFD.9.C4AFD
>Content-Type: text/plain
>Content-Transfer-Encoding: quoted-printable
>
>Money isn't everything...or so you were told right? 
>Well we bet you that it is! Take a look at these girls 
>who would do absolutely ANYTHING to win over a self made billionaire!
>
>http://www.hotxxxpass.net/pass2/
>
>These hardworking girls think it's their lucky day because Max 
>the billionaire suitor has fallen upon them and their wonderful 
>talents! Watch them show themselves off to impress Max!
>
>You won't believe what these girls will do to get a piece of Max's pie!
>
>http://www.hotxxxpass.net/pass2/
>jpbbvp hn
> mf
> rcjoz
>--1._1C_DFD.9.C4AFD--
>
>

Spam clues for this message:
*H* 0.0310695580167 
*S* 0.968984419484 
subject:? 0.155172413793 
content-type:multipart/alternative 0.706000895656 
subject:' 0.738095238095 
subject:get 0.844827586207 
subject:money 0.844827586207 
subject:you 0.934782608696 
to:2**2 0.969798657718 


From MMARTINEZ at intranet.reeusda.gov  Tue Mar 18 10:34:21 2003
From: MMARTINEZ at intranet.reeusda.gov (Martinez, Michael - CSREES/ISTM)
Date: Tue Mar 18 10:48:07 2003
Subject: [Spambayes] 
	Suggestion: interface to qmail/qmail-scanner smtp gateway
Message-ID: <E8E5E0D3B5C9D611B23500C00D00E9BC303720@CSREESSERVER>

It'd be great if you could write a small, lightweight interface to
qmail-scanner/qmail. Something like what "spamd/spamc" is for
"SpamAssassin."

I would be running spambayes on my smtp gateway right now, except that no
one has written the interface.

Martinez, Michael
CSREES/ISTM/USDA
(202) 720-6223 

From acunningham at rsasecurity.com  Tue Mar 18 17:16:18 2003
From: acunningham at rsasecurity.com (Cunningham, Andy)
Date: Tue Mar 18 12:11:26 2003
Subject: [Spambayes] Outlook 2002
Message-ID: <418A63CAEBF2D4118A1A00508BB1A0B8029F1709@exuk01>

Mark

Thanks for the fast response.  I actually managed to beat one of the nightly
builds (2003-03-17) into working after I'd sent this, but I will update to
your new code tomorrow and see if that works as well.

AndyC 

-----Original Message-----
From: Mark Hammond [mailto:mhammond@skippinet.com.au] 
Sent: 17 March 2003 21:48
To: Cunningham, Andy; spambayes@python.org
Subject: RE: [Spambayes] Outlook 2002


>     msgstore = session.OpenMsgStore(0, eid, None, mapi.MDB_NO_MAIL |
> pywintypes.com_error: (-2147219968, 'OLE error 0x80040600', None, 
> None)

The error code for this is MAPI_E_CORRUPT_STORE, which doesn't sound good!

I have checked in a change so that any errors when walking the folder tree
are ignored.  However, this same error is going to happen, so that part of
your folder tree will *not* appear in the dialog.  Hopefully only a small
part of your tree is corrupt, so the folders you want will still be there -
you will have to try it and see.

Mark.

From noreply at sourceforge.net  Tue Mar 18 08:24:36 2003
From: noreply at sourceforge.net (SourceForge.net)
Date: Tue Mar 18 12:27:07 2003
Subject: [Spambayes] 
	[ spambayes-Bugs-695632 ] MySQL Digest Causes Spambayes to Crash
Message-ID: <E18vJtE-000482-00@sc8-sf-web3.sourceforge.net>

Bugs item #695632, was opened at 2003-03-01 10:48
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=498103&aid=695632&group_id=61702

Category: None
Group: None
Status: Open
Resolution: None
Priority: 5
Submitted By: Richard Scott (rich1)
Assigned to: Nobody/Anonymous (nobody)
Summary: MySQL Digest Causes Spambayes to Crash

Initial Comment:
The main mysql e-mail list (digest version) and the
mysql bugs e-mail list (digest version) always cause
Spambayes to crash.  It appears that the error occurs
in Generator.py.
Here is the output:

Training ham (/home/richard/Mail/inbox):
  Reading as MH mailbox
  /home/richard/Mail/inbox/2        
  /home/richard/Mail/inbox/5        
  /home/richard/Mail/inbox/6        
  /home/richard/Mail/inbox/724        
  /home/richard/Mail/inbox/29        
  /home/richard/Mail/inbox/751        
Traceback (most recent call last):
  File "/home/richard/spambayes/mboxtrain.py", line
278, in ?
    main()
  File "/home/richard/spambayes/mboxtrain.py", line
265, in main
    train(h, g, False, force)
  File "/home/richard/spambayes/mboxtrain.py", line
207, in train
    mhdir_train(h, path, is_spam, force)
  File "/home/richard/spambayes/mboxtrain.py", line
190, in mhdir_train
    f.write(msg.as_string())
  File
"/usr/lib/python2.2/site-packages/email/Message.py",
line 107, in as_string
    g.flatten(self, unixfrom=unixfrom)
  File
"/usr/lib/python2.2/site-packages/email/Generator.py",
line 100, in flatten
    self._write(msg)
  File
"/usr/lib/python2.2/site-packages/email/Generator.py",
line 128, in _write
    self._dispatch(msg)
  File
"/usr/lib/python2.2/site-packages/email/Generator.py",
line 154, in _dispatch
    meth(msg)
  File
"/usr/lib/python2.2/site-packages/email/Generator.py",
line 243, in _handle_multipart
    g.flatten(part, unixfrom=False)
  File
"/usr/lib/python2.2/site-packages/email/Generator.py",
line 100, in flatten
    self._write(msg)
  File
"/usr/lib/python2.2/site-packages/email/Generator.py",
line 128, in _write
    self._dispatch(msg)
  File
"/usr/lib/python2.2/site-packages/email/Generator.py",
line 154, in _dispatch
    meth(msg)
  File
"/usr/lib/python2.2/site-packages/email/Generator.py",
line 212, in _handle_text
    raise TypeError, 'string payload expected: %s' %
type(payload)
TypeError: string payload expected: <type 'list'>


----------------------------------------------------------------------

Comment By: Chuck Bearden (cfbearden)
Date: 2003-03-18 10:24

Message:
Logged In: YES 
user_id=499555

I am experiencing the same problem with the axkit digest and
also with the 
monthly log files for a LISTSERV list that I run.  Perhaps
it's the presence of 
so many From: To: Subject: Received: etc. lines within one
email message.

I can fix this problem for myself by inserting a procmail
recipe for the digests
before the spambayes recipes, but I'm not sure how well that
approach will
scale to the 100+ folks I'd like to deploy this for.  Also,
it could cause 
problems for POP proxy users, since I don't see how they can
prevent their
digest traffic from being considered by by the spambayes
filters on the proxy.


----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=498103&aid=695632&group_id=61702

From noreply at sourceforge.net  Tue Mar 18 10:48:56 2003
From: noreply at sourceforge.net (SourceForge.net)
Date: Tue Mar 18 14:00:32 2003
Subject: [Spambayes] 
	[ spambayes-Bugs-695632 ] MySQL Digest Causes Spambayes to Crash
Message-ID: <E18vM8u-0008I3-00@sc8-sf-web1.sourceforge.net>

Bugs item #695632, was opened at 2003-03-01 10:48
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=498103&aid=695632&group_id=61702

Category: None
Group: None
Status: Open
Resolution: None
Priority: 5
Submitted By: Richard Scott (rich1)
Assigned to: Nobody/Anonymous (nobody)
Summary: MySQL Digest Causes Spambayes to Crash

Initial Comment:
The main mysql e-mail list (digest version) and the
mysql bugs e-mail list (digest version) always cause
Spambayes to crash.  It appears that the error occurs
in Generator.py.
Here is the output:

Training ham (/home/richard/Mail/inbox):
  Reading as MH mailbox
  /home/richard/Mail/inbox/2        
  /home/richard/Mail/inbox/5        
  /home/richard/Mail/inbox/6        
  /home/richard/Mail/inbox/724        
  /home/richard/Mail/inbox/29        
  /home/richard/Mail/inbox/751        
Traceback (most recent call last):
  File "/home/richard/spambayes/mboxtrain.py", line
278, in ?
    main()
  File "/home/richard/spambayes/mboxtrain.py", line
265, in main
    train(h, g, False, force)
  File "/home/richard/spambayes/mboxtrain.py", line
207, in train
    mhdir_train(h, path, is_spam, force)
  File "/home/richard/spambayes/mboxtrain.py", line
190, in mhdir_train
    f.write(msg.as_string())
  File
"/usr/lib/python2.2/site-packages/email/Message.py",
line 107, in as_string
    g.flatten(self, unixfrom=unixfrom)
  File
"/usr/lib/python2.2/site-packages/email/Generator.py",
line 100, in flatten
    self._write(msg)
  File
"/usr/lib/python2.2/site-packages/email/Generator.py",
line 128, in _write
    self._dispatch(msg)
  File
"/usr/lib/python2.2/site-packages/email/Generator.py",
line 154, in _dispatch
    meth(msg)
  File
"/usr/lib/python2.2/site-packages/email/Generator.py",
line 243, in _handle_multipart
    g.flatten(part, unixfrom=False)
  File
"/usr/lib/python2.2/site-packages/email/Generator.py",
line 100, in flatten
    self._write(msg)
  File
"/usr/lib/python2.2/site-packages/email/Generator.py",
line 128, in _write
    self._dispatch(msg)
  File
"/usr/lib/python2.2/site-packages/email/Generator.py",
line 154, in _dispatch
    meth(msg)
  File
"/usr/lib/python2.2/site-packages/email/Generator.py",
line 212, in _handle_text
    raise TypeError, 'string payload expected: %s' %
type(payload)
TypeError: string payload expected: <type 'list'>


----------------------------------------------------------------------

>Comment By: Tim Stone (timstone4)
Date: 2003-03-18 12:48

Message:
Logged In: YES 
user_id=645698

I believe this problem has been resolved for pop3proxy.

----------------------------------------------------------------------

Comment By: Chuck Bearden (cfbearden)
Date: 2003-03-18 10:24

Message:
Logged In: YES 
user_id=499555

I am experiencing the same problem with the axkit digest and
also with the 
monthly log files for a LISTSERV list that I run.  Perhaps
it's the presence of 
so many From: To: Subject: Received: etc. lines within one
email message.

I can fix this problem for myself by inserting a procmail
recipe for the digests
before the spambayes recipes, but I'm not sure how well that
approach will
scale to the 100+ folks I'd like to deploy this for.  Also,
it could cause 
problems for POP proxy users, since I don't see how they can
prevent their
digest traffic from being considered by by the spambayes
filters on the proxy.


----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=498103&aid=695632&group_id=61702

From tim at fourstonesExpressions.com  Tue Mar 18 13:36:34 2003
From: tim at fourstonesExpressions.com (Tim Stone - Four Stones Expressions)
Date: Tue Mar 18 14:36:39 2003
Subject: [Spambayes] setup
In-Reply-To: <F0CFA9BF0DAD3B4FAAD47A49793452A75664AF@s-coengl-e06>
Message-ID: <2VWNUQ09SP959809QLPK72DCD0XJD.3e777542@myst>

3/18/2003 1:29:15 PM, "Vaughan, David" <David.Vaughan@trizetto.com> wrote:

>
>	It's not supposed to be this hard :-)
>
>	I'll keep trying but presently am unable to set up POP3.  I get the
>message "Connection to server imap.mail.netcenter.com timed out." but can
>not find in the Netscape 7.02 preferences where to set the server name.

pop3proxy does not support imap servers at this time.  For that matter, there 
isn't any imap support in spambayes at this point in time... :(

c'est moi - TimS
http://www.fourstonesExpressions.com
http://wecanstopspam.org


From ducky at webfoot.com  Tue Mar 18 15:43:52 2003
From: ducky at webfoot.com (Kaitlin Duck Sherwood)
Date: Tue Mar 18 18:38:34 2003
Subject: [Spambayes] OT: email developer posting at OSAF
Message-ID: <p05100303ba9d5c1314e3@[10.0.0.2]>

(Hi, sorry for being a bit off-topic, but this seems like an 
outstanding place to look for people who know about open source 
projects, email, and python.)

We at the Open Source Applications Foundation is looking for an 
experienced and self-motivated person to join our development team in 
the San Francisco area.  The ideal person would have a knowledge of 
e-mail protocols and standards as well as experience producing 
end-user software.  User interface experience is very valuable.

We have just posted this job to http://www.osafoundation.org/employment.htm

Interested parties can send information to jobs@osafoundation.org.


(Note that this is my "home" email account, not my OSAF account, so 
don't reply to this account.)

From T.A.Meyer at massey.ac.nz  Wed Mar 19 12:14:34 2003
From: T.A.Meyer at massey.ac.nz (Meyer, Tony)
Date: Tue Mar 18 19:15:16 2003
Subject: [Spambayes] New Outlook binary available
Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F13C8C8FD@its-xchg4.massey.ac.nz>

> I have made a new Outlook installer binary on my starship page -      
> http://starship.python.net/crew/mhammond/spambayes/ (Should           
> I be putting these on the main spambayes page, even though they       
> aren't official releases? I'm happy to!)                              

IMO, yes (I don't see how they are any less official than the alpha1
and alpha2 that have gone out).

=Tony Meyer

From T.A.Meyer at massey.ac.nz  Wed Mar 19 12:38:44 2003
From: T.A.Meyer at massey.ac.nz (Meyer, Tony)
Date: Tue Mar 18 19:39:19 2003
Subject: [Spambayes] setup
Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F13C8C900@its-xchg4.massey.ac.nz>

> pop3proxy does not support imap servers at this time. For             
> that matter, there isn't any imap support in spambayes                
> at this point in time... :(                                           

How hard would this be to address? I've read all the past messages about
how complex IMAP is, but to just hook into spambayes like pop3proxy
does. (Obviously the ui - which probably will be a separate module
someday anyway - could be reused).

Does anyone have an IMAP account that would be willing to work on this?
Apart from the impossible case of webmail, it does seem like IMAP is the
last big group of users that can't use spambayes.

=Tony Meyer

From david at theresistance.net  Tue Mar 18 21:00:49 2003
From: david at theresistance.net (David Shaw)
Date: Tue Mar 18 21:02:21 2003
Subject: [Spambayes] Outlook 2002
In-Reply-To: <418A63CAEBF2D4118A1A00508BB1A0B8029F1709@exuk01>
Message-ID: <9B8C1314-59AE-11D7-8825-000393582EF6@theresistance.net>

I got the spam below today.  Spambayes said it was ham.  I trained it  
as spam.  I reclassified.  Spambayes still said it was ham.  I had to  
classify it as spam 6 times before it would recognize it as such.  I  
think this list makes spam about antispam software get by 100% of the  
time (this list comprises over half of my daily ham).

----
 From stiffypop17291@mindless.com Tue Mar 18 14:48:05 2003
Return-Path: <stiffypop17291@mindless.com>
Received: from ZZ (67.97.202.131) by theresistance.net with SMTP (Eudora
  Internet Mail Server 3.0.3) for <david@theresistance.net>;
  Tue, 18 Mar 2003 14:33:31 -0500
To: David <david@theresistance.net>
From: stiffypop17291@mindless.com
Reply-To: tallstranger00897@engineer.com
Sender: jeffrey_dunlap3581@paris.com
X-Mailer: OutLook Express IMO, 59
Subject: David, Intelligent antispam IER software
MIME-Version: 1.0
Content-type: text/html
Content-Transfer-Encoding: 8bit
Date: Tue, 18 Mar 2003 14:33:31 -0500
Message-ID: <1164106485-1165210066@theresistance.net>

<HTML><HEAD><TITLE>Spam Remedy</TITLE>

<BODY>
<CENTER>
<TABLE cellSpacing=0 cellPadding=0 width=540>
   <TBODY>
   <TR>
     <TD width=538 bgColor=#ffaa99 colSpan=2><FONT size=2><I><FONT
       style="FILTER: dropshadow(color=#336699, offx=3, offy=4,  
positive=1); WIDTH: 550px; COLOR: #ffffff; FONT-FAMILY: Arial Black;  
POSITION: relative">TheVeryBest
       - Software Downloads</FONT></I></FONT></TD></TR>
   <TR>
     <TD align=middle width=538 bgColor=#0066cc colSpan=2><SPAN
       style="WIDTH: 60px">&nbsp;</SPAN><FONT face=verdana
       color=#ffffff><B>Top-Rank Software Download Site on the
       Internet</B></FONT><SPAN style="WIDTH: 60px">&nbsp;</SPAN> <FONT
       face=verdana color=#ffffff size=1>
       <SCRIPT language=JavaScript>
<!--
  var days = new Array("","Sun","Mon","Tue","Wed","Thu","Fri","Sat");
  var months = new  
Array("","Jan","Feb","Mar","Apr","May","June","July","Aug","Sep","Oct"," 
Nov","Dec");
  var dateObj = new Date()
  var day = days[dateObj.getDay() + 1]
  var month = months[dateObj.getMonth() + 1]
  var date = dateObj.getDate()
  document.write(day + ", " + month + " " + date)
  // -->
</SCRIPT>
       </FONT></TD></TR>
   <TR>
     <TD vAlign=top width=748>
       <TABLE width=748>
         <TBODY>
         <TR>
           <TD width=740 colSpan=2><FONT face=arial size=2><A
              
href="http://www.Siliconeparadise.com/remedy/ 
index.html?Utw2EJz3u7">Internet</A>-&gt;<A
              
href="http://www.Siliconeparadise.com/remedy/ 
index.html?hWT14FrUkz">Email</A>-&gt;<A
              
href="http://www.Siliconeparadise.com/remedy/index.html?atGiTvEygc">Spam
             Remedy v1.5 PRO</A></FONT></TD></TR>
         <TR>
           <TD width=534><BR><FONT face=arial size=2><IMG height=32
             src="http://siliconeparadise.com/ads/logo.gif" width=32  
border=0></FONT><FONT
             face=arial color=#00aa66 size=4>Spam Remedy</FONT><FONT  
face=arial
             size=2>&nbsp;&nbsp;&nbsp;&nbsp;<A
              
href="http://www.Siliconeparadise.com/remedy/index.html?20voRZ0tTk"><IMG
             height=19 src="http://siliconeparadise.com/ads/buy.gif"  
width=55
             border=0></A>&nbsp;&nbsp;&nbsp;&nbsp;<A
              
href="http://www.Siliconeparadise.com/remedy/index.html?g07NR2iaLv"><IMG
             height=19  
src="http://siliconeparadise.com/ads/download.gif" width=55
             border=0></A>(3.17MB)<BR>
             <HR color=#236801 SIZE=1>
             </FONT></TD></TR>
         <TR>
           <TD width=534><A
              
href="http://www.Siliconeparadise.com/remedy/index.html?zLD6WmqqLx"><IMG
             height=236 src="http://siliconeparadise.com/ads/screen.gif"  
width=270 align=right
             border=0></A><FONT face=arial size=2><B>Description:</B>
             <BR><BR><B>The powerful, effective and intelligent anti-spam
             tool.<BR>It automatically cleans spam messages out of your  
mailbox
             before you receive or read them. </B><BR><BR>Features:<BR>
             <UL>
               <LI>Automatically Blocking Spam<BR>Spam Remedy  
automatically
               checks your mail boxes and <A
                
href="http://www.Siliconeparadise.com/remedy/ 
index.html?tEnNU7EJfm">filters</A>
               unwanted, dangerous, or offensive mail messages to save  
your time
               from manually detecting and organizing mail messages.
               <LI>Effectively Spam <A
                
href="http://www.Siliconeparadise.com/remedy/ 
index.html?WkeU0WL9Xn">Detecting</A><BR>A
               complex <A
                
href="http://www.Siliconeparadise.com/remedy/ 
index.html?vm1YiMbA4M">Aritificial
               Intelligence</A> algorithm has been used in Spam Remedy  
product to
               detecting legitimate mail messages and spam messages,the  
technique
               has more precision than other filter-based and  
keyword-based <A
                
href="http://www.Siliconeparadise.com/remedy/ 
index.html?Yntd22bxDh">anti-spam
               technologies</A>.
               <LI>Be Sure You Get Your Right Mail Messages<BR>Spam  
Remedy
               doesn't confirm a spam message by a single keyword in mail
               content. It <A
                
href="http://www.Siliconeparadise.com/remedy/ 
index.html?KprOaiCLip">examines</A>
               the entire message - source, headers and mail content to  
confirm
               whether it is a spam message.
               <LI>Supports Multiple Email Types and Almost All Email  
Clients
               <BR>Spam Remedy supports POP3, Hotmail/MSN, IMAP4 and  
MAPI email
               accounts,Directly works with almost all email  
clients(Outlook
               Express, Becky Mail,Foxmail,Outlook, The bat!, Eudora  
etc.),
               espacially includes support for web-based Hotmail/MSN  
email
               clients. Nothing you need to change to your email clients!
               <LI><A
                
href="http://www.Siliconeparadise.com/remedy/index.html?0EDRrMwueq">Easy
               to use</A>&nbsp; - You don't need to set any complex  
filter rules,
               just add your email accounts to Spam Remedy and then it  
works.
               <LI>Friends List and <A
                
href="http://www.Siliconeparadise.com/remedy/ 
index.html?73Fg6x97UC">Rejecting
               List</A><BR>With Friends List and <A
                
href="http://www.Siliconeparadise.com/remedy/ 
index.html?evRxbRHEuP">Rejecting
               List,</A>you have the chance to decide who are never  
blocked or
               directly treat their mail messages as spam.
               <LI>Keep your inbox <A
                
href="http://www.Siliconeparadise.com/remedy/ 
index.html?Ukjn4C2pW6">clean</A><BR>Spam
               Remedy places all intercepted spam messages to its i<A
                
href="http://www.Siliconeparadise.com/remedy/ 
index.html?HU0c0WA0Uk">nterval
               mail database</A> so that your inbox remains uncluttered  
and free
               of spam.If for some reason a legitimate email is flagged  
as spam,
               you can <A
                
href="http://www.Siliconeparadise.com/remedy/ 
index.html?rZHbmIUUdf">easily
               recover</A> in multiple ways. <BR><BR>Editor's Rating:<A
                
href="http://www.Siliconeparadise.com/remedy/index.html?AYqKdkd5B0"><IMG
               height=18  
src="http://siliconeparadise.com/ads/4horse.gif" width=100 border=0>
               </A></FONT></LI></UL></TD></TR></TBODY></TABLE></TD></TR>
   <TR>
     <TD align=middle width=538 bgColor=#0066cc colSpan=2 height=20><FONT
       face=verdana size=1>Copyright ?2002-2003 <A
        
href="http://www.Siliconeparadise.com/remedy/ 
index.html?29KRcCFDrp">DarkSoft
       Group</A>&nbsp; All Rights Reserved.
</FONT></TD></TR></TBODY></TABLE></CENTER></BODY></HTML>


From mhammond at skippinet.com.au  Wed Mar 19 13:09:53 2003
From: mhammond at skippinet.com.au (Mark Hammond)
Date: Tue Mar 18 21:10:59 2003
Subject: [Spambayes] Outlook 2002
In-Reply-To: <9B8C1314-59AE-11D7-8825-000393582EF6@theresistance.net>
Message-ID: <LCEPIIGDJPKCOIHOBJEPIEIHOHAA.mhammond@skippinet.com.au>

Can you mail me the "spam clues" for one such message?  Although the
behaviour you describe is possibly correct, I would like to make sure we are
seeing all the payload etc.

Mark.

> -----Original Message-----
> From: spambayes-bounces@python.org
> [mailto:spambayes-bounces@python.org]On Behalf Of David Shaw
> Sent: Wednesday, 19 March 2003 1:01 PM
> To: spambayes@python.org
> Subject: Re: [Spambayes] Outlook 2002
>
>
> I got the spam below today.  Spambayes said it was ham.  I trained it
> as spam.  I reclassified.  Spambayes still said it was ham.  I had to
> classify it as spam 6 times before it would recognize it as such.  I
> think this list makes spam about antispam software get by 100% of the
> time (this list comprises over half of my daily ham).
>
> ----
>  From stiffypop17291@mindless.com Tue Mar 18 14:48:05 2003
> Return-Path: <stiffypop17291@mindless.com>
> Received: from ZZ (67.97.202.131) by theresistance.net with SMTP (Eudora
>   Internet Mail Server 3.0.3) for <david@theresistance.net>;
>   Tue, 18 Mar 2003 14:33:31 -0500
> To: David <david@theresistance.net>
> From: stiffypop17291@mindless.com
> Reply-To: tallstranger00897@engineer.com
> Sender: jeffrey_dunlap3581@paris.com
> X-Mailer: OutLook Express IMO, 59
> Subject: David, Intelligent antispam IER software
> MIME-Version: 1.0
> Content-type: text/html
> Content-Transfer-Encoding: 8bit
> Date: Tue, 18 Mar 2003 14:33:31 -0500
> Message-ID: <1164106485-1165210066@theresistance.net>
>
> <HTML><HEAD><TITLE>Spam Remedy</TITLE>
>
> <BODY>
> <CENTER>
> <TABLE cellSpacing=0 cellPadding=0 width=540>
>    <TBODY>
>    <TR>
>      <TD width=538 bgColor=#ffaa99 colSpan=2><FONT size=2><I><FONT
>        style="FILTER: dropshadow(color=#336699, offx=3, offy=4,
> positive=1); WIDTH: 550px; COLOR: #ffffff; FONT-FAMILY: Arial Black;
> POSITION: relative">TheVeryBest
>        - Software Downloads</FONT></I></FONT></TD></TR>
>    <TR>
>      <TD align=middle width=538 bgColor=#0066cc colSpan=2><SPAN
>        style="WIDTH: 60px">&nbsp;</SPAN><FONT face=verdana
>        color=#ffffff><B>Top-Rank Software Download Site on the
>        Internet</B></FONT><SPAN style="WIDTH: 60px">&nbsp;</SPAN> <FONT
>        face=verdana color=#ffffff size=1>
>        <SCRIPT language=JavaScript>
> <!--
>   var days = new Array("","Sun","Mon","Tue","Wed","Thu","Fri","Sat");
>   var months = new
> Array("","Jan","Feb","Mar","Apr","May","June","July","Aug","Sep","Oct","
> Nov","Dec");
>   var dateObj = new Date()
>   var day = days[dateObj.getDay() + 1]
>   var month = months[dateObj.getMonth() + 1]
>   var date = dateObj.getDate()
>   document.write(day + ", " + month + " " + date)
>   // -->
> </SCRIPT>
>        </FONT></TD></TR>
>    <TR>
>      <TD vAlign=top width=748>
>        <TABLE width=748>
>          <TBODY>
>          <TR>
>            <TD width=740 colSpan=2><FONT face=arial size=2><A
>
> href="http://www.Siliconeparadise.com/remedy/
> index.html?Utw2EJz3u7">Internet</A>-&gt;<A
>
> href="http://www.Siliconeparadise.com/remedy/
> index.html?hWT14FrUkz">Email</A>-&gt;<A
>
> href="http://www.Siliconeparadise.com/remedy/index.html?atGiTvEygc">Spam
>              Remedy v1.5 PRO</A></FONT></TD></TR>
>          <TR>
>            <TD width=534><BR><FONT face=arial size=2><IMG height=32
>              src="http://siliconeparadise.com/ads/logo.gif" width=32
> border=0></FONT><FONT
>              face=arial color=#00aa66 size=4>Spam Remedy</FONT><FONT
> face=arial
>              size=2>&nbsp;&nbsp;&nbsp;&nbsp;<A
>
> href="http://www.Siliconeparadise.com/remedy/index.html?20voRZ0tTk"><IMG
>              height=19 src="http://siliconeparadise.com/ads/buy.gif"
> width=55
>              border=0></A>&nbsp;&nbsp;&nbsp;&nbsp;<A
>
> href="http://www.Siliconeparadise.com/remedy/index.html?g07NR2iaLv"><IMG
>              height=19
> src="http://siliconeparadise.com/ads/download.gif" width=55
>              border=0></A>(3.17MB)<BR>
>              <HR color=#236801 SIZE=1>
>              </FONT></TD></TR>
>          <TR>
>            <TD width=534><A
>
> href="http://www.Siliconeparadise.com/remedy/index.html?zLD6WmqqLx"><IMG
>              height=236 src="http://siliconeparadise.com/ads/screen.gif"
> width=270 align=right
>              border=0></A><FONT face=arial size=2><B>Description:</B>
>              <BR><BR><B>The powerful, effective and intelligent anti-spam
>              tool.<BR>It automatically cleans spam messages out of your
> mailbox
>              before you receive or read them. </B><BR><BR>Features:<BR>
>              <UL>
>                <LI>Automatically Blocking Spam<BR>Spam Remedy
> automatically
>                checks your mail boxes and <A
>
> href="http://www.Siliconeparadise.com/remedy/
> index.html?tEnNU7EJfm">filters</A>
>                unwanted, dangerous, or offensive mail messages to save
> your time
>                from manually detecting and organizing mail messages.
>                <LI>Effectively Spam <A
>
> href="http://www.Siliconeparadise.com/remedy/
> index.html?WkeU0WL9Xn">Detecting</A><BR>A
>                complex <A
>
> href="http://www.Siliconeparadise.com/remedy/
> index.html?vm1YiMbA4M">Aritificial
>                Intelligence</A> algorithm has been used in Spam Remedy
> product to
>                detecting legitimate mail messages and spam messages,the
> technique
>                has more precision than other filter-based and
> keyword-based <A
>
> href="http://www.Siliconeparadise.com/remedy/
> index.html?Yntd22bxDh">anti-spam
>                technologies</A>.
>                <LI>Be Sure You Get Your Right Mail Messages<BR>Spam
> Remedy
>                doesn't confirm a spam message by a single keyword in mail
>                content. It <A
>
> href="http://www.Siliconeparadise.com/remedy/
> index.html?KprOaiCLip">examines</A>
>                the entire message - source, headers and mail content to
> confirm
>                whether it is a spam message.
>                <LI>Supports Multiple Email Types and Almost All Email
> Clients
>                <BR>Spam Remedy supports POP3, Hotmail/MSN, IMAP4 and
> MAPI email
>                accounts,Directly works with almost all email
> clients(Outlook
>                Express, Becky Mail,Foxmail,Outlook, The bat!, Eudora
> etc.),
>                espacially includes support for web-based Hotmail/MSN
> email
>                clients. Nothing you need to change to your email clients!
>                <LI><A
>
> href="http://www.Siliconeparadise.com/remedy/index.html?0EDRrMwueq">Easy
>                to use</A>&nbsp; - You don't need to set any complex
> filter rules,
>                just add your email accounts to Spam Remedy and then it
> works.
>                <LI>Friends List and <A
>
> href="http://www.Siliconeparadise.com/remedy/
> index.html?73Fg6x97UC">Rejecting
>                List</A><BR>With Friends List and <A
>
> href="http://www.Siliconeparadise.com/remedy/
> index.html?evRxbRHEuP">Rejecting
>                List,</A>you have the chance to decide who are never
> blocked or
>                directly treat their mail messages as spam.
>                <LI>Keep your inbox <A
>
> href="http://www.Siliconeparadise.com/remedy/
> index.html?Ukjn4C2pW6">clean</A><BR>Spam
>                Remedy places all intercepted spam messages to its i<A
>
> href="http://www.Siliconeparadise.com/remedy/
> index.html?HU0c0WA0Uk">nterval
>                mail database</A> so that your inbox remains uncluttered
> and free
>                of spam.If for some reason a legitimate email is flagged
> as spam,
>                you can <A
>
> href="http://www.Siliconeparadise.com/remedy/
> index.html?rZHbmIUUdf">easily
>                recover</A> in multiple ways. <BR><BR>Editor's Rating:<A
>
> href="http://www.Siliconeparadise.com/remedy/index.html?AYqKdkd5B0"><IMG
>                height=18
> src="http://siliconeparadise.com/ads/4horse.gif" width=100 border=0>
>                </A></FONT></LI></UL></TD></TR></TBODY></TABLE></TD></TR>
>    <TR>
>      <TD align=middle width=538 bgColor=#0066cc colSpan=2 height=20><FONT
>        face=verdana size=1>Copyright ?2002-2003 <A
>
> href="http://www.Siliconeparadise.com/remedy/
> index.html?29KRcCFDrp">DarkSoft
>        Group</A>&nbsp; All Rights Reserved.
> </FONT></TD></TR></TBODY></TABLE></CENTER></BODY></HTML>
>
>
>
> _______________________________________________
> Spambayes mailing list
> Spambayes@python.org
> http://mail.python.org/mailman/listinfo/spambayes
>


From david at theresistance.net  Tue Mar 18 22:19:41 2003
From: david at theresistance.net (David Shaw)
Date: Tue Mar 18 22:21:12 2003
Subject: [Spambayes] Outlook 2002
In-Reply-To: <LCEPIIGDJPKCOIHOBJEPIEIHOHAA.mhammond@skippinet.com.au>
Message-ID: <9FE05FE8-59B9-11D7-8825-000393582EF6@theresistance.net>

I think this is what you want.  I restored my backup of the hammie.db  
to before I trained on the message in question, and here's what I got:

*H*	1.0	
*S*	3.33066907388e-16	
headers	0.00257289879931	
spam.	0.00310559006211	
content.	0.00517836593786	
inbox	0.00585175552666	
source,	0.00634696755994	
remains	0.00850661625709	
recover	0.00884086444008	
technique	0.00884086444008	
mailbox	0.00959488272921	
boxes	0.0104895104895	
keyword	0.0115681233933	
anti-spam	0.0115681233933	
eudora	0.0115681233933	
description:	0.0136778115502	
clients.	0.0155709342561	
web-based	0.0167286245353	
algorithm	0.0167286245353	
express,	0.0167286245353	
features:	0.0196506550218	
rules,	0.0215311004785	
spam	0.0223177887819	
detecting	0.0238095238095	
precision	0.0266272189349	
etc.),	0.0266272189349	
reason	0.0289726436155	
offensive	0.0348837209302	
examines	0.0348837209302	
subject:software	0.0348837209302	
complex	0.038637702312	
spam,	0.0392551056175	
dangerous,	0.0412844036697	
pop3,	0.0412844036697	
messages	0.0430328965312	
filter	0.0465043869016	
subject:antispam	0.0505617977528	
subject:Intelligent	0.0505617977528	
subject:IER	0.0505617977528	
remedy	0.0505617977528	
doesn't	0.0585953340706	
message.	0.0640451247568	
editor's	0.0652173913043	
intercepted	0.0652173913043	
hotmail/msn	0.0652173913043	
top-rank	0.0652173913043	
espacially	0.0652173913043	
?2002-2003	0.0652173913043	
v1.5	0.0652173913043	
url:buy	0.0652173913043	
unwanted,	0.0652173913043	
uncluttered	0.0652173913043	
spam.if	0.0652173913043	
rejecting	0.0652173913043	
rating:	0.0652173913043	
messages,the	0.0652173913043	
mapi	0.0652173913043	
imap4	0.0652173913043	
hotmail/msn,	0.0652173913043	
filter-based	0.0652173913043	
(3.17mb)	0.0652173913043	
bat!,	0.0652173913043	
becky	0.0652173913043	
clients!	0.0652173913043	
checks	0.0710347118419	
filters	0.0803222637998	
clients	0.0855282287063	
supports	0.0871172200062	
aritificial	0.0918367346939	
theverybest	0.0918367346939	
darksoft	0.0918367346939	
sure	0.0992330392591	
almost	0.102559626457	
works.	0.111519301848	
clean	0.120768412967	
them.	0.121151958915	
single	0.122015711876	
works	0.125061861969	
add	0.126036514129	
tool.	0.132129948073	
skip:m 20	0.138677557916	
list	0.141137401539	
url:download	0.152090210781	
content	0.152386830957	
multiple	0.152995044376	
manually	0.153866187624	
set	0.155020414893	
url:screen	0.155172413793	
url:remedy	0.155172413793	
url:4horse	0.155172413793	
-&gt;	0.155172413793	
database	0.161594300822	
mail	0.166700111476	
blocked	0.18061971205	
used	0.184085816436	
entire	0.185547009874	
keep	0.192563678333	
whether	0.194129600717	
support	0.197733550434	
directly	0.201527083185	
powerful,	0.211649404564	
then	0.220440381137	
its	0.226466533203	
change	0.229694796147	
intelligent	0.23151590252	
ways.	0.236278160711	
some	0.242962413214	
types	0.243161455013	
read	0.247014415533	
nothing	0.257912566324	
confirm	0.273987887728	
places	0.273987887728	
use	0.275639337316	
blocking	0.278210951417	
need	0.283254780348	
accounts	0.287308467645	
group	0.288249835829	
don't	0.295123702457	
software	0.299203612621	
treat	0.299715643894	
has	0.299716644058	
pro	0.305490905645	
before	0.309520509917	
message	0.315550066847	
skip:k 10	0.320136815288	
flagged	0.321924580422	
download	0.323406630897	
right	0.324779887317	
that	0.333199816514	
easy	0.333347052849	
than	0.335342627679	
easily	0.339718352497	
been	0.345234674073	
url:siliconeparadise	0.655538429202	
url:www	0.65776955901	
free	0.712682333876	
from:no real name:2**0	0.729567209811	
url:html	0.774652055141	
header:Received:1	0.795755497715	
copyright	0.796002970816	
x-mailer:outlook express imo, 59	0.832612041334	
receive	0.834103482178	
rights	0.856909611825	
reserved.	0.887970350732	
url:logo	0.888597344878	
url:index	0.89243705346	
subject:David	0.916407352467	
virus:<script	0.939714334058	
virus:</script	0.950128292744	
content-type:text/html	0.966531304437	
url:gif	0.980286583576	
url:ads	0.983271375465
On Tuesday, March 18, 2003, at 09:09  PM, Mark Hammond wrote:

> Can you mail me the "spam clues" for one such message?  Although the
> behaviour you describe is possibly correct, I would like to make sure  
> we are
> seeing all the payload etc.
>
> Mark.
>
>> -----Original Message-----
>> From: spambayes-bounces@python.org
>> [mailto:spambayes-bounces@python.org]On Behalf Of David Shaw
>> Sent: Wednesday, 19 March 2003 1:01 PM
>> To: spambayes@python.org
>> Subject: Re: [Spambayes] Outlook 2002
>>
>>
>> I got the spam below today.  Spambayes said it was ham.  I trained it
>> as spam.  I reclassified.  Spambayes still said it was ham.  I had to
>> classify it as spam 6 times before it would recognize it as such.  I
>> think this list makes spam about antispam software get by 100% of the
>> time (this list comprises over half of my daily ham).
>>
>> ----
>>  From stiffypop17291@mindless.com Tue Mar 18 14:48:05 2003
>> Return-Path: <stiffypop17291@mindless.com>
>> Received: from ZZ (67.97.202.131) by theresistance.net with SMTP  
>> (Eudora
>>   Internet Mail Server 3.0.3) for <david@theresistance.net>;
>>   Tue, 18 Mar 2003 14:33:31 -0500
>> To: David <david@theresistance.net>
>> From: stiffypop17291@mindless.com
>> Reply-To: tallstranger00897@engineer.com
>> Sender: jeffrey_dunlap3581@paris.com
>> X-Mailer: OutLook Express IMO, 59
>> Subject: David, Intelligent antispam IER software
>> MIME-Version: 1.0
>> Content-type: text/html
>> Content-Transfer-Encoding: 8bit
>> Date: Tue, 18 Mar 2003 14:33:31 -0500
>> Message-ID: <1164106485-1165210066@theresistance.net>
>>
>> <HTML><HEAD><TITLE>Spam Remedy</TITLE>
>>
>> <BODY>
>> <CENTER>
>> <TABLE cellSpacing=0 cellPadding=0 width=540>
>>    <TBODY>
>>    <TR>
>>      <TD width=538 bgColor=#ffaa99 colSpan=2><FONT size=2><I><FONT
>>        style="FILTER: dropshadow(color=#336699, offx=3, offy=4,
>> positive=1); WIDTH: 550px; COLOR: #ffffff; FONT-FAMILY: Arial Black;
>> POSITION: relative">TheVeryBest
>>        - Software Downloads</FONT></I></FONT></TD></TR>
>>    <TR>
>>      <TD align=middle width=538 bgColor=#0066cc colSpan=2><SPAN
>>        style="WIDTH: 60px">&nbsp;</SPAN><FONT face=verdana
>>        color=#ffffff><B>Top-Rank Software Download Site on the
>>        Internet</B></FONT><SPAN style="WIDTH: 60px">&nbsp;</SPAN>  
>> <FONT
>>        face=verdana color=#ffffff size=1>
>>        <SCRIPT language=JavaScript>
>> <!--
>>   var days = new Array("","Sun","Mon","Tue","Wed","Thu","Fri","Sat");
>>   var months = new
>> Array("","Jan","Feb","Mar","Apr","May","June","July","Aug","Sep","Oct" 
>> ,"
>> Nov","Dec");
>>   var dateObj = new Date()
>>   var day = days[dateObj.getDay() + 1]
>>   var month = months[dateObj.getMonth() + 1]
>>   var date = dateObj.getDate()
>>   document.write(day + ", " + month + " " + date)
>>   // -->
>> </SCRIPT>
>>        </FONT></TD></TR>
>>    <TR>
>>      <TD vAlign=top width=748>
>>        <TABLE width=748>
>>          <TBODY>
>>          <TR>
>>            <TD width=740 colSpan=2><FONT face=arial size=2><A
>>
>> href="http://www.Siliconeparadise.com/remedy/
>> index.html?Utw2EJz3u7">Internet</A>-&gt;<A
>>
>> href="http://www.Siliconeparadise.com/remedy/
>> index.html?hWT14FrUkz">Email</A>-&gt;<A
>>
>> href="http://www.Siliconeparadise.com/remedy/ 
>> index.html?atGiTvEygc">Spam
>>              Remedy v1.5 PRO</A></FONT></TD></TR>
>>          <TR>
>>            <TD width=534><BR><FONT face=arial size=2><IMG height=32
>>              src="http://siliconeparadise.com/ads/logo.gif" width=32
>> border=0></FONT><FONT
>>              face=arial color=#00aa66 size=4>Spam Remedy</FONT><FONT
>> face=arial
>>              size=2>&nbsp;&nbsp;&nbsp;&nbsp;<A
>>
>> href="http://www.Siliconeparadise.com/remedy/ 
>> index.html?20voRZ0tTk"><IMG
>>              height=19 src="http://siliconeparadise.com/ads/buy.gif"
>> width=55
>>              border=0></A>&nbsp;&nbsp;&nbsp;&nbsp;<A
>>
>> href="http://www.Siliconeparadise.com/remedy/ 
>> index.html?g07NR2iaLv"><IMG
>>              height=19
>> src="http://siliconeparadise.com/ads/download.gif" width=55
>>              border=0></A>(3.17MB)<BR>
>>              <HR color=#236801 SIZE=1>
>>              </FONT></TD></TR>
>>          <TR>
>>            <TD width=534><A
>>
>> href="http://www.Siliconeparadise.com/remedy/ 
>> index.html?zLD6WmqqLx"><IMG
>>              height=236  
>> src="http://siliconeparadise.com/ads/screen.gif"
>> width=270 align=right
>>              border=0></A><FONT face=arial size=2><B>Description:</B>
>>              <BR><BR><B>The powerful, effective and intelligent  
>> anti-spam
>>              tool.<BR>It automatically cleans spam messages out of  
>> your
>> mailbox
>>              before you receive or read them.  
>> </B><BR><BR>Features:<BR>
>>              <UL>
>>                <LI>Automatically Blocking Spam<BR>Spam Remedy
>> automatically
>>                checks your mail boxes and <A
>>
>> href="http://www.Siliconeparadise.com/remedy/
>> index.html?tEnNU7EJfm">filters</A>
>>                unwanted, dangerous, or offensive mail messages to save
>> your time
>>                from manually detecting and organizing mail messages.
>>                <LI>Effectively Spam <A
>>
>> href="http://www.Siliconeparadise.com/remedy/
>> index.html?WkeU0WL9Xn">Detecting</A><BR>A
>>                complex <A
>>
>> href="http://www.Siliconeparadise.com/remedy/
>> index.html?vm1YiMbA4M">Aritificial
>>                Intelligence</A> algorithm has been used in Spam Remedy
>> product to
>>                detecting legitimate mail messages and spam  
>> messages,the
>> technique
>>                has more precision than other filter-based and
>> keyword-based <A
>>
>> href="http://www.Siliconeparadise.com/remedy/
>> index.html?Yntd22bxDh">anti-spam
>>                technologies</A>.
>>                <LI>Be Sure You Get Your Right Mail Messages<BR>Spam
>> Remedy
>>                doesn't confirm a spam message by a single keyword in  
>> mail
>>                content. It <A
>>
>> href="http://www.Siliconeparadise.com/remedy/
>> index.html?KprOaiCLip">examines</A>
>>                the entire message - source, headers and mail content  
>> to
>> confirm
>>                whether it is a spam message.
>>                <LI>Supports Multiple Email Types and Almost All Email
>> Clients
>>                <BR>Spam Remedy supports POP3, Hotmail/MSN, IMAP4 and
>> MAPI email
>>                accounts,Directly works with almost all email
>> clients(Outlook
>>                Express, Becky Mail,Foxmail,Outlook, The bat!, Eudora
>> etc.),
>>                espacially includes support for web-based Hotmail/MSN
>> email
>>                clients. Nothing you need to change to your email  
>> clients!
>>                <LI><A
>>
>> href="http://www.Siliconeparadise.com/remedy/ 
>> index.html?0EDRrMwueq">Easy
>>                to use</A>&nbsp; - You don't need to set any complex
>> filter rules,
>>                just add your email accounts to Spam Remedy and then it
>> works.
>>                <LI>Friends List and <A
>>
>> href="http://www.Siliconeparadise.com/remedy/
>> index.html?73Fg6x97UC">Rejecting
>>                List</A><BR>With Friends List and <A
>>
>> href="http://www.Siliconeparadise.com/remedy/
>> index.html?evRxbRHEuP">Rejecting
>>                List,</A>you have the chance to decide who are never
>> blocked or
>>                directly treat their mail messages as spam.
>>                <LI>Keep your inbox <A
>>
>> href="http://www.Siliconeparadise.com/remedy/
>> index.html?Ukjn4C2pW6">clean</A><BR>Spam
>>                Remedy places all intercepted spam messages to its i<A
>>
>> href="http://www.Siliconeparadise.com/remedy/
>> index.html?HU0c0WA0Uk">nterval
>>                mail database</A> so that your inbox remains  
>> uncluttered
>> and free
>>                of spam.If for some reason a legitimate email is  
>> flagged
>> as spam,
>>                you can <A
>>
>> href="http://www.Siliconeparadise.com/remedy/
>> index.html?rZHbmIUUdf">easily
>>                recover</A> in multiple ways. <BR><BR>Editor's  
>> Rating:<A
>>
>> href="http://www.Siliconeparadise.com/remedy/ 
>> index.html?AYqKdkd5B0"><IMG
>>                height=18
>> src="http://siliconeparadise.com/ads/4horse.gif" width=100 border=0>
>>                 
>> </A></FONT></LI></UL></TD></TR></TBODY></TABLE></TD></TR>
>>    <TR>
>>      <TD align=middle width=538 bgColor=#0066cc colSpan=2  
>> height=20><FONT
>>        face=verdana size=1>Copyright ?2002-2003 <A
>>
>> href="http://www.Siliconeparadise.com/remedy/
>> index.html?29KRcCFDrp">DarkSoft
>>        Group</A>&nbsp; All Rights Reserved.
>> </FONT></TD></TR></TBODY></TABLE></CENTER></BODY></HTML>
>>
>>
>>
>> _______________________________________________
>> Spambayes mailing list
>> Spambayes@python.org
>> http://mail.python.org/mailman/listinfo/spambayes
>>
>


From tim at fourstonesExpressions.com  Tue Mar 18 21:32:49 2003
From: tim at fourstonesExpressions.com (Tim Stone - Four Stones Expressions)
Date: Tue Mar 18 22:32:59 2003
Subject: [Spambayes] Outlook 2002
In-Reply-To: <9B8C1314-59AE-11D7-8825-000393582EF6@theresistance.net>
Message-ID: <PND83Y6YXSNKELJ72PO3OQN7362RN.3e77e4e1@myst>

3/18/2003 8:00:49 PM, David Shaw <david@theresistance.net> wrote:

> I think this list makes spam about antispam software get by 100% of the  
>time (this list comprises over half of my daily ham).

Hmmmm... interesting.  Perhaps we should put whitelist rules in the system 
<wink>


c'est moi - TimS
http://www.fourstonesExpressions.com
http://wecanstopspam.org

There are 10 kinds of people in the world:
  those who understand binary,
  and those who don't


From T.A.Meyer at massey.ac.nz  Wed Mar 19 15:56:09 2003
From: T.A.Meyer at massey.ac.nz (Meyer, Tony)
Date: Tue Mar 18 22:56:59 2003
Subject: [Spambayes] Outlook 2002
Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F13C8C90A@its-xchg4.massey.ac.nz>

> (this list comprises over half of my daily ham).                      
> spam. 0.00310559006211                                                
> anti-spam 0.0115681233933                                             
> spam 0.0223177887819                                                  
> spam, 0.0392551056175                                                 
> subject:antispam 0.0505617977528                                      
> spam.if 0.0652173913043                                               
Perhaps you should train less on this list and more on the remaining    
ham you have? More ham isn't necessarily better, if the ham contains    
a lot of spam clues (as I understand it). Presumably these six clues    
(even the odd 'spam.if' clue) resulted from training on this list.      

> espacially 0.0652173913043                                            
> aritificial 0.0918367346939                                           
It seems quite strange to me that these two misspelled words score so   
low. Do you get a lot of ham that has poorly spelled words?             

> ?2002-2003 0.0652173913043                                            
This also seems strange; do you really have a lot of ham with this sort 
of copyright info?                                                      

> v1.5 0.0652173913043                                                  
Or this version number?                                                 

> url:buy 0.0652173913043                                               
Or email with 'buy' in an embedded URL?                                 

> theverybest 0.0918367346939                                           
> darksoft 0.0918367346939                                              
These seem even stranger. I didn't read the email, but Darksoft is the  
manufacturer, right? Any idea what ham contributed such low scores to   
these words?                                                            

It almost looks to me like you have a similar email in your ham
somewhere that was mistrained - I don't seem how a lot of these clues
could result from this list.

=Tony Meyer

From T.A.Meyer at massey.ac.nz  Wed Mar 19 16:00:59 2003
From: T.A.Meyer at massey.ac.nz (Meyer, Tony)
Date: Tue Mar 18 23:01:36 2003
Subject: [Spambayes] Spambayes installation problem
Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F13C8C90C@its-xchg4.massey.ac.nz>

[Geoff's problems installing the Outlook plugin]

> The new file installs correctly ... we have lift off   ... 
> thanks for your help

No worries, I'm glad it works now - and it's not that I actually did anything in the end!

I've crossed this to the list & Mark so that we know that it works now, and so if anyone has anything similar they know to try the new installer.

=Tony Meyer

From T.A.Meyer at massey.ac.nz  Wed Mar 19 16:03:59 2003
From: T.A.Meyer at massey.ac.nz (Meyer, Tony)
Date: Tue Mar 18 23:04:34 2003
Subject: [Spambayes] Beta status checklist
Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F13C8C90D@its-xchg4.massey.ac.nz>

With a new version of the Outlook plugin released, and TimS close to
finishing up alpha3, I was wondering how close things were to beta.

I was wondering if we could come up with a list of 'to do's that the
consensus agreed needed to be implemented/fixed before we would consider
that spambayes was ready for a first beta release.

So, to start off:                                                       
* Much better documentation for the SMTP proxy training option.         

Anyone care to add to the list?

=Tony Meyer

From tim at fourstonesExpressions.com  Tue Mar 18 22:09:32 2003
From: tim at fourstonesExpressions.com (Tim Stone - Four Stones Expressions)
Date: Tue Mar 18 23:09:39 2003
Subject: [Spambayes] Beta status checklist
In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F13C8C90D@its-xchg4.massey.ac.nz>
Message-ID: <ML53NLPXWQMBA6H93NL2UOJGCJHQL.3e77ed7c@myst>

>So, to start off:                                                       
>* Much better documentation for the SMTP proxy training option. 
* Incorporation of integration.txt (and probably other text files) into the 
website, and maybe a review of the mailing list for faq type information        
* Installation with some level of migration from previous release.

c'est moi - TimS
http://www.fourstonesExpressions.com
http://wecanstopspam.org

There are 10 kinds of people in the world:
  those who understand binary,
  and those who don't


From tim.one at comcast.net  Tue Mar 18 23:10:18 2003
From: tim.one at comcast.net (Tim Peters)
Date: Tue Mar 18 23:11:44 2003
Subject: [Spambayes] Outlook 2002
In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F13C8C90A@its-xchg4.massey.ac.nz>
Message-ID: <LNBBLJKPBEHFEDALKOLCAEIOEBAB.tim.one@comcast.net>

[Meyer, Tony]
> ...
> It almost looks to me like you have a similar email in your ham
> somewhere that was mistrained -

Or many.  The oddball clues have spamprobs too low to be due to hapaxes.
Other oddities:

messages,the	0.0652173913043
spam.if	0.0652173913043
theverybest	0.0918367346939


From tim_one at email.msn.com  Tue Mar 18 23:53:19 2003
From: tim_one at email.msn.com (Tim Peters)
Date: Tue Mar 18 23:55:22 2003
Subject: [Spambayes] New Outlook binary available
In-Reply-To: <LCEPIIGDJPKCOIHOBJEPMEENOHAA.mhammond@skippinet.com.au>
Message-ID: <LNBBLJKPBEHFEDALKOLCMELAEBAB.tim_one@email.msn.com>

[Mark Hammond]
> I have made a new Outlook installer binary on my starship page -
> http://starship.python.net/crew/mhammond/spambayes/  (Should I be putting
> these on the main spambayes page, even though they aren't official
> releases?  I'm happy to!)

+1, if it increases visibility and/or distribution, and I expect it does
both to make the installer available from both.


From acunningham at rsasecurity.com  Wed Mar 19 10:22:14 2003
From: acunningham at rsasecurity.com (Cunningham, Andy)
Date: Wed Mar 19 05:17:02 2003
Subject: [Spambayes] Beta status checklist (or this turning into new f
	eature requests?)
Message-ID: <418A63CAEBF2D4118A1A00508BB1A0B8029F1722@exuk01>

I'd add the Outlook Binary installer as a part of the release.  I think this
is going to make a huge difference to takeup within the Windows world - it
will be a pre-requisite for any kind of corporate use.... 

What are people's thoughts on some kind of predefined training database
(like ham/spam terms that appear in more than x% of submitted training
databases)?

The other feature that I personally like to see is the ability to send an
NDR when the message is identified as definitely being spam - what are
peoples thoughts on this?

AndyC 


-----Original Message-----
From: Tim Stone - Four Stones Expressions
[mailto:tim@fourstonesExpressions.com] 
Sent: 19 March 2003 04:10
To: Spambayes; Meyer, Tony
Subject: Re: [Spambayes] Beta status checklist


>So, to start off:                                                       
>* Much better documentation for the SMTP proxy training option.
* Incorporation of integration.txt (and probably other text files) into the 
website, and maybe a review of the mailing list for faq type information

* Installation with some level of migration from previous release.

c'est moi - TimS
http://www.fourstonesExpressions.com
http://wecanstopspam.org

There are 10 kinds of people in the world:
  those who understand binary,
  and those who don't


_______________________________________________
Spambayes mailing list
Spambayes@python.org http://mail.python.org/mailman/listinfo/spambayes

From tim at fourstonesExpressions.com  Wed Mar 19 06:53:32 2003
From: tim at fourstonesExpressions.com (Tim Stone - Four Stones Expressions)
Date: Wed Mar 19 07:53:42 2003
Subject: [Spambayes] Beta status checklist
In-Reply-To: <ML53NLPXWQMBA6H93NL2UOJGCJHQL.3e77ed7c@myst>
Message-ID: <NKNJJF52ZX09YUHCB9RTQHFC0TR2Y6.3e78684c@myst>

3/18/2003 10:09:32 PM, Tim Stone - Four Stones Expressions 
<tim@fourstonesExpressions.com> wrote:

>>So, to start off:                                                       
>>* Much better documentation for the SMTP proxy training option. 
>* Incorporation of integration.txt (and probably other text files) into the 
>website, and maybe a review of the mailing list for faq type information        
>* Installation with some level of migration from previous release.
* Prerequisite checking for email and dbm modules (at least)
>
>c'est moi - TimS
>http://www.fourstonesExpressions.com
>http://wecanstopspam.org
>
>There are 10 kinds of people in the world:
>  those who understand binary,
>  and those who don't
>
>
>
>_______________________________________________
>Spambayes mailing list
>Spambayes@python.org
>http://mail.python.org/mailman/listinfo/spambayes
>
>


c'est moi - TimS
http://www.fourstonesExpressions.com
http://wecanstopspam.org

There are 10 kinds of people in the world:
  those who understand binary,
  and those who don't


From noreply at sourceforge.net  Wed Mar 19 02:03:48 2003
From: noreply at sourceforge.net (SourceForge.net)
Date: Wed Mar 19 08:14:29 2003
Subject: [Spambayes] 
	[ spambayes-Bugs-706170 ] Execute test suite fails in Outlook
Message-ID: <E18vaQG-0008My-00@sc8-sf-web1.sourceforge.net>

Bugs item #706170, was opened at 2003-03-19 11:03
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=498103&aid=706170&group_id=61702

Category: Outlook
Group: None
Status: Open
Resolution: None
Priority: 5
Submitted By: Fredrik Rodland (fmmr)
Assigned to: Mark Hammond (mhammond)
Summary: Execute test suite fails in Outlook

Initial Comment:
The test suite fails in outlook.

I've retrained messages from a spam and a ham folder.

I think this may be related to moving the database-files 
from the spambayes to the default docs-folders in 
windows a couple of weeks ago.

the following traceback is shown in PythonWin:

Executing automated tests...
Traceback (most recent call last):
  
File "c:\Programfiler\_UTIL\spambayes_cvs\spambayes\
Outlook2000\addin.py", line 308, in Tester
    tester.test(manager)
  
File "c:\Programfiler\_UTIL\spambayes_cvs\spambayes\
Outlook2000\tester.py", line 306, in test
    TestSpamFilter(driver)
  
File "c:\Programfiler\_UTIL\spambayes_cvs\spambayes\
Outlook2000\tester.py", line 173, in TestSpamFilter
    msg, words = driver.CreateTestMessageInFolder
(SPAM, driver.folder_watch)
  
File "c:\Programfiler\_UTIL\spambayes_cvs\spambayes\
Outlook2000\tester.py", line 132, in 
CreateTestMessageInFolder
    msg, words = self.CreateTestMessage(spam_status)
  
File "c:\Programfiler\_UTIL\spambayes_cvs\spambayes\
Outlook2000\tester.py", line 145, in CreateTestMessage
    words.update(FindTopWords(self.manager.bayes, 50, 
True))
  
File "c:\Programfiler\_UTIL\spambayes_cvs\spambayes\
Outlook2000\tester.py", line 64, in FindTopWords
    for word, info in extractor(bayes):
  
File "c:\Programfiler\_UTIL\spambayes_cvs\spambayes\
Outlook2000\tester.py", line 46, in DBExtractor
    key = bayes.dbm.next()[0]
  File "C:\PROGRA~1\_DEV\Python22\Lib\site-
packages\bsddb3\__init__.py", line 122, in next
    rv = self.dbc.next()
DBNotFoundError: (-30991, 'DB_NOTFOUND: No 
matching key/data pair found')
Tests FAILED.  Sorry about that.  If I were you, I would 
do a full re-train ASAP
Please delete any test messages from your Spam, 
Unsure or Inbox folders first.

----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=498103&aid=706170&group_id=61702

From ilumb at platform.com  Wed Mar 19 11:09:29 2003
From: ilumb at platform.com (Ian Lumb)
Date: Wed Mar 19 11:10:46 2003
Subject: [Spambayes] Beta status checklist (or this turning into new
	feature requests?)
Message-ID: <4AB0624F069DAD4E90F18B13A818EEFE076690@catoexm04.noam.corp.platform.com>

Add the Outlook binary installer as part of the release?  Absolutely!

Pre-defined training db?  Not sure. Why? Many mail servers already use spam-filtering technologies. Currently, client-side spambayes compliments what exists on the server-side. 

BTW, are there plans to develop server-side spambayes? (Apologies if this is a FAQ.) I know that it can eclipse what we are currently using on our Exchange server :-)

-Ian


-----Original Message-----
From: Cunningham, Andy [mailto:acunningham@rsasecurity.com]
Sent: Wednesday, March 19, 2003 5:22 AM
To: 'tim@fourstonesExpressions.com'; Spambayes; Meyer, Tony
Subject: RE: [Spambayes] Beta status checklist (or this turning into new
feature requests?)


I'd add the Outlook Binary installer as a part of the release.  I think this
is going to make a huge difference to takeup within the Windows world - it
will be a pre-requisite for any kind of corporate use.... 

What are people's thoughts on some kind of predefined training database
(like ham/spam terms that appear in more than x% of submitted training
databases)?

The other feature that I personally like to see is the ability to send an
NDR when the message is identified as definitely being spam - what are
peoples thoughts on this?

AndyC 


-----Original Message-----
From: Tim Stone - Four Stones Expressions
[mailto:tim@fourstonesExpressions.com] 
Sent: 19 March 2003 04:10
To: Spambayes; Meyer, Tony
Subject: Re: [Spambayes] Beta status checklist


>So, to start off:                                                       
>* Much better documentation for the SMTP proxy training option.
* Incorporation of integration.txt (and probably other text files) into the 
website, and maybe a review of the mailing list for faq type information

* Installation with some level of migration from previous release.

c'est moi - TimS
http://www.fourstonesExpressions.com
http://wecanstopspam.org

There are 10 kinds of people in the world:
  those who understand binary,
  and those who don't


_______________________________________________
Spambayes mailing list
Spambayes@python.org http://mail.python.org/mailman/listinfo/spambayes

_______________________________________________
Spambayes mailing list
Spambayes@python.org
http://mail.python.org/mailman/listinfo/spambayes
From db3l at fitlinxx.com  Wed Mar 19 11:32:37 2003
From: db3l at fitlinxx.com (David Bolen)
Date: Wed Mar 19 12:20:24 2003
Subject: [Spambayes] Outlook addin delay in updating and resetting unread
Message-ID: <uy93biafu.fsf@fitlinxx.com>

I know this has come up in the past on this list, but I just installed
the most recent installer-based version of the Outlook addin-in on a
co-worker's machine, and am seeing the problem pretty consistently,
whereas before it was more hit or miss (or so I thought).  Originally
I thought more so than on my own machine, but after paying more
attention, I now seem to be able to reproduce this consistently on my
machine too - not sure if this is easier with the installer-version
(which I just switched to trying, having been using the source release
up to now).

The behavior is that when a new message arrives, it shows as new in
Outlook (this is Outlook with a corporate Exchange server), but the
spam column is not filled in (the behavior occurs with or without
display of the column, but it's easy to see with the column present).
The log file shows that the message has been classified so it
certainly seems to be an Outlook issue.

We've waited for over a minute with no change.  If you interact with
Outlook in various ways (switch to a different folder and back, often
just opening the message or creating a reply) the field will update,
but at the same time the message gets remarked as unread even if you
had just opened it and marked it read.

This of course is annoying to the user because they're currently
reading the message but it will still show as unread when they are
done.

Interestingly enough, whatever latency or update problem exists is
always behind by one message - if a new message arrives, the prior
message will have its Spam field updated.  If the newly arrived
message is filtered by the addin, it does move or take whatever
operation, so again the filter appears to be running.

It seems clear that this is some latency issue with Outlook updating
the status of a message - it's not clear if resetting the read bit is
because the delayed status includes an explicit unread bit, or if
outlook is just refreshing the status of the message as of the delayed
update.

I'm going to switch back to the source version to see if its just as
reproduceable there (or if I just got used to it without realizing it)
to play around a little, but was wondering if anyone had any other
ideas or knew of any workarounds for my co-worker in the meantime?

Thanks.

-- David


From Paul.Moore at atosorigin.com  Wed Mar 19 17:33:35 2003
From: Paul.Moore at atosorigin.com (Moore, Paul)
Date: Wed Mar 19 12:34:59 2003
Subject: [Spambayes] Outlook addin delay in updating and resetting unread
Message-ID: <16E1010E4581B049ABC51D4975CEDB880113D992@UKDCX001.uk.int.atosorigin.com>

From: David Bolen [mailto:db3l@fitlinxx.com]
> It seems clear that this is some latency issue with Outlook
> updating the status of a message - it's not clear if resetting
> the read bit is because the delayed status includes an explicit
> unread bit, or if outlook is just refreshing the status of the
> message as of the delayed update.

I've seen this problem as well, also on an Exchange server. I'm
quite a way behind on releases (haven't updated from CVS for a
few weeks), so it's not a new thing.

For me, it's always been pretty random (as far as I've been able
to tell) so I've never been able to offer much to go on. But yes,
it's irritating.

I've not heard anyone hitting this except on Exchange, so maybe
it's an issue with how Outlook interacts with Exchange rather than
a pure Outlook issue...?

Paul.

From db3l at fitlinxx.com  Wed Mar 19 13:05:41 2003
From: db3l at fitlinxx.com (David Bolen)
Date: Wed Mar 19 13:05:47 2003
Subject: [Spambayes] 
	Re: Outlook addin delay in updating and resetting unread
References: <16E1010E4581B049ABC51D4975CEDB880113D992@UKDCX001.uk.int.atosorigin.com>
Message-ID: <uu1dzi64q.fsf@fitlinxx.com>

"Moore, Paul" <Paul.Moore@atosorigin.com> writes:

> I've not heard anyone hitting this except on Exchange, so maybe
> it's an issue with how Outlook interacts with Exchange rather than
> a pure Outlook issue...?

Could be - I only work with an Exchange server.  I've determined that
a workaround that seems solid at this point is to execute another
SaveChanges() call when changing the Spam property.  At this point I'm
testing it via another msg.Save() call up in the filter module.

I'm not sure why, but since that results in two saves during filtering
(one after the spam property is updated and another after the mail
folder information as part of the actions) the display is always
updated immediately.  It's probably a bit more overhead, but I'm not
sure how much and if it fixes the issue, I'm willing to spend the
extra API call (which presumably may result in another round trip to
the server).

I've only been able to test on my machine so far since I'm having
trouble exactly replicating the binary installer package, but it's
working for me.

-- David


From noreply at sourceforge.net  Wed Mar 19 12:46:25 2003
From: noreply at sourceforge.net (SourceForge.net)
Date: Wed Mar 19 15:39:13 2003
Subject: [Spambayes] [ spambayes-Bugs-706520 ] assert fails in classifier
Message-ID: <E18vkS9-0008Ig-00@sc8-sf-web4.sourceforge.net>

Bugs item #706520, was opened at 2003-03-19 12:46
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=498103&aid=706520&group_id=61702

Category: None
Group: None
Status: Open
Resolution: None
Priority: 5
Submitted By: Adam Glass (adamglass)
Assigned to: Nobody/Anonymous (nobody)
Summary: assert fails in classifier

Initial Comment:
This morning, I noticed that my emails no longer had a
X-Spambayes-Classification header, so I looked through
my procmail logs, and sure enough, hammiefilter.py is
giving a traceback when an assertion fails.  This
happens on all messages now; it is not specific to a
single message, or intermittent.  Therefore, I suspect
my .hammiedb is corrupted... I can supply it to anyone
who would like to investigate it for debugging purposes.

I am using Spambayes 1.0a2, installed on a system with
Python 2.2.1, with the new version of the email library
(as per the install docs.)

Please contact me if you require any further details.

Example of how to generate the error follows, along
with traceback:

adam$ /usr/local/bin/hammiefilter.py -f -d
$HOME/.hammiedb < example
Traceback (most recent call last):
  File "/usr/local/bin/hammiefilter.py", line 179, in ?
    main()
  File "/usr/local/bin/hammiefilter.py", line 175, in main
    action(msg)
  File "/usr/local/bin/hammiefilter.py", line 113, in
filter
    return h.filter(msg)
  File
"/usr/local/lib/python2.2/site-packages/spambayes/hammie.py",
line 108, in filter
    prob, clues = self._scoremsg(msg, True)
  File
"/usr/local/lib/python2.2/site-packages/spambayes/hammie.py",
line 38, in _scoremsg
    return self.bayes.spamprob(tokenize(msg), evidence)
  File
"/usr/local/lib/python2.2/site-packages/spambayes/classifier.py",
line 217, in chi2_spamprob
    clues = self._getclues(wordstream)
  File
"/usr/local/lib/python2.2/site-packages/spambayes/classifier.py",
line 441, in _getclues
    prob = self.probability(record)
  File
"/usr/local/lib/python2.2/site-packages/spambayes/classifier.py",
line 304, in probability
    assert spamcount <= nspam
AssertionError


----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=498103&aid=706520&group_id=61702

From tim at fourstonesExpressions.com  Wed Mar 19 14:48:35 2003
From: tim at fourstonesExpressions.com (Tim Stone - Four Stones Expressions)
Date: Wed Mar 19 15:48:41 2003
Subject: [Spambayes] Beta status checklist
In-Reply-To: <NKNJJF52ZX09YUHCB9RTQHFC0TR2Y6.3e78684c@myst>
Message-ID: <53HESNHEAOJKIKFVT3294ZWC8SN1YHE.3e78d7a3@myst>

3/19/2003 6:53:32 AM, Tim Stone - Four Stones Expressions 
<tim@fourstonesExpressions.com> wrote:

>3/18/2003 10:09:32 PM, Tim Stone - Four Stones Expressions 
><tim@fourstonesExpressions.com> wrote:
>
>>>So, to start off:                                                       
>>>* Much better documentation for the SMTP proxy training option. 
>>* Incorporation of integration.txt (and probably other text files) into the 
>>website, and maybe a review of the mailing list for faq type information        
>>* Installation with some level of migration from previous release.
>* Prerequisite checking for email and dbm modules (at least)
* Some kind of recovery from wordinfo database corruption (nham and nspam are 
lost on an increasingly frequent basis)  Bug 706520 *MUST* be fixed.

c'est moi - TimS
http://www.fourstonesExpressions.com
http://wecanstopspam.org

There are 10 kinds of people in the world:
  those who understand binary,
  and those who don't


From Phil.Cox at SystemExperts.com  Wed Mar 19 14:01:53 2003
From: Phil.Cox at SystemExperts.com (Phil Cox)
Date: Wed Mar 19 17:20:52 2003
Subject: [Spambayes] Not getting the icon in the tool bar
Message-ID: <000201c2ee63$26bf9d20$0500000a@jiloa.com>

The application seems to be working, but I don't get the icon in the
toolbar to configure it. Any thoughts?

Here is my log file:

SpamAddin - Connecting to Outlook
Loaded bayes database from 'C:\Documents and Settings\pcc\Application
Data\SpamBayes\default_bayes_database.db'
Loaded message database from 'C:\Documents and Settings\pcc\Application
Data\SpamBayes\default_message_database.db'
Bayes database initialized with 0 spam and 0 good messages
Loaded databases in 2.70174ms


Phil


From bplist at www.wormy.org  Wed Mar 19 19:00:21 2003
From: bplist at www.wormy.org (BP List)
Date: Wed Mar 19 17:42:53 2003
Subject: [Spambayes] mboxtrain.py error
Message-ID: <Pine.LNX.4.44.0303191852550.23103-100000@www.wormy.org>

I created the database file with "hammiefilter.py -n".  It seems that
every mailbox file I run mboxtrain.py on results in an error
similar to this:

www:/home/bryan# ./mboxtrain.py -d /home/bryan/.hammiedb -g
/home/bryan/mail/Mailbox -s /home/bryan/mail/SPAM
Training ham (/home/bryan/mail/Mailbox):
  Reading as Unix mbox
Traceback (most recent call last):
  File "./mboxtrain.py", line 284, in ?
    main()
  File "./mboxtrain.py", line 271, in main
    train(h, g, False, force)
  File "./mboxtrain.py", line 209, in train
    mbox_train(h, path, is_spam, force)
  File "./mboxtrain.py", line 166, in mbox_train
    fcntl.lockf(f, fcntl.LOCK_UN)
IOError: [Errno 16] Device or resource busy


I have tried this as root and as the user.  I assume that there is really
nothing wrong with mboxtrain.py, but I don't have the faintest idea where
to start looking.  I've tried several mailbox files all with the same
result.  I am sure that noone had the mailbox open as well.  I've just
installed all the latest supporting applications that are listed in the
spambayes documentation.  Please let me know if you need any specific
details.

Thanks in advance!

-- Bryan


From T.A.Meyer at massey.ac.nz  Thu Mar 20 10:49:27 2003
From: T.A.Meyer at massey.ac.nz (Meyer, Tony)
Date: Wed Mar 19 17:50:46 2003
Subject: [Spambayes] Not getting the icon in the tool bar
Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F1318CD9B@its-xchg4.massey.ac.nz>

> The application seems to be working, but I don't get the icon in the
> toolbar to configure it. Any thoughts?

Well, there's nothing in the log file, so you're right, it does seem to be working.  I would suggest that you try:
* making sure that you're in the inbox, and not in something like "Outlook Today"
* ensuring that the standard toolbar is visible
* resetting the standard toolbar (right-click on it, choose customize, and the reset) and restarting Outlook.

Which version of the plugin are you using?
(a) The most recent (002) installer (binary) version from Mark's website?
(b) The older (001) installer (binary) version from Mark's website?  (this is known to have this sort of bug, so if so, please get the newer version)
(c) The most recent CVS source
(d) Old CVS source.

=Tony Meyer

From T.A.Meyer at massey.ac.nz  Thu Mar 20 10:52:53 2003
From: T.A.Meyer at massey.ac.nz (Meyer, Tony)
Date: Wed Mar 19 17:57:50 2003
Subject: [Spambayes] Beta status checklist (not new feature requests!)
Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F13C8C915@its-xchg4.massey.ac.nz>

> Add the Outlook binary installer as part of the release?  Absolutely!

I've discussed this with TimS off-list, but what's the general consensus here?

(History-wise, I believe that alpha1 didn't have the plugin, but alpha2 does, and alpha3 will).

I *don't* think that the Outlook plugin should be part of a beta release.  The installer that Mark's created does a much better job, I think, and only installs those bits that the plugin needs, not pop3proxy and all the rest.

IMO, a (potential) user should download *either* the Outlook installer, *or* a beta release of everything else.

Thoughts? (Mark?)

=Tony Meyer

From T.A.Meyer at massey.ac.nz  Thu Mar 20 11:03:51 2003
From: T.A.Meyer at massey.ac.nz (Meyer, Tony)
Date: Wed Mar 19 18:04:25 2003
Subject: [Spambayes] Beta status checklist (or this turning into new
	feature requests?)
Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F1318CD9C@its-xchg4.massey.ac.nz>

[Andy Cunningham]                                                       
> What are people's thoughts on some kind of predefined                 
> training database                                                     
> (like ham/spam terms that appear in more than x% of submitted         
training                                                                
> databases)?                                                           

This has had plenty of discussion previously; I'd suggest people flick
through the archives if they haven't read them already.

To add my 2c, since I haven't previously, I would say that this is a
*bad idea*, unless a system is developed to expunge this pre-defined
set at some point after the user has collected their own data. You only
have to train a single message and the system will do better than a
coin-toss, so there's no need to have pre-defined stuff - it catches
onto the sort of messages that would be in a pre-defined db so quickly I
don't think there's any point. People that don't want to train would be
better off with SpamAssassin, or something like that.

=Tony Meyer

From T.A.Meyer at massey.ac.nz  Thu Mar 20 11:06:33 2003
From: T.A.Meyer at massey.ac.nz (Meyer, Tony)
Date: Wed Mar 19 18:07:08 2003
Subject: [Spambayes] Beta status checklist (or this turning into
	newfeature requests?)
Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F13C8C917@its-xchg4.massey.ac.nz>

> BTW, are there plans to develop server-side spambayes?                
> (Apologies if this is a FAQ.) I know that it can eclipse what         
> we are currently using on our Exchange server :-)                     

This is a FAQ, and does lend weight to TimS's suggestion that we need a
list of answers for FAQs :)

I have no experience with using spambayes in a server type situation,
but from reading the messages on the list, I believe that this can
be done now, to a certain extent. The real question is how you want
training to be done - does the admin do it? Does everyone contribute? Do
you want users to have a shared definition of spam, or individual?

=Tony Meyer

From B-Morgan at concentric.net  Wed Mar 19 16:10:20 2003
From: B-Morgan at concentric.net (Brad Morgan)
Date: Wed Mar 19 18:10:55 2003
Subject: [Spambayes] Beta status checklist (not new feature requests!)
In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F13C8C915@its-xchg4.massey.ac.nz>
Message-ID: <NABBJOOEOFODEALNMJAJMEOEHNAA.B-Morgan@concentric.net>

> IMO, a (potential) user should download *either* the Outlook installer,
*or*
> a beta release of everything else.

> Thoughts? (Mark?)

> =Tony Meyer

This sounds reasonable to me.

Regards,

Brad Morgan


From mhammond at skippinet.com.au  Thu Mar 20 10:22:28 2003
From: mhammond at skippinet.com.au (Mark Hammond)
Date: Wed Mar 19 18:23:04 2003
Subject: [Spambayes] Beta status checklist (not new feature requests!)
In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F13C8C915@its-xchg4.massey.ac.nz>
Message-ID: <LCEPIIGDJPKCOIHOBJEPIELBOHAA.mhammond@skippinet.com.au>

> > Add the Outlook binary installer as part of the release?  Absolutely!
>
> I've discussed this with TimS off-list, but what's the general
> consensus here?
>
> (History-wise, I believe that alpha1 didn't have the plugin, but
> alpha2 does, and alpha3 will).
>
> I *don't* think that the Outlook plugin should be part of a beta
> release.  The installer that Mark's created does a much better
> job, I think, and only installs those bits that the plugin needs,
> not pop3proxy and all the rest.
>
> IMO, a (potential) user should download *either* the Outlook
> installer, *or* a beta release of everything else.

Sounds fine to me - except I would raise the bar a little - why not make a
pop3propxy *binary* release for Windows too - then the problem becomes moot-
on Windows you get a binary.

I realize time is an issue, so this strategy sounds OK for beta2, but maybe
we could aim for a beta3 with binaries before v1.

Mark.


From skip at pobox.com  Wed Mar 19 17:43:49 2003
From: skip at pobox.com (Skip Montanaro)
Date: Wed Mar 19 18:44:06 2003
Subject: [Spambayes] Beta status checklist (or this turning into new
        feature requests?)
In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F1318CD9C@its-xchg4.massey.ac.nz>
References: <1ED4ECF91CDED24C8D012BCF2B034F1318CD9C@its-xchg4.massey.ac.nz>
Message-ID: <15993.181.61574.726@montanaro.dyndns.org>


    Tony> [Andy Cunningham]
    >> What are people's thoughts on some kind of predefined training
    >> database (like ham/spam terms that appear in more than x% of
    >> submitted training databases)?

    Tony> To add my 2c, since I haven't previously, I would say that this is
    Tony> a *bad idea*, unless a system is developed to expunge this
    Tony> pre-defined set at some point after the user has collected their
    Tony> own data.

My 2c...  I am currently manually training for a couple other people here at
Northwestern.  I do want to get more victims^H^H^H^H^H^H^H early adopters,
but it generally seems to be working pretty well.  I had to encourage them a
little to send me ham which was correctly classified (spam is no problem, I
have fountains full of the stuff).  I think they seemed to expect the system
to work properly from the get-go and were only sending me stuff that was
misclassified or which wound up marked unsure.  Accordingly, they were a bit
confused at a few of the mistakes it made.  At the moment I have just 152
hams and 135 spams in the training database.  Things seem to be working okay
though I haven't been tracking it in any formal sense, just in the sense
that they aren't complaining. ;-)

Skip


From N7DR at arrisi.com  Wed Mar 19 16:59:21 2003
From: N7DR at arrisi.com (D. R. Evans)
Date: Wed Mar 19 18:59:26 2003
Subject: [Spambayes] database corruption
Message-ID: <3E78A1E9.29639.DCE8D7@localhost>

Just to let folk know that the database corruption that I reported and 
filed a while back has happened again (#699063).

I line in Colorado and some of you may know that we just had a massive 
storm. The power here was intermittent for a couple of hours, and as a 
result the Linux box running pop3proxy.py went down a couple of times 
due to loss of power.

When everything came back up and seemed stable, I restarted 
pop3proxy.py and was unable to restart pop3proxy.py because of database 
corruption. As before, there was no mail activity going on at the time 
of the crash. 

Tim suspected in the resolution of the bug report that switching to a 
newer version of bsddb would fix the problem. I'm not in a position to 
do that at the moment (maybe I'll try again after Mandrake 9.1 is 
released), so I will have to try to switch to some other bayesian 
filtering system instead, at least for a while 
:-(

  Doc


 --------------------------------------------------------------
Phone:  +1 303 494 0394
Mobile: +1 720 839 8462
Fax:    +1 781 240 0527
--------------------------------------------------------------


From mhammond at skippinet.com.au  Thu Mar 20 11:11:15 2003
From: mhammond at skippinet.com.au (Mark Hammond)
Date: Wed Mar 19 19:12:21 2003
Subject: [Spambayes] database corruption
In-Reply-To: <3E78A1E9.29639.DCE8D7@localhost>
Message-ID: <LCEPIIGDJPKCOIHOBJEPAELFOHAA.mhammond@skippinet.com.au>

> Just to let folk know that the database corruption that I reported and
> filed a while back has happened again (#699063).

How about we do a db sync after we perform a train?  This shouldn't be too
painful, won't affect scoring performance, and should always leave the DB
consistent.  Only drawback I see is that after a huge retrain, my fast
machine takes a number of seconds to save the DB - OTOH, paying this penalty
*during* the retrain operation is moer appealing than paying it at shutdown
anyway.

Mark.


From tim at fourstonesExpressions.com  Wed Mar 19 18:18:29 2003
From: tim at fourstonesExpressions.com (Tim Stone - Four Stones Expressions)
Date: Wed Mar 19 19:18:35 2003
Subject: [Spambayes] database corruption
In-Reply-To: <LCEPIIGDJPKCOIHOBJEPAELFOHAA.mhammond@skippinet.com.au>
Message-ID: <96H5JIA8HFKJB73V3VQL73PL4XHEOJ.3e7908d5@myst>

3/19/2003 6:11:15 PM, "Mark Hammond" <mhammond@skippinet.com.au> wrote:

>> Just to let folk know that the database corruption that I reported and
>> filed a while back has happened again (#699063).
>
>How about we do a db sync after we perform a train?  This shouldn't be too
>painful, won't affect scoring performance, and should always leave the DB
>consistent.  Only drawback I see is that after a huge retrain, my fast
>machine takes a number of seconds to save the DB - OTOH, paying this penalty
>*during* the retrain operation is moer appealing than paying it at shutdown
>anyway.

I thought that the proxy does this already, but a cursory inspection of the 
code doesn't look like that's there.  I'll check in a fix.

c'est moi - TimS
http://www.fourstonesExpressions.com
http://wecanstopspam.org

There are 10 kinds of people in the world:
  those who understand binary,
  and those who don't.


From tim at fourstonesExpressions.com  Wed Mar 19 18:22:26 2003
From: tim at fourstonesExpressions.com (Tim Stone - Four Stones Expressions)
Date: Wed Mar 19 19:22:33 2003
Subject: [Spambayes] database corruption
In-Reply-To: <96H5JIA8HFKJB73V3VQL73PL4XHEOJ.3e7908d5@myst>
Message-ID: <04A5F1YHGFBZU04NL952VFDDAYWLJXV.3e7909c2@myst>

3/19/2003 6:18:29 PM, Tim Stone - Four Stones Expressions 
<tim@fourstonesExpressions.com> wrote:

>3/19/2003 6:11:15 PM, "Mark Hammond" <mhammond@skippinet.com.au> wrote:
>
>>> Just to let folk know that the database corruption that I reported and
>>> filed a while back has happened again (#699063).
>>
>>How about we do a db sync after we perform a train?  This shouldn't be too
>>painful, won't affect scoring performance, and should always leave the DB
>>consistent.  Only drawback I see is that after a huge retrain, my fast
>>machine takes a number of seconds to save the DB - OTOH, paying this penalty
>>*during* the retrain operation is moer appealing than paying it at shutdown
>>anyway.
>
>I thought that the proxy does this already, but a cursory inspection of the 
>code doesn't look like that's there.  I'll check in a fix.

Well, after a closer look, it really does.  The DBDictClassifier 
implementation does a db.sync() as well...

>
>c'est moi - TimS
>http://www.fourstonesExpressions.com
>http://wecanstopspam.org
>
>There are 10 kinds of people in the world:
>  those who understand binary,
>  and those who don't.
>
>
>
>_______________________________________________
>Spambayes mailing list
>Spambayes@python.org
>http://mail.python.org/mailman/listinfo/spambayes
>
>


c'est moi - TimS
http://www.fourstonesExpressions.com
http://wecanstopspam.org

There are 10 kinds of people in the world:
  those who understand binary,
  and those who don't.


From mhammond at skippinet.com.au  Thu Mar 20 12:35:32 2003
From: mhammond at skippinet.com.au (Mark Hammond)
Date: Wed Mar 19 20:36:27 2003
Subject: [Spambayes] database corruption
In-Reply-To: <04A5F1YHGFBZU04NL952VFDDAYWLJXV.3e7909c2@myst>
Message-ID: <LCEPIIGDJPKCOIHOBJEPIELJOHAA.mhammond@skippinet.com.au>

> >I thought that the proxy does this already, but a cursory
> inspection of the
> >code doesn't look like that's there.  I'll check in a fix.
>
> Well, after a closer look, it really does.  The DBDictClassifier
> implementation does a db.sync() as well...

It does a db.sync() during a store, but that is all I can see.  It does not
sync after an individual train, which is what I was suggesting.

Mark.


From tim at fourstonesExpressions.com  Wed Mar 19 20:06:57 2003
From: tim at fourstonesExpressions.com (Tim Stone - Four Stones Expressions)
Date: Wed Mar 19 21:07:07 2003
Subject: [Spambayes] database corruption
In-Reply-To: <LCEPIIGDJPKCOIHOBJEPIELJOHAA.mhammond@skippinet.com.au>
Message-ID: <52NIUSHCA9SOSQUP5ZZVOMA9IH52PLZW.3e792241@myst>

3/19/2003 7:35:32 PM, "Mark Hammond" <mhammond@skippinet.com.au> wrote:


>It does a db.sync() during a store, but that is all I can see.  It does not
>sync after an individual train, which is what I was suggesting.

The pop3proxy initiates the store after a train, at line 945.

c'est moi - TimS
http://www.fourstonesExpressions.com
http://wecanstopspam.org

There are 10 kinds of people in the world:
  those who understand binary,
  and those who don't.


From mhammond at skippinet.com.au  Thu Mar 20 13:21:29 2003
From: mhammond at skippinet.com.au (Mark Hammond)
Date: Wed Mar 19 21:22:40 2003
Subject: [Spambayes] database corruption
In-Reply-To: <52NIUSHCA9SOSQUP5ZZVOMA9IH52PLZW.3e792241@myst>
Message-ID: <LCEPIIGDJPKCOIHOBJEPEELMOHAA.mhammond@skippinet.com.au>

> >It does a db.sync() during a store, but that is all I can see.
> It does not
> >sync after an individual train, which is what I was suggesting.
>
> The pop3proxy initiates the store after a train, at line 945.

Interesting.  Then I wonder how this problem could occur.  Presumably the
original poster was not performing a train operation as the machine went
down (certainly not *every* time this has happened).  So assuming that a
synch() was done at least a few seconds ago, what could cause the database
to get into a corrupt state?  How would the file ever change after the last
train had completed?

MArk.


From tim at fourstonesExpressions.com  Wed Mar 19 20:27:00 2003
From: tim at fourstonesExpressions.com (Tim Stone - Four Stones Expressions)
Date: Wed Mar 19 21:27:07 2003
Subject: [Spambayes] database corruption
In-Reply-To: <LCEPIIGDJPKCOIHOBJEPEELMOHAA.mhammond@skippinet.com.au>
Message-ID: <ZW3VVTSQVUZTOLLH1V4Z2U0942D8ZPJ.3e7926f4@myst>

3/19/2003 8:21:29 PM, "Mark Hammond" <mhammond@skippinet.com.au> wrote:

>Interesting.  Then I wonder how this problem could occur.  Presumably the
>original poster was not performing a train operation as the machine went
>down (certainly not *every* time this has happened).  So assuming that a
>synch() was done at least a few seconds ago, what could cause the database
>to get into a corrupt state?  How would the file ever change after the last
>train had completed?

The only answer I can come up with is that there is a bug in whatever dbm 
implementation that D.R.Evans (and others) are currently using.  Is there a 
way to determine what dbm implementation gets used by these guys?

c'est moi - TimS
http://www.fourstonesExpressions.com
http://wecanstopspam.org

There are 10 kinds of people in the world:
  those who understand binary,
  and those who don't.


From noreply at sourceforge.net  Wed Mar 19 16:31:47 2003
From: noreply at sourceforge.net (SourceForge.net)
Date: Wed Mar 19 22:30:39 2003
Subject: [Spambayes] 
	[ spambayes-Bugs-706170 ] Execute test suite fails in Outlook
Message-ID: <E18vnyF-0000Iz-00@sc8-sf-web4.sourceforge.net>

Bugs item #706170, was opened at 2003-03-19 21:03
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=498103&aid=706170&group_id=61702

Category: Outlook
Group: None
Status: Open
Resolution: None
Priority: 5
Submitted By: Fredrik Rodland (fmmr)
Assigned to: Mark Hammond (mhammond)
Summary: Execute test suite fails in Outlook

Initial Comment:
The test suite fails in outlook.

I've retrained messages from a spam and a ham folder.

I think this may be related to moving the database-files 
from the spambayes to the default docs-folders in 
windows a couple of weeks ago.

the following traceback is shown in PythonWin:

Executing automated tests...
Traceback (most recent call last):
  
File "c:\Programfiler\_UTIL\spambayes_cvs\spambayes\
Outlook2000\addin.py", line 308, in Tester
    tester.test(manager)
  
File "c:\Programfiler\_UTIL\spambayes_cvs\spambayes\
Outlook2000\tester.py", line 306, in test
    TestSpamFilter(driver)
  
File "c:\Programfiler\_UTIL\spambayes_cvs\spambayes\
Outlook2000\tester.py", line 173, in TestSpamFilter
    msg, words = driver.CreateTestMessageInFolder
(SPAM, driver.folder_watch)
  
File "c:\Programfiler\_UTIL\spambayes_cvs\spambayes\
Outlook2000\tester.py", line 132, in 
CreateTestMessageInFolder
    msg, words = self.CreateTestMessage(spam_status)
  
File "c:\Programfiler\_UTIL\spambayes_cvs\spambayes\
Outlook2000\tester.py", line 145, in CreateTestMessage
    words.update(FindTopWords(self.manager.bayes, 50, 
True))
  
File "c:\Programfiler\_UTIL\spambayes_cvs\spambayes\
Outlook2000\tester.py", line 64, in FindTopWords
    for word, info in extractor(bayes):
  
File "c:\Programfiler\_UTIL\spambayes_cvs\spambayes\
Outlook2000\tester.py", line 46, in DBExtractor
    key = bayes.dbm.next()[0]
  File "C:\PROGRA~1\_DEV\Python22\Lib\site-
packages\bsddb3\__init__.py", line 122, in next
    rv = self.dbc.next()
DBNotFoundError: (-30991, 'DB_NOTFOUND: No 
matching key/data pair found')
Tests FAILED.  Sorry about that.  If I were you, I would 
do a full re-train ASAP
Please delete any test messages from your Spam, 
Unsure or Inbox folders first.

----------------------------------------------------------------------

>Comment By: Mark Hammond (mhammond)
Date: 2003-03-20 11:31

Message:
Logged In: YES 
user_id=14198

This seems a bsddb3 problem.  The code in question:

        try:
            key = bayes.dbm.next()[0]
        except bsddb.error:

already attempts to catch this error.  Further, the docs for
DBNotFoundError state that it derives from bsddb.error,
meaning my except statement should work.

I will try and get to using my Python 2.2 version for the
plugin to fix this.

----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=498103&aid=706170&group_id=61702

From noreply at sourceforge.net  Wed Mar 19 16:32:34 2003
From: noreply at sourceforge.net (SourceForge.net)
Date: Wed Mar 19 22:30:46 2003
Subject: [Spambayes] 
	[ spambayes-Bugs-702920 ] Manual filtering (Outlook) stops if one
	message fails
Message-ID: <E18vnz0-0000Jd-00@sc8-sf-web4.sourceforge.net>

Bugs item #702920, was opened at 2003-03-13 23:38
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=498103&aid=702920&group_id=61702

Category: Outlook
Group: None
>Status: Closed
>Resolution: Fixed
Priority: 5
Submitted By: Fredrik Rodland (fmmr)
Assigned to: Mark Hammond (mhammond)
Summary: Manual filtering (Outlook) stops if one message fails

Initial Comment:
I've posted tyhis question on the maillist, and with (at 
least) one positive feedback, I enter it here:

If manual filtering is started, and one e-mail fails, the 
rest of the filetering seems to be skipped.  

couldn't the filtering of the remaining messages 
continue, skipping the message which failed?


----------------------------------------------------------------------

>Comment By: Mark Hammond (mhammond)
Date: 2003-03-20 11:32

Message:
Logged In: YES 
user_id=14198

Checked the fix in for this a couple of days ago.

----------------------------------------------------------------------

Comment By: Fredrik Rodland (fmmr)
Date: 2003-03-17 22:06

Message:
Logged In: YES 
user_id=724871

I (sligthly) chqanged the summary.

I've included one traceback.  However I've run into several 
different ones in the past when filtering manual, and all seems 
to stop the actual filter-process.  What I want/wish is that the 
filtering process continues with the remaining messages even 
if one message fails.  There have also been several other 
comments on this subject on the list.

the actual traceback as requested:
Error getting property from stream (-2147221233, 'OLE error 
0x8004010f', None, None)
Exception in thread Thread-2:
Traceback (most recent call last):
  File "C:\PROGRA~1\_DEV\Python22\lib\threading.py", line 
408, in __bootstrap
    self.run()
  File "C:\PROGRA~1\_DEV\Python22\lib\threading.py", line 
396, in run
    apply(self.__target, self.__args, self.__kwargs)
  
File "c:\Programfiler\_UTIL\spambayes_cvs\spambayes\Outlo
ok2000\dialogs\AsyncDialog.py", line 115, in thread_target
    self._DoProcess()
  
File "c:\Programfiler\_UTIL\spambayes_cvs\spambayes\Outlo
ok2000\dialogs\FilterDialog.py", line 375, in _DoProcess
    self.filterer(self.mgr, self.progress)
  
File "c:\Programfiler\_UTIL\spambayes_cvs\spambayes\Outlo
ok2000\filter.py", line 100, in filterer
    this_dispositions = filter_folder(f, mgr, progress)
  
File "c:\Programfiler\_UTIL\spambayes_cvs\spambayes\Outlo
ok2000\filter.py", line 80, in filter_folder
    disposition = filter_message(message, mgr, all_actions)
  
File "c:\Programfiler\_UTIL\spambayes_cvs\spambayes\Outlo
ok2000\filter.py", line 15, in filter_message
    prob = mgr.score(msg)
  
File "c:\Programfiler\_UTIL\spambayes_cvs\spambayes\Outlo
ok2000\manager.py", line 439, in score
    email = msg.GetEmailPackageObject()
  
File "c:\Programfiler\_UTIL\spambayes_cvs\spambayes\Outlo
ok2000\msgstore.py", line 639, in GetEmailPackageObject
    text = self._GetMessageText()
  
File "c:\Programfiler\_UTIL\spambayes_cvs\spambayes\Outlo
ok2000\msgstore.py", line 582, in _GetMessageText
    assert msg.is_multipart()
AssertionError

----------------------------------------------------------------------

Comment By: Mark Hammond (mhammond)
Date: 2003-03-15 10:39

Message:
Logged In: YES 
user_id=14198

Can you please post a traceback? (and sorry if I missed it
on the list)

----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=498103&aid=702920&group_id=61702

From noreply at sourceforge.net  Wed Mar 19 16:33:37 2003
From: noreply at sourceforge.net (SourceForge.net)
Date: Wed Mar 19 22:30:53 2003
Subject: [Spambayes] [ spambayes-Bugs-677842 ] COM error on access denied
Message-ID: <E18vo01-0000KE-00@sc8-sf-web4.sourceforge.net>

Bugs item #677842, was opened at 2003-01-31 10:21
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=498103&aid=677842&group_id=61702

Category: Outlook
Group: None
>Status: Closed
>Resolution: Fixed
Priority: 5
Submitted By: Tony Meyer (anadelonbrin)
Assigned to: Mark Hammond (mhammond)
Summary: COM error on access denied

Initial Comment:
Some folders (public ones in particular) may not allow 
the user access to create the spam field.  This also 
seems to cause an 'access denied' com error later on.  
An example traceback is below.

Warning: failed to create the Outlook user-property in 
folder 'MCN Newsletter'
 (-2147352567, 'Exception occurred.', (4096, 'Microsoft 
Outlook', "You don't have appropriate permission to 
perform this operation.", None, 0, -2147024891), None)
 This is probably because the code has recently been 
changed, but it will
 have no effect on the filtering or scoring.
AntiSpam: Watching for new messages in folder MCN 
Newsletter
AntiSpam: Watching for new messages in folder Inbox
AntiSpam: Watching for new messages in folder Spam
Error processing missed messages!
Traceback (most recent call last):
  File "D:\CVS Modules\spambayes\Outlook2000
\addin.py", line 610, in OnConnection
    self.ProcessMissedMessages()
  File "D:\CVS Modules\spambayes\Outlook2000
\addin.py", line 884, in ProcessMissedMessages
  File "D:\CVS Modules\spambayes\Outlook2000
\addin.py", line 129, in ProcessMessage
    if msgstore_message.GetField
(manager.config.field_score_name) is not None:
  File "D:\CVS Modules\spambayes\Outlook2000
\msgstore.py", line 651, in GetField
    prop = self.mapi_object.GetIDsFromNames(props, 0)
[0]
com_error: (-2147024891, 'Access is denied.', None, 
None)


----------------------------------------------------------------------

>Comment By: Mark Hammond (mhammond)
Date: 2003-03-20 11:33

Message:
Logged In: YES 
user_id=14198

This has been fixed a while ago too - it was the same
problem that caused Hotmail messages to fail.  Please reopen
if you have problems.

----------------------------------------------------------------------

Comment By: Tony Meyer (anadelonbrin)
Date: 2003-02-05 06:03

Message:
Logged In: YES 
user_id=552329

Of course I don't need to wait until mail arrives, I can 'filter 
now'...sigh (it is early yet, I'm not really awake).

I made the change and tried to filter a folder without write-
access.  This is what I got:

Warning: failed to create the Outlook user-property in 
folder 'MCN Newsletter'
 (-2147352567, 'Exception occurred.', (4096, 'Microsoft 
Outlook', "You don't have appropriate permission to perform 
this operation.", None, 0, -2147024891), None)
 This is probably because the code has recently been 
changed, but it will
 have no effect on the filtering or scoring.
Exception in thread Thread-1:
Traceback (most recent call last):
  File "D:\Python22\Lib\threading.py", line 408, in __bootstrap
    self.run()
  File "D:\Python22\Lib\threading.py", line 396, in run
    apply(self.__target, self.__args, self.__kwargs)
  File "D:\CVS Modules\spambayes\Outlook2000
\dialogs\AsyncDialog.py", line 115, in thread_target
    self._DoProcess()
  File "D:\CVS Modules\spambayes\Outlook2000
\dialogs\FilterDialog.py", line 375, in _DoProcess
    self.filterer(self.mgr, self.progress)
  File "D:\CVS Modules\spambayes\Outlook2000\filter.py", 
line 85, in filterer
    this_dispositions = filter_folder(f, mgr, progress)
  File "D:\CVS Modules\spambayes\Outlook2000\filter.py", 
line 65, in filter_folder
    disposition = filter_message(message, mgr, all_actions)
  File "D:\CVS Modules\spambayes\Outlook2000\filter.py", 
line 15, in filter_message
    prob = mgr.score(msg)
  File "D:\CVS Modules\spambayes\Outlook2000
\manager.py", line 384, in score
    email = msg.GetEmailPackageObject()
  File "D:\CVS Modules\spambayes\Outlook2000
\msgstore.py", line 595, in GetEmailPackageObject
    text = self._GetMessageText()
  File "D:\CVS Modules\spambayes\Outlook2000
\msgstore.py", line 472, in _GetMessageText
    hr, data = self.mapi_object.GetProps(prop_ids,0)
com_error: (-2147024891, 'Access is denied.', None, None)

The more I think about it, the more I am of the opinion that 
filtering (and scoring) should not be allowed unless the user 
has write access to the folder.  This would be simple enough 
to implement I presume (somewhere in folderselector.py, a 
check to see that access is available when the user selects a 
folder).

This would also leave someone else to do public folder 
testing, since I don't have write access to any :)

Apologies again for the multiple messages - like I said, it's 
early :)

----------------------------------------------------------------------

Comment By: Tony Meyer (anadelonbrin)
Date: 2003-02-05 05:56

Message:
Logged In: YES 
user_id=552329

ack.  my stupid browser (because of my stupid actions) 
resent my comment many times.  my apologies.

----------------------------------------------------------------------

Comment By: Tony Meyer (anadelonbrin)
Date: 2003-02-05 05:50

Message:
Logged In: YES 
user_id=552329

Hi Mark

I don't really want to do anything with public folders!  But 
there was a message (from Neale from memory) about a user 
having trouble so I tried playing round with them and got this 
problem.

I would want to filter a public folder that I didn't have write 
access to so that I could see/rank the spam scores I guess.  
Although the worthwhileness (is that a word? ;) of this does 
seem a bit dubious.  Maybe the 'solution' is to disallow all 
filtering on folders without write access?

I'll have a go repoducing the exception with the change in 
code and let you know how it goes.  I'll have to wait until 
about midday (NZ) for any mail to arrive in the public folder.

----------------------------------------------------------------------

Comment By: Tony Meyer (anadelonbrin)
Date: 2003-02-05 05:49

Message:
Logged In: YES 
user_id=552329

Hi Mark

I don't really want to do anything with public folders!  But 
there was a message (from Neale from memory) about a user 
having trouble so I tried playing round with them and got this 
problem.

I would want to filter a public folder that I didn't have write 
access to so that I could see/rank the spam scores I guess.  
Although the worthwhileness (is that a word? ;) of this does 
seem a bit dubious.  Maybe the 'solution' is to disallow all 
filtering on folders without write access?

I'll have a go repoducing the exception with the change in 
code and let you know how it goes.  I'll have to wait until 
about midday (NZ) for any mail to arrive in the public folder.

----------------------------------------------------------------------

Comment By: Tony Meyer (anadelonbrin)
Date: 2003-02-05 05:48

Message:
Logged In: YES 
user_id=552329

Hi Mark

I don't really want to do anything with public folders!  But 
there was a message (from Neale from memory) about a user 
having trouble so I tried playing round with them and got this 
problem.

I would want to filter a public folder that I didn't have write 
access to so that I could see/rank the spam scores I guess.  
Although the worthwhileness (is that a word? ;) of this does 
seem a bit dubious.  Maybe the 'solution' is to disallow all 
filtering on folders without write access?

I'll have a go repoducing the exception with the change in 
code and let you know how it goes.  I'll have to wait until 
about midday (NZ) for any mail to arrive in the public folder.

----------------------------------------------------------------------

Comment By: Mark Hammond (mhammond)
Date: 2003-02-04 23:01

Message:
Logged In: YES 
user_id=14198

Hi Tony,
  I didn't realize you were an antipode ;)

I'm wondering why you want to filter public folders that you
don't have write access to?  Or is the point that you can
*move* the message, just can't save fields?

Interestingly, your exception points at:
 if
msgstore_message.GetField(manager.config.field_score_name)
is not None: 

which implies that this error is actually on the *following*
message, not the one that is actually failing.  This does
make sense, as we pass mapi.MAPI_DEFERRED_ERRORS to all mapi
functions.  I'm wondering if you can easily repro this
exception?  If so, I would be interested to see what
changing msgstore.py, line 666 (eeek!!!) in current CVS from:

self.mapi_object.SaveChanges(mapi.KEEP_OPEN_READWRITE |
USE_DEFERRED_ERRORS)
to:
        self.mapi_object.SaveChanges(mapi.KEEP_OPEN_READWRITE)

has on this exception, and if indeed the exception is now
raised from the "save" operation rather than a following one.

----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=498103&aid=677842&group_id=61702

From skip at pobox.com  Wed Mar 19 22:29:26 2003
From: skip at pobox.com (Skip Montanaro)
Date: Wed Mar 19 23:29:31 2003
Subject: [Spambayes] database corruption
In-Reply-To: <3E78A1E9.29639.DCE8D7@localhost>
References: <3E78A1E9.29639.DCE8D7@localhost>
Message-ID: <15993.17318.29523.61558@montanaro.dyndns.org>


    Doc> When everything came back up and seemed stable, I restarted
    Doc> pop3proxy.py and was unable to restart pop3proxy.py because of
    Doc> database corruption. As before, there was no mail activity going on
    Doc> at the time of the crash.

What version of Berkeley DB are you using?  Try this command:

    rpm -qa | egrep '^(lib)?db'

It might report something like

    db1-devel-1.85-6mdk
    libdb3.2-devel-3.2.9-2mdk
    db1-1.85-6mdk
    libdbtcl3.2-3.2.9-2mdk
    db2-2.4.14-3mdk
    libdb3.2-3.2.9-2mdk

Note that I don't have anything like "libdb3.2-utils-3.2.9-2mdk".  If you
don't but have your Mandrake CD around, install that RPM.  That will give
you a bunch of commands which begin with "db_".  Try running db_recover on
your corrupt database file and see if it fixes the problem.

What does ldd say about the version of libdb linked into your bsddb.so file?
Try something like this:

    % ldd /usr/local/lib/python2.2/lib-dynload/bsddb.so 
            libdb-3.2.so => /lib/libdb-3.2.so (0x4001c000)
            libc.so.6 => /lib/libc.so.6 (0x400a3000)
            /lib/ld-linux.so.2 => /lib/ld-linux.so.2 (0x80000000)

You want the -utils RPM which corresponds to the libdb version bsddb.so was
linked against.  On Mandrake systems you can install multiple versions of
libdb simultaneously.

Skip

From mhammond at skippinet.com.au  Thu Mar 20 15:53:52 2003
From: mhammond at skippinet.com.au (Mark Hammond)
Date: Wed Mar 19 23:54:49 2003
Subject: [Spambayes] database corruption
In-Reply-To: <15993.17318.29523.61558@montanaro.dyndns.org>
Message-ID: <LCEPIIGDJPKCOIHOBJEPMEMCOHAA.mhammond@skippinet.com.au>

bsddb.db.version() tells us the version too.

Mark.


From T.A.Meyer at massey.ac.nz  Thu Mar 20 17:19:10 2003
From: T.A.Meyer at massey.ac.nz (Meyer, Tony)
Date: Thu Mar 20 00:19:49 2003
Subject: [Spambayes] Beta status checklist (not new feature requests!)
Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F13C8C920@its-xchg4.massey.ac.nz>

> > IMO, a (potential) user should download *either* the Outlook        
> > installer, *or* a beta release of everything else.                  
>
> Sounds fine to me - except I would raise the bar a little -           
> why not make a pop3propxy *binary* release for Windows too -          
> then the problem becomes moot- on Windows you get a binary.           

I notice that this is also listed in the "short term plans" in the
readme in the windows directory. Does this mean that you are working on
it, or that someone else should?

> I realize time is an issue, so this strategy sounds OK for            
> beta2, but maybe we could aim for a beta3 with binaries before v1.    

Well, TimS is only doing _alpha_ 3 at the moment, unless the list of
prereq's for beta 1 ends up really short, so there should be time. But
otherwise, yes.

=Tony Meyer

From T.A.Meyer at massey.ac.nz  Thu Mar 20 19:09:40 2003
From: T.A.Meyer at massey.ac.nz (Meyer, Tony)
Date: Thu Mar 20 02:10:45 2003
Subject: [Spambayes] Storing Options
Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F1318CDC1@its-xchg4.massey.ac.nz>

Ok, here's a draft proposal for changes to create a new way of storing  
options. I'm not going to implement any of this unless there is a       
consensus that it's a good thing, so don't panic.                       

There are four main changes, each outlined below. They can be           
implemented separately, but together makes most sense.                  

I would not change:                                                     
* The search path for options (i.e. defaults, then envar, then          
current/home directory).                                                
* Storing the defaults inside a .py file rather than having a           
'bayes.ini' file (reading the archives, the reasons behind this make    
sense                                                                   

So, here's what I do propose. Note that these are significant changes   
and would require changes (improvements ;) all over the code. The user  
would notice nothing, however. When you get a chance, please read       
through these and comment.                                              

1. Change from using getattr to get

This means using 'options["pop3proxy_servers"]' rather than             
'options.pop3proxy_servers'. This avoid the possible problems with      
conflicts with existing OptionsClass attribute names, and allows #2 and 
(more easily) #3.                                                       

2. Use the section data.

This means using 'options[("pop3proxy", "servers")]'. For backwards     
compatability 'options["pop3proxy_servers"]' would return the value of  
any option named "pop3proxy_servers", whichever section it was in.      

This is tidier, and allows neat things later on (like maybe only        
loading option sections that are relevant). For the most part it is     
already set up this way, it's just that Options currently throws away   
all the section information.                                            

3. Setting values propagates through to ConfigParser

This means that 'options.pop3proxy_add_evidence_header = True' (with    
#1 & #2, 'options[("pop3proxy", "add_evidence header")] = True')        
would not just change the Options object, but also the ConfigParser     
object that it inherits from.                                           

This *does not* mean that that any files would be changed, but *does*   
mean that they could be updated on demand, via the write() function -   
or via the update() function in UpdatableConfigParser).                 

4. Detailed options.

Each option has the following attributes:                               
* a name                                                                
* a nice name                                                           
* a default value                                                       
* explanation text                                                      
* either a tuple or a regex of allowed values                           
* the current value                                                     
* whether it should be restored on a 'return to defaults' command       

Two simple examples:

"pop3proxy_servers", "Servers", "", "These are the servers that will be
proxied blah blah...", r"\w", "pop.example.com", False

"add_evidence_header", "Clues Header", True, "This option adds a header 
with the spam clues blah blah blah", (True, False), False, True         

These would be accessed as follows:                                     
nice name: options.display_name(sect, opt)                              
default: options.default(sect, opt) - these would also be the values of 
all options prior to loading any config file                            
explanation text: options.doc(sect, opt)                                
allowed values: options.valid_input(sect, opt)                          
current value: either via options[(sect, opt)], or via                  
options.get(sect, opt)                                                  
restore on revert: options.no_restore(sect, opt)                        

Also provided would be options.is_valid(sect, opt, value) which would   
return True iff the value was valid for that option.                    

OR options[(sect, opt)] / options.get(sect, opt) returns an Option
object that has these things. This is nicer, but is more work to just
get the current value, which is what is wanted most of the time.

From T.A.Meyer at massey.ac.nz  Thu Mar 20 19:25:58 2003
From: T.A.Meyer at massey.ac.nz (Meyer, Tony)
Date: Thu Mar 20 02:26:33 2003
Subject: [Spambayes] Beta status checklist
Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F13C8C92A@its-xchg4.massey.ac.nz>

> * Some kind of recovery from wordinfo database corruption

If the database itself is corrupt, is there really anything we can do,  
other than point them towards the db recovery tools? (Unless we expect  
people to hold onto ham & spam to retrain on).                          

I would suggest that if the db is dead, all we can do is rename it (for 
recovery purposes) and create a new, empty db.                          

> (nham and nspam are lost on an increasingly frequent basis)

It seems (via a grep for 'nham' or 'nspam') like the only things that   
use nham and nspam are:                                                     
* testing code (the user wouldn't be using this)                        
* experimental_ham_spam_imbalance (off by default)                      

If this is correct, does it really matter if nham and/or nspam are      
incorrect? (Not that the bugs shouldn't be traced down, however).       

=Tony Meyer

From acunningham at rsasecurity.com  Thu Mar 20 09:11:07 2003
From: acunningham at rsasecurity.com (Cunningham, Andy)
Date: Thu Mar 20 04:06:03 2003
Subject: [Spambayes] Beta status checklist (or this turning into newfe
	ature requests?)
Message-ID: <418A63CAEBF2D4118A1A00508BB1A0B8029F172C@exuk01>

I wonder if something like the following would work:

1) Each user has a private spam database stored on the server.
2) A scheduled task will compute some kind of "average" of the databases.
This might include some kind of threshold (e.g., if more than 20% of users
say it's spam), or just straight averaging. 
3) Systems admins can train directly on the system database to provide
feedback as to whether the system is, in fact, spam or not.  I would also
have some kind of whitelist/blacklist built in 
4) incoming mail is checked against both the user and the system database,
and scored against each, to get two scores a user score (u-score) and a
system score (s-score).   Then you can apply both scores:

		u-score > 90% or s-score > 90% ==> spam
		u-score < 15% or s-score < 15% ==> ham


I guess it would take some kind of analysis to determine the best averaging
process.  This means that you end up with hundreds of people training the
database.

AndyC 

-----Original Message-----
From: Meyer, Tony [mailto:T.A.Meyer@massey.ac.nz] 
Sent: 19 March 2003 23:07
To: Spambayes
Subject: RE: [Spambayes] Beta status checklist (or this turning into
newfeature requests?)


> BTW, are there plans to develop server-side spambayes?                
> (Apologies if this is a FAQ.) I know that it can eclipse what         
> we are currently using on our Exchange server :-)                     

This is a FAQ, and does lend weight to TimS's suggestion that we need a list
of answers for FAQs :)

I have no experience with using spambayes in a server type situation, but
from reading the messages on the list, I believe that this can be done now,
to a certain extent. The real question is how you want training to be done -
does the admin do it? Does everyone contribute? Do you want users to have a
shared definition of spam, or individual?

=Tony Meyer

_______________________________________________
Spambayes mailing list
Spambayes@python.org http://mail.python.org/mailman/listinfo/spambayes

From acunningham at rsasecurity.com  Thu Mar 20 09:36:12 2003
From: acunningham at rsasecurity.com (Cunningham, Andy)
Date: Thu Mar 20 04:31:02 2003
Subject: [Spambayes] Outlook 2002
Message-ID: <418A63CAEBF2D4118A1A00508BB1A0B8029F172D@exuk01>

Mark

I tried out your change - in fact, I tried out several variants of newer
code, and in all of them I now seem to be getting a different error.  This
is based on the latest CVS build checked out at around 9AM GMT this morning,
though the same thing happens in the a2 release as well, now that I have
removed the source of the error (I traced the problem below to a moved .pst
file which hadn't been modified in outlook - so continuing the folder tree
walk on that error is probably a Good Thing.)


Traceback (most recent call last):
  File
"C:\andyc\Install\spambayes\spambayes\Outlook2000\dialogs\ManagerDialog.py",
line 97, in OnInitDialog
    self.UpdateControlStatus()
  File
"C:\andyc\Install\spambayes\spambayes\Outlook2000\dialogs\ManagerDialog.py",
line 143, in UpdateControlStatus
    watch_names = self.mgr.FormatFolderNames(
  File "C:\andyc\Install\spambayes\spambayes\Outlook2000\manager.py", line
222, in FormatFolderNames
    folder = self.message_store.GetFolder(eid)
  File "C:\andyc\Install\spambayes\spambayes\Outlook2000\msgstore.py", line
242, in GetFolder
    folder_id = self.NormalizeID(folder_id)
  File "C:\andyc\Install\spambayes\spambayes\Outlook2000\msgstore.py", line
195, in NormalizeID
    assert False, "We expect fully qualified IDs - second branch"
AssertionError: We expect fully qualified IDs - second branch
win32ui: OnInitDialog() virtual handler (<bound method
ManagerDialog.OnInitDialog of <dialogs.ManagerDialog.ManagerDialog instance
at 0x03141328>>) raised an exception
SpamAddin - Disconnecting from Outlook
Bayes database is not dirty - not writing
Addin terminating: 1 COM client and 2 COM servers exist.

The " - second branch" comment is where I modified the two identical assert
statements in NormaliseID so that I could tell which one was getting
triggered.  This is the second instance.  Commenting out the assertion seems
to allow everything to work properly, though I don't understand the code
well enough to ensure that I'm not storing problems up for later.....

AndyC 


-----Original Message-----
From: Mark Hammond [mailto:mhammond@skippinet.com.au] 
Sent: 17 March 2003 21:48
To: Cunningham, Andy; spambayes@python.org
Subject: RE: [Spambayes] Outlook 2002


>     msgstore = session.OpenMsgStore(0, eid, None, mapi.MDB_NO_MAIL |
> pywintypes.com_error: (-2147219968, 'OLE error 0x80040600', None, 
> None)

The error code for this is MAPI_E_CORRUPT_STORE, which doesn't sound good!

I have checked in a change so that any errors when walking the folder tree
are ignored.  However, this same error is going to happen, so that part of
your folder tree will *not* appear in the dialog.  Hopefully only a small
part of your tree is corrupt, so the folders you want will still be there -
you will have to try it and see.

Mark.

From popiel at wolfskeep.com  Thu Mar 20 07:52:05 2003
From: popiel at wolfskeep.com (T. Alexander Popiel)
Date: Thu Mar 20 10:52:08 2003
Subject: [Spambayes] Storing Options 
In-Reply-To: Message from "Meyer, Tony" <T.A.Meyer@massey.ac.nz> 
	<1ED4ECF91CDED24C8D012BCF2B034F1318CDC1@its-xchg4.massey.ac.nz> 
References: <1ED4ECF91CDED24C8D012BCF2B034F1318CDC1@its-xchg4.massey.ac.nz> 
Message-ID: <20030320155205.428B62DE9E@cashew.wolfskeep.com>

In message:  <1ED4ECF91CDED24C8D012BCF2B034F1318CDC1@its-xchg4.massey.ac.nz>
             "Meyer, Tony" <T.A.Meyer@massey.ac.nz> writes:

>Ok, here's a draft proposal for changes to create a new way of storing
>options. I'm not going to implement any of this unless there is a
>consensus that it's a good thing, so don't panic.

Looked good to me.

- Alex

From tim at fourstonesExpressions.com  Thu Mar 20 10:30:44 2003
From: tim at fourstonesExpressions.com (Tim Stone - Four Stones Expressions)
Date: Thu Mar 20 11:30:49 2003
Subject: [Spambayes] Storing Options
In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F1318CDC1@its-xchg4.massey.ac.nz>
Message-ID: <2X05SOHG95URMGMGB794SRWTBAB9S96.3e79ecb4@myst>

+1 from me.

3/20/2003 1:09:40 AM, "Meyer, Tony" <T.A.Meyer@massey.ac.nz> wrote:

>Ok, here's a draft proposal for changes to create a new way of storing  
>options. I'm not going to implement any of this unless there is a       
>consensus that it's a good thing, so don't panic.                       
>
>There are four main changes, each outlined below. They can be           
>implemented separately, but together makes most sense.                  
>
>I would not change:                                                     
>* The search path for options (i.e. defaults, then envar, then          
>current/home directory).                                                
>* Storing the defaults inside a .py file rather than having a           
>'bayes.ini' file (reading the archives, the reasons behind this make    
>sense                                                                   
>
>So, here's what I do propose. Note that these are significant changes   
>and would require changes (improvements ;) all over the code. The user  
>would notice nothing, however. When you get a chance, please read       
>through these and comment.                                              
>
>1. Change from using getattr to get
>
>This means using 'options["pop3proxy_servers"]' rather than             
>'options.pop3proxy_servers'. This avoid the possible problems with      
>conflicts with existing OptionsClass attribute names, and allows #2 and 
>(more easily) #3.                                                       
>
>2. Use the section data.
>
>This means using 'options[("pop3proxy", "servers")]'. For backwards     
>compatability 'options["pop3proxy_servers"]' would return the value of  
>any option named "pop3proxy_servers", whichever section it was in.      
>
>This is tidier, and allows neat things later on (like maybe only        
>loading option sections that are relevant). For the most part it is     
>already set up this way, it's just that Options currently throws away   
>all the section information.                                            
>
>3. Setting values propagates through to ConfigParser
>
>This means that 'options.pop3proxy_add_evidence_header = True' (with    
>#1 & #2, 'options[("pop3proxy", "add_evidence header")] = True')        
>would not just change the Options object, but also the ConfigParser     
>object that it inherits from.                                           
>
>This *does not* mean that that any files would be changed, but *does*   
>mean that they could be updated on demand, via the write() function -   
>or via the update() function in UpdatableConfigParser).                 
>
>4. Detailed options.
>
>Each option has the following attributes:                               
>* a name                                                                
>* a nice name                                                           
>* a default value                                                       
>* explanation text                                                      
>* either a tuple or a regex of allowed values                           
>* the current value                                                     
>* whether it should be restored on a 'return to defaults' command       
>
>Two simple examples:
>
>"pop3proxy_servers", "Servers", "", "These are the servers that will be
>proxied blah blah...", r"\w", "pop.example.com", False
>
>"add_evidence_header", "Clues Header", True, "This option adds a header 
>with the spam clues blah blah blah", (True, False), False, True         
>
>These would be accessed as follows:                                     
>nice name: options.display_name(sect, opt)                              
>default: options.default(sect, opt) - these would also be the values of 
>all options prior to loading any config file                            
>explanation text: options.doc(sect, opt)                                
>allowed values: options.valid_input(sect, opt)                          
>current value: either via options[(sect, opt)], or via                  
>options.get(sect, opt)                                                  
>restore on revert: options.no_restore(sect, opt)                        
>
>Also provided would be options.is_valid(sect, opt, value) which would   
>return True iff the value was valid for that option.                    
>
>OR options[(sect, opt)] / options.get(sect, opt) returns an Option
>object that has these things. This is nicer, but is more work to just
>get the current value, which is what is wanted most of the time.
>
>_______________________________________________
>Spambayes mailing list
>Spambayes@python.org
>http://mail.python.org/mailman/listinfo/spambayes
>
>


c'est moi - TimS
http://www.fourstonesExpressions.com
http://wecanstopspam.org

There are 10 kinds of people in the world:
  those who understand binary,
  and those who don't.


From tim.one at comcast.net  Thu Mar 20 16:54:38 2003
From: tim.one at comcast.net (Tim Peters)
Date: Thu Mar 20 16:59:31 2003
Subject: [Spambayes] Beta status checklist
In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F13C8C92A@its-xchg4.massey.ac.nz>
Message-ID: <BIEJKCLHCIOIHAGOKOLHKEAOFDAA.tim.one@comcast.net>

> ...
> It seems (via a grep for 'nham' or 'nspam') like the only things that
> use nham and nspam are:
>
> * testing code (the user wouldn't be using this)
> * experimental_ham_spam_imbalance (off by default)
>
> If this is correct,

Nope, they enter into every probability calculation, via
Classifier.probability().  More, they have to.

I expect a real bug got hacked over instead of solved at the time these
int() calls got added to classifier.add_msg():

        if is_spam:
            self.nspam = int(self.nspam) + 1  # account for string nspam
        else:
            self.nham = int(self.nham) + 1   # account for string nham

That is, the database was hosed if these things were ever strings, or
someone hacked around a bad database integration in the wrong place.

Note that it's easy to show that nham and nspam must be ints, provided that
only methods of Classifier muck with a Classifier's instance variables.
Under the same assumption, no word's hamcount can exceed nham, or its
spamcount nspam.


From noreply at sourceforge.net  Thu Mar 20 17:34:14 2003
From: noreply at sourceforge.net (SourceForge.net)
Date: Thu Mar 20 20:40:43 2003
Subject: [Spambayes] 
	[ spambayes-Feature Requests-703283 ] mboxtrain only trains on cur
	in maildir
Message-ID: <E18wBQE-0001mB-00@sc8-sf-web1.sourceforge.net>

Feature Requests item #703283, was opened at 2003-03-13 16:57
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=498106&aid=703283&group_id=61702

Category: None
Group: None
Status: Open
Priority: 5
Submitted By: Matthew Cowles (mdcowles)
>Assigned to: Tim Stone (timstone4)
Summary: mboxtrain only trains on cur in maildir

Initial Comment:
When training on a maildir, mboxtrain trains only on
the messages in the subirectory cur. It ignores
messages in the subdirectory new. Since new is for
messages that haven't been seen, I think it's worth
looking there since at least some spam will have been
filed unseen.

This is the same as bug 699174 which Tim Stone closed
saying, "This is a feature request.  If this remains as
a requirement, please
resubmit as such."

The patch attached to that bug report fixes the
behavior which I still consider a bug.

----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=498106&aid=703283&group_id=61702

From noreply at sourceforge.net  Thu Mar 20 17:47:53 2003
From: noreply at sourceforge.net (SourceForge.net)
Date: Thu Mar 20 20:40:50 2003
Subject: [Spambayes] 
	[ spambayes-Feature Requests-703283 ] mboxtrain only trains on cur
	in maildir
Message-ID: <E18wBdR-00023K-00@sc8-sf-web3.sourceforge.net>

Feature Requests item #703283, was opened at 2003-03-13 16:57
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=498106&aid=703283&group_id=61702

Category: None
Group: None
>Status: Closed
Priority: 5
Submitted By: Matthew Cowles (mdcowles)
Assigned to: Tim Stone (timstone4)
Summary: mboxtrain only trains on cur in maildir

Initial Comment:
When training on a maildir, mboxtrain trains only on
the messages in the subirectory cur. It ignores
messages in the subdirectory new. Since new is for
messages that haven't been seen, I think it's worth
looking there since at least some spam will have been
filed unseen.

This is the same as bug 699174 which Tim Stone closed
saying, "This is a feature request.  If this remains as
a requirement, please
resubmit as such."

The patch attached to that bug report fixes the
behavior which I still consider a bug.

----------------------------------------------------------------------

>Comment By: Tim Stone (timstone4)
Date: 2003-03-20 19:47

Message:
Logged In: YES 
user_id=645698

Added -n option to train mail in "new".  This leaves the current behavior of 
training only "cur" unaltered.

----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=498106&aid=703283&group_id=61702

From noreply at sourceforge.net  Thu Mar 20 17:49:02 2003
From: noreply at sourceforge.net (SourceForge.net)
Date: Thu Mar 20 20:40:57 2003
Subject: [Spambayes] 
	[ spambayes-Feature Requests-695059 ] wildcard support for mboxtrain
Message-ID: <E18wBeY-0007MB-00@sc8-sf-web2.sourceforge.net>

Feature Requests item #695059, was opened at 2003-02-28 07:54
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=498106&aid=695059&group_id=61702

Category: None
Group: None
Status: Open
Priority: 5
Submitted By: bill parducci (humantypo)
>Assigned to: Tim Stone (timstone4)
Summary: wildcard support for mboxtrain

Initial Comment:
i have about 40 folders that i use to keep track of
numerous e-mail lists, projects, scraps of digital
dimentia, etc.  

it would be very helpful if mboxtrain would accept
wildcards for mail folder identification.  yes, i could
have 40 command line params, but that adds a YAM (Yet
Another Maintenance) task to make sure that the folders
match the command line parameters. 

what would really be useful is if mboxtrain would keep
track of folders that it has read in that session
already. that way one could use the following syntax:

mboxtrain -d [db] -s [dir]/spam -g [dir]/*

and not have the ham process read the spam folder
(since it is likely that there will be only 1 spam
folder and multiple ham folders). i suppose you could
just hard code the ham flag parser to ignore folders
named 'spam' but that would kinda be horky...

anway, i think would help in the move towards more 'set
& forget' operation.

----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=498106&aid=695059&group_id=61702

From T.A.Meyer at massey.ac.nz  Fri Mar 21 15:30:34 2003
From: T.A.Meyer at massey.ac.nz (Meyer, Tony)
Date: Thu Mar 20 22:31:16 2003
Subject: [Spambayes] Beta status checklist
Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F13C8C933@its-xchg4.massey.ac.nz>

> Nope, they enter into every probability calculation, via              
> Classifier.probability(). More, they have to.                         

I don't know how I missed that. I was even looking at that section of
the code; I remember reading those lines. Go figure.

> I expect a real bug got hacked over instead of solved at the          
> time these int() calls got added to classifier.add_msg():             
[...]
> That is, the database was hosed if these things were ever strings, or 
> someone hacked around a bad database integration in the wrong place.  

Really we need to solve the problem that's causing the incorrect counts,
rather than try and restore 'corrupt' db's. What we need, of course, is
someone who regularly seems this problem so that we can track it down.
Anyone out there?

=Tony Meyer

From tdickenson at devmail.geminidataloggers.co.uk  Fri Mar 21 09:33:12 2003
From: tdickenson at devmail.geminidataloggers.co.uk (Toby Dickenson)
Date: Fri Mar 21 04:33:16 2003
Subject: [Spambayes]  [ spambayes-Feature Requests-695059 ] wildcard
	support for mboxtrain
In-Reply-To: <E18wBeY-0007MB-00@sc8-sf-web2.sourceforge.net>
References: <E18wBeY-0007MB-00@sc8-sf-web2.sourceforge.net>
Message-ID: <200303210933.12940.tdickenson@devmail.geminidataloggers.co.uk>

On Friday 21 March 2003 1:49 am, SourceForge.net wrote:

It sounds like you are aiming the same direction as me.

> it would be very helpful if mboxtrain would accept
> wildcards for mail folder identification.  yes, i could
> have 40 command line params, but that adds a YAM (Yet
> Another Maintenance) task to make sure that the folders
> match the command line parameters.

I am currently using a script that extracts all my mail folder names from a 
kmail configuration file, then builds up a long hammie command line and 
executes it. (Im happy to contribute this if anyone is interested)

This is working well for me. 

Every day I perform a full train unsing hammie, not mboxtrains incremental 
approach. This means I can use the mail reader to expire old messages, and 
have them removed from the spambayes database.

> and not have the ham process read the spam folder
>(since it is likely that there will be only 1 spam
> folder and multiple ham folders). 

I started with one folder, but am now using two. Filters put new spam in a 
spam folder, and at the end of the week I review it for hams, and move all 
the spams into a spam/archive folder.

> i suppose you could
> just hard code the ham flag parser to ignore folders
> named 'spam' but that would kinda be horky...

I assume that any folder named spam and its subfolders contain spam.


From Paul.Moore at atosorigin.com  Fri Mar 21 13:13:04 2003
From: Paul.Moore at atosorigin.com (Moore, Paul)
Date: Fri Mar 21 08:13:35 2003
Subject: [Spambayes] Getting a mbox file from Outlook Express
Message-ID: <16E1010E4581B049ABC51D4975CEDB880113D99B@UKDCX001.uk.int.atosorigin.com>

I'm trying to get a friend set up with Spambayes using Outlook
Express. To get some initial training sorted, it would be nice to
get a mbox file of some of his existing messages which he could
train on. But I can't find a way of getting OE to save a mbox
file. Is there a way? Any OE victims around here...?

Thanks,
Paul.

From tim at fourstonesExpressions.com  Fri Mar 21 07:32:49 2003
From: tim at fourstonesExpressions.com (Tim Stone - Four Stones Expressions)
Date: Fri Mar 21 08:32:56 2003
Subject: [Spambayes] Getting a mbox file from Outlook Express
In-Reply-To: <16E1010E4581B049ABC51D4975CEDB880113D99B@UKDCX001.uk.int.atosorigin.com>
Message-ID: <WR1TE0LIE0SPWQD874GEYWQMBAIEA0FD.3e7b1481@myst>

>But I can't find a way of getting OE to save a mbox
>file. Is there a way?

You're sore out of luck on that one, dude.

Outlook-Express-Victim-ly yours - TimS

c'est moi - TimS
http://www.fourstonesExpressions.com
http://wecanstopspam.org

There are 10 kinds of people in the world:
  those who understand binary,
  and those who don't.


From tim at fourstonesExpressions.com  Fri Mar 21 08:45:49 2003
From: tim at fourstonesExpressions.com (Tim Stone - Four Stones Expressions)
Date: Fri Mar 21 09:45:55 2003
Subject: [Spambayes] Getting a mbox file from Outlook Express
In-Reply-To: <WR1TE0LIE0SPWQD874GEYWQMBAIEA0FD.3e7b1481@myst>
Message-ID: <43F0MG1VVSTR95LJMJQ072IHCB98WT.3e7b259d@myst>

3/21/2003 7:32:49 AM, Tim Stone - Four Stones Expressions 
<tim@fourstonesExpressions.com> wrote:

>>But I can't find a way of getting OE to save a mbox
>>file. Is there a way?
>
>You're sore out of luck on that one, dude.

Well, it appears as if I've spoken a bit too soon on this one.  I did some 
digging, and found a program called MailNavigator 
(http://www.mailnavigator.com/mailnavigator.html), that can read OE mailboxes 
and export them as an mbox.  I've downloaded it, tried it, and it works. 

When you start it up, do File->Load External Mailbox...  Point the browser 
window at the OE inbox.dbx file, normally in Documents and Settings
\currentuser\Local Settings\Application Data\Identities\{bunchaglorp}
\Microsoft\Outlook Express.  You should see your inbox (or whatever folder you 
loaded) contents in MailNavigator.  Then do Message->Select All, then 
Message->Save As... pick a file name and location, and select file type 
RFC822-text file... et voila, you have an mbox!

c'est moi - TimS
http://www.fourstonesExpressions.com
http://wecanstopspam.org

There are 10 kinds of people in the world:
  those who understand binary,
  and those who don't.


From noreply at sourceforge.net  Fri Mar 21 05:35:52 2003
From: noreply at sourceforge.net (SourceForge.net)
Date: Fri Mar 21 09:48:15 2003
Subject: [Spambayes] 
	[ spambayes-Bugs-707491 ] Pop3 proxy service code for Windows
	doesn't work...
Message-ID: <E18wMga-00084F-00@sc8-sf-web3.sourceforge.net>

Bugs item #707491, was opened at 2003-03-21 13:35
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=498103&aid=707491&group_id=61702

Category: pop3proxy
Group: None
Status: Open
Resolution: None
Priority: 5
Submitted By: Paul Moore (pmoore)
Assigned to: Nobody/Anonymous (nobody)
Summary: Pop3 proxy service code for Windows doesn't work...

Initial Comment:
The pop3proxy_service.py program doesn't seem to 
work with Python 2.2.2. The problem is that a main 
program doesn't have a __file__ variable defined. (This 
works in Python 2.3, which I guess is why this got 
missed...)

I've attached a "quick fix" patch, which uses a helper 
module "findme.py".

----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=498103&aid=707491&group_id=61702

From noreply at sourceforge.net  Fri Mar 21 05:36:47 2003
From: noreply at sourceforge.net (SourceForge.net)
Date: Fri Mar 21 09:48:22 2003
Subject: [Spambayes] 
	[ spambayes-Bugs-707491 ] Pop3 proxy service code for Windows
	doesn't work...
Message-ID: <E18wMhT-0006RY-00@sc8-sf-web2.sourceforge.net>

Bugs item #707491, was opened at 2003-03-21 13:35
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=498103&aid=707491&group_id=61702

Category: pop3proxy
Group: None
Status: Open
Resolution: None
Priority: 5
Submitted By: Paul Moore (pmoore)
>Assigned to: Mark Hammond (mhammond)
Summary: Pop3 proxy service code for Windows doesn't work...

Initial Comment:
The pop3proxy_service.py program doesn't seem to 
work with Python 2.2.2. The problem is that a main 
program doesn't have a __file__ variable defined. (This 
works in Python 2.3, which I guess is why this got 
missed...)

I've attached a "quick fix" patch, which uses a helper 
module "findme.py".

----------------------------------------------------------------------

>Comment By: Paul Moore (pmoore)
Date: 2003-03-21 13:36

Message:
Logged In: YES 
user_id=113328

File attachment didn't work :-(

----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=498103&aid=707491&group_id=61702

From tim at fourstonesExpressions.com  Fri Mar 21 08:53:17 2003
From: tim at fourstonesExpressions.com (Tim Stone - Four Stones Expressions)
Date: Fri Mar 21 09:53:23 2003
Subject: [Spambayes] Getting a mbox file from Outlook Express
In-Reply-To: <43F0MG1VVSTR95LJMJQ072IHCB98WT.3e7b259d@myst>
Message-ID: <EAUOC8YSWRE0C7FAGBLIWUBAB4Y6TQ.3e7b275d@myst>

3/21/2003 8:45:49 AM, Tim Stone - Four Stones Expressions 
<tim@fourstonesExpressions.com> wrote:

>3/21/2003 7:32:49 AM, Tim Stone - Four Stones Expressions 
><tim@fourstonesExpressions.com> wrote:
>
>>>But I can't find a way of getting OE to save a mbox
>>>file. Is there a way?
>>
>>You're sore out of luck on that one, dude.
>
>Well, it appears as if I've spoken a bit too soon on this one.  I did some 
>digging

More digging.  There's a sourceforge project called mbx2mbox, at 
http://mbx2mbox.sourceforge.net/  I haven't tried this, but it looks like it 
will do what you want, as well.

c'est moi - TimS
http://www.fourstonesExpressions.com
http://wecanstopspam.org

There are 10 kinds of people in the world:
  those who understand binary,
  and those who don't.


From Paul.Moore at atosorigin.com  Fri Mar 21 15:07:37 2003
From: Paul.Moore at atosorigin.com (Moore, Paul)
Date: Fri Mar 21 10:08:09 2003
Subject: [Spambayes] Getting a mbox file from Outlook Express
Message-ID: <16E1010E4581B049ABC51D4975CEDB880113D99F@UKDCX001.uk.int.atosorigin.com>

From: Tim Stone - Four Stones Expressions
>> Well, it appears as if I've spoken a bit too soon on this one.
>> I did some digging
[...]
> More digging.

Thanks for these! I'll pass the info on...

Paul


From popiel at wolfskeep.com  Fri Mar 21 08:24:30 2003
From: popiel at wolfskeep.com (T. Alexander Popiel)
Date: Fri Mar 21 11:24:34 2003
Subject: [Spambayes] [ spambayes-Feature Requests-695059 ] wildcard
	support for mboxtrain 
In-Reply-To: Message from Toby Dickenson
	<tdickenson@devmail.geminidataloggers.co.uk> 
	<200303210933.12940.tdickenson@devmail.geminidataloggers.co.uk> 
References: <E18wBeY-0007MB-00@sc8-sf-web2.sourceforge.net>
	<200303210933.12940.tdickenson@devmail.geminidataloggers.co.uk> 
Message-ID: <20030321162430.C6D952DE2F@cashew.wolfskeep.com>

In message:  <200303210933.12940.tdickenson@devmail.geminidataloggers.co.uk>
             Toby Dickenson <tdickenson@devmail.geminidataloggers.co.uk> writes
:
>On Friday 21 March 2003 1:49 am, SourceForge.net wrote:
>
>It sounds like you are aiming the same direction as me.
>
>> it would be very helpful if mboxtrain would accept
>> wildcards for mail folder identification. [...]

>I am currently using a script that extracts all my mail folder names from a 
>kmail configuration file, then builds up a long hammie command line and 
>executes it. (Im happy to contribute this if anyone is interested)
>
>This is working well for me. 

My approach to this problem is that I make two copies of every mail;
one copy goes into an 'everything' folder, and the other copy gets
delivered into 'inbox' or 'newspam' as appropriate.  As I review spam
(or find it as false negatives), I move it into a 'spam' folder.

For training, ham = everything - spam - newspam.  Naming three folders
doesn't seem to be a big deal, whereas naming all the innumerable
folders that my inbox gets sorted into would be.

The code to do this is checked in under contrib as bulktrain.sh and
bulkgraph.py, and described in BULK.txt.

>Every day I perform a full train unsing hammie, not mboxtrains incremental 
>approach. This means I can use the mail reader to expire old messages, and 
>have them removed from the spambayes database.

I just ignore everything more than 120 days old, personally... and
that's just to keep the database around 20 meg.  Tests show that it
hurts accuracy by less than 1%.  Of course, ignoring everything over
a week old hurts less than 5%...

- Alex

From skip at pobox.com  Fri Mar 21 15:58:35 2003
From: skip at pobox.com (Skip Montanaro)
Date: Fri Mar 21 16:58:44 2003
Subject: [Spambayes] filtering in the face of disk quotas or full disks
Message-ID: <15995.35595.805586.553059@montanaro.dyndns.org>


Has anybody thought about how any of the Spambayes tools would perform in
the face of disk quotas or full disk partitions?  Here at Northwestern they
are going to start supporting IMAP (against their better wishes, but the
customer is always right).  Because they have roughly 30,000 active email
accounts and IMAP allows (requires?) mail to be stored on the server, they
are going to institute disk quotas on the mail servers for the first time.
Procmail+SpamAssassin seems to be breaking in some situations and SA is
(incorrectly, I believe) getting egg on its face as a result.  I'd like to
make sure Spambayes has these various problems addressed.

Skip


From popiel at wolfskeep.com  Fri Mar 21 17:34:19 2003
From: popiel at wolfskeep.com (T. Alexander Popiel)
Date: Fri Mar 21 20:34:24 2003
Subject: [Spambayes] filtering in the face of disk quotas or full disks 
In-Reply-To: Message from Skip Montanaro <skip@pobox.com> 
	<15995.35595.805586.553059@montanaro.dyndns.org> 
References: <15995.35595.805586.553059@montanaro.dyndns.org> 
Message-ID: <20030322013419.5BF1F2DE2F@cashew.wolfskeep.com>

In message:  <15995.35595.805586.553059@montanaro.dyndns.org>
             Skip Montanaro <skip@pobox.com> writes:
>
>Has anybody thought about how any of the Spambayes tools would perform in
>the face of disk quotas or full disk partitions?

Very poorly.  I think that'd send it straight into DB corruption.
In general, spambayes is likely to require a bit more disk space
than any fixed-pattern classifier like SpamAssassin... my database
is about 20 megs, for instance.  I don't hink that SpamAssassin
requires more than a few K of personal storage, unless you turn on
its bayesian stuff...

- Alex

From tshumway at jdiworks.net  Fri Mar 21 18:19:30 2003
From: tshumway at jdiworks.net (Terrel Shumway)
Date: Fri Mar 21 21:14:38 2003
Subject: [Spambayes] Binaries for MSwin
In-Reply-To: <LCEPIIGDJPKCOIHOBJEPIELBOHAA.mhammond@skippinet.com.au>
References: <LCEPIIGDJPKCOIHOBJEPIELBOHAA.mhammond@skippinet.com.au>
Message-ID: <200303211819.30500.tshumway@jdiworks.net>

On Wednesday 19 March 2003 15:22, Mark Hammond wrote:
> > > Add the Outlook binary installer as part of the release?  Absolutely!
>
> Sounds fine to me - except I would raise the bar a little - why not make a
> pop3propxy *binary* release for Windows too - then the problem becomes
> moot- on Windows you get a binary.

one more reason to publish binaries for mswin: ZoneAlarm.
popfile, written in perl, forces the average[1] user to allow all perl 
programs to access the internet -- a gaping hole in your firewall. (I 
consider this a defect in ZoneAlarm's design, but I don't think it is going 
away anytime soon.)


---
[1] a sophisticated user could create a private copy of perl.exe and call it 
popfile.exe

From tim at fourstonesExpressions.com  Fri Mar 21 22:23:43 2003
From: tim at fourstonesExpressions.com (Tim Stone - Four Stones Expressions)
Date: Fri Mar 21 23:23:51 2003
Subject: [Spambayes] Binaries for MSwin
In-Reply-To: <200303211819.30500.tshumway@jdiworks.net>
Message-ID: <TRID09NLPLHURQPFBURTOWQEDFERMOI.3e7be54f@myst>

3/21/2003 8:19:30 PM, Terrel Shumway <tshumway@jdiworks.net> wrote:

>one more reason to publish binaries for mswin: ZoneAlarm.
>popfile, written in perl, forces the average[1] user to allow all perl 
>programs to access the internet -- a gaping hole in your firewall.

Excellent point.


c'est moi - TimS
http://www.fourstonesExpressions.com
http://wecanstopspam.org

There are 10 kinds of people in the world:
  those who understand binary,
  and those who don't.


From skip at pobox.com  Sat Mar 22 00:22:11 2003
From: skip at pobox.com (Skip Montanaro)
Date: Sat Mar 22 01:22:20 2003
Subject: [Spambayes] filtering in the face of disk quotas or full disks 
In-Reply-To: <20030322013419.5BF1F2DE2F@cashew.wolfskeep.com>
References: <15995.35595.805586.553059@montanaro.dyndns.org>
        <20030322013419.5BF1F2DE2F@cashew.wolfskeep.com>
Message-ID: <15996.275.455989.47435@montanaro.dyndns.org>


    >> Has anybody thought about how any of the Spambayes tools would
    >> perform in the face of disk quotas or full disk partitions?

    Alex> Very poorly.  I think that'd send it straight into DB corruption.

I'm less concerned with database corruption than loss of email.  For stuff
like hammiefilter, the database is opened read-only anyway.

Skip


From popiel at wolfskeep.com  Fri Mar 21 22:28:44 2003
From: popiel at wolfskeep.com (T. Alexander Popiel)
Date: Sat Mar 22 01:28:48 2003
Subject: [Spambayes] filtering in the face of disk quotas or full disks 
In-Reply-To: Message from Skip Montanaro <skip@pobox.com> 
	<15996.275.455989.47435@montanaro.dyndns.org> 
References: <15995.35595.805586.553059@montanaro.dyndns.org>
	<20030322013419.5BF1F2DE2F@cashew.wolfskeep.com>
	<15996.275.455989.47435@montanaro.dyndns.org> 
Message-ID: <20030322062844.D04932DE2F@cashew.wolfskeep.com>

In message:  <15996.275.455989.47435@montanaro.dyndns.org>
             Skip Montanaro <skip@pobox.com> writes:
>
>    >> Has anybody thought about how any of the Spambayes tools would
>    >> perform in the face of disk quotas or full disk partitions?
>
>    Alex> Very poorly.  I think that'd send it straight into DB corruption.
>
>I'm less concerned with database corruption than loss of email.  For stuff
>like hammiefilter, the database is opened read-only anyway.

Eh, in that case, it's not spambayes's problem.  Mail delivery is
outside the scope of a classifier.  At most, pop3proxy's private
cache would be affected... but I don't think that's how you'd be
using the system in a server-based environment.

- Alex

From skip at pobox.com  Sat Mar 22 00:42:24 2003
From: skip at pobox.com (Skip Montanaro)
Date: Sat Mar 22 01:42:30 2003
Subject: [Spambayes] filtering in the face of disk quotas or full disks 
In-Reply-To: <20030322062844.D04932DE2F@cashew.wolfskeep.com>
References: <15995.35595.805586.553059@montanaro.dyndns.org>
        <20030322013419.5BF1F2DE2F@cashew.wolfskeep.com>
        <15996.275.455989.47435@montanaro.dyndns.org>
        <20030322062844.D04932DE2F@cashew.wolfskeep.com>
Message-ID: <15996.1488.414014.793768@montanaro.dyndns.org>

    >> I'm less concerned with database corruption than loss of email.  For
    >> stuff like hammiefilter, the database is opened read-only anyway.

    Alex> Eh, in that case, it's not spambayes's problem.  Mail delivery is
    Alex> outside the scope of a classifier.  At most, pop3proxy's private
    Alex> cache would be affected... but I don't think that's how you'd be
    Alex> using the system in a server-based environment.

I agree, but hammiefilter (for example), has to respond appropriately (no
tracebacks, proper exit code so callers like procmail can do the right
thing) if it encounters an IOError.  Similarly, pop3proxy has to no lose
messages if it finds it can't write the message to the disk.

Skip

From spambayes at djl.freeuk.com  Sat Mar 22 11:43:44 2003
From: spambayes at djl.freeuk.com (David Leftley)
Date: Sat Mar 22 06:43:48 2003
Subject: [Spambayes] Getting a mbox file from Outlook Express
In-Reply-To: <16E1010E4581B049ABC51D4975CEDB880113D99B@UKDCX001.uk.int.atosorigin.com>
References: <16E1010E4581B049ABC51D4975CEDB880113D99B@UKDCX001.uk.int.atosorigin.com>
Message-ID: <pvio7v853t6tuvp4ks8bvdufruo4qa1dht@4ax.com>

On Fri, 21 Mar 2003 13:13:04 -0000, "Moore, Paul"
<Paul.Moore@atosorigin.com> wrote:

>I'm trying to get a friend set up with Spambayes using Outlook
>Express. To get some initial training sorted, it would be nice to
>get a mbox file of some of his existing messages which he could
>train on. But I can't find a way of getting OE to save a mbox
>file. Is there a way? Any OE victims around here...?
>
Possibly the simplest way to approach this is to install a copy of
Eudora, and tell it to import the messages from OE. I believe Eudora
uses standard mbox files for its storage.

David.


From francois.granger at free.fr  Sat Mar 22 14:20:54 2003
From: francois.granger at free.fr (Francois Granger)
Date: Sat Mar 22 08:21:02 2003
Subject: [Spambayes] Getting a mbox file from Outlook Express
In-Reply-To: <pvio7v853t6tuvp4ks8bvdufruo4qa1dht@4ax.com>
References: 
 <16E1010E4581B049ABC51D4975CEDB880113D99B@UKDCX001.uk.int.atosorigin.com>
 <pvio7v853t6tuvp4ks8bvdufruo4qa1dht@4ax.com>
Message-ID: <a05200f3fbaa2138c6569@[192.168.1.20]>

At 11:43 +0000 22/03/2003, in message Re: [Spambayes] Getting a mbox 
file from Outlook Expres, David Leftley wrote:
>On Fri, 21 Mar 2003 13:13:04 -0000, "Moore, Paul"
><Paul.Moore@atosorigin.com> wrote:
>
>>I'm trying to get a friend set up with Spambayes using Outlook
>>Express. To get some initial training sorted, it would be nice to
>>get a mbox file of some of his existing messages which he could
>>train on. But I can't find a way of getting OE to save a mbox
>>file. Is there a way? Any OE victims around here...?
>>
>Possibly the simplest way to approach this is to install a copy of
>Eudora, and tell it to import the messages from OE. I believe Eudora
>uses standard mbox files for its storage.

Not exactly standard because it extract the enclosures.

-- 
Hofstadter's Law :
It always takes longer than you expect, even when you take into 
account Hofstadter's Law.

From bill at parducci.net  Sat Mar 22 06:44:35 2003
From: bill at parducci.net (bill parducci)
Date: Sat Mar 22 09:44:40 2003
Subject: [Spambayes] filtering in the face of disk quotas or full disks
In-Reply-To: <15996.1488.414014.793768@montanaro.dyndns.org>
References: <15995.35595.805586.553059@montanaro.dyndns.org>
	<20030322013419.5BF1F2DE2F@cashew.wolfskeep.com>
	<15996.275.455989.47435@montanaro.dyndns.org>
	<20030322062844.D04932DE2F@cashew.wolfskeep.com>
	<15996.1488.414014.793768@montanaro.dyndns.org>
Message-ID: <3E7C76D3.9000206@parducci.net>

this issue, in combination with some of the manual processes posted to the list to maintain db size and relevancy has made me wonder if spambayes shouldn't incorporate the ability to FIFO token/training info. 

it seems that the most straightforward way to do this would be to time stamp each entry into the db and then have a configurable param indicating how long the db should keep information before pruning it (ostensibly during the training process). 

this would fundamentally increase the size of the db in order to store this info, but should make it much more predictable in terms of size. given the results of some of the notes that i have seen on the list, it seems that mail more than a couple of months old doesn't add to the accuracy of the system (and in some cases can decrease it) so i don't see this as a detriment to the system's behavior (as long as the data life span is reasonable).

just thinking out loud, but this seems like a move forward in creating a 'set & forget' system.

b


From skip at pobox.com  Sat Mar 22 09:52:33 2003
From: skip at pobox.com (Skip Montanaro)
Date: Sat Mar 22 10:53:30 2003
Subject: [Spambayes] filtering in the face of disk quotas or full disks
In-Reply-To: <3E7C76D3.9000206@parducci.net>
References: <15995.35595.805586.553059@montanaro.dyndns.org>
        <20030322013419.5BF1F2DE2F@cashew.wolfskeep.com>
        <15996.275.455989.47435@montanaro.dyndns.org>
        <20030322062844.D04932DE2F@cashew.wolfskeep.com>
        <15996.1488.414014.793768@montanaro.dyndns.org>
        <3E7C76D3.9000206@parducci.net>
Message-ID: <15996.34497.75006.235906@montanaro.dyndns.org>


    bill> this issue, in combination with some of the manual processes
    bill> posted to the list to maintain db size and relevancy has made me
    bill> wonder if spambayes shouldn't incorporate the ability to FIFO
    bill> token/training info.

This is also not what I'm worried about.  While we need to provide means to
manage the size of the database, that is essentially an offline activity.
I'm worried simply about the situation where a mail message arrives and
there's no disk space left to process it properly.

You really can't control the way the database file size grows.  Since it's
implementing a hash, once the key density gets too high, it expands the
database dramatically and shuffles things all around.  In between these
striking leaps in size, the database grows little, if at all, for each new
key added.

Let me restate the problem: I just don't want Spambayes to be accused,
rightly or wrongly, of losing mail because a disk quota was exceeded or a
disk partition filled up.  Everything else is merely an inconvenience.  Lost
mail can't be recovered.  What motivated this was an (incorrect, in my
opinion) assumption by a sys admin where I work that because there was a
failure in a mail setup using procmail and SpamAssassin when the disk quota
was exceeded that it was obviously a SpamAssassin problem.  

Skip

From wsy at merl.com  Sat Mar 22 06:20:47 2003
From: wsy at merl.com (Bill Yerazunis)
Date: Sat Mar 22 12:21:37 2003
Subject: [Spambayes] Binaries for MSwin
Message-ID: <200303221120.h2MBKlQ01327@localhost.localdomain>


   From: Terrel Shumway <tshumway@jdiworks.net>

   > Sounds fine to me - except I would raise the bar a little - why not make a
   > pop3propxy *binary* release for Windows too - then the problem becomes
   > moot- on Windows you get a binary.

   one more reason to publish binaries for mswin: ZoneAlarm.
   popfile, written in perl, forces the average[1] user to allow all perl 
   programs to access the internet -- a gaping hole in your firewall. (I 
   consider this a defect in ZoneAlarm's design, but I don't think it is going 
   away anytime soon.)

   ---

   [1] a sophisticated user could create a private copy of perl.exe
   and call it popfile.exe

Or a sophisticated _installer_ program could make that copy (or symlink)
of perl.exe itself, name it popfile.exe, and all would be well.

   -Bill Y.

From bill at parducci.net  Sat Mar 22 10:27:45 2003
From: bill at parducci.net (bill parducci)
Date: Sat Mar 22 13:27:49 2003
Subject: [Spambayes] filtering in the face of disk quotas or full disks
In-Reply-To: <15996.34497.75006.235906@montanaro.dyndns.org>
References: <15995.35595.805586.553059@montanaro.dyndns.org>
	<20030322013419.5BF1F2DE2F@cashew.wolfskeep.com>
	<15996.275.455989.47435@montanaro.dyndns.org>
	<20030322062844.D04932DE2F@cashew.wolfskeep.com>
	<15996.1488.414014.793768@montanaro.dyndns.org>
	<3E7C76D3.9000206@parducci.net>
	<15996.34497.75006.235906@montanaro.dyndns.org>
Message-ID: <3E7CAB21.3000604@parducci.net>


Skip Montanaro wrote:
> This is also not what I'm worried about.  While we need to provide means to
> manage the size of the database, that is essentially an offline activity.
> I'm worried simply about the situation where a mail message arrives and
> there's no disk space left to process it properly.

ok, but to date, this is a *manual* 'offline activity' involving any number of homegrown solutions to resolve. while this is operationally acceptable to advanced users such as those that mind this list, i believe that it is impractical for the vast majority of those who could benefit from this solution (but are unable/unwilling to keeps multiple copies of mail in numerous files, etc.)

> You really can't control the way the database file size grows.  Since it's
> implementing a hash, once the key density gets too high, it expands the
> database dramatically and shuffles things all around.  In between these
> striking leaps in size, the database grows little, if at all, for each new
> key added.

perhaps using the current h architecture, but if you have the ability to maintain the size of the input pool (possibly via a secondary data store that handles raw tokens), then it seems illogical that the size of the db cannot be managed within reason.

> Let me restate the problem: I just don't want Spambayes to be accused,
> rightly or wrongly, of losing mail because a disk quota was exceeded or a
> disk partition filled up.  Everything else is merely an inconvenience.  Lost
> mail can't be recovered.  What motivated this was an (incorrect, in my
> opinion) assumption by a sys admin where I work that because there was a
> failure in a mail setup using procmail and SpamAssassin when the disk quota
> was exceeded that it was obviously a SpamAssassin problem.  

good luck preventing misplaced accusations! :o) 

b


From noreply at sourceforge.net  Sat Mar 22 23:35:31 2003
From: noreply at sourceforge.net (SourceForge.net)
Date: Sun Mar 23 03:31:20 2003
Subject: [Spambayes] 
	[ spambayes-Bugs-707491 ] Pop3 proxy service code for Windows
	doesn't work...
Message-ID: <E18x00x-0007Rw-00@sc8-sf-web1.sourceforge.net>

Bugs item #707491, was opened at 2003-03-22 00:35
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=498103&aid=707491&group_id=61702

Category: pop3proxy
Group: None
Status: Open
Resolution: None
Priority: 5
Submitted By: Paul Moore (pmoore)
Assigned to: Mark Hammond (mhammond)
Summary: Pop3 proxy service code for Windows doesn't work...

Initial Comment:
The pop3proxy_service.py program doesn't seem to 
work with Python 2.2.2. The problem is that a main 
program doesn't have a __file__ variable defined. (This 
works in Python 2.3, which I guess is why this got 
missed...)

I've attached a "quick fix" patch, which uses a helper 
module "findme.py".

----------------------------------------------------------------------

>Comment By: Mark Hammond (mhammond)
Date: 2003-03-23 18:35

Message:
Logged In: YES 
user_id=14198

Fixed in r1.3 - thanks.

----------------------------------------------------------------------

Comment By: Paul Moore (pmoore)
Date: 2003-03-22 00:36

Message:
Logged In: YES 
user_id=113328

File attachment didn't work :-(

----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=498103&aid=707491&group_id=61702

From noreply at sourceforge.net  Sat Mar 22 23:35:48 2003
From: noreply at sourceforge.net (SourceForge.net)
Date: Sun Mar 23 03:31:28 2003
Subject: [Spambayes] 
	[ spambayes-Bugs-707491 ] Pop3 proxy service code for Windows
	doesn't work...
Message-ID: <E18x01E-0007SD-00@sc8-sf-web1.sourceforge.net>

Bugs item #707491, was opened at 2003-03-22 00:35
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=498103&aid=707491&group_id=61702

Category: pop3proxy
Group: None
>Status: Closed
>Resolution: Fixed
Priority: 5
Submitted By: Paul Moore (pmoore)
Assigned to: Mark Hammond (mhammond)
Summary: Pop3 proxy service code for Windows doesn't work...

Initial Comment:
The pop3proxy_service.py program doesn't seem to 
work with Python 2.2.2. The problem is that a main 
program doesn't have a __file__ variable defined. (This 
works in Python 2.3, which I guess is why this got 
missed...)

I've attached a "quick fix" patch, which uses a helper 
module "findme.py".

----------------------------------------------------------------------

Comment By: Mark Hammond (mhammond)
Date: 2003-03-23 18:35

Message:
Logged In: YES 
user_id=14198

Fixed in r1.3 - thanks.

----------------------------------------------------------------------

Comment By: Paul Moore (pmoore)
Date: 2003-03-22 00:36

Message:
Logged In: YES 
user_id=113328

File attachment didn't work :-(

----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=498103&aid=707491&group_id=61702

From Eugeny.Sattler at RU.NESTLE.com  Mon Mar 24 17:25:23 2003
From: Eugeny.Sattler at RU.NESTLE.com (Eugeny.Sattler@RU.NESTLE.com)
Date: Mon Mar 24 09:53:05 2003
Subject: [Spambayes] SpamBayes and Outlook 2000
Message-ID: <5D7D85C4DFC1D411BD8700B0D07810E00174A272@KUFMXS04>

Hi,
I would like to try your Outlook 2000 add-in.
Pls tell me, is it for POP3 connection only or suitable also for MS Exchange
Server 5.5 environment ?
Thanks.


-- 
Eugeny


From mhammond at skippinet.com.au  Tue Mar 25 08:04:13 2003
From: mhammond at skippinet.com.au (Mark Hammond)
Date: Mon Mar 24 16:04:57 2003
Subject: [Spambayes] SpamBayes and Outlook 2000
In-Reply-To: <5D7D85C4DFC1D411BD8700B0D07810E00174A272@KUFMXS04>
Message-ID: <LCEPIIGDJPKCOIHOBJEPCEKIOIAA.mhammond@skippinet.com.au>

> Hi,
> I would like to try your Outlook 2000 add-in.
> Pls tell me, is it for POP3 connection only or suitable also for 
> MS Exchange
> Server 5.5 environment ?
> Thanks.

It is suitable for both.

Regards,

Mark.


From skip at pobox.com  Mon Mar 24 15:16:20 2003
From: skip at pobox.com (Skip Montanaro)
Date: Mon Mar 24 16:16:30 2003
Subject: [Spambayes] __del__ in DBDictClassifier?
Message-ID: <15999.30116.549922.124871@montanaro.dyndns.org>


Is there some reason the storage.DBDictClassifier class doesn't implement a
__del__ method which calls store()?  If not, I'm going to add one. 

Skip

From noreply at sourceforge.net  Mon Mar 24 14:19:35 2003
From: noreply at sourceforge.net (SourceForge.net)
Date: Mon Mar 24 17:19:39 2003
Subject: [Spambayes] 
	[ spambayes-Bugs-709051 ] Error loading configuration should not be
	fatal
Message-ID: <E18xaI3-0004vi-00@sc8-sf-web1.sourceforge.net>

Bugs item #709051, was opened at 2003-03-25 09:19
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=498103&aid=709051&group_id=61702

Category: Outlook
Group: None
Status: Open
Resolution: None
Priority: 5
Submitted By: Mark Hammond (mhammond)
Assigned to: Mark Hammond (mhammond)
Summary: Error loading configuration should not be fatal

Initial Comment:
There was a report of this error using the second
binary release:

SpamAddin - Connecting to Outlook
pythoncom error: Failed to call the universal dispatcher
Traceback (most recent call last):
  File "E:\src\pythonex\com\win32com\universal.py",
line 170, in dispatch
  File "E:\src\pythonex\com\win32com\server\policy.py",
line 322, in _InvokeEx_
  File "E:\src\pythonex\com\win32com\server\policy.py",
line 601, in _invokeex_
  File "E:\src\pythonex\com\win32com\server\policy.py",
line 541, in _invokeex_
  File "E:\src\spambayes\Outlook2000\addin.py", line
655, in OnConnection
  File "E:\src\spambayes\Outlook2000\manager.py", line
475, in GetManager
  File "E:\src\spambayes\Outlook2000\manager.py", line
152, in __init__
  File "E:\src\spambayes\Outlook2000\manager.py", line
355, in LoadConfig
exceptions.EOFError: 

While there is another problem that caused this error,
we should not die completely loading the config pickle
should it get screwed up.  However, as this means
spambayes will be unconfigured, we do need a scheme to
let the user know this (as we do in the few other
places where we disable spambayes due to config errors)

----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=498103&aid=709051&group_id=61702

From noreply at sourceforge.net  Mon Mar 24 14:56:42 2003
From: noreply at sourceforge.net (SourceForge.net)
Date: Mon Mar 24 17:46:04 2003
Subject: [Spambayes] 
	[ spambayes-Bugs-709051 ] Error loading configuration should not be
	fatal
Message-ID: <E18xary-0005ck-00@sc8-sf-web1.sourceforge.net>

Bugs item #709051, was opened at 2003-03-25 09:19
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=498103&aid=709051&group_id=61702

Category: Outlook
Group: None
Status: Open
Resolution: None
Priority: 5
Submitted By: Mark Hammond (mhammond)
Assigned to: Mark Hammond (mhammond)
Summary: Error loading configuration should not be fatal

Initial Comment:
There was a report of this error using the second
binary release:

SpamAddin - Connecting to Outlook
pythoncom error: Failed to call the universal dispatcher
Traceback (most recent call last):
  File "E:\src\pythonex\com\win32com\universal.py",
line 170, in dispatch
  File "E:\src\pythonex\com\win32com\server\policy.py",
line 322, in _InvokeEx_
  File "E:\src\pythonex\com\win32com\server\policy.py",
line 601, in _invokeex_
  File "E:\src\pythonex\com\win32com\server\policy.py",
line 541, in _invokeex_
  File "E:\src\spambayes\Outlook2000\addin.py", line
655, in OnConnection
  File "E:\src\spambayes\Outlook2000\manager.py", line
475, in GetManager
  File "E:\src\spambayes\Outlook2000\manager.py", line
152, in __init__
  File "E:\src\spambayes\Outlook2000\manager.py", line
355, in LoadConfig
exceptions.EOFError: 

While there is another problem that caused this error,
we should not die completely loading the config pickle
should it get screwed up.  However, as this means
spambayes will be unconfigured, we do need a scheme to
let the user know this (as we do in the few other
places where we disable spambayes due to config errors)

----------------------------------------------------------------------

>Comment By: Mark Hammond (mhammond)
Date: 2003-03-25 09:56

Message:
Logged In: YES 
user_id=14198

The reporter just let me know that the problem was caused by
about 20 power failures over short period.  So I don't think
we can cure the cause here, just the symptoms.

----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=498103&aid=709051&group_id=61702

From tim at fourstonesExpressions.com  Mon Mar 24 19:34:18 2003
From: tim at fourstonesExpressions.com (Tim Stone - Four Stones Expressions)
Date: Mon Mar 24 20:34:25 2003
Subject: [Spambayes] __del__ in DBDictClassifier?
In-Reply-To: <15999.30116.549922.124871@montanaro.dyndns.org>
Message-ID: <TP621XCAD72SMZD9YWUR32UQVSVRTR.3e7fb21a@myst>

3/24/2003 3:16:20 PM, Skip Montanaro <skip@pobox.com> wrote:

>
>Is there some reason the storage.DBDictClassifier class doesn't implement a
>__del__ method which calls store()?  If not, I'm going to add one. 

Yup.  There is no guarantee that the __del__ method is called, so we (Richie 
and I) felt like rather than give the impression that store would always be 
called, it would be better to make it explicit.  You know.. the old "dumb 
beats smart" thing.

c'est moi - TimS
http://www.fourstonesExpressions.com
http://wecanstopspam.org

There are 10 kinds of people in the world:
  those who understand binary,
  and those who don't.


From skip at pobox.com  Mon Mar 24 22:22:24 2003
From: skip at pobox.com (Skip Montanaro)
Date: Mon Mar 24 23:22:34 2003
Subject: [Spambayes] __del__ in DBDictClassifier?
In-Reply-To: <TP621XCAD72SMZD9YWUR32UQVSVRTR.3e7fb21a@myst>
References: <15999.30116.549922.124871@montanaro.dyndns.org>
        <TP621XCAD72SMZD9YWUR32UQVSVRTR.3e7fb21a@myst>
Message-ID: <15999.55680.895881.768181@montanaro.dyndns.org>


    >> Is there some reason the storage.DBDictClassifier class doesn't
    >> implement a __del__ method which calls store()?

    Tim> Yup.  There is no guarantee that the __del__ method is called, 

You're suggesting that there's a good chance a DBDictClassifier instance
will be involved in a cycle?  Looking at the code briefly I didn't see an
instance attributes which looked like they would refer to other objects
which would (possibly indirectly) refer back to the instance.  It's a common
Python idiom to call an object's close() method in its __del__ method. 

Skip


From tim at fourstonesExpressions.com  Tue Mar 25 07:46:45 2003
From: tim at fourstonesExpressions.com (Tim Stone - Four Stones Expressions)
Date: Tue Mar 25 08:46:57 2003
Subject: [Spambayes] __del__ in DBDictClassifier?
In-Reply-To: <15999.55680.895881.768181@montanaro.dyndns.org>
Message-ID: <WUPMHS75NHVTV2212MLVUQPWSR.3e805dc5@myst>

3/24/2003 10:22:24 PM, Skip Montanaro <skip@pobox.com> wrote:

>You're suggesting that there's a good chance a DBDictClassifier instance
>will be involved in a cycle?  Looking at the code briefly I didn't see an
>instance attributes which looked like they would refer to other objects
>which would (possibly indirectly) refer back to the instance.  It's a common
>Python idiom to call an object's close() method in its __del__ method. 

Quoting your mail of 11/14/2002:

From: Skip Montanaro <skip@pobox.com>
Date: Thu, 14 Nov 2002 10:49:28 -0600
To: spambayes@python.org
Subject: [Spambayes] read-only DBDict in hammie?

I'd like to share the anydbm file between several accounts on my machine.
Before I fiddle hammie.py so it opens the file in read-only mode, is there
any reason when classifying (not training) it actually needs to update the
file?  There's a __del__ method in PersistentBayes which does this:

    def __del__(self):
        #super.__del__(self)
        self.save_state()

    def save_state(self):
        self.wordinfo[self.statekey] = (self.nham, self.nspam)

When classifying there's no reason that nham or nspam would change, right?

Skip

Quoting an exchange between Neale and Richie dated 11/18/2002:

From: Richie Hindle <richie@entrian.com>
To: Neale Pickett <neale@woozle.org>
Subject: Re: [Spambayes] Hammiefilter doesn't write out the pickle
Date: Mon, 18 Nov 2002 18:02:07 +0000
Cc: spambayes@python.org

Hi Neale,

> Neale thinks this is the right way to do it.  If the Bayes.* classes
> write out their state on destruction, we can treat them all the same.
> That's easy enough, just have them call self.store() in the __del__
> method.

Richie thinks this is a bad move.  Here's a minor rant I sent to Tim Stone
when he did exactly this in his Bayes module:

--------------------------------------------------------------------------

PersistentBayes.__del__() calls store() - this seems like a bad thing for
three reasons.  One is that I might not want to save my changes to the
database - pop3proxy has an explicit "Save & Shutdown" and "Shutdown"
buttons to give the user control over whether the database is saved or not
(to let you do speculative training and discard the results, for instance).
[This is the least important of the three reasons.  Four, four reasons!]
Also, the pop3proxy self-test uses an in-memory bayes instance that it
never wants to write to disk.  Secondly, it's unpredictable when __del__
will be called, or even *whether* it will be called - this:

class A:
    def __del__(self):
        print "A.__del__"

class B:
    def __del__(self):
        print "B.__del__"

a = A()
b = B()
a.b = b
b.a = a
print "Exiting..."

won't call either __del__ method in the current CPython implementation.

Thirdly, if users of PersistentBayes explicitly call store() - which seems
like the right thing to do - the database will be written out twice.  [And
that can take *a long time*.]

[snip]

I've found another reason why PersistentBayes.__del__() is a bad thing -
self.db_name isn't set in the case where a PickledBayes is created using a
filename that doesn't exist (which is done by the pop3proxy self-test) -
that was leading to exceptions being throw from __del__, which is a
notoriously hard problem to track down.

--------------------------------------------------------------------------

I'd much rather have an explicit store() method and document the fact that
storage may be pre-empted by certain implementations.  Relying on __del__
is nasty.

--
Richie Hindle
richie@entrian.com


As you can tell, I had coded the __del__ originally, and it was removed for 
the objections that you and Richie raised.

c'est moi - TimS
http://www.fourstonesExpressions.com
http://wecanstopspam.org

There are 10 kinds of people in the world:
  those who understand binary,
  and those who don't.


From skip at pobox.com  Tue Mar 25 08:12:22 2003
From: skip at pobox.com (Skip Montanaro)
Date: Tue Mar 25 09:12:33 2003
Subject: [Spambayes] __del__ in DBDictClassifier?
In-Reply-To: <WUPMHS75NHVTV2212MLVUQPWSR.3e805dc5@myst>
References: <15999.55680.895881.768181@montanaro.dyndns.org>
        <WUPMHS75NHVTV2212MLVUQPWSR.3e805dc5@myst>
Message-ID: <16000.25542.670078.393940@montanaro.dyndns.org>


    Tim> Richie thinks this is a bad move.  Here's a minor rant I sent to
    Tim> Tim Stone when he did exactly this in his Bayes module:

    Tim> ... - pop3proxy has an explicit "Save & Shutdown" and "Shutdown"
    Tim> buttons to give the user control over whether the database is saved
    Tim> or not ...

Good enough for me.

Skip

From skip at pobox.com  Wed Mar 26 09:19:54 2003
From: skip at pobox.com (Skip Montanaro)
Date: Wed Mar 26 10:20:02 2003
Subject: [Spambayes] Win 2k/XP + Eudora?
Message-ID: <16001.50458.808956.759671@montanaro.dyndns.org>

I've been asked to take a look at installing Spambayes for a user in one of
the departments.  She's running Win2k/XP and uses Eudora as her email
client.  Sounds like I will need to install Python+pop3proxy for her.  I
seem to recall something odd about Eudora and different POP servers.  Is
that only when using multiple POP servers?

Thanks,

Skip


From tim at fourstonesExpressions.com  Wed Mar 26 09:56:51 2003
From: tim at fourstonesExpressions.com (Tim Stone - Four Stones Expressions)
Date: Wed Mar 26 10:56:57 2003
Subject: [Spambayes] Win 2k/XP + Eudora?
In-Reply-To: <16001.50458.808956.759671@montanaro.dyndns.org>
Message-ID: <QK3W7262NLGBED98FB1T972YB9KHWQ6.3e81cdc3@myst>

3/26/2003 9:19:54 AM, Skip Montanaro <skip@pobox.com> wrote:

> Is that only when using multiple POP servers?

Yup.  Apparently Eudora can only access one pop server.  Papadoc checked in an 
html document about configuring various pop3 clients, but I can't seem to find 
it at the moment.

c'est moi - TimS
http://www.fourstonesExpressions.com
http://wecanstopspam.org

There are 10 kinds of people in the world:
  those who understand binary,
  and those who don't.


From tim at fourstonesExpressions.com  Wed Mar 26 10:07:36 2003
From: tim at fourstonesExpressions.com (Tim Stone - Four Stones Expressions)
Date: Wed Mar 26 11:07:40 2003
Subject: [Spambayes] Win 2k/XP + Eudora?
In-Reply-To: <3E81CFCE.30605@videotron.ca>
Message-ID: <1Z97LKLE0B6A6OC8KG62GFQMGAJGUP.3e81d048@myst>

3/26/2003 10:05:34 AM, papaDoc <papaDoc@videotron.ca> wrote:

>Hi Tim,
>
>The document was not checked in since I don't have check in access
>but the document is attached to one of the old mail of this list.
>

Ah... no wonder I can't find it!  I have the old mail.  Thanks for 
the tip.

c'est moi - TimS
http://www.fourstonesExpressions.com
http://wecanstopspam.org

There are 10 kinds of people in the world:
  those who understand binary,
  and those who don't.


From skip at pobox.com  Wed Mar 26 10:45:03 2003
From: skip at pobox.com (Skip Montanaro)
Date: Wed Mar 26 11:45:12 2003
Subject: [Spambayes] Win 2k/XP + Eudora?
In-Reply-To: <1Z97LKLE0B6A6OC8KG62GFQMGAJGUP.3e81d048@myst>
References: <3E81CFCE.30605@videotron.ca>
        <1Z97LKLE0B6A6OC8KG62GFQMGAJGUP.3e81d048@myst>
Message-ID: <16001.55567.948358.769105@montanaro.dyndns.org>


    >> The document was not checked in since I don't have check in access
    >> but the document is attached to one of the old mail of this list.

    Tim> Ah... no wonder I can't find it!  I have the old mail.  Thanks for
    Tim> the tip.

Tim,

If you can check it in, please do.  Otherwise, forward it to me and I'll see
that it gets stitched into the spambayes website.

Thx,

Skip

From tony-bayes at lownds.com  Wed Mar 26 09:15:46 2003
From: tony-bayes at lownds.com (Tony Lownds)
Date: Wed Mar 26 12:37:08 2003
Subject: [Spambayes] Win 2k/XP + Eudora?
In-Reply-To: <16001.50458.808956.759671@montanaro.dyndns.org>
References: <16001.50458.808956.759671@montanaro.dyndns.org>
Message-ID: <a05200f37baa78ef4faa4@[204.162.121.84]>

At 9:19 AM -0600 3/26/03, Skip Montanaro wrote:
>I've been asked to take a look at installing Spambayes for a user in one of
>the departments.  She's running Win2k/XP and uses Eudora as her email
>client.  Sounds like I will need to install Python+pop3proxy for her.  I
>seem to recall something odd about Eudora and different POP servers.  Is
>that only when using multiple POP servers?
>

Eudora can't use a different port for different accounts, they all 
have to use port 110. With a plugin, a port other than 110 can be 
used - but it is still used across accounts.

-Tony

From francois.granger at free.fr  Wed Mar 26 19:14:01 2003
From: francois.granger at free.fr (Francois Granger)
Date: Wed Mar 26 13:14:07 2003
Subject: [Spambayes] Win 2k/XP + Eudora?
In-Reply-To: <QK3W7262NLGBED98FB1T972YB9KHWQ6.3e81cdc3@myst>
References: <QK3W7262NLGBED98FB1T972YB9KHWQ6.3e81cdc3@myst>
Message-ID: <a05200f0ebaa7997ce146@[192.168.1.20]>

At 09:56 -0600 on 26/03/2003, in message Re: [Spambayes] Win 2k/XP + 
Eudora?, Tim Stone - Four Stones Expressions wrote:
>3/26/2003 9:19:54 AM, Skip Montanaro <skip@pobox.com> wrote:
>
>>  Is that only when using multiple POP servers?
>
>Yup.  Apparently Eudora can only access one pop server.

Eudora can access mutiple pop server. But all must have the same port number.

Somebody (I don't rember who and can't find the msg in achive) gave a 
trick for MacOS X, available for Unixes, wich create multiple 
localhost adresses.

======= in a shell script
sudo ifconfig lo0 inet 127.0.0.2 add
sudo ifconfig lo0 inet 127.0.0.3 add
sudo ifconfig lo0 inet 127.0.0.4 add

======= in bayescustomize.ini
[pop3proxy]
pop3proxy_servers = pop.nerim.net:110, pop.free.fr:110, 
altern.org:110, pop.laposte.net:110
pop3proxy_ports = 127.0.0.1:110, 127.0.0.2:110,127.0.0.3:110, 127.0.0.4:110

=======

This may be portable to W2000 ?


Ref:
http://mail.python.org/pipermail/spambayes/2003-January/002659.html

-- 
Hofstadter's Law :
It always takes longer than you expect, even when you take into 
account Hofstadter's Law.

From tim at fourstonesExpressions.com  Wed Mar 26 13:09:25 2003
From: tim at fourstonesExpressions.com (Tim Stone - Four Stones Expressions)
Date: Wed Mar 26 14:09:33 2003
Subject: [Spambayes] Win 2k/XP + Eudora?
In-Reply-To: <a05200f0ebaa7997ce146@[192.168.1.20]>
Message-ID: <76IESRTO9676YJIQNTOJ3YTR1TKVR.3e81fae5@myst>

3/26/2003 12:14:01 PM, Francois Granger <francois.granger@free.fr> wrote:

>
>======= in a shell script
>sudo ifconfig lo0 inet 127.0.0.2 add
>sudo ifconfig lo0 inet 127.0.0.3 add
>sudo ifconfig lo0 inet 127.0.0.4 add
>
>======= in bayescustomize.ini
>[pop3proxy]
>pop3proxy_servers = pop.nerim.net:110, pop.free.fr:110, 
>altern.org:110, pop.laposte.net:110
>pop3proxy_ports = 127.0.0.1:110, 127.0.0.2:110,127.0.0.3:110, 127.0.0.4:110
>
>=======
>
>This may be portable to W2000 ?
>

The alternate ipaddresses that you created in your shell script can be added 
in the c:\winnt\system32\drivers\etc\hosts file.  Simply add lines like:

127.0.0.1 localhost  (this line should already be there...)
127.0.0.2 localhost2
127.0.0.3 localhost3
127.0.0.4 localhost4

and then use the same trick in bayescustomize.ini.

c'est moi - TimS
http://www.fourstonesExpressions.com
http://wecanstopspam.org

There are 10 kinds of people in the world:
  those who understand binary,
  and those who don't.


From skip at pobox.com  Wed Mar 26 14:19:10 2003
From: skip at pobox.com (Skip Montanaro)
Date: Wed Mar 26 15:19:17 2003
Subject: [Spambayes] Any ideas about this one?
Message-ID: <16002.2878.68928.814803@montanaro.dyndns.org>


The message at

    http://manatee.mojam.com/~skip/junk.msg

scored squarely in the ham zone for me, mostly because the scoring was
swamped by all those normally good address clues (aahz, aleax
cosc.canterbury.ac.nz, etc).  I could obviously remove "to" from my
address_headers option.  I tried doing that, which moved it up near 0.5,
however I noticed no skip: tokens were generated:

    X-Spambayes-Classification: unsure; 0.46
    X-Spambayes-Debug: '*H*': 0.89; '*S*': 0.80;
            'x-mailer:microsoft outlook imo, build 9.0.2416
            (9.0.2911.0)': 0.01; 'subject:pack': 0.09; 'subject:: ': 0.19;
            'header:Message-ID:1': 0.35; 'subject:Watch': 0.75;
            'content-type:application/x-msdownload': 0.97;
            'filename:fname piece:exe': 0.97

Is that related to the structure of the message (causing the attachment to
be skipped altogether)?

Skip

P.S. I couldn't send the message itself to the list because the virus
detector rejected it, hence the URL above.  Should we allow stuff like that
to squeeze through to this list?

S

From tim.one at comcast.net  Wed Mar 26 17:04:08 2003
From: tim.one at comcast.net (Tim Peters)
Date: Wed Mar 26 17:06:23 2003
Subject: [Spambayes] Any ideas about this one?
In-Reply-To: <16002.2878.68928.814803@montanaro.dyndns.org>
Message-ID: <LNBBLJKPBEHFEDALKOLCIEEBEDAB.tim.one@comcast.net>

[Skip Montanaro]
> The message at
>
>     http://manatee.mojam.com/~skip/junk.msg
>
> scored squarely in the ham zone for me, mostly because the scoring was
> swamped by all those normally good address clues (aahz, aleax
> cosc.canterbury.ac.nz, etc).  I could obviously remove "to" from my
> address_headers option.  I tried doing that, which moved it up near 0.5,
> however I noticed no skip: tokens were generated:
>
> ...
>
> Is that related to the structure of the message (causing the
> attachment to be skipped altogether)?

I think so -- the MIME type was application/x-msdownload, and the tokenizer
doesn't even bother to decode non- text/* portions.

> ...
> P.S. I couldn't send the message itself to the list because the virus
> detector rejected it, hence the URL above.  Should we allow stuff
> like that to squeeze through to this list?

It would have been held for moderator approval regardless, due to sheer
size, and I would have rejected it (people on this list should be able to
find quarter-meg examples of viruses on their own <wink>).  The salient
points in this message were the headers, + a comment of the form "and the
body is a quarter megabyte of base64".


From francois.granger at free.fr  Wed Mar 26 23:35:57 2003
From: francois.granger at free.fr (Francois Granger)
Date: Wed Mar 26 17:36:04 2003
Subject: [Spambayes] Any ideas about this one?
In-Reply-To: <16002.2878.68928.814803@montanaro.dyndns.org>
References: <16002.2878.68928.814803@montanaro.dyndns.org>
Message-ID: <a05200f0fbaa7db5c5205@[192.168.1.20]>

At 14:19 -0600 on 26/03/2003, in message [Spambayes] Any ideas about 
this one?, Skip Montanaro wrote:
>The message at
>
>     http://manatee.mojam.com/~skip/junk.msg
>
>scored squarely in the ham zone for me, mostly because the scoring was
>swamped by all those normally good address clues (aahz, aleax
>cosc.canterbury.ac.nz, etc).  I could obviously remove "to" from my
>address_headers option.  I tried doing that, which moved it up near 0.5,
>however I noticed no skip: tokens were generated:
>
>     X-Spambayes-Classification: unsure; 0.46
>     X-Spambayes-Debug: '*H*': 0.89; '*S*': 0.80;
>             'x-mailer:microsoft outlook imo, build 9.0.2416
>             (9.0.2911.0)': 0.01; 'subject:pack': 0.09; 'subject:: ': 0.19;
>             'header:Message-ID:1': 0.35; 'subject:Watch': 0.75;
>             'content-type:application/x-msdownload': 0.97;
>             'filename:fname piece:exe': 0.97
>
>Is that related to the structure of the message (causing the attachment to
>be skipped altogether)?

Not easy to classify...
My database "thinks":

Spam probability: 0.810594681692
Clues:

*H* 0.313039016579
*S* 0.934228379963
header:Received:5 0.0854354380187
subject:: 0.110737860364
subject:. 0.744834167131
header:Importance:1 0.781318555354
to:2**6 0.844827586207
subject:this 0.898823641021
subject:Watch 0.983271375465


well, funny !

-- 
Hofstadter's Law :
It always takes longer than you expect, even when you take into 
account Hofstadter's Law.

From skip at pobox.com  Wed Mar 26 18:26:34 2003
From: skip at pobox.com (Skip Montanaro)
Date: Wed Mar 26 19:26:38 2003
Subject: [Spambayes] Any ideas about this one?
In-Reply-To: <a05200f0fbaa7db5c5205@[192.168.1.20]>
References: <16002.2878.68928.814803@montanaro.dyndns.org>
        <a05200f0fbaa7db5c5205@[192.168.1.20]>
Message-ID: <16002.17722.470645.635722@montanaro.dyndns.org>

    Francois> to:2**6 0.844827586207

Odd, I don't see that at all in my clues.

As long as someone's database is snagging that message, I won't worry about
it, though I am kind of curious about the missing to:2**6 clue in the debug
results.

Skip


From popiel at wolfskeep.com  Wed Mar 26 19:29:31 2003
From: popiel at wolfskeep.com (T. Alexander Popiel)
Date: Wed Mar 26 22:29:36 2003
Subject: [Spambayes] Any ideas about this one? 
In-Reply-To: Message from Skip Montanaro <skip@pobox.com> 
	<16002.17722.470645.635722@montanaro.dyndns.org> 
References: <16002.2878.68928.814803@montanaro.dyndns.org>
	<a05200f0fbaa7db5c5205@[192.168.1.20]>
	<16002.17722.470645.635722@montanaro.dyndns.org> 
Message-ID: <20030327032931.9EEE92DDC7@cashew.wolfskeep.com>

In message:  <16002.17722.470645.635722@montanaro.dyndns.org>
             Skip Montanaro <skip@pobox.com> writes:
>    Francois> to:2**6 0.844827586207
>
>Odd, I don't see that at all in my clues.
>
>As long as someone's database is snagging that message, I won't worry about
>it, though I am kind of curious about the missing to:2**6 clue in the debug
>results.

It probably was in the midrange zone to be ignored (.4 to .6 by default).

- Alex

From skip at pobox.com  Thu Mar 27 08:03:44 2003
From: skip at pobox.com (Skip Montanaro)
Date: Thu Mar 27 09:03:50 2003
Subject: [Spambayes] Any ideas about this one? 
In-Reply-To: <20030327032931.9EEE92DDC7@cashew.wolfskeep.com>
References: <16002.2878.68928.814803@montanaro.dyndns.org>
        <a05200f0fbaa7db5c5205@[192.168.1.20]>
        <16002.17722.470645.635722@montanaro.dyndns.org>
        <20030327032931.9EEE92DDC7@cashew.wolfskeep.com>
Message-ID: <16003.1216.922317.235294@montanaro.dyndns.org>


    >> As long as someone's database is snagging that message, I won't worry
    >> about it, though I am kind of curious about the missing to:2**6 clue
    >> in the debug results.

    Alex> It probably was in the midrange zone to be ignored (.4 to .6 by
    Alex> default).

The default is 0.5 (meaning show everything):

    # The range of clues that are added to the "debug" header in the E-mail
    # All clues that have their probability smaller than this number, or
    # larger than one minus this number are added to the header such that
    # you can see why spambayes thinks this is ham/spam or why it is
    # unsure. The default is to show all clues, but you can reduce that by
    # setting showclue to a lower value, such as 0.1
    clue_mailheader_cutoff: 0.5

and I didn't change that, so everything should be shown.  Just for
completeness, here's my options file, in case I'm missing something:

    [Hammie]
    hammie_debug_header: True

    [Tokenizer]
    summarize_email_prefixes: True
    summarize_email_suffixes: True
    address_headers: from

    [Categorization]
    ham_cutoff: 0.20
    spam_cutoff: 0.88

    [hammiefilter]
    hammiefilter_persistent_storage_file: ~/hammie.db

    [globals]
    dbm_type: dbhash

Skip

From francois.granger at free.fr  Thu Mar 27 15:12:44 2003
From: francois.granger at free.fr (Francois Granger)
Date: Thu Mar 27 09:12:50 2003
Subject: [Spambayes] Any ideas about this one?
In-Reply-To: <16003.1216.922317.235294@montanaro.dyndns.org>
References: <16002.2878.68928.814803@montanaro.dyndns.org>       
 <a05200f0fbaa7db5c5205@[192.168.1.20]>       
 <16002.17722.470645.635722@montanaro.dyndns.org>       
 <20030327032931.9EEE92DDC7@cashew.wolfskeep.com>
 <16003.1216.922317.235294@montanaro.dyndns.org>
Message-ID: <a05200f16baa8b71dd373@[192.168.1.20]>

At 08:03 -0600 on 27/03/2003, in message Re: [Spambayes] Any ideas 
about this one?, Skip Montanaro wrote:
>completeness, here's my options file, in case I'm missing something:
>
>     [Hammie]
>     hammie_debug_header: True
>
>     [Tokenizer]
>     summarize_email_prefixes: True
>     summarize_email_suffixes: True
>     address_headers: from
>
>     [Categorization]
>     ham_cutoff: 0.20
>     spam_cutoff: 0.88
>
>     [hammiefilter]
>     hammiefilter_persistent_storage_file: ~/hammie.db
>
>     [globals]
>     dbm_type: dbhash

To be able to compare, here is mine:

[Categorization]
ham_cutoff = 0.10
spam_cutoff = 0.95

[pop3proxy]
pop3proxy_persistent_storage_file = hammie.db
pop3proxy_servers = pop.nerim.net:110, pop.free.fr:110, 
altern.org:110, pop.laposte.net
pop3proxy_ports = 127.0.0.1:110, 127.0.0.2:110,127.0.0.3:110, 127.0.0.4:110


-- 
Hofstadter's Law :
It always takes longer than you expect, even when you take into 
account Hofstadter's Law.

From skip at pobox.com  Thu Mar 27 21:10:55 2003
From: skip at pobox.com (Skip Montanaro)
Date: Thu Mar 27 22:11:06 2003
Subject: [Spambayes] Non-email use of the spambayes project
Message-ID: <16003.48447.482874.642781@montanaro.dyndns.org>


I've successfully applied the Spambayes code (http://spambayes.sf.net/) to a
non-email application today and thought I'd pass the concept along to
others.  Many of you on c.l.py probably are aware of the Spambayes project
which relies on user segregation of a set of email messages into spam and
ham, then combines the resulting clues they contain to predict the hamminess
or spamminess of email messages it hasn't seen before.  It works extremely
well for this, but the basic concept is applicable to other classification
problems.

I've operated the Mojam and Musi-Cal websites for several years.  Over that
time we've accumulated a sizable venue database.  Unfortunately, many
entries in the database have become stale and don't contribute anything to
the system other than to slow down queries.  Venue names get misspelled,
venues go out of business, non-venue stuff slips into the database, or other
errors occur.  As a result, I had a venue database containing roughly 35,000
entries, only about half of which were referenced by concert items in the
database.  The database as it sat couldn't be licensed to potential
customers because of all the errors it contained.  I could simply delete all
of those entries, but that would delete a lot of useful content from the
database.  Many of those currently unreferenced venue entries *are* correct
and will eventually be associated with other concerts, or will be useful as
corollary information for people using our websites or as an extra database
we can license to content consumers.

I wrote a trivial little application today which allowed me to rummage
through the unreferenced records in the database.  I could delete entries
which I felt were incorrect, but it was a one-at-a-time process.  With
15,000+ entries to scan, one-by-one wasn't going to cut it.

Then I got the idea to use the Spambayes classifier to watch what I was
doing and train on my actions.  I was viewing the records in chunks of 20
items at a time, sorted alphabetically.  I could choose to delete one or
more items or move onto the next chunk of 20 entries.  A deletion caused the
classifier to be trained on the entry as "spam".  Moving onto the next chunk
caused the classifier to be trained on the remaining undeleted entries as
"ham".  Over a short period of time, it got reasonably good at identifying
"spam".  I then started sorting each chunk of 20 items by its spambayes
score and could specify a threshold score below which to eliminate all
entries in that chunk.

The next improvement was to sort the entire mess of records by the spambayes
classification.  I was then seeing entire chunks of records whose scores
fell below the threshold and was able to delete them 20 at a time.

The entire Spambayes code is a single tokenizer generator function and a
small Classifier class:

    import spambayes.storage

    class Classifier: 
        def __init__(self): 
            self.cls = spambayes.storage.DBDictClassifier("fven.db") 

        def classify(self, d): 
            return self.cls.spamprob(tokenize(d), True) 

        def train(self, d, saved): 
            self.cls.learn(tokenize(d), saved) 

        def __del__(self): 
            self.cls.store() 

    def tokenize(d): 
        # d is a dictionary as returned by a MySQL query - tokenize the 
        # various fields, noting interesting facts 
        yield "venue length:%d" % len(d["venue"]) 
        for word in d["venue"].split(): 
            # looks like a festival - not a venue at all
            if word.lower().endswith("fest"): 
                yield "venue:<fest>" 
            yield "venue:"+word
        # most correct venue names don't contain punctuation
        if (string.translate(d["venue"], null_xlate, string.punctuation) 
            != d["venue"]): 
            yield "venue:<punctuation>"
        # no address information for this venue - less valuable
        if not d["addr1"]: 
            yield "addr1:<empty>"
        elif d["addr1"][0] not in string.digits:
            # most valid addresses in the US/Canada begin with a street number
            yield "addr1:<no number>" 
        for word in d["addr1"].split(): 
            yield "addr1:"+word 
        for word in d["addr2"].split(): 
            yield "addr2:"+word 
        yield "phone:"+d["phone"] 
        yield "city:"+d["city"].strip() 
        yield "region:"+(d["state"].strip() or d["country"].strip()) 
        yield "zip:"+d["zip"] 
        # sometimes the city gets replicated in the address, making the
        # data "dirtier" and thus less valuable
        vwords = d["venue"].lower().split() 
        for word in d["city"].lower().split(): 
            if word in vwords: 
                yield "city:<in venue>" 
                break
        # the record's id reflects its age - older records, and thus
        # smaller ids, are more likely to be outdated
        try: 
            yield "id:2**%.0f" % math.log(int(d["id"]) // 100) 
        except OverflowError: 
            yield "id:2**0" 
        return 

    ...

    classifier = Classifier()

The input to the tokenizer, instead of being an email message, is a
dictionary representing the return value from an SQL query.  When an item is
to be deleted, it gets classified like so:

    classifier.train(d, False)

When moving the the next chunk, the remaining records are classified like
so: 

    for item in chunk:
        classifier.train(item, True)

I haven't gotten too crazy with the tokenizer (compare it with the Spambayes
tokenizer!).  I will probably collect some other clues in the tokenizer,
such as what other tables reference the venue record.  For the time being,
it's working okay.  I just need it to do a reasonably good job segregating
records so I can quickly scan a group and make a deletion decision.  So far,
it's doing a very good job.  Not bad for 15-30 minutes of work...

Skip


From tim_one at email.msn.com  Thu Mar 27 23:27:34 2003
From: tim_one at email.msn.com (Tim Peters)
Date: Thu Mar 27 23:28:15 2003
Subject: [Spambayes] Any ideas about this one?
In-Reply-To: <16003.1216.922317.235294@montanaro.dyndns.org>
Message-ID: <LNBBLJKPBEHFEDALKOLCGEKCEDAB.tim_one@email.msn.com>

[T. Alexander Popiel]
> It probably was in the midrange zone to be ignored (.4 to .6 by
> default).

[Skip Montanaro]
> The default is 0.5 (meaning show everything):
> ...
>     clue_mailheader_cutoff: 0.5

I expect Alex had this mind:

# When scoring a message, ignore all words with
# abs(word.spamprob - 0.5) < minimum_prob_strength.
# This may be a hack, but it has proved to reduce error rates in many
# tests.  0.1 appeared to work well across all corpora.
minimum_prob_strength: 0.1

abs(p-0.5) < 0.1  is-same-as  0.4 < p < 0.6; Classifier._getclues() doesn't
return any word with a spamprob in that range.


From Paul.Moore at atosorigin.com  Fri Mar 28 09:22:38 2003
From: Paul.Moore at atosorigin.com (Moore, Paul)
Date: Fri Mar 28 04:24:02 2003
Subject: [Spambayes] Non-email use of the spambayes project
Message-ID: <16E1010E4581B049ABC51D4975CEDB880113D9CA@UKDCX001.uk.int.atosorigin.com>

From: Skip Montanaro [mailto:skip@pobox.com]
> I've successfully applied the Spambayes code (http://spambayes.sf.net/)
> to a non-email application today and thought I'd pass the concept along
> to others.

This is a lovely idea! Based on this description, I'm sure I can think of
a number of "data cleaning" exercises I'd like to do which might benefit
from this sort of approach.

Makes me wonder if there's a case for taking the algorithmic guts out of
spambayes, and making a standalone library module from it...

Thanks for posting this.
Paul.

From mhammond at skippinet.com.au  Fri Mar 28 23:04:48 2003
From: mhammond at skippinet.com.au (Mark Hammond)
Date: Fri Mar 28 07:05:32 2003
Subject: [Spambayes] RE: error when trying spambayes addin with outlook2002
In-Reply-To: <6D9E338D57C2D411AC5800105AF41E123A1B67@mailnt.eurosys.nl>
Message-ID: <LCEPIIGDJPKCOIHOBJEPEEMLOJAA.mhammond@skippinet.com.au>

> exceptions.ValueError: invalid literal for float(): 0.20

This will almost certainly be due to not succumbing to world domination, and
not having your locale set to an English one <wink>.

Adding:

import locale
locale.setlocale(locale.LC_NUMERIC, "en")

Somewhere near the top of addin.py should fix this.  I think I will check
this in, rather than waiting for a non-Windows user to strike this problem
;)

Coalition-of-the-commas ly,

Mark.


From jon at doobla.com  Fri Mar 28 04:02:44 2003
From: jon at doobla.com (Jonathon Jones)
Date: Fri Mar 28 07:18:08 2003
Subject: [Spambayes] Using your script with sendmail on my server?
Message-ID: <003601c2f508$ccecce10$a98f59cf@doobla>

Hi,

I am somewhat new to Linux but I am learning fast.  I have a Linux server with Ensim installed where I host my own sites and a few for others.  I want to use your filter to filter out spam on the box and I was wondering how I can do it?  Ideally there would have to be a database for each domain or user and I would want it to run between the mail server and their client software, but on my box.  I don't want them to have to install any software or anything.  I was thinking about setting up training email addresses so that anything sent to spam@domain.com would be flagged as spam and anything sent to ham@domain.com would be flagged as ham.  

Any suggestions?  Am I on the right track, or is there a better way?  I'd really appreciate any help you'd be willing to give.

God Bless,

Jon
From skip at pobox.com  Fri Mar 28 06:19:39 2003
From: skip at pobox.com (Skip Montanaro)
Date: Fri Mar 28 07:19:45 2003
Subject: [Spambayes] Non-email use of the spambayes project
In-Reply-To: <16E1010E4581B049ABC51D4975CEDB880113D9CA@UKDCX001.uk.int.atosorigin.com>
References: <16E1010E4581B049ABC51D4975CEDB880113D9CA@UKDCX001.uk.int.atosorigin.com>
Message-ID: <16004.15835.738554.638681@montanaro.dyndns.org>


    Paul> Makes me wonder if there's a case for taking the algorithmic guts
    Paul> out of spambayes, and making a standalone library module from
    Paul> it...

Given how easy it is to use as-is, I don't see a strong need.  More
important I think is to document how to use it as I did.  So much of what is
there now is so strongly tied to classifying email messages that it's easy
to lose sight of how well it can be applied to other classifcation problems.

Skip


From spambayes at rodland.no  Fri Mar 28 13:55:00 2003
From: spambayes at rodland.no (Fredrik Rodland)
Date: Fri Mar 28 07:56:31 2003
Subject: [Spambayes] Non-email use of the spambayes project
In-Reply-To: <16004.15835.738554.638681@montanaro.dyndns.org>
Message-ID: <OLEKJBLGLGDHBDLHGIINEEFJCPAA.spambayes@rodland.no>


> -----Original Message-----
> From: spambayes-bounces@python.org
> [mailto:spambayes-bounces@python.org]On Behalf Of Skip Montanaro
> Sent: 28. mars 2003 13:20
> To: Moore, Paul
> Cc: python-list@python.org; spambayes@python.org
> Subject: RE: [Spambayes] Non-email use of the spambayes project
>
> important I think is to document how to use it as I did.  So much
> of what is
> there now is so strongly tied to classifying email messages that it's easy
> to lose sight of how well it can be applied to other
> classifcation problems.


Totally agree!

also, for us who're not completely into Python, it would be great with some
sort of cookbook/skeletons/APIs available and documented.  I tried to read
your original code, but gave up after a while.  I have a similar situation,
having a database with 100.000 people in it, with quite a few rows not being
real persons.  It'd be gresat to try to use the spambayes code on this.

The concept should be fairly common so that one could write a script/program
in any language.

At least what I'm picturing is write a script wich loops over the dataset,
construct some kind of concatinated string, and passing this as argument to
one of three procedures/methods/scripts:

A. classify as spam
B. classify as ham
C. get_score


Fredrik


--
Fredrik Rodland	Technical Architect, Stocknet, Oslo, Norway
Stocknet:		http://www.stocknet.com		phone: +47 23 28 40 17
Private:		http://rodland.no			phone: +47 99 21 98 17


From tchur at optushome.com.au  Sat Mar 29 07:22:56 2003
From: tchur at optushome.com.au (Tim Churches)
Date: Fri Mar 28 15:34:05 2003
Subject: Orange (was: [Spambayes] Non-email use of the spambayes project)
In-Reply-To: <OLEKJBLGLGDHBDLHGIINEEFJCPAA.spambayes@rodland.no>
References: <OLEKJBLGLGDHBDLHGIINEEFJCPAA.spambayes@rodland.no>
Message-ID: <1048882982.1263.23.camel@emilio>

On Fri, 2003-03-28 at 23:55, Fredrik Rodland wrote:
> > important I think is to document how to use it as I did.  So much
> > of what is
> > there now is so strongly tied to classifying email messages that it's easy
> > to lose sight of how well it can be applied to other
> > classifcation problems.
> Totally agree!
> also, for us who're not completely into Python, it would be great with some
> sort of cookbook/skeletons/APIs available and documented.  I tried to read
> your original code, but gave up after a while.  I have a similar situation,
> having a database with 100.000 people in it, with quite a few rows not being
> real persons.  It'd be gresat to try to use the spambayes code on this.

The Orange project, developed at the University of Ljubljana, is well
worth a look. It is a Python framework and collection of modules (many
of them C extension modules) for learning about data mining and machine
learning techniques. It includes facilities for a number of supervised
and non-supervised classification methods apart from the naive Bayes
classifier, such as (quoting the Orange Web site) "classification trees,
k-NN, majority classifier, support vector machines, logistic regression.
Ensemble methods like boosting and bagging are also included ."

It is quite well documented and now even has a GUI interface. Code is
GPLed. See http://magix.fri.uni-lj.si/orange/

-- 

Tim C

PGP/GnuPG Key 1024D/EAF993D0 available from keyservers everywhere
or at http://members.optushome.com.au/tchur/pubkey.asc
Key fingerprint = 8C22 BF76 33BA B3B5 1D5B  EB37 7891 46A9 EAF9 93D0

-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part
Url : http://mail.python.org/pipermail/spambayes/attachments/20030329/6d31e4b5/attachment.bin
From noreply at sourceforge.net  Sat Mar 29 08:45:38 2003
From: noreply at sourceforge.net (SourceForge.net)
Date: Sat Mar 29 16:10:14 2003
Subject: [Spambayes] 
	[ spambayes-Patches-711845 ] mboxtrain.py in mh mode: trivial fix
Message-ID: <E18zJSc-0001uv-00@sc8-sf-web2.sourceforge.net>

Patches item #711845, was opened at 2003-03-29 11:45
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=498105&aid=711845&group_id=61702

Category: None
Group: None
Status: Open
Resolution: None
Priority: 5
Submitted By: Jay Berkenbilt (jay_berkenbilt)
Assigned to: Nobody/Anonymous (nobody)
Summary: mboxtrain.py in mh mode: trivial fix

Initial Comment:
This patch relative to mboxtrain.py in the 2003-01-17
snapshot fixes two trivial problems in mhdir_train:
files are overwritten needlessly, and the count of
trained messages is not properly updated.  I just took
the logic from the maildir_train function and
duplicated it.  

----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=498105&aid=711845&group_id=61702

From francois.granger at free.fr  Sun Mar 30 00:15:59 2003
From: francois.granger at free.fr (Francois Granger)
Date: Sat Mar 29 18:16:06 2003
Subject: [Spambayes] Back to language issue (long)
Message-ID: <a05200f3cbaabd7e8843b@[192.168.1.20]>

I got this mail (see at end) as ham. I did tested it first withe a 
copy and past into pop3proxy to check why. I then did a copy and pas 
of all but the french part. My current database is 1.1MB  Total 
emails trained: Spam: 487  Ham: 433

Database available on request.

See result below:

full message:
=============


Spam probability: 0.000751759880204
Clues:

*H* 1.0
*S* 0.00150351976007
quand 0.00850661625709
quelque 0.0136778115502
chose 0.0145631067961
sous 0.0167286245353
demande 0.0180722891566
pourrait 0.0196506550218
mon 0.0208789727319
comme 0.0221227588516
fait 0.0230376824338
fa?on 0.0238095238095
maintenant 0.0238095238095
donner 0.0266272189349
m'a 0.0266272189349
sait 0.0266272189349
mais 0.0313902714872
aide 0.033396550918
aller 0.0348837209302
famille 0.0348837209302
raison 0.0348837209302
pas 0.0408933100639
j'aurais 0.0412844036697
voudrais 0.0412844036697
leur 0.0498272166206
d'abord 0.0505617977528
temps. 0.0505617977528
valeurs 0.0505617977528
suis 0.0584879191463
j'ai 0.0600818444939
actuellement 0.0652173913043
mort 0.0652173913043
no. 0.0652173913043
pourraient 0.0652173913043
quant 0.0652173913043
que 0.0720802127541
savoir 0.0775927513664
courrier 0.0834026744242
dit 0.0834026744242
peux 0.0834026744242
qui 0.0880139586679
leurs 0.0900374919101
advised 0.0918367346939
d'accord 0.0918367346939
fils 0.0918367346939
garder 0.0918367346939
pays 0.0918367346939
p?re 0.0918367346939
qu'elles 0.0918367346939
raisons 0.0918367346939
r?pondre 0.0918367346939
seraient 0.0918367346939
tells 0.0918367346939
parce 0.0980935237284
faire 0.109870293122
est 0.112227878216
peut 0.121878882484
pour 0.122393105049
beaucoup 0.123296503689
depuis 0.130659320434
?t? 0.131386324669
dans 0.132119929488
mot 0.133327070159
les 0.144455392115
une 0.145286593037
passe 0.151233850681
puis 0.15146680803
argent 0.155172413793
changement 0.155172413793
contacter 0.155172413793
d'eux. 0.155172413793
envoyer 0.155172413793
fonds 0.155172413793
gouvernement 0.155172413793
jonas 0.155172413793
lui. 0.155172413793
lumi?re 0.155172413793
manque 0.155172413793
mexico 0.155172413793
monsieur, 0.155172413793
pourrais 0.155172413793
p?res 0.155172413793
saisir 0.155172413793
samuel 0.155172413793
tracer 0.155172413793
tu? 0.155172413793
venir 0.155172413793
sur 0.167633574543
des 0.171795256129
f?vrier 0.175326781486
cette 0.185894857561
r?ponse 0.195556328108
par 0.203011609628
merci 0.213163172086
avec 0.216683274074
votre 0.223048790244
possible 0.223362908057
d'un 0.233611469218
content-type:text/plain 0.237053248399
que, 0.256059940913
soci?t? 0.256059940913
trace 0.256059940913
voila 0.256059940913
son 0.261468013498
details 0.753633575315
agent 0.75736859731
surprise 0.75736859731
please 0.758980156238
looking 0.760041610618
contact 0.77287714023
out 0.779791147442
dear 0.786906048266
lettre 0.794293030271
journal 0.796004632524
skip:n 10 0.799435070973
fact 0.805426676513
internet. 0.817845487146
government 0.831484524404
reasons 0.842224555182
chamber 0.844827586207
email addr:voila.fr 0.844827586207
from:addr:voila.fr 0.844827586207
l'argent 0.844827586207
s?curit? 0.844827586207
veuillez 0.844827586207
voeu 0.844827586207
paid 0.865736495188
family 0.883222448153
letter 0.890903967158
company 0.902210464433
8bit%:9 0.908163265306
ahead. 0.908163265306
assistance 0.908163265306
commerce 0.908163265306
compte 0.908163265306
d?tails 0.908163265306
forces 0.908163265306
l'aide 0.908163265306
transf?rer 0.908163265306
business 0.91145727403
country 0.913240293115
watch 0.915916543989
money 0.919669618949
expenses 0.934782608696
father 0.934782608696
african 0.949438202247
anticipated 0.949438202247
ownership 0.949438202247
funds 0.95871559633
transfer 0.95871559633
transfer. 0.95871559633
percentage 0.96511627907

all but french:
===============


Spam probability: 0.999530237124
Clues:

*H* 1.05021213948e-09
*S* 0.999060475299
advised 0.0918367346939
tells 0.0918367346939
jonas 0.155172413793
mexico 0.155172413793
samuel 0.155172413793
possible 0.223362908057
content-type:text/plain 0.237053248399
trace 0.256059940913
son 0.261468013498
worked 0.276699759479
anything 0.29049020174
keeping 0.29049020174
light 0.321159980827
killed 0.332823460338
soon 0.345187733082
running 0.354319106501
trying 0.366311400136
them. 0.395181777302
know 0.398909302873
knows 0.6044824946
the 0.604592362611
subject:- 0.605156130452
want 0.606231021087
this 0.606271581031
not 0.60636980267
make 0.606842817477
start 0.607293682795
come 0.607655753716
and 0.608871670018
going 0.611497626481
can 0.614004582662
who 0.619148006288
netherlands 0.621790545969
under 0.622731249912
agree 0.630287560804
sir, 0.630287560804
for 0.631336804391
it. 0.632419432237
they 0.635641043314
security 0.63597973579
you 0.642573051538
through 0.644897220533
because 0.648506005667
send 0.650492892877
from 0.654952588668
one 0.656989004993
phone 0.657287865339
2002 0.66388718647
ways 0.670909653021
has 0.671329724104
their 0.671472909829
name 0.673196353036
address 0.675205726154
may 0.679409037042
your 0.681455529625
all 0.690605015038
more 0.694798240263
mr. 0.704335845591
leader 0.715217636184
late 0.721105048724
request 0.732178002628
subject:. 0.735614438659
details 0.753633575315
agent 0.75736859731
surprise 0.75736859731
please 0.758980156238
looking 0.760041610618
contact 0.77287714023
out 0.779791147442
dear 0.786906048266
journal 0.796004632524
skip:n 10 0.799435070973
fact 0.805426676513
internet. 0.817845487146
government 0.831484524404
reasons 0.842224555182
chamber 0.844827586207
email addr:voila.fr 0.844827586207
from:addr:voila.fr 0.844827586207
paid 0.865736495188
family 0.883222448153
letter 0.890903967158
company 0.902210464433
ahead. 0.908163265306
assistance 0.908163265306
commerce 0.908163265306
forces 0.908163265306
business 0.91145727403
country 0.913240293115
watch 0.915916543989
money 0.919669618949
expenses 0.934782608696
father 0.934782608696
african 0.949438202247
anticipated 0.949438202247
ownership 0.949438202247
funds 0.95871559633
transfer 0.95871559633
transfer. 0.95871559633
percentage 0.96511627907

full message:
=============

Return-Path: <ask_savimbi1@voila.fr>
Delivered-To: online.fr-francois.granger@free.fr
Received: (qmail 25473 invoked from network); 29 Mar 2003 19:22:26 -0000
Received: from smtp-out.voila.wanadooportails.com (HELO 
mailsmtp5.ftmms) (193.252.117.74)
   by mrelay4-2.free.fr with SMTP; 29 Mar 2003 19:22:26 -0000
Received: from voila.fr (10.3.7.82) by mailsmtp5.ftmms (6.7.015)
         id 3E6540600058A8BC; Sat, 29 Mar 2003 20:08:38 +0100
Date: Sat, 29 Mar 2003 20:08:38 +0100
Message-Id: <HCIYIE$67A01BD652D6EB7C039C1962B12284FB@voila.fr>
Subject: anticipated co-operation.
MIME-Version: 1.0
X-Sensitivity: 3
Content-Type: text/plain
From: "ask_savimbi1" <ask_savimbi1@voila.fr>
To: "ask_savimbi1" <ask_savimbi1@voila.fr>
X-XaM3-API-Version: 3.2 R27 (B52-pl1)
X-type: 0
X-SenderIP: 81.23.193.84
X-Spambayes-Classification: ham

Dear Sir,


This letter may come to you as a surprise due to fact that we have 
not met. I got your address from the south African chamber of

commerce business journal from one of their laison officers who 
knows about what I am going through so he advised me to

contact you for assistance. My name is Samuel savimbi son of the 
late unit a rebel leader Jonas Savimbi from Angola who was

killed on the 22nd of February 2002 by the government forces in 
Mexico province. Since the death of my late father the

government has being looking for ways they could seize my fathers 
money and so it is this light that I need your assistance in

trying to keep the funds from them. The money is in the Netherlands 
with a security company who could transfer the money at

my command, so i would need you to go to Holland when the need 
arises but you have to first reply this mail so i could give you

more details as to how to go about it. Please know that all your 
expenses would be paid for and there is a percentage my family

has worked out should you agree to help in this transfer.

Sir I am presently under security watch in my country Angola and i 
would want you to help me in keeping the money in

your account because the government can?t trace the money to your 
individual account in America.

Please because of security reasons I would prefer not to discuss 
much on the Internet.

I request that you send your reply to chief_teas@voila.fr  so we can 
start with the change of ownership and also I would have to

send you the password for the account so we can make the transfer as 
soon as possible because I am running out of

time.  Please contact my agent in Holland on phone no.31-61-2722388 
and anything he tells you, brief me so I can give them

the go ahead.

Thanks for your anticipated co-operation.


Regards,


Mr. Samuel savimbi


Cher Monsieur,


Cette lettre peut venir ? vous comme surprise due au fait que

nous n'avons pas rencontr?. J'ai obtenu votre adresse de la chambre 
de commerce sud-africaine le

journal d'affaires d'un de leurs officiers de laison qui sait ce que

j'interviens ainsi il m'a conseill? de vous contacter pour l'aide. 
Mon nom est fils de savimbi de Samuel de l'unit? en retard par

Chef

rebelle Jonas Savimbi d'Angola qui a ?t? tu? sur le 22?me f?vrier

2002 par le gouvernement force dans la province du Mexique. Depuis 
la mort de mon d?funt p?re le gouvernement a

rechercher des

mani?res qu'elles pourraient saisir mon argent de p?res et ainsi il

est cette lumi?re que j'ai besoin de votre aide dans l'essai de

garder les fonds d'eux. L'argent est en Hollandes avec une soci?t? 
de valeurs mobili?res

qui pourrait transf?rer l'argent ? ma commande, ainsi j'aurais

besoin de vous pour aller en Hollande quand le besoin se fait sentir

mais vous devez d'abord r?pondre ce courrier ainsi je pourrais vous

donner plus de d?tails quant ? la fa?on aborder lui. Veuillez savoir 
que toutes vos d?penses seraient pay?es pour et il y

a un pourcentage que ma famille a ?tabli si vous ?tes d'accord sur

l'aide dans ce transfert.

Monsieur I suis actuellement sous la montre de s?curit? dans mon

pays Angola et je voudrais que vous m'aidiez en maintenant l'argent

dans votre compte parce que le gouvernement ne peut pas tracer

l'argent ? votre compte individuel en Am?rique.

Veuillez en raison des raisons de s?curit? je pr?f?rerais ne pas

discuter beaucoup sur l'Internet.

Je demande que vous envoyez votre r?ponse au chief_teas@voila.fr 
ainsi nous peut commencer par le changement de la

propri?t? et

?galement je devrais vous envoyer le mot de passe pour le compte

ainsi nous pouvons faire le transfert aussit?t que possible parce que

je manque de temps.  Veuillez entrer en contact avec mon agent en 
Hollande sur le No. de

t?l?phone.31-61-2722388 et quelque chose il vous dit que, donnez- des

instructionsmoi ainsi je peux leur donner l'avancement.

Merci pour votre coop?ration pr?vue.


Respect,


M.. Savimbi de Samuel


------------------------------------------


Faites un voeu et puis Voila ! www.voila.fr


-- 
Fran?ois Granger
http://francois.granger.free.fr/

From tim at fourstonesExpressions.com  Sat Mar 29 18:17:13 2003
From: tim at fourstonesExpressions.com (Tim Stone - Four Stones Expressions)
Date: Sat Mar 29 19:17:48 2003
Subject: [Spambayes] Back to language issue (long)
In-Reply-To: <a05200f3cbaabd7e8843b@[192.168.1.20]>
Message-ID: <GEQMA7PNURLKVPZV3W4ZE3VYVHYWSO.3e863789@myst>

How interesting.  I wonder if a weakness of spambayes is to include a bunch of 
gibberish tokens that would almost surely not be in someone's database, which 
would tend to drive the spamprob strongly towards unknown prob, which is .5 by 
default...  (not that French is gibberish <wink>)  - TimS

3/29/2003 5:15:59 PM, Francois Granger <francois.granger@free.fr> wrote:

>I got this mail (see at end) as ham. I did tested it first withe a 
>copy and past into pop3proxy to check why. I then did a copy and pas 
>of all but the french part. My current database is 1.1MB  Total 
>emails trained: Spam: 487  Ham: 433
>

c'est moi - TimS
http://www.fourstonesExpressions.com
http://wecanstopspam.org

There are 10 kinds of people in the world:
  those who understand binary,
  and those who don't.


From tim at fourstonesExpressions.com  Sat Mar 29 18:52:06 2003
From: tim at fourstonesExpressions.com (Tim Stone - Four Stones Expressions)
Date: Sat Mar 29 19:52:11 2003
Subject: [Spambayes] Back to language issue (long)
In-Reply-To: <a05200f3cbaabd7e8843b@[192.168.1.20]>
Message-ID: <ZT86953OMN08JGYECZUZZWVUQK87.3e863fb6@myst>

3/29/2003 5:15:59 PM, Francois Granger <francois.granger@free.fr> wrote:

>I got this mail (see at end) as ham. I did tested it first withe a 
>copy and past into pop3proxy to check why.

Here's my pop3proxy test with the full mail, including the french part...  
Firmly spam.

Spam probability: 0.92419807897

Clues:
 
*H*
0.0145501365628

*S*
0.862946294503

assistance.
0.0412844036697

leurs
0.0505617977528

sur
0.0505617977528

les
0.0505617977528

des
0.0505617977528

courrier
0.0505617977528

chamber
0.0918367346939

tells
0.0918367346939

raison
0.0918367346939

trying
0.139001551758

chef
0.155172413793

forces
0.155172413793

ownership
0.155172413793

province
0.155172413793

force
0.205136233853

pour
0.205136233853

soon
0.21345434995

agree
0.230877558425

going
0.236121618369

individual
0.242284867767

est
0.284013639555

got
0.290993746827

22nd
0.294934298229

skip:n 10
0.30063314409

anything
0.302156320408

skip:t 20
0.303820933856

could
0.326118902066

mail
0.326308585409

knows
0.331910224838

running
0.337647867133

but
0.339140209196

should
0.340976178118

keeping
0.364093796853

no.
0.3773889689

surprise
0.3773889689

it.
0.390905206875

thanks
0.39803669893

how
0.6002750524

there
0.605416290974

what
0.609061801031

netherlands
0.616411930908

start
0.616433461787

fact
0.621424884871

would
0.624827760755

all
0.654446720984

trace
0.655810415038

help
0.656378980481

phone
0.658303044151

mexico
0.666416791604

discuss
0.666416791604

leader
0.666416791604

send
0.668378643841

under
0.677515506493

more
0.69919514058

their
0.702904517833

much
0.70583428009

plus
0.706968820414

details
0.706968820414

one
0.707042188084

come
0.708327278837

country
0.708704976074

your
0.70918861243

skip:i 10
0.71494003345

please
0.726385151004

dear
0.726868260612

address
0.730123697827

from
0.7306605061

out
0.733744809345

business
0.735181579572

looking
0.739728588612

funds
0.743289737924

contact
0.748813284618

father
0.756245996156

advised
0.756245996156

time.
0.760504822525

because
0.772202794598

south
0.782632496655

transfer
0.782632496655

give
0.791118686042

want
0.796597204303

regards,
0.811918594754

february
0.819174677074

prefer
0.821306509832

watch
0.835226655679

samuel
0.844827586207

unit
0.844827586207

aller
0.844827586207

agent
0.844827586207

company
0.848392489438

through
0.858782955016

paid
0.883475677477

light
0.916342496387

money
0.919305321038

message-id:invalid
0.934782608696

commerce
0.934782608696

chose
0.934782608696

skip:- 40
0.935713160894

government
0.95871559633

expenses
0.96511627907

rebel
0.96511627907

assistance
0.96511627907
c'est moi - TimS
http://www.fourstonesExpressions.com
http://wecanstopspam.org

There are 10 kinds of people in the world:
  those who understand binary,
  and those who don't.


From matt at mondoinfo.com  Sat Mar 29 19:13:25 2003
From: matt at mondoinfo.com (Matthew Dixon Cowles)
Date: Sat Mar 29 20:17:46 2003
Subject: [Spambayes] Back to language issue (long)
In-Reply-To: <GEQMA7PNURLKVPZV3W4ZE3VYVHYWSO.3e863789@myst>
References: <a05200f3cbaabd7e8843b@[192.168.1.20]>
	<GEQMA7PNURLKVPZV3W4ZE3VYVHYWSO.3e863789@myst>
Message-ID: <1048985376.4.887@sake.mondoinfo.com>

Dear=20Tim,

>=20How=20interesting.=20=20I=20wonder=20if=20a=20weakness=20of=20spambayes=
=20is=20to=20include
>=20a=20bunch=20of=20gibberish=20tokens=20that=20would=20almost=20surely=20=
not=20be=20in
>=20someone's=20database,=20which=20would=20tend=20to=20drive=20the=20spamp=
rob=20strongly
>=20towards=20unknown=20prob,=20which=20is=20.5=20by=20default...

I=20don't=20think=20it=20is.=20The=20point=20of=20ignoring=20all=20the=20cl=
ues=20but=20the=20most
extreme=20ones=20is=20that=20bland=20or=20gibberish=20words=20are=20unlikel=
y=20to=20be
counted.

I=20think=20that=20the=20problem=20in=20this=20case=20is=20that=20Francois=
=20doesn't=20get
much=20spam=20in=20French.=20If=20he=20did,=20the=20bland=20French=20words=
=20(which=20is
almost=20all=20of=20them=20listed=20in=20the=20clues)=20would=20likely=20be=
=20ignored=20and
the=20ones=20that=20are=20indicative=20of=20this=20sort=20of=20spam=20("arg=
ent",=20"tu=E9",
"gouvernement",=20etc)=20would=20be=20scored=20correctly.

I=20suspect=20that=20the=20error=20is=20just=20a=20matter=20of=20spambayes=
=20not
recognizing=20a=20sort=20of=20spam=20that=20it=20hasn't=20been=20trained=20=
on.

Regards,
Matt


From tim_one at email.msn.com  Sat Mar 29 20:55:36 2003
From: tim_one at email.msn.com (Tim Peters)
Date: Sat Mar 29 20:56:15 2003
Subject: [Spambayes] Back to language issue (long)
In-Reply-To: <GEQMA7PNURLKVPZV3W4ZE3VYVHYWSO.3e863789@myst>
Message-ID: <LNBBLJKPBEHFEDALKOLCCEAMEEAB.tim_one@email.msn.com>

[Tim Stone]
> How interesting.  I wonder if a weakness of spambayes is to
> include a bunch of gibberish tokens that would almost surely not
> be in someone's database, which would tend to drive the spamprob
> strongly towards unknown prob, which is .5 by
> default...  (not that French is gibberish <wink>)  - TimS

That won't work:  an unknown word has, as you say, spamprob 0.5 by default,
and all words with spamprob in (.4, .6) are simply ignored by default.  They
don't affect the score at all.  In Francois's case, it seems clear that he
simply hasn't gotten (trained on) many French renditions of the Nigerian
scam, but has gotten (trained on) significant numbers of French ham.  So
even vanilla French words (like quelque) have strong ham scores for him.  So
long as it remains true that he gets very few French Nigerian scams, they'll
continue to score as ham -- but then, by supposition, they are in fact rare,
so nothing to get excited about.  If French renditions of this spam become
common, the very low ham probs of common French words will approach 0.5 (and
so common French words will become ignored), and the spamprobs of telltale
French words will get much spammier, and the system will nail French spam.


From tim at fourstonesExpressions.com  Sat Mar 29 20:03:11 2003
From: tim at fourstonesExpressions.com (Tim Stone - Four Stones Expressions)
Date: Sat Mar 29 21:03:21 2003
Subject: [Spambayes] Back to language issue (long)
In-Reply-To: <LNBBLJKPBEHFEDALKOLCCEAMEEAB.tim_one@email.msn.com>
Message-ID: <42CBSR6NHGC844121QL6ZSR2X3XKIPJ.3e86505f@myst>

3/29/2003 7:55:36 PM, "Tim Peters" <tim_one@email.msn.com> wrote:

>That won't work:  an unknown word has, as you say, spamprob 0.5 by default,
>and all words with spamprob in (.4, .6) are simply ignored by default.

That, I didn't know.  Learn something new all the time...


c'est moi - TimS
http://www.fourstonesExpressions.com
http://wecanstopspam.org

There are 10 kinds of people in the world:
  those who understand binary,
  and those who don't.


From tim_one at email.msn.com  Sat Mar 29 21:46:34 2003
From: tim_one at email.msn.com (Tim Peters)
Date: Sat Mar 29 21:47:49 2003
Subject: [Spambayes] Back to language issue (long)
In-Reply-To: <42CBSR6NHGC844121QL6ZSR2X3XKIPJ.3e86505f@myst>
Message-ID: <LNBBLJKPBEHFEDALKOLCAEBAEEAB.tim_one@email.msn.com>

[TimP]
> That won't work:  an unknown word has, as you say, spamprob 0.5
> by default, and all words with spamprob in (.4, .6) are simply
> ignored by default.

[TimS]
> That, I didn't know.  Learn something new all the time...

FYI, it's controlled by option minimum_prob_strength.  You can arrange to
ignore nothing by setting that to 0.0 (the default is 0.1), or to ignore
everything by setting it to 0.5.  Almost all testing reports said 0.1 worked
better than 0.0; one report did a little better at 0.0, but, for the reason
you gave, a setting of 0.0 would leave an exploitable hole in the scoring.
As is, gibberish words have no effect on scoring, but do have a subtler
effect:  they bloat the database size.


From tim at fourstonesExpressions.com  Sat Mar 29 21:59:21 2003
From: tim at fourstonesExpressions.com (Tim Stone - Four Stones Expressions)
Date: Sat Mar 29 22:59:28 2003
Subject: [Spambayes] Back to language issue (long)
In-Reply-To: <LNBBLJKPBEHFEDALKOLCAEBAEEAB.tim_one@email.msn.com>
Message-ID: <OJNI8YUUSB9HEA0E9NJKH2Z6QKXRCB.3e866b99@myst>

3/29/2003 8:46:34 PM, "Tim Peters" <tim_one@email.msn.com> wrote:

>but do have a subtler effect:  they bloat the database size.

If I recall correctly, single occurance words are called hapaxes, right?  
We've talked about aging before, but it seems like it would be clearly a good 
thing to age hapaxes.  After a while, ALL they will do is bloat the database, 
which is arguably a bad thing.


c'est moi - TimS
http://www.fourstonesExpressions.com
http://wecanstopspam.org

There are 10 kinds of people in the world:
  those who understand binary,
  and those who don't.


From skip at pobox.com  Sat Mar 29 22:31:00 2003
From: skip at pobox.com (Skip Montanaro)
Date: Sat Mar 29 23:31:20 2003
Subject: [Spambayes] Back to language issue (long)
In-Reply-To: <OJNI8YUUSB9HEA0E9NJKH2Z6QKXRCB.3e866b99@myst>
References: <LNBBLJKPBEHFEDALKOLCAEBAEEAB.tim_one@email.msn.com>
        <OJNI8YUUSB9HEA0E9NJKH2Z6QKXRCB.3e866b99@myst>
Message-ID: <16006.29444.29137.628295@montanaro.dyndns.org>


    TimP> but do have a subtler effect:  they bloat the database size.

    TimS> If I recall correctly, single occurance words are called hapaxes,
    TimS> right?  We've talked about aging before, but it seems like it
    TimS> would be clearly a good thing to age hapaxes.  After a while, ALL
    TimS> they will do is bloat the database, which is arguably a bad thing.

I retrain on my entire saved email collection periodically.  After a full
retrain, I delete all hapaxes (well, I copy the database except for the
hapaxes it contains).  It cuts the database size roughly in half, and if,
after adding more messages, those tokens are no longer hapaxes, they will be
kept after the next retrain.

Seems to work for me.

Skip

From noreply at sourceforge.net  Sun Mar 30 21:47:32 2003
From: noreply at sourceforge.net (SourceForge.net)
Date: Mon Mar 31 00:44:59 2003
Subject: [Spambayes] 
	[ spambayes-Bugs-712480 ] Outlook 2002 (XP) installation fails
Message-ID: <E18zs8q-00037a-00@sc8-sf-web4.sourceforge.net>

Bugs item #712480, was opened at 2003-03-31 05:47
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=498103&aid=712480&group_id=61702

Category: Outlook
Group: None
Status: Open
Resolution: None
Priority: 5
Submitted By: Paul Marrero (pmarrero)
Assigned to: Mark Hammond (mhammond)
Summary: Outlook 2002 (XP) installation fails

Initial Comment:
I use office XP with the Outlook client.  It appears that 
the registration was successfull but I cannnot find any 
menu buttons.  XP clipboard does appear to have the 
Icons.  The command line train works.  Not sure where 
to go from here.  

----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=498103&aid=712480&group_id=61702

From richard at jowsey.com  Mon Mar 31 21:14:56 2003
From: richard at jowsey.com (Richard Jowsey)
Date: Mon Mar 31 06:15:05 2003
Subject: [Spambayes] Latest spammer trick stymied
Message-ID: <3E88AFD0.4984.1072C98C@localhost>

Lately (as prophesied), there have been a number of very short spams 
arriving, containing only a singleton URL. My proxy's classifier was 
giving these an "unsure" rating -- too few clues. But, these buggers 
were starting to become quite annoying...

So today I added a simple web-crawler, which will venture out on 
demand and slurp the words off any site. This little hoover is only 
unleashed when the number of distinct clues/words in an email is less 
than 150, it's heading for the "unsure" bucket, and we find an http 
URL in there. The entire source HTML is then whacked through the 
tokenizer and classified.

The extra servlet processing can take a couple seconds, mostly 
network overhead, and really only noticeable when paying close 
attention to message download times, but the results are really worth 
it! It nails them dead.

Cheers!
Richard


From pje at telecommunity.com  Mon Mar 31 07:32:36 2003
From: pje at telecommunity.com (Phillip J. Eby)
Date: Mon Mar 31 07:33:01 2003
Subject: [Spambayes] Latest spammer trick stymied
In-Reply-To: <3E88AFD0.4984.1072C98C@localhost>
Message-ID: <5.1.0.14.0.20030331073006.01ebac50@mail.telecommunity.com>

At 09:14 PM 3/31/03 +1000, Richard Jowsey wrote:
>So today I added a simple web-crawler, which will venture out on
>demand and slurp the words off any site. This little hoover is only
>unleashed when the number of distinct clues/words in an email is less
>than 150, it's heading for the "unsure" bucket, and we find an http
>URL in there. The entire source HTML is then whacked through the
>tokenizer and classified.

Won't this just convince spammers that:

1) Their spam is "working", because "people are clicking on the link", and

2) If there's a unique ID in the URL, it will confirm that your address is 
live and that you're a sucker for whatever it is they mailed you.  :)

Of course, I also suppose it's possible that if enough people install a 
spam filter that works this way, the resulting "spambayes effect" might 
crash a few of their servers.  :)


From anthony at interlink.com.au  Mon Mar 31 22:51:03 2003
From: anthony at interlink.com.au (Anthony Baxter)
Date: Mon Mar 31 07:52:21 2003
Subject: [Spambayes] Latest spammer trick stymied 
In-Reply-To: <5.1.0.14.0.20030331073006.01ebac50@mail.telecommunity.com> 
Message-ID: <200303311251.h2VCp4419496@localhost.localdomain>


>>> "Phillip J. Eby" wrote
> Won't this just convince spammers that:
> 
> 1) Their spam is "working", because "people are clicking on the link", and

So? More fool them - hopefully they'll spend more money on this useless
technique, and go broke, sooner.

> 2) If there's a unique ID in the URL, it will confirm that your address is 
> live and that you're a sucker for whatever it is they mailed you.  :)

I figure there's little or no point to trying to hide addresses from 
spammers. Unless you never ever post to a mailing list, or to anyone 
off-site, and you've got a non-obvious username, they're going to get
your address anyway.

> Of course, I also suppose it's possible that if enough people install a 
> spam filter that works this way, the resulting "spambayes effect" might 
> crash a few of their servers.  :)

Well, if nothing else, the useless load on their webserver helps push a
little of the cost of spam back towards the spammer.


Anthony
-- 
Anthony Baxter     <anthony@interlink.com.au>   
It's never too late to have a happy childhood.


From tim at fourstonesExpressions.com  Mon Mar 31 07:42:48 2003
From: tim at fourstonesExpressions.com (Tim Stone - Four Stones Expressions)
Date: Mon Mar 31 08:43:25 2003
Subject: [Spambayes] Latest spammer trick stymied 
In-Reply-To: <200303311251.h2VCp4419496@localhost.localdomain>
Message-ID: <FEGEB8OJ1D9RGEWTNJSPEDQPLA671.3e8845d8@myst>

3/31/2003 6:51:03 AM, Anthony Baxter <anthony@interlink.com.au> wrote:

>Well, if nothing else, the useless load on their webserver helps push a
>little of the cost of spam back towards the spammer.

We have to be careful with this.  It would be relatively simple to stymie, by 
simply adding two urls, the spam one, and an unrelated innocent site.  Or 
three urls, or whatever...

We definitely should NOT crawl the site, just in case it really is an innocent 
url.  The load can crush a site, particularly if it's hosted.  BUT, if we 
don't crawl the site, then the trick is easily stymied by simply having the 
page be a linked jpg with the appropriate information, or a flash, or 
whatever... so we're darned if we do, darned if we don't.

Spambayes is superb at recognizing spam based solely upon the payload 
received.  If these mails are slipping through, then we need to examine the 
clues and see why.  Can you show us the clues for one of your mails that 
headed for unsure?  At the moment, we clue url:<chunk>, which is very likely 
to become a hapax.  Perhaps a better solution is to create a token for the 
presence of a url...

c'est moi - TimS
http://www.fourstonesExpressions.com
http://wecanstopspam.org

There are 10 kinds of people in the world:
  those who understand binary,
  and those who don't.


From noreply at sourceforge.net  Sun Mar 30 22:05:24 2003
From: noreply at sourceforge.net (SourceForge.net)
Date: Mon Mar 31 10:22:04 2003
Subject: [Spambayes] 
	[ spambayes-Bugs-712480 ] Outlook 2002 (XP) installation fails
Message-ID: <E18zsQ8-0003l5-00@sc8-sf-web4.sourceforge.net>

Bugs item #712480, was opened at 2003-03-31 17:47
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=498103&aid=712480&group_id=61702

Category: Outlook
Group: None
Status: Open
Resolution: None
Priority: 5
Submitted By: Paul Marrero (pmarrero)
Assigned to: Mark Hammond (mhammond)
Summary: Outlook 2002 (XP) installation fails

Initial Comment:
I use office XP with the Outlook client.  It appears that 
the registration was successfull but I cannnot find any 
menu buttons.  XP clipboard does appear to have the 
Icons.  The command line train works.  Not sure where 
to go from here.  

----------------------------------------------------------------------

>Comment By: Tony Meyer (anadelonbrin)
Date: 2003-03-31 18:05

Message:
Logged In: YES 
user_id=552329

Which version of the Outlook plugin are you using?  (a) the 
latest CVS, (b) the 001 stand-alone installer, or (c) the 002 
stand-alone installer?  I know that the 001 installer has been 
known to have this problem (although it appeared to be fixed 
in 002).

----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=498103&aid=712480&group_id=61702

From David.Vaughan at trizetto.com  Mon Mar 31 08:22:31 2003
From: David.Vaughan at trizetto.com (Vaughan, David)
Date: Mon Mar 31 10:27:52 2003
Subject: [Spambayes] setup
Message-ID: <F0CFA9BF0DAD3B4FAAD47A49793452A756657D@s-coengl-e06>


	I finally figured out my problem.  Netscape Webmail uses imap and my
employer does not have that port opened up.  It works at home but not here
at the office.  Go figure.

	Will Spambayes work with imap or must it be pop3?

-----Original Message-----
From: Tim Stone - Four Stones Expressions
[mailto:tim@fourstonesExpressions.com]
Sent: Tuesday, March 18, 2003 2:37 PM
To: Vaughan, David; Spambayes
Subject: Re: RE: [Spambayes] setup


3/18/2003 1:29:15 PM, "Vaughan, David" <David.Vaughan@trizetto.com> wrote:

>
>	It's not supposed to be this hard :-)
>
>	I'll keep trying but presently am unable to set up POP3.  I get the
>message "Connection to server imap.mail.netcenter.com timed out." but can
>not find in the Netscape 7.02 preferences where to set the server name.

pop3proxy does not support imap servers at this time.  For that matter,
there 
isn't any imap support in spambayes at this point in time... :(

c'est moi - TimS
http://www.fourstonesExpressions.com
http://wecanstopspam.org

From tim at fourstonesExpressions.com  Mon Mar 31 09:49:14 2003
From: tim at fourstonesExpressions.com (Tim Stone - Four Stones Expressions)
Date: Mon Mar 31 10:49:20 2003
Subject: [Spambayes] setup
In-Reply-To: <F0CFA9BF0DAD3B4FAAD47A49793452A756657D@s-coengl-e06>
Message-ID: <41OKWVGBUQNYTA5POJI54PC8A9MLIF.3e88637a@myst>

3/31/2003 9:22:31 AM, "Vaughan, David" <David.Vaughan@trizetto.com> wrote:

>	Will Spambayes work with imap or must it be pop3?

There currenly is no imap proxy in spambayes.  It is a documented feature 
request, but nobody has picked it up as of yet.  I think the problem 
(certainly from my point of view) is that imap servers to test against are not  
nearly as common as pop3 servers.

c'est moi - TimS
http://www.fourstonesExpressions.com
http://wecanstopspam.org

There are 10 kinds of people in the world:
  those who understand binary,
  and those who don't.


From dave at boost-consulting.com  Mon Mar 31 11:07:52 2003
From: dave at boost-consulting.com (David Abrahams)
Date: Mon Mar 31 11:08:12 2003
Subject: [Spambayes] Spambayes/procmail
Message-ID: <uof3r8qp3.fsf@boost-consulting.com>


I want to set up spambayes to work with procmail on my mail server.
Does anyone have experience with that?

If not, will someone please discuss it with me?  I'm particularly
interested in what the model for getting new spam/ham classifications
to procmail might be.  My last query of 24 February went completely
unanswered, which is a little discouraging.  I have quite a learning
curve to overcome, having no experience with procmail and little with
IMAP.  If someone who knows a little about SpamBayes could at least
help me figure out which questions I need to answer in order to get
started, that would be a big help.

Thanks!

-- 
Dave Abrahams
Boost Consulting
www.boost-consulting.com


From tim at fourstonesExpressions.com  Mon Mar 31 10:17:31 2003
From: tim at fourstonesExpressions.com (Tim Stone - Four Stones Expressions)
Date: Mon Mar 31 11:17:37 2003
Subject: [Spambayes] Spambayes/procmail
In-Reply-To: <uof3r8qp3.fsf@boost-consulting.com>
Message-ID: <SQNJA52ZVQPJ61ICQVB6A91Y82CSQ.3e886a1b@myst>

3/31/2003 10:07:52 AM, David Abrahams <dave@boost-consulting.com> wrote:

>
>I want to set up spambayes to work with procmail on my mail server.
>Does anyone have experience with that?

You should start by reading 
http://spambayes.sourceforge.net/applications.html.  There is a link to a page 
called "guide to integrating hammie with your mailer" on that page that should 
give you some good starting points.

The subject of integrating with procmail has been discussed relatively 
extensively in the mailing list.  You might check out the archives, searching 
on procmail.  Again, start at http://spambayes.sourceforge.net

If after that you're having trouble, please be sure to drop us a line!

c'est moi - TimS
http://www.fourstonesExpressions.com
http://wecanstopspam.org

There are 10 kinds of people in the world:
  those who understand binary,
  and those who don't.


From skip at pobox.com  Mon Mar 31 10:27:29 2003
From: skip at pobox.com (Skip Montanaro)
Date: Mon Mar 31 11:27:41 2003
Subject: [Spambayes] Spambayes/procmail
In-Reply-To: <uof3r8qp3.fsf@boost-consulting.com>
References: <uof3r8qp3.fsf@boost-consulting.com>
Message-ID: <16008.27761.497235.617892@montanaro.dyndns.org>


    Dave> I want to set up spambayes to work with procmail on my mail
    Dave> server.  Does anyone have experience with that?

Dave,

I use spambayes with procmail.  The major issue is generally not one of
getting messages classified, but of getting them trained.  Here are the
relevant bits out of my procmailrc file:

    PYCKSUM=$HOME/local/bin/pycksum
    HAMMIE=$HOME/local/bin/hammiefilter.py
    BAYESCUSTOMIZE=$HOME/hammie.opt

    :0 fw:hamlock
    | $HAMMIE -d $HOME/hammie.db

    :0
    * ^X-Spambayes-Classification: spam
    {
        ### this recipe gobbles items with matching body checksums (taken
        ### loosely to try and avoid obvious tricks)
        :0 W: cksum.lock
        | $PYCKSUM -v $HOME/tmp/cksum.cache

        ### spam scores come in two flavors - equal to 1.00 and less than
        ### 1.00 scores are much more likely to be real spam, so require
        ### less sifting - therefore keep them separate
        :0:
        * ^X-Spambayes-Classification: spam; 1.00
        $SPAM1

        :0:
        $SPAM
    }

    :0
    * ^X-Spambayes-Classification: unsure
    unsure

    ...

You can dispense with the PYCKSUM stuff, though I find it does delete a fair
number of duplicate spams.  I get email for a large number of aliases at the
same address however.  YMMV.  I've attached the version of the script which
I use.  It's similar to the loosecksum.py script in the Spambayes utilities
directory, but incorporates the ideas Justin Mason detailed about the
SpamAssassin checksummer.

Skip

-------------- next part --------------
A non-text attachment was scrubbed...
Name: pycksum.py
Type: application/octet-stream
Size: 3099 bytes
Desc: not available
Url : http://mail.python.org/pipermail/spambayes/attachments/20030331/c1cfd712/pycksum-0001.obj
From tim at fourstonesExpressions.com  Mon Mar 31 10:36:21 2003
From: tim at fourstonesExpressions.com (Tim Stone - Four Stones Expressions)
Date: Mon Mar 31 11:36:26 2003
Subject: [Spambayes] Spambayes/procmail
In-Reply-To: <16008.27761.497235.617892@montanaro.dyndns.org>
Message-ID: <71A8XU5ZVSQLC9HEEAKKIZURMNH42X.3e886e85@myst>

>You can dispense with the PYCKSUM stuff, though I find it does delete a fair
>number of duplicate spams.  I get email for a large number of aliases at the
>same address however.  YMMV.  I've attached the version of the script which
>I use.  It's similar to the loosecksum.py script in the Spambayes utilities
>directory, but incorporates the ideas Justin Mason detailed about the
>SpamAssassin checksummer.

Maybe you can update integration.txt with these pertinent bits?  Also, perhaps 
check in your checksummer?

c'est moi - TimS
http://www.fourstonesExpressions.com
http://wecanstopspam.org

There are 10 kinds of people in the world:
  those who understand binary,
  and those who don't.


From dave at boost-consulting.com  Mon Mar 31 11:35:06 2003
From: dave at boost-consulting.com (David Abrahams)
Date: Mon Mar 31 11:42:04 2003
Subject: [Spambayes] Spambayes/procmail
In-Reply-To: <16008.27761.497235.617892@montanaro.dyndns.org> (Skip
 Montanaro's message of "Mon, 31 Mar 2003 10:27:29 -0600")
References: <uof3r8qp3.fsf@boost-consulting.com>
	<16008.27761.497235.617892@montanaro.dyndns.org>
Message-ID: <u65pz8pfp.fsf@boost-consulting.com>


Skip, thanks for replying!

Skip Montanaro <skip@pobox.com> writes:

> I use spambayes with procmail.  The major issue is generally not one of
> getting messages classified, but of getting them trained.

I figured it would be; I think that's what I meant by "classified".
I do have a folder full of accumulated spam.  What has been your
strategy for training?

-- 
Dave Abrahams
Boost Consulting
www.boost-consulting.com


From skip at pobox.com  Mon Mar 31 10:53:05 2003
From: skip at pobox.com (Skip Montanaro)
Date: Mon Mar 31 11:53:14 2003
Subject: [Spambayes] Spambayes/procmail
In-Reply-To: <u65pz8pfp.fsf@boost-consulting.com>
References: <uof3r8qp3.fsf@boost-consulting.com>
        <16008.27761.497235.617892@montanaro.dyndns.org>
        <u65pz8pfp.fsf@boost-consulting.com>
Message-ID: <16008.29297.744394.325654@montanaro.dyndns.org>

    >> I use spambayes with procmail.  The major issue is generally not one
    >> of getting messages classified, but of getting them trained.

    Dave> I figured it would be; I think that's what I meant by
    Dave> "classified".  I do have a folder full of accumulated spam.  What
    Dave> has been your strategy for training?

Here's what I do.  It's sensitive to my particular mail setup, so you can
probably only use this as a rough guide.

My mail reader is VM inside XEmacs.  VM has a "l"abel command prefix.  I
added two new keys to its keymap, "h" and "s" (which were fortuitously
unused) to copy messages to spam and ham folders:

  (defun copy-to-spam ()
    (interactive)
    (vm-save-message (expand-file-name "~/tmp/newspam"))
    (vm-undelete-message 1))

  (defun copy-to-nonspam ()
    (interactive)
    (vm-save-message (expand-file-name "~/tmp/newham"))
    (vm-undelete-message 1))

  (define-key vm-mode-map "ls" 'copy-to-spam)
  (define-key vm-summary-mode-map "ls" 'copy-to-spam)
  (define-key vm-mode-map "lh" 'copy-to-nonspam)
  (define-key vm-summary-mode-map "lh" 'copy-to-nonspam)

~/tmp/new{ham,spam} are then processed using a fairly simple shell script:

    #!/bin/bash

    export BAYESCUSTOMIZE=$HOME/hammie.opt
    cd ~/tmp

    base=new
    db=hammie.db

    # touch the messages up a bit to avoid spurious "clues"
    if [ -f ${base}ham -a -f ${base}spam ] ; then
        unheader.py -p 'X-VM|X-Hammie|X-Spam' ${base}ham > ${base}ham.clean
        unheader.py -p 'X-VM|X-Hammie|X-Spam' ${base}spam > ${base}spam.clean

        # do the deed
        hammie.py -d -p $db -g ${base}ham.clean -s ${base}spam.clean

        # save the files for later retraining
        cat ${base}ham.clean >> ${base}ham.clean.save
        echo "" >> ${base}ham.clean.save
        rm ${base}ham ${base}ham.clean

        cat ${base}spam.clean >> ${base}spam.clean.save
        echo "" >> ${base}spam.clean.save
        rm ${base}spam ${base}spam.clean
    else
        echo Missing ${base}ham and/or ${base}spam files
    fi

I run the train script periodically to train on new ham and spam, then copy
the resulting hammie.db file to where it's really used:

    % train
    Training ham (newham.clean):
        12
    Training spam (newspam.clean):
        29
    % cp -p hammie.db ~

This setup works fine for me, though probably won't be as attractive for
people who aren't as addicted to the shell prompt as I am.

Skip

From jh at web.de  Mon Mar 31 20:05:18 2003
From: jh at web.de (Juergen Hermann)
Date: Mon Mar 31 13:06:06 2003
Subject: [Spambayes] Added headers and no newline
Message-ID: <E1903ex-0002n1-00@smtp.web.de>

Hi!

I did not check whether this was fixed yet, I get a lot of this

X-Spambayes-Classification: ham
X-Spambayes-MailId: 1049128650-2
X-Spambayes-Spam-Probability: 8.78190881126e-009
X-Spambayes-Evidence: '*H*': 0.00; '*S*': 0.00; 'subject:] ': 0.00; 
'url:listinfo': 0.00; 'url:mailman': 0.00; 'skip:_ 40': 0.00; 
'url:python': 0.00; 'email addr:python.org': 0.00; 'subject:[': 0.00; 
'europython': 0.00; 'email name:europython': 0.00; 
'subject:EuroPython': 0.00; 'url:europython': 0.00; 
'header:Received:7': 0.00; 'idea.': 0.00; 'url:mail': 0.00; 'url:org': 
0.00; 'header:Errors-To:1': 0.00; 'think': 0.00; 'good': 0.00; 'space': 
0.00; 'list': 0.00; 'mailing': 0.00; 'some': 0.00; 'big': 0.00; 'will': 
0.00; 'url:html': 0.00; 'url:www': 0.00; 'lives': 0.00; 'url:index': 
0.00
Open Space was a big success at PyCon.
I think having some will be a good idea.
The OpenSpace manifesto lives here:'
http://www.openspaceworld.org/english/index.html

with a few weeks old spambayes. The newline before the body is missing, 
thus the first part of the message is not shown normally in the email 
client.

I'll update anyway.


Ciao, J?rgen


From skip at pobox.com  Mon Mar 31 12:33:51 2003
From: skip at pobox.com (Skip Montanaro)
Date: Mon Mar 31 13:34:00 2003
Subject: [Spambayes] Spambayes/procmail
In-Reply-To: <uwuif75zh.fsf@boost-consulting.com>
References: <uof3r8qp3.fsf@boost-consulting.com>
        <16008.27761.497235.617892@montanaro.dyndns.org>
        <u65pz8pfp.fsf@boost-consulting.com>
        <16008.29297.744394.325654@montanaro.dyndns.org>
        <uwuif75zh.fsf@boost-consulting.com>
Message-ID: <16008.35343.710709.566185@montanaro.dyndns.org>


    Dave> Here's what I've never understood about this system: shouldn't it
    Dave> be enough to label spam?  GNUs gives me a key to label a message
    Dave> as spam.  If I collect all of those, shouldn't I be able to tell
    Dave> spambayes that everything in my INBOX that's been read and isn't
    Dave> in my SpamBox is ham?

I suspect you can use or adapt Neil Schemenauer's mboxtrain.py script to do
what you want.  I started doing things this way before that was an option
though.

    >> This setup works fine for me, though probably won't be as attractive
    >> for people who aren't as addicted to the shell prompt as I am.

    Dave> Well, I'm not sure I understand it yet, but I think I'll get
    Dave> there.

Yeah, it will probably take awhile.  If you fetch your email via POP you
might find the pop3proxy a better fit.  It provides a web-based training
interface.

Skip

From bill at parducci.net  Mon Mar 31 10:34:41 2003
From: bill at parducci.net (bill parducci)
Date: Mon Mar 31 13:38:15 2003
Subject: [Spambayes] Spambayes/procmail
References: <uof3r8qp3.fsf@boost-consulting.com>
	<16008.27761.497235.617892@montanaro.dyndns.org>
	<u65pz8pfp.fsf@boost-consulting.com>
	<16008.29297.744394.325654@montanaro.dyndns.org>
Message-ID: <3E888A41.4090104@parducci.net>

i think this speaks well to the point that training is a individual and 
manual proceess! :o)

b

Skip Montanaro wrote:

> Here's what I do.  It's sensitive to my particular mail setup, so you can
> probably only use this as a rough guide.
> 
> My mail reader is VM inside XEmacs.  VM has a "l"abel command prefix.  I
> added two new keys to its keymap, "h" and "s" (which were fortuitously
> unused) to copy messages to spam and ham folders:
> 
>   (defun copy-to-spam ()
>     (interactive)
>     (vm-save-message (expand-file-name "~/tmp/newspam"))
>     (vm-undelete-message 1))
> 
>   (defun copy-to-nonspam ()
>     (interactive)
>     (vm-save-message (expand-file-name "~/tmp/newham"))
>     (vm-undelete-message 1))
> 
>   (define-key vm-mode-map "ls" 'copy-to-spam)
>   (define-key vm-summary-mode-map "ls" 'copy-to-spam)
>   (define-key vm-mode-map "lh" 'copy-to-nonspam)
>   (define-key vm-summary-mode-map "lh" 'copy-to-nonspam)
> 
> ~/tmp/new{ham,spam} are then processed using a fairly simple shell script:
> 
>     #!/bin/bash
> 
>     export BAYESCUSTOMIZE=$HOME/hammie.opt
>     cd ~/tmp
> 
>     base=new
>     db=hammie.db
> 
>     # touch the messages up a bit to avoid spurious "clues"
>     if [ -f ${base}ham -a -f ${base}spam ] ; then
>         unheader.py -p 'X-VM|X-Hammie|X-Spam' ${base}ham > ${base}ham.clean
>         unheader.py -p 'X-VM|X-Hammie|X-Spam' ${base}spam > ${base}spam.clean
> 
>         # do the deed
>         hammie.py -d -p $db -g ${base}ham.clean -s ${base}spam.clean
> 
>         # save the files for later retraining
>         cat ${base}ham.clean >> ${base}ham.clean.save
>         echo "" >> ${base}ham.clean.save
>         rm ${base}ham ${base}ham.clean
> 
>         cat ${base}spam.clean >> ${base}spam.clean.save
>         echo "" >> ${base}spam.clean.save
>         rm ${base}spam ${base}spam.clean
>     else
>         echo Missing ${base}ham and/or ${base}spam files
>     fi
> 
> I run the train script periodically to train on new ham and spam, then copy
> the resulting hammie.db file to where it's really used:
> 
>     % train
>     Training ham (newham.clean):
>         12
>     Training spam (newspam.clean):
>         29
>     % cp -p hammie.db ~
> 
> This setup works fine for me, though probably won't be as attractive for
> people who aren't as addicted to the shell prompt as I am.
> 
> Skip
> 
> _______________________________________________
> Spambayes mailing list
> Spambayes@python.org
> http://mail.python.org/mailman/listinfo/spambayes


From dave at boost-consulting.com  Mon Mar 31 13:20:34 2003
From: dave at boost-consulting.com (David Abrahams)
Date: Mon Mar 31 13:54:58 2003
Subject: [Spambayes] Spambayes/procmail
In-Reply-To: <16008.29297.744394.325654@montanaro.dyndns.org> (Skip
 Montanaro's message of "Mon, 31 Mar 2003 10:53:05 -0600")
References: <uof3r8qp3.fsf@boost-consulting.com>
	<16008.27761.497235.617892@montanaro.dyndns.org>
	<u65pz8pfp.fsf@boost-consulting.com>
	<16008.29297.744394.325654@montanaro.dyndns.org>
Message-ID: <uwuif75zh.fsf@boost-consulting.com>

Skip Montanaro <skip@pobox.com> writes:

>     >> I use spambayes with procmail.  The major issue is generally not one
>     >> of getting messages classified, but of getting them trained.
>
>     Dave> I figured it would be; I think that's what I meant by
>     Dave> "classified".  I do have a folder full of accumulated spam.  What
>     Dave> has been your strategy for training?
>
> Here's what I do.  It's sensitive to my particular mail setup, so you can
> probably only use this as a rough guide.
>
> My mail reader is VM inside XEmacs.  

I'm using GNUs, FWIW.

> VM has a "l"abel command prefix.  I added two new keys to its
> keymap, "h" and "s" (which were fortuitously unused) to copy
> messages to spam and ham folders:

Here's what I've never understood about this system: shouldn't it be
enough to label spam?  GNUs gives me a key to label a message as spam.
If I collect all of those, shouldn't I be able to tell spambayes that
everything in my INBOX that's been read and isn't in my SpamBox is
ham?

>   (defun copy-to-spam ()
>     (interactive)
>     (vm-save-message (expand-file-name "~/tmp/newspam"))
>     (vm-undelete-message 1))
>
>   (defun copy-to-nonspam ()
>     (interactive)
>     (vm-save-message (expand-file-name "~/tmp/newham"))
>     (vm-undelete-message 1))
>
>   (define-key vm-mode-map "ls" 'copy-to-spam)
>   (define-key vm-summary-mode-map "ls" 'copy-to-spam)
>   (define-key vm-mode-map "lh" 'copy-to-nonspam)
>   (define-key vm-summary-mode-map "lh" 'copy-to-nonspam)
>
> ~/tmp/new{ham,spam} are then processed using a fairly simple shell script:
>
>     #!/bin/bash
>
>     export BAYESCUSTOMIZE=$HOME/hammie.opt
>     cd ~/tmp
>
>     base=new
>     db=hammie.db
>
>     # touch the messages up a bit to avoid spurious "clues"
>     if [ -f ${base}ham -a -f ${base}spam ] ; then
>         unheader.py -p 'X-VM|X-Hammie|X-Spam' ${base}ham > ${base}ham.clean
>         unheader.py -p 'X-VM|X-Hammie|X-Spam' ${base}spam > ${base}spam.clean
>
>         # do the deed
>         hammie.py -d -p $db -g ${base}ham.clean -s ${base}spam.clean
>
>         # save the files for later retraining
>         cat ${base}ham.clean >> ${base}ham.clean.save
>         echo "" >> ${base}ham.clean.save
>         rm ${base}ham ${base}ham.clean
>
>         cat ${base}spam.clean >> ${base}spam.clean.save
>         echo "" >> ${base}spam.clean.save
>         rm ${base}spam ${base}spam.clean
>     else
>         echo Missing ${base}ham and/or ${base}spam files
>     fi
>
> I run the train script periodically to train on new ham and spam, then copy
> the resulting hammie.db file to where it's really used:
>
>     % train
>     Training ham (newham.clean):
>         12
>     Training spam (newspam.clean):
>         29
>     % cp -p hammie.db ~
>
> This setup works fine for me, though probably won't be as attractive for
> people who aren't as addicted to the shell prompt as I am.

Well, I'm not sure I understand it yet, but I think I'll get there.
Thanks!

-- 
Dave Abrahams
Boost Consulting
www.boost-consulting.com


From dave at boost-consulting.com  Mon Mar 31 13:52:21 2003
From: dave at boost-consulting.com (David Abrahams)
Date: Mon Mar 31 13:55:08 2003
Subject: [Spambayes] Spambayes/procmail
In-Reply-To: <16008.35343.710709.566185@montanaro.dyndns.org> (Skip
 Montanaro's message of "Mon, 31 Mar 2003 12:33:51 -0600")
References: <uof3r8qp3.fsf@boost-consulting.com>
	<16008.27761.497235.617892@montanaro.dyndns.org>
	<u65pz8pfp.fsf@boost-consulting.com>
	<16008.29297.744394.325654@montanaro.dyndns.org>
	<uwuif75zh.fsf@boost-consulting.com>
	<16008.35343.710709.566185@montanaro.dyndns.org>
Message-ID: <uhe9j74ii.fsf@boost-consulting.com>

Skip Montanaro <skip@pobox.com> writes:

>     Dave> Here's what I've never understood about this system: shouldn't it
>     Dave> be enough to label spam?  GNUs gives me a key to label a message
>     Dave> as spam.  If I collect all of those, shouldn't I be able to tell
>     Dave> spambayes that everything in my INBOX that's been read and isn't
>     Dave> in my SpamBox is ham?
>
> I suspect you can use or adapt Neil Schemenauer's mboxtrain.py script to do
> what you want.  I started doing things this way before that was an option
> though.

Excellent!  Thank you.

>     >> This setup works fine for me, though probably won't be as attractive
>     >> for people who aren't as addicted to the shell prompt as I am.
>
>     Dave> Well, I'm not sure I understand it yet, but I think I'll get
>     Dave> there.
>
> Yeah, it will probably take awhile.  If you fetch your email via POP you
> might find the pop3proxy a better fit.  It provides a web-based training
> interface.

Nope; I'm using IMAP.

Thanks,
-- 
Dave Abrahams
Boost Consulting
www.boost-consulting.com


From skip at pobox.com  Mon Mar 31 15:46:36 2003
From: skip at pobox.com (Skip Montanaro)
Date: Mon Mar 31 16:46:45 2003
Subject: [Spambayes] Latest spammer trick stymied 
In-Reply-To: <3E893DA7.31420.20D35DB@localhost>
References: <200303311251.h2VCp4419496@localhost.localdomain>
        <3E893DA7.31420.20D35DB@localhost>
Message-ID: <16008.46908.795498.412561@montanaro.dyndns.org>


    >> We definitely should NOT crawl the site, just in case it really is an
    >> innocent url.  The load can crush a site, particularly if it's
    >> hosted.

    Richard> Nah. You need to throw thousands of requests at a half-decent
    Richard> web server before it gives up the ghost. And if they're sending
    Richard> out 10 million mail pieces, they should expect their http
    Richard> server to take some load. These are definitely NOT innocent
    Richard> emails. They come from bogus senders, have minimal headers
    Richard> (deliberately), and contain *nothing* but a url. Which points,
    Richard> via redirect naturally, to an incest porn or get-a-huge-penis
    Richard> site, etc.

You can't make that judgement beforehand.  If the site you are poking is a
valid site and the email received was not spam, none of what you said holds.
If I remember correctly, you said this was only to be performed in
circumstances where certain criteria were met, none of which included a
conclusion the mail was spam.

Skip

From popiel at wolfskeep.com  Mon Mar 31 14:15:29 2003
From: popiel at wolfskeep.com (T. Alexander Popiel)
Date: Mon Mar 31 17:15:36 2003
Subject: [Spambayes] Latest spammer trick stymied 
In-Reply-To: Message from "Richard Jowsey" <richard@jowsey.com> 
   of "Tue, 01 Apr 2003 07:20:07 +1000." <3E893DA7.31420.20D35DB@localhost> 
References: <200303311251.h2VCp4419496@localhost.localdomain>
	<3E893DA7.31420.20D35DB@localhost> 
Message-ID: <20030331221529.8A6592DDF2@cashew.wolfskeep.com>

In message:  <3E893DA7.31420.20D35DB@localhost>
             "Richard Jowsey" <richard@jowsey.com> writes:
>> We have to be careful with this.  It would be relatively simple to
>> stymie, by simply adding two urls, the spam one, and an unrelated
>> innocent site.  Or three urls, or whatever...
>
>Spammers are simple folk. They won't be putting no innocent url's in 
>these spams...

Spammers might be simple folk, but serious crackers (not the script
kiddies) certainly are not.  If there comes to be a widely deployed
tool with this sort of fetch-what-I-tell-you-to behaviour, then it
will get exploited by people wanting to do a denial of service
attack or similar.  Why bother sending out your own IRC-controlled
worm, when there's already remote-controllable spamfilters ready
and waiting to pound a site into the ground?  After all, writing
(and releasing) a worm is already recognized as a crime, but the
legality of just sending out a not-as-innocent-as-it-looks email
blast is still in contention...

- Alex

From tshumway at jdiworks.net  Mon Mar 31 14:40:42 2003
From: tshumway at jdiworks.net (tshumway@jdiworks.net)
Date: Mon Mar 31 17:37:34 2003
Subject: [Spambayes] Latest spammer trick stymied 
In-Reply-To: <16008.46908.795498.412561@montanaro.dyndns.org>
References: <200303311251.h2VCp4419496@localhost.localdomain>
	<3E893DA7.31420.20D35DB@localhost>
	<16008.46908.795498.412561@montanaro.dyndns.org>
Message-ID: <1049150442.3e88c3ea2d4d9@jdiworks.net>

Quoting Skip Montanaro <skip@pobox.com>:
> 
>     >> We definitely should NOT crawl the site, just in case it really is an
>     >> innocent url.  The load can crush a site, particularly if it's
>     >> hosted.
> 
>     Richard> Nah. You need to throw thousands of requests at a half-decent
>     Richard> web server before it gives up the ghost. And if they're sending
>     Richard> out 10 million mail pieces, they should expect their http
>     Richard> server to take some load. These are definitely NOT innocent
>     Richard> emails. They come from bogus senders, have minimal headers
>     Richard> (deliberately), and contain *nothing* but a url. Which points,
> 
> You can't make that judgement beforehand.  If the site you are poking is a
> valid site and the email received was not spam, none of what you said holds.
> If I remember correctly, you said this was only to be performed in
> circumstances where certain criteria were met, none of which included a
> conclusion the mail was spam.

Anyone who includes a URL in a mail message will probably be prepared for some
load based on the number of people receiving the message. If I send a message to
a client asking him to look at a web site on a staging server, I expect a dozen
or so hits, followed by a phone call.  If I send a message to my family mailing
list, I expect a couple hundred hits (followed by a complaint from my brother
that his picture looks ugly (What can I do? 8-) ).  If an evil spammer sends a
URL to 50 million addresses, it might expect (hope for) a decent slashdot spike.

Interpreting the results of the http request opens a new can of worms.  All of 
the tricks we use to mangle addresses (javascript, formmail honeypots,
user-agent based web-pages, funky encodings, etc.) can now be used by the
spammer against us. hmmm. I think it will take a while for that to become a
major problem.

In a server-side deployment where the same spam is likely to reach many hosted
mailboxes, a specialized proxy server might be able to reduce the perceived
response rate and the wasted bandwidth.


 -- Terrel


From tim at fourstonesExpressions.com  Mon Mar 31 16:05:23 2003
From: tim at fourstonesExpressions.com (Tim Stone - Four Stones Expressions)
Date: Mon Mar 31 18:01:56 2003
Subject: [Spambayes] Latest spammer trick stymied 
In-Reply-To: <16008.46908.795498.412561@montanaro.dyndns.org>
Message-ID: <SMQKB7042UF0A0TDAC0A7VUB9UQ05LK.3e88bba3@myst>

3/31/2003 3:46:36 PM, Skip Montanaro <skip@pobox.com> wrote:

>
>    >> We definitely should NOT crawl the site, just in case it really is an
>    >> innocent url.  The load can crush a site, particularly if it's
>    >> hosted.
>
>    Richard> Nah. You need to throw thousands of requests at a half-decent
>    Richard> web server before it gives up the ghost. And if they're sending
>    Richard> out 10 million mail pieces, they should expect their http
>    Richard> server to take some load. These are definitely NOT innocent
>    Richard> emails. They come from bogus senders, have minimal headers
>    Richard> (deliberately), and contain *nothing* but a url. Which points,
>    Richard> via redirect naturally, to an incest porn or get-a-huge-penis
>    Richard> site, etc.
>
>You can't make that judgement beforehand.  If the site you are poking is a
>valid site and the email received was not spam, none of what you said holds.
>If I remember correctly, you said this was only to be performed in
>circumstances where certain criteria were met, none of which included a
>conclusion the mail was spam.

That's right.  We really should try to solve this problem with tokenization.

>
>Skip
>
>


c'est moi - TimS
http://www.fourstonesExpressions.com
http://wecanstopspam.org

There are 10 kinds of people in the world:
  those who understand binary,
  and those who don't.


From tim at fourstonesExpressions.com  Mon Mar 31 17:04:33 2003
From: tim at fourstonesExpressions.com (Tim Stone - Four Stones Expressions)
Date: Mon Mar 31 18:04:39 2003
Subject: [Spambayes] Latest spammer trick stymied 
Message-ID: <1U071TPJPNNL41DC4WXS95KHOMCALKPK.3e88c981@myst>

3/31/2003 4:15:29 PM, "T. Alexander Popiel" <popiel@wolfskeep.com> wrote:

>In message:  <3E893DA7.31420.20D35DB@localhost>
>             "Richard Jowsey" <richard@jowsey.com> writes:
>>> We have to be careful with this.  It would be relatively simple to
>>> stymie, by simply adding two urls, the spam one, and an unrelated
>>> innocent site.  Or three urls, or whatever...
>>
>>Spammers are simple folk. They won't be putting no innocent url's in 
>>these spams...
>
>Spammers might be simple folk, but serious crackers (not the script
>kiddies) certainly are not.  If there comes to be a widely deployed
>tool with this sort of fetch-what-I-tell-you-to behaviour, then it
>will get exploited by people wanting to do a denial of service
>attack or similar.  Why bother sending out your own IRC-controlled
>worm, when there's already remote-controllable spamfilters ready
>and waiting to pound a site into the ground?  After all, writing
>(and releasing) a worm is already recognized as a crime, but the
>legality of just sending out a not-as-innocent-as-it-looks email
>blast is still in contention...

EXCELLENT point, Alex.  Case closed.

>
>- Alex
>
>_______________________________________________
>Spambayes mailing list
>Spambayes@python.org
>http://mail.python.org/mailman/listinfo/spambayes
>
>


c'est moi - TimS
http://www.fourstonesExpressions.com
http://wecanstopspam.org

There are 10 kinds of people in the world:
  those who understand binary,
  and those who don't.


From bill at parducci.net  Mon Mar 31 15:21:24 2003
From: bill at parducci.net (bill parducci)
Date: Mon Mar 31 18:24:58 2003
Subject: [Spambayes] Latest spammer trick stymied - QUESTION
References: <200303311251.h2VCp4419496@localhost.localdomain>
	<3E893DA7.31420.20D35DB@localhost>
	<16008.46908.795498.412561@montanaro.dyndns.org>
	<1049150442.3e88c3ea2d4d9@jdiworks.net>
Message-ID: <3E88CD74.4050405@parducci.net>

currently, does spambayes treat a URL as a single token or is it parsed 
somehow?

it would seem that if URLs were parsed you would be able to train 
spambayes to detect mail for odious content based on components of the link.

take the example: http://check.myspam.com/ad/junk?random=fsldkjflksj

it would seem that the most accurate way to evaluate this would be to 
parse using '/' (starting after 'http://'). that would allow spambayes 
to evaluate the domain (check.mypam.com) while giving it the ability to 
differentiate between directories (which may map to users on ISP 
systems: http://user.aol.com/niceguy vs. http://user.aol.com/spammer).


b


From popiel at wolfskeep.com  Mon Mar 31 16:06:06 2003
From: popiel at wolfskeep.com (T. Alexander Popiel)
Date: Mon Mar 31 19:06:10 2003
Subject: [Spambayes] Latest spammer trick stymied - QUESTION 
In-Reply-To: Message from bill parducci <bill@parducci.net> 
   of "Mon, 31 Mar 2003 15:21:24 PST." <3E88CD74.4050405@parducci.net> 
References: <200303311251.h2VCp4419496@localhost.localdomain>
	<3E893DA7.31420.20D35DB@localhost>
	<16008.46908.795498.412561@montanaro.dyndns.org>
	<1049150442.3e88c3ea2d4d9@jdiworks.net>  <3E88CD74.4050405@parducci.net> 
Message-ID: <20030401000606.33A3A2DDF2@cashew.wolfskeep.com>

In message:  <3E88CD74.4050405@parducci.net>
             bill parducci <bill@parducci.net> writes:

>currently, does spambayes treat a URL as a single token or is it parsed 
>somehow?

URLs are parsed with the following code:

| urlsep_re = re.compile(r"[;?:@&=+,$.]")
| 
| class URLStripper(Stripper):
|     def __init__(self):
|         # The empty regexp matches anything at once.
|         Stripper.__init__(self, url_re.search, re.compile("").search)
| 
|     def tokenize(self, m):
|         proto, guts = m.groups()
|         tokens = ["proto:" + proto]
|         pushclue = tokens.append
| 
|         # Lose the trailing punctuation for casual embedding, like:
|         #     The code is at http://mystuff.org/here?  Didn't resolve.
|         # or
|         #     I found it at http://mystuff.org/there/.  Thanks!
|         assert guts
|         while guts and guts[-1] in '.:?!/':
|             guts = guts[:-1]
|         for piece in guts.split('/'):
|             for chunk in urlsep_re.split(piece):
|                 pushclue("url:" + chunk)
|         return tokens

>take the example: http://check.myspam.com/ad/junk?random=fsldkjflksj

That example would yield the tokens:

  proto:http
  url:check
  url:myspam
  url:com
  url:ad
  url:junk
  url:random
  url:fsldkjflksj

>it would seem that the most accurate way to evaluate this would be to 
>parse using '/' (starting after 'http://'). that would allow spambayes 
>to evaluate the domain (check.mypam.com) while giving it the ability to 
>differentiate between directories (which may map to users on ISP 
>systems: http://user.aol.com/niceguy vs. http://user.aol.com/spammer).

This already happens to some extent, though the I think there could
be better handling of the composite hostname and directory path...
to wit, I suspect that adding the following tokens would help:

  url:myspam.com
  url:check.myspam.com
  url:check.myspam.com/ad
  url:check.myspam.com/ad/junk

I haven't tested this yet, but I further suspect that I will have
Tim Peters' problem: my results are already good enough that I won't
be able to say anything conclusive about it.

- Alex

From bill at parducci.net  Mon Mar 31 16:36:48 2003
From: bill at parducci.net (bill parducci)
Date: Mon Mar 31 19:41:42 2003
Subject: [Spambayes] Latest spammer trick stymied
References: <LCEPIIGDJPKCOIHOBJEPCELIOKAA.mhammond@skippinet.com.au>
Message-ID: <3E88DF20.9080204@parducci.net>

Mark Hammond wrote:

> Could you not do the same thing today, by sending out a HTML email
> referencing some images from the server you want to attack?  Given the
> number of mail clients out there that will fetch these images (using their
> mailers default settings), I would expect this to remain a far more
> effective attack than the one you propose.

yes, that would DoS the [http] target, but one could DoS the [mail] 
recipient's system by sending multiple messages linking to a site that 
is overloaded (or intentionally slow) so that the [blocking] 'slurp' 
event clogs up the mail processing flow.

it's just a matter of whom you wish to annoy. :o)

b


From tim.one at comcast.net  Mon Mar 31 19:37:16 2003
From: tim.one at comcast.net (Tim Peters)
Date: Mon Mar 31 19:44:19 2003
Subject: [Spambayes] Latest spammer trick stymied - QUESTION
In-Reply-To: <20030401000606.33A3A2DDF2@cashew.wolfskeep.com>
Message-ID: <LNBBLJKPBEHFEDALKOLCOEDCECAB.tim.one@comcast.net>

[T. Alexander Popiel]
> ...
> This already happens to some extent, though the I think there could
> be better handling of the composite hostname and directory path...
> to wit, I suspect that adding the following tokens would help:
>
>   url:myspam.com
>   url:check.myspam.com
>   url:check.myspam.com/ad
>   url:check.myspam.com/ad/junk
>
> I haven't tested this yet, but I further suspect that I will have
> Tim Peters' problem: my results are already good enough that I won't
> be able to say anything conclusive about it.

Mining embedded URLs was the first tokenization enhancement added to the
project, and it instantly cut the false negative rate in half -- that
remains the single biggest win we ever got.  At first, it was fancier than
it is now.  The scheme got simpler over time, as testing showed no
significant difference in results as more gimmicks got thrown out.

Note that we actually generate more tokens than meet the eye for spam like:

"""
X-Message-Info: JGTYoYF78jEHjJx36Oi8+Q1OJDRSDidP
Received: from wildlife.com ([4.40.47.205]) by mc9-f10.bay6.hotmail.com with
	Microsoft SMTPSVC(5.0.2195.5600);	 Sun, 30 Mar 2003 23:44:18 -0800
Date: Sun, 30 Mar 2003 01:37:18 -0300
From: "Ella Schotte" <skoocea@wildlife.com>
To: <tim_one@email.msn.com>
Message-ID: <20030330013718.9ltGDlkp5jmJ@wildlife.com>
Content-Type: text/plain
Subject: with Daughter
Return-Path: skoocea@wildlife.com
X-OriginalArrivalTime: 31 Mar 2003 07:44:18.0807 (UTC)
	FILETIME=[56139870:01C2F759]


http://jeajeeceap.lewdmother.com
"""


The complete list of tokens generated by the Outlook client by default for
that is:

'cc:none'
'content-type:text/plain'
'from:addr:skoocea'
'from:addr:wildlife.com'
'from:name:ella schotte'
'header:Date:1'
'header:From:1'
'header:Message-ID:1'
'header:Received:1'
'header:Return-Path:1'
'header:Subject:1'
'header:To:1'
'message-id:@wildlife.com'
'noheader:abuse-reports-to'
'noheader:errors-to'
'noheader:importance'
'noheader:in-reply-to'
'noheader:mime-version'
'noheader:organization'
'noheader:reply-to'
'noheader:user-agent'
'noheader:x-abuse-info'
'noheader:x-complaints-to'
'noheader:x-face'
'proto:http'
'reply-to:none'
'sender:none'
'subject: '
'subject:Daughter'
'subject:with'
'to:2**0'
'to:addr:email.msn.com'
'to:addr:tim_one'
'to:no real name:2**0'
'url:com'
'url:jeajeeceap'
'url:lewdmother'
'x-mailer:none'

Currently, in my home classifier, only 7 of those have spamprobs outside of
(.4, .6), so 31 tokens are ignored.  If "minimal headers" becomes a popular
spam gimmick, that will boost the spamprobs of the assorted "noheader:xyz"
and "xyz:none" tokens, to the point where they're no longer ignored.


From tim at fourstonesExpressions.com  Mon Mar 31 18:44:53 2003
From: tim at fourstonesExpressions.com (Tim Stone - Four Stones Expressions)
Date: Mon Mar 31 19:45:03 2003
Subject: [Spambayes] Latest spammer trick stymied
In-Reply-To: <LCEPIIGDJPKCOIHOBJEPCELIOKAA.mhammond@skippinet.com.au>
Message-ID: <QP04PNB06KIU9885RQYTJNKB0RPSO.3e88e105@myst>

3/31/2003 6:24:36 PM, "Mark Hammond" <mhammond@skippinet.com.au> wrote:


>[Tim S again]
>> EXCELLENT point, Alex.  Case closed.
>
>I'm not sure who you are speaking for here <wink>.  But yeah, fetching the
>URL does seem the wrong long-term approach.  I'm very impressed with the
>creativity of the idea though - I see lots of these spams and did wonder WTF
>we could do about it.

Speaking for myself, of course... 

We currently do not provide a token for the *presence* of a url.  I'm not sure if this
would have pushed it toward spamminess or not, but it bears researching.

c'est moi - TimS
http://www.fourstonesExpressions.com
http://wecanstopspam.org

There are 10 kinds of people in the world:
  those who understand binary,
  and those who don't.


From tim.one at comcast.net  Mon Mar 31 19:59:56 2003
From: tim.one at comcast.net (Tim Peters)
Date: Mon Mar 31 20:01:18 2003
Subject: [Spambayes] Latest spammer trick stymied
In-Reply-To: <QP04PNB06KIU9885RQYTJNKB0RPSO.3e88e105@myst>
Message-ID: <LNBBLJKPBEHFEDALKOLCMEDEECAB.tim.one@comcast.net>

[Tim Stone]
> Speaking for myself, of course...
>
> We currently do not provide a token for the *presence* of a url.

We already generate one of

    proto:http
    proto:https
    proto:ftp

depending on what's approrpriate.

> I'm not sure if this would have pushed it toward spamminess or not, but it
> bears researching.

Look in your database for the spamprob on 'proto:http'.  My bet is that it's
near neutral; it's reasonable to expect that a "found a URL" token would
have the same spamprob.


From bill at parducci.net  Mon Mar 31 17:15:02 2003
From: bill at parducci.net (bill parducci)
Date: Mon Mar 31 20:18:37 2003
Subject: [Spambayes] Latest spammer trick stymied - QUESTION
References: <200303311251.h2VCp4419496@localhost.localdomain>
	<3E893DA7.31420.20D35DB@localhost>
	<16008.46908.795498.412561@montanaro.dyndns.org>
	<1049150442.3e88c3ea2d4d9@jdiworks.net>  <3E88CD74.4050405@parducci.net>
	<20030401000606.33A3A2DDF2@cashew.wolfskeep.com>
Message-ID: <3E88E816.4060003@parducci.net>

T. Alexander Popiel wrote:
>>take the example: http://check.myspam.com/ad/junk?random=fsldkjflksj
> That example would yield the tokens:
> 
>   proto:http
>   url:check
>   url:myspam
>   url:com
>   url:ad
>   url:junk
>   url:random
>   url:fsldkjflksj

<bayesian ignorance shields up>
doesn't the degree of granularity here dilute the information? in other 
words, 'com' and 'junk' are extremely common, while 'myspam.com' less so 
and 'check.myspam.com' completely unique. since neutral tokens are 
ignored, words like these may not be considered, while the following 
most likely would be considered:

>   url:myspam.com
>   url:check.myspam.com
>   url:check.myspam.com/ad
>   url:check.myspam.com/ad/junk

therefore, in the case of url parsing, it would seem that less 
[granularity] is more [accuracy].

</shields>

b


From tim.one at comcast.net  Mon Mar 31 20:28:43 2003
From: tim.one at comcast.net (Tim Peters)
Date: Mon Mar 31 20:40:03 2003
Subject: [Spambayes] Latest spammer trick stymied - QUESTION
In-Reply-To: <3E88E816.4060003@parducci.net>
Message-ID: <LNBBLJKPBEHFEDALKOLCEEDIECAB.tim.one@comcast.net>

[bill parducci]
> <bayesian ignorance shields up>
> doesn't the degree of granularity here dilute the information? in other
> words, 'com' and 'junk' are extremely common, while 'myspam.com' less so
> and 'check.myspam.com' completely unique. since neutral tokens are
> ignored, words like these may not be considered, while the following
> most likely would be considered:
>
>>   url:myspam.com

That's decent, but likely no better than url:myspam.

>>   url:check.myspam.com
>>   url:check.myspam.com/ad
>>   url:check.myspam.com/ad/junk

Those are probably one-shot hapaxes (i.e., worthless, except for catching
copies of the same spam).  If you own a domain xyz.com, then you can make up
all the ABC.xyz.com targets you like, and spammers generally do.  ABC
doesn't repeat often except in copies of the same spam.

> therefore, in the case of url parsing, it would seem that less
> [granularity] is more [accuracy].

Test and measure.


From tim.one at comcast.net  Mon Mar 31 20:36:28 2003
From: tim.one at comcast.net (Tim Peters)
Date: Mon Mar 31 20:44:44 2003
Subject: [Spambayes] Latest spammer trick stymied
In-Reply-To: <LCEPIIGDJPKCOIHOBJEPCELIOKAA.mhammond@skippinet.com.au>
Message-ID: <LNBBLJKPBEHFEDALKOLCEEDJECAB.tim.one@comcast.net>

[Mark Hammond]
> ...
> But yeah, fetching the URL does seem the wrong long-term approach.

Hard to say.

> I'm very impressed with the creativity of the idea though - I see lots of
these
> spams and did wonder WTF we could do about it.

I suggest you wait <wink>.  I saw a lot of these last week, but a lot less
this week so far.  As advertising, sending a single URL has got to suck:
who would click on it, and why, especially after the novelty wears off?  For
reasons explained earlier, if this is combined with the minimal-header
gimmick, positive tokens generated for the absence of assorted header lines
will eventually get high spamprobs too.


From bill at parducci.net  Mon Mar 31 17:55:40 2003
From: bill at parducci.net (bill parducci)
Date: Mon Mar 31 20:59:15 2003
Subject: [Spambayes] Latest spammer trick stymied - QUESTION
References: <LNBBLJKPBEHFEDALKOLCEEDIECAB.tim.one@comcast.net>
Message-ID: <3E88F19C.9040008@parducci.net>

Tim Peters wrote:
>>>  url:check.myspam.com
>>>  url:check.myspam.com/ad
>>>  url:check.myspam.com/ad/junk
> 
> Those are probably one-shot hapaxes (i.e., worthless, except for catching
> copies of the same spam).  If you own a domain xyz.com, then you can make up
> all the ABC.xyz.com targets you like, and spammers generally do.  ABC
> doesn't repeat often except in copies of the same spam.

empirically i am not so sure. below are links that have been arriving 
daily in my trolling account (each listed twice per note, one supposedly 
in case you are having problems with the other):

http://www.nudesletter.com/schoolgirl-FEB/index.html
http://www.nudesletter.com/auditions-SC/index.html
http://www.nudesletter.com/8thstreet-ND/index.html
http://www.nudesletter.com/multi-FEB/index.html
http://www.nudesletter.com/russians-WG/index.html

while the goal is the same (traffic to www.nudesletter.com), each day 
the url changes. there are a number of other spam threads that work 
similarly.

>>therefore, in the case of url parsing, it would seem that less
>>[granularity] is more [accuracy].
> 
> Test and measure.

you left off 'write code' before 'test and measure'. i am still coming 
up to speed there so for me this will have to stay in the theoretical 
for the time being.

b


From neale at woozle.org  Mon Mar 31 17:57:24 2003
From: neale at woozle.org (Neale Pickett)
Date: Mon Mar 31 21:03:33 2003
Subject: [Spambayes] Latest spammer trick stymied - QUESTION
In-Reply-To: <LNBBLJKPBEHFEDALKOLCOEDCECAB.tim.one@comcast.net> (Tim
 Peters's message of "Mon, 31 Mar 2003 19:37:16 -0500")
References: <LNBBLJKPBEHFEDALKOLCOEDCECAB.tim.one@comcast.net>
Message-ID: <w534r5jt1x7.fsf@woozle.org>

Tim Peters <tim.one@comcast.net> writes:

> The scheme got simpler over time, as testing showed no significant
> difference in results as more gimmicks got thrown out.

Hi gang.  I'm not supposed to be working on this project anymore but I
just can't help following up to this one.  I see Tim answering a lot of
"I've got a cool tokenizing idea" questions.  So many, in fact, that I
think there ought to be a FAQ on the web page somehwere, to the tune of:

Q: Hey!  Why don't you implement cool tokenizer trick X?  I think it
   would really foil those spammers!

A: Have you run your tokenizer trick against a set of messages to see if
   it actually works?  Many times what seems like a good idea turns out
   not to help much, and sometimes even hurts.  If you have a good idea,
   you've run it against a batch of messages and can prove that it
   helps, paste the code for your technique and the proof to the mailing
   list.  Otherwise, you will likely get a message from Tim Peters about
   why you need to test your idea :)

Just an idea.

Neale

From bill at parducci.net  Mon Mar 31 18:22:31 2003
From: bill at parducci.net (bill parducci)
Date: Mon Mar 31 21:26:05 2003
Subject: [Spambayes] Latest spammer trick stymied - QUESTION
References: <1ED4ECF91CDED24C8D012BCF2B034F13010B4430@its-xchg4.massey.ac.nz>
Message-ID: <3E88F7E7.3070708@parducci.net>

 > Many times what seems like a good idea turns
 > out not to help much, and sometimes even hurts.

this very thread started with such an approach {build and show] and was 
predominantly dismissed. this may not have an affect on the 
implementer's use of the modification, but i would hate to think that 
this would be the only 'allowable' method by which ideas can be posted.

...and sometimes someone else has tried it and it didn't help. why would 
you want to force people to reinvent the wheel before discussing an idea?

b


From noreply at sourceforge.net  Mon Mar 31 18:48:48 2003
From: noreply at sourceforge.net (SourceForge.net)
Date: Mon Mar 31 21:34:49 2003
Subject: [Spambayes] 
	[ spambayes-Bugs-712480 ] Outlook 2002 (XP) installation fails
Message-ID: <E190BpQ-0003O9-00@sc8-sf-web2.sourceforge.net>

Bugs item #712480, was opened at 2003-03-31 17:47
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=498103&aid=712480&group_id=61702

Category: Outlook
Group: None
Status: Open
Resolution: None
Priority: 5
Submitted By: Paul Marrero (pmarrero)
Assigned to: Mark Hammond (mhammond)
Summary: Outlook 2002 (XP) installation fails

Initial Comment:
I use office XP with the Outlook client.  It appears that 
the registration was successfull but I cannnot find any 
menu buttons.  XP clipboard does appear to have the 
Icons.  The command line train works.  Not sure where 
to go from here.  

----------------------------------------------------------------------

>Comment By: Tony Meyer (anadelonbrin)
Date: 2003-04-01 14:48

Message:
Logged In: YES 
user_id=552329

Actually, I get this too.  I've just switched to Outlook XP, so 
I'm not sure if this is the reason, or just that I'm doing a fresh 
install.  The log includes the following traces:

SpamAddin - Connecting to Outlook
Failed to load bayes database
Traceback (most recent call last):
  File "E:\src\spambayes\Outlook2000\manager.py", line 310, 
in LoadBayes
  File "E:\src\spambayes\Outlook2000\manager.py", line 118, 
in open_bayes
AttributeError: 'module' object has no 
attribute 'DBDictClassifier'
Loaded message database from 'C:\Documents and 
Settings\tameyer\Application 
Data\SpamBayes\default_message_database.db'
Either bayes database or message database is missing - 
creating new
pythoncom error: Failed to call the universal dispatcher
Traceback (most recent call last):
  File "E:\src\pythonex\com\win32com\universal.py", line 170, 
in dispatch
  File "E:\src\pythonex\com\win32com\server\policy.py", line 
322, in _InvokeEx_
  File "E:\src\pythonex\com\win32com\server\policy.py", line 
601, in _invokeex_
  File "E:\src\pythonex\com\win32com\server\policy.py", line 
541, in _invokeex_
  File "E:\src\spambayes\Outlook2000\addin.py", line 655, in 
OnConnection
  File "E:\src\spambayes\Outlook2000\manager.py", line 475, 
in GetManager
  File "E:\src\spambayes\Outlook2000\manager.py", line 165, 
in __init__
  File "E:\src\spambayes\Outlook2000\manager.py", line 329, 
in LoadBayes
  File "E:\src\spambayes\Outlook2000\manager.py", line 378, 
in InitNewBayes
  File "E:\src\spambayes\Outlook2000\manager.py", line 94, 
in new_bayes
  File "E:\src\spambayes\Outlook2000\manager.py", line 118, 
in open_bayes
exceptions.AttributeError: 'module' object has no 
attribute 'DBDictClassifier'
SpamAddin - Connecting to Outlook
Failed to load bayes database
Traceback (most recent call last):
  File "E:\src\spambayes\Outlook2000\manager.py", line 310, 
in LoadBayes
  File "E:\src\spambayes\Outlook2000\manager.py", line 118, 
in open_bayes
AttributeError: 'module' object has no 
attribute 'DBDictClassifier'
Loaded message database from 'C:\Documents and 
Settings\tameyer\Application 
Data\SpamBayes\default_message_database.db'
Either bayes database or message database is missing - 
creating new
pythoncom error: Failed to call the universal dispatcher
Traceback (most recent call last):
  File "E:\src\pythonex\com\win32com\universal.py", line 170, 
in dispatch
  File "E:\src\pythonex\com\win32com\server\policy.py", line 
322, in _InvokeEx_
  File "E:\src\pythonex\com\win32com\server\policy.py", line 
601, in _invokeex_
  File "E:\src\pythonex\com\win32com\server\policy.py", line 
541, in _invokeex_
  File "E:\src\spambayes\Outlook2000\addin.py", line 655, in 
OnConnection
  File "E:\src\spambayes\Outlook2000\manager.py", line 475, 
in GetManager
  File "E:\src\spambayes\Outlook2000\manager.py", line 165, 
in __init__
  File "E:\src\spambayes\Outlook2000\manager.py", line 329, 
in LoadBayes
  File "E:\src\spambayes\Outlook2000\manager.py", line 378, 
in InitNewBayes
  File "E:\src\spambayes\Outlook2000\manager.py", line 94, 
in new_bayes
  File "E:\src\spambayes\Outlook2000\manager.py", line 118, 
in open_bayes
exceptions.AttributeError: 'module' object has no 
attribute 'DBDictClassifier'


----------------------------------------------------------------------

Comment By: Tony Meyer (anadelonbrin)
Date: 2003-03-31 18:05

Message:
Logged In: YES 
user_id=552329

Which version of the Outlook plugin are you using?  (a) the 
latest CVS, (b) the 001 stand-alone installer, or (c) the 002 
stand-alone installer?  I know that the 001 installer has been 
known to have this problem (although it appeared to be fixed 
in 002).

----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=498103&aid=712480&group_id=61702

From tim at fourstonesExpressions.com  Mon Mar 31 19:05:03 2003
From: tim at fourstonesExpressions.com (Tim Stone - Four Stones Expressions)
Date: Mon Mar 31 23:43:53 2003
Subject: [Spambayes] Latest spammer trick stymied
In-Reply-To: <LNBBLJKPBEHFEDALKOLCMEDEECAB.tim.one@comcast.net>
Message-ID: <SN61ZWD98YWNKA0ZUZWQKMITPJEQM2X.3e88e5bf@myst>

3/31/2003 6:59:56 PM, Tim Peters <tim.one@comcast.net> wrote:


>
>Look in your database for the spamprob on 'proto:http'.  My bet is that it's
>near neutral; it's reasonable to expect that a "found a URL" token would
>have the same spamprob.

Ok. I missed that one.  Yeah, it's .56 or so.  So that idea's a dumb one.  ;)  
So what's your take on the slurping thing, Tim?


c'est moi - TimS
http://www.fourstonesExpressions.com
http://wecanstopspam.org

There are 10 kinds of people in the world:
  those who understand binary,
  and those who don't.


From tim at fourstonesExpressions.com  Mon Mar 31 22:47:10 2003
From: tim at fourstonesExpressions.com (Tim Stone - Four Stones Expressions)
Date: Mon Mar 31 23:47:21 2003
Subject: [Spambayes] Latest spammer trick stymied - QUESTION
In-Reply-To: <w534r5jt1x7.fsf@woozle.org>
Message-ID: <TSNIWT1TKIIEKJIWQDCYW97KHWVMJZ.3e8919ce@myst>

3/31/2003 7:57:24 PM, Neale Pickett <neale@woozle.org> wrote:

>Tim Peters <tim.one@comcast.net> writes:
>
>> The scheme got simpler over time, as testing showed no significant
>> difference in results as more gimmicks got thrown out.
>
>Hi gang.  I'm not supposed to be working on this project anymore but I
>just can't help following up to this one.  I see Tim answering a lot of
>"I've got a cool tokenizing idea" questions.  So many, in fact, that I
>think there ought to be a FAQ on the web page somehwere, to the tune of:
>

Problem there is, that it seems like the spambayes site is the last place 
people look for information.  ;)

c'est moi - TimS
http://www.fourstonesExpressions.com
http://wecanstopspam.org

There are 10 kinds of people in the world:
  those who understand binary,
  and those who don't.