From kennypitt at hotmail.com  Mon Dec  1 09:40:59 2003
From: kennypitt at hotmail.com (Kenny Pitt)
Date: Mon Dec  1 09:41:35 2003
Subject: [spambayes-dev] More CVS branch/tags questions
In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F13026F29F3@its-xchg4.massey.ac.nz>
Message-ID: <Law11-OE49U3CGG792B00007a61@hotmail.com>

Tony Meyer wrote:
> Well, yes.  My only worry is that the trunk contains various code
> that the branch doesn't which is definitely not bug-fix and does
> change sb_server.py, UserInterface.py and ProxyUI.py a fair bit.  Of
> course, if I did it right, I didn't introduce any bugs adding that
> code anyway, but ... 

I've been using the updated version for quite awhile now, both from
source and with the new py2exe Windows binary.  We wrang out a couple of
minor bugs early on, but I haven't had any problems with it in several
weeks.

> OTOH, it would be nice to have some of those features in a release
> sooner than May 04 :)  (Especially the one that lets people submit a
> decent bug report).

Haven't tried the bug report feature, so can't give you a read on the
stability there.

>> 3) Put together a binary from my current py2exe setup script,
>> which includes CVS and a number of sb_ programs.
> 
> Does one need a special version of py2exe for this?  If so, is it one
> that there's a binary available for?  (i.e. can I do this without
> VC++?) 

It's the version in the sandbox subdirectory in py2exe CVS.  There
currently isn't a binary install that I know of, but I'm sure someone
could throw one together if needed.  I've been meaning to do it myself
so I can put it on my computer at home, but just haven't gotten around
to it.

On an at best partially related aside, if/when we redo the release
branching could we possibly do something with the version numbering in
Version.py?  It seems a bit confusing to have a completely different
version number for every app, especially when they appear to be totally
unrelated to the "1.0a7" type release numbering.

-- 
Kenny Pitt


From skip at pobox.com  Mon Dec  1 10:01:13 2003
From: skip at pobox.com (Skip Montanaro)
Date: Mon Dec  1 10:01:24 2003
Subject: [spambayes-dev] More CVS branch/tags questions
In-Reply-To: <027801c3b472$2b4c06a0$0200a8c0@eden>
References: <1ED4ECF91CDED24C8D012BCF2B034F13026F29D1@its-xchg4.massey.ac.nz>
	<027801c3b472$2b4c06a0$0200a8c0@eden>
Message-ID: <16331.22457.680172.948562@montanaro.dyndns.org>


    Mark> Should we abandon the branch, merging everything back to the
    Mark> trunk?  

Yes.  Let's reserve branches for the brief time leading up to a release and
for maintenance of old releases (if we decide that's worth doing, which I
doubt we will).

Skip

From Hugo.Duncan at alcan.com  Mon Dec  1 11:44:13 2003
From: Hugo.Duncan at alcan.com (Hugo.Duncan@alcan.com)
Date: Mon Dec  1 11:41:00 2003
Subject: [spambayes-dev] sb_notesfilter.py changes
Message-ID: <OF36FD87ED.AEC13A96-ON85256DEF.005B05F9-85256DEF.005BA471@alcan.com>


Hi,

I downloaded spambayes a few days ago, saw that you had some notes support,
and tried to integrate the script to run on receipt of mail in my notes
client.

These were the changes that I had to make (attached diff file):

add -P password option to specify the notes password.  Not terribly secure
I guess, but
you don't have to use it if you don't want to.

make it so that the pathname of the mail database is used both on server
and on
local machine.

make the replication occur only if running on the server fails.

allow redirection of stdout and stderr to file (-R filename)

allow logging to a notes database.

add file "Spam" to processed mail, to record spam probability.


I then added an "agent" (lotus speak for script) that processes new mail,
and some menu options for manually marking as spam,  for unmarking
falsely classified spam and for training as ham.

Regards,
Hugo

(See attached file: sb_notesfilter.diff)

Notice:
This message and any attachments are the property of Alcan and are intended
solely for the named recipients or entity to whom this message is
addressed. If you have received this message in error please inform the
sender via e-mail and destroy the message. If you are not the intended
recipient you are not allowed to use, copy or disclose the contents or
attachments in whole or in part.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: sb_notesfilter.diff
Type: application/octet-stream
Size: 6216 bytes
Desc: not available
Url : http://mail.python.org/pipermail/spambayes-dev/attachments/20031201/55cee7d2/sb_notesfilter-0001.obj
From tim at fourstonesExpressions.com  Mon Dec  1 11:52:23 2003
From: tim at fourstonesExpressions.com (Tim Stone)
Date: Mon Dec  1 11:52:55 2003
Subject: [spambayes-dev] sb_notesfilter.py changes
In-Reply-To: <OF36FD87ED.AEC13A96-ON85256DEF.005B05F9-85256DEF.005BA471@alcan.com>
References: <OF36FD87ED.AEC13A96-ON85256DEF.005B05F9-85256DEF.005BA471@alcan.com>
Message-ID: <oprzig5lnnit6vze@mail.fourstonesExpressions.com>

> Hi,
>
> I downloaded spambayes a few days ago, saw that you had some notes 
> support,
> and tried to integrate the script to run on receipt of mail in my notes
> client.

Very primitive support, as you have seen.  Glad to see that these were the 
only changes you required <wink>.  Would you let me know how it works for 
you as time goes on?  One thing that I noticed was that it became less and 
less able to accurately classify, and I think it's related to a couple of 
things.  One is that notes does not give you headers in the rfc822 sense.  
Thus the "no headers found" token becomes the deciding factor if your 
spam/ham ratio gets way out of whack, which mine was... like 3000:250 or 
something like that.  It started deciding that almost everything was 
spam... let me know what your experience is.  If your s:h ratio remains 
more reasonable, i would expect it to behave much more reasonably.

I'd love to have a notes integration, similar to the outlook integration, 
that doesn't rely on an external program using the com interface... but 
that is beyond my abilities with notes.

>
> These were the changes that I had to make (attached diff file):
>
> add -P password option to specify the notes password.  Not terribly 
> secure
> I guess, but
> you don't have to use it if you don't want to.
>
> make it so that the pathname of the mail database is used both on server
> and on
> local machine.
>
> make the replication occur only if running on the server fails.
>
> allow redirection of stdout and stderr to file (-R filename)
>
> allow logging to a notes database.
>
> add file "Spam" to processed mail, to record spam probability.

I'll check 'em out!  Thanks.

>
>
>
> I then added an "agent" (lotus speak for script) that processes new mail,
> and some menu options for manually marking as spam,  for unmarking
> falsely classified spam and for training as ham.

I tried to do this, but just couldn't figure it out... if you can give me 
the source for the agent, I can include it in the documentation... again, 
thanks for the input.
>
> Regards,
> Hugo
>
> (See attached file: sb_notesfilter.diff)
>
> Notice:
> This message and any attachments are the property of Alcan and are 
> intended
> solely for the named recipients or entity to whom this message is
> addressed. If you have received this message in error please inform the
> sender via e-mail and destroy the message. If you are not the intended
> recipient you are not allowed to use, copy or disclose the contents or
> attachments in whole or in part.


-- 

Vous exprimer; Expr?sese; Te stesso esprimere; Express yourself!
Tim Stone
See my photography at www.fourstonesExpressions.com
See my writing at www.xanga.com/obj3kshun

From Hugo.Duncan at alcan.com  Mon Dec  1 12:23:43 2003
From: Hugo.Duncan at alcan.com (Hugo.Duncan@alcan.com)
Date: Mon Dec  1 12:20:28 2003
Subject: [spambayes-dev] Re: sb_notesfilter.py changes
Message-ID: <OF4D37D21E.76820919-ON85256DEF.005E9DE5-85256DEF.005F4204@alcan.com>


Tim,

Thanks for getting any sort of notes support!

> it became less and less able to accurately classify

I'll keep an eye on this.  Thanks for the warning.

>One is that notes does not give you headers in the rfc822 sense.

Although you can access them in the document fields.  I just wrote
an agent to extract these so that I could send stuff to SpamCop.

> I'd love to have a notes integration, similar to the outlook integration,
> that doesn't rely on an external program using the com
> interface... but that is beyond my abilities with notes.

Presumably the outlook integration uses some sort of dll? does
it still require a python interpreter ?

> if you can give me the source for the agent

This is the "After new mail arrives" agent, in LotusScript.  Not very
pretty,
but it works for me.

Sub Initialize
  Err=0
  res%=Shell("c:/usr/Python23/pythonw.exe
c:/usr/Python23/scripts/sb_notesfilter.py -t -c   -r your_server_name -l
your_db_name -f Spambayes -d notesbayes -i index_name -P your_password_here
-R c:/tmp/bayes.log -L SpamBayesLog",1)
  If (Err<>0) Then
    Messagebox  Error$
  End If
End Sub


The others are "SimpleAction"'s to move the mail to the appropriate
folders.


Notice:
This message and any attachments are the property of Alcan and are intended
solely for the named recipients or entity to whom this message is
addressed. If you have received this message in error please inform the
sender via e-mail and destroy the message. If you are not the intended
recipient you are not allowed to use, copy or disclose the contents or
attachments in whole or in part.


From tim at fourstonesExpressions.com  Mon Dec  1 12:52:23 2003
From: tim at fourstonesExpressions.com (Tim Stone)
Date: Mon Dec  1 12:52:43 2003
Subject: [spambayes-dev] Re: sb_notesfilter.py changes
In-Reply-To: <OF4D37D21E.76820919-ON85256DEF.005E9DE5-85256DEF.005F4204@alcan.com>
References: <OF4D37D21E.76820919-ON85256DEF.005E9DE5-85256DEF.005F4204@alcan.com>
Message-ID: <oprzijxlh0it6vze@mail.fourstonesExpressions.com>

> Tim,
>
> Thanks for getting any sort of notes support!

My pleasure.  Was born out of necessity <wink>.

>> One is that notes does not give you headers in the rfc822 sense.
>
> Although you can access them in the document fields.  I just wrote
> an agent to extract these so that I could send stuff to SpamCop.
>

Hmmmm.... don't know if that's available to the com interface.

>> I'd love to have a notes integration, similar to the outlook 
>> integration,
>> that doesn't rely on an external program using the com
>> interface... but that is beyond my abilities with notes.
>
> Presumably the outlook integration uses some sort of dll? does
> it still require a python interpreter ?

It is a straight python program, that uses Mark Hammonds windows 
interfacing dll to access the COM interface that notes sports.  That 
interface is not particularly rich...


What version of notes are you using?  V5, I presume...

-- 

Vous exprimer; Expr?sese; Te stesso esprimere; Express yourself!
Tim Stone
See my photography at www.fourstonesExpressions.com
See my writing at www.xanga.com/obj3kshun

From richie at entrian.com  Mon Dec  1 15:22:39 2003
From: richie at entrian.com (Richie Hindle)
Date: Mon Dec  1 15:22:49 2003
Subject: [spambayes-dev] More CVS branch/tags questions
In-Reply-To: <Law11-OE49U3CGG792B00007a61@hotmail.com>
References: <1ED4ECF91CDED24C8D012BCF2B034F13026F29F3@its-xchg4.massey.ac.nz>
	<Law11-OE49U3CGG792B00007a61@hotmail.com>
Message-ID: <b08nsv8he0r45hh0f4r2b9q508nq8ua7vu@4ax.com>


[Kenny]
> On an at best partially related aside, if/when we redo the release
> branching could we possibly do something with the version numbering in
> Version.py?  It seems a bit confusing to have a completely different
> version number for every app, especially when they appear to be totally
> unrelated to the "1.0a7" type release numbering.

+1

I wasn't around when that was introduced, but I have to say it's never
made much sense to me.

-- 
Richie Hindle
richie@entrian.com


From tameyer at ihug.co.nz  Mon Dec  1 18:53:20 2003
From: tameyer at ihug.co.nz (Tony Meyer)
Date: Mon Dec  1 18:53:28 2003
Subject: [spambayes-dev] More CVS branch/tags questions
In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F1304477C04@its-xchg4.massey.ac.nz>
Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F13026F29F5@its-xchg4.massey.ac.nz>

[Kenny]
> I've been using the updated version for quite awhile now, 
> both from source and with the new py2exe Windows binary.  We 
> wrang out a couple of minor bugs early on, but I haven't had 
> any problems with it in several weeks.

Good to know.

> Haven't tried the bug report feature, so can't give you a 
> read on the stability there.

It's fairly straightforward, and worked for me, so hopefully goes ok.  At
the very least it should manage to give people and understanding of what
information we need.

> It's the version in the sandbox subdirectory in py2exe CVS.  
> There currently isn't a binary install that I know of, but 
> I'm sure someone could throw one together if needed.  I've 
> been meaning to do it myself so I can put it on my computer 
> at home, but just haven't gotten around to it.

If you could make me a copy, that would be fantastic.  At the moment I'm
stuck either testing whatever Mark throws my way or running from source
only.  I suppose I could just install VC++ (we have some sort of site
license here, I gather), but I really can't be bothered <wink>.

[Kenny]
> On an at best partially related aside, if/when we redo the 
> release branching could we possibly do something with the 
> version numbering in Version.py?  It seems a bit confusing to 
> have a completely different version number for every app, 
> especially when they appear to be totally unrelated to the 
> "1.0a7" type release numbering.

[Richie]
> I wasn't around when that was introduced, but
> I have to say it's never made much sense to me.

The thing is that the various apps do change at different rates -
mboxtrain/filter, for example, tend to change much more slowly (I imagine
because, being simpler in concept, they're more stable to begin with), and
the core of the system hasn't changed for a long time (since 1.0a1?).  As
more apps start getting released separately, I think this will become more
important.

I agree that it could be improved, though.  One thing I think would help
(I've made this change locally (ages ago), but haven't checked it in) is
removing the 'interface version' from pop3proxy/imapfilter and having a
separate 'interface version' (my fault it was there).  This, for example,
lets you see that the interface in this release has had significant changes,
but pop3proxy itself has not changed at all.  Another thing is that it's
somewhat out of date in terms of names, which is easily fixed.

Mark could probably explain the reasoning better than me, though, since he
came up with it <wink>.

=Tony Meyer


From kennypitt at hotmail.com  Tue Dec  2 09:39:35 2003
From: kennypitt at hotmail.com (Kenny Pitt)
Date: Tue Dec  2 09:40:12 2003
Subject: [spambayes-dev] More CVS branch/tags questions
In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F13026F29F5@its-xchg4.massey.ac.nz>
Message-ID: <Law11-OE55BerkSgCot00000a14@hotmail.com>

Tony Meyer wrote:
> [Kenny]
>> I've been using the updated version for quite awhile now,
>> both from source and with the new py2exe Windows binary.  We
>> wrang out a couple of minor bugs early on, but I haven't had
>> any problems with it in several weeks.
> 
> Good to know.

Oops, may have spoken too soon.  Just noticed I'm getting the following
error whenever I save configuration in the binary version.  This doesn't
happen when I run from source.  Note that the only thing I changed in
config was the spam cutoff value, so it shouldn't have anything to do
with the close/reopen of the training database.

"""
500 Server error

Traceback (most recent call last):

  File "spambayes\Dibbler.pyc", line 457, in found_terminator

  File "spambayes\UserInterface.pyc", line 801, in onChangeopts

  File "spambayes\ProxyUI.pyc", line 691, in reReadOptions

ImportError: No module named Options
"""

-- 
Kenny Pitt


From dave at boost-consulting.com  Tue Dec  2 11:37:14 2003
From: dave at boost-consulting.com (David Abrahams)
Date: Tue Dec  2 12:30:19 2003
Subject: [spambayes-dev] Serious problem
Message-ID: <uad6bfa6d.fsf@boost-consulting.com>


Something curious started happening to all the email I receive after I
upgraded from the "old" (pre-reorganization) Spambayes to the new one
(e.g. that includes "sb_filter.py").  When mail is run through
sb_filter.py, any line which begins with a period ("."), and all lines
thereafter, are stripped from the email.  For example, the following
paragraph begins with "./configure".  If you don't see it, you're
seeing the bug.

./configure is the command most people use to run a configure script.

-- 
Dave Abrahams
Boost Consulting
www.boost-consulting.com


From skip at pobox.com  Tue Dec  2 12:35:36 2003
From: skip at pobox.com (Skip Montanaro)
Date: Tue Dec  2 12:35:48 2003
Subject: [spambayes-dev] Serious problem
In-Reply-To: <uad6bfa6d.fsf@boost-consulting.com>
References: <uad6bfa6d.fsf@boost-consulting.com>
Message-ID: <16332.52584.801341.395335@montanaro.dyndns.org>


    Dave> When mail is run through sb_filter.py, any line which begins with
    Dave> a period ("."), and all lines thereafter, are stripped from the
    Dave> email.  For example, the following paragraph begins with
    Dave> "./configure".  If you don't see it, you're seeing the bug.

    Dave> ./configure is the command most people use to run a configure script.

I think the problem is elsewhere in your mail tool chain.  I saw the dot
.at
.the
.beginning
.of
.the
.line
.just
.fine.

Also saw several other lines after it, including your electronic signature.

Skip

From pclarke at dynapower.com  Tue Dec  2 12:49:40 2003
From: pclarke at dynapower.com (Clarke, Peter)
Date: Tue Dec  2 12:49:45 2003
Subject: [spambayes-dev] new Spambayes "feature"??
Message-ID: <769511B15AAE874591E117534314507704F00D@dppexchange.dynapower.com>

	THE BEST spam-filter I know about (and I know "just a little!" -
tried several, (even PAID for some!)).  I believe I have only had one e-mail
falsely labeled as spam, and NONE the other way, in over two months of heavy
attack by the spam-generating community!!
	Suggestion:
	It would be neat if the "you've got mail" icon in the "system tray"
(at the bottom of the screen) would be deleted whenever Spambayes sends any
spam e-mail off to the assigned "junk" folder.
	If the software already has this feature, then I'm obviously
ignorant on it, and how to activate it - if so, please enlighten.
Thanks a million!

Peter W Clarke
Chief Engineer - ULTRACAST* Products

DYNAPOWER CORPORATION
Specialists in AC and DC  Power Conversion Systems
ULTRACAST* Cast Coil Transformers
85 Meadowland Drive (05403)
PO Box 9210
S. Burlington, VT 05407-9210

Phone: (802) 652-1354  Fax: (802) 652-1371
E-mail:  pclarke@dynapower.com   Web: www.dynapower.com


IMPORTANT: The information contained in this communication is confidential
and/or proprietary business or technical data.  It is intended for receipt
only by the individual or entity to which it is addressed.  If the reader of
this message is not the intended recipient, you are hereby notified that any
dissemination, copying, or distribution of this communication is strictly
prohibited.  If you have received this communication in error, please
immediately notify us by telephone 802-860-7200 or electronically by return
message, and delete or destroy all copies of this communication. 


From dave at boost-consulting.com  Tue Dec  2 13:06:52 2003
From: dave at boost-consulting.com (David Abrahams)
Date: Tue Dec  2 13:07:17 2003
Subject: [spambayes-dev] Serious problem
In-Reply-To: <16332.52584.801341.395335@montanaro.dyndns.org> (Skip
	Montanaro's message of "Tue, 2 Dec 2003 11:35:36 -0600")
References: <uad6bfa6d.fsf@boost-consulting.com>
	<16332.52584.801341.395335@montanaro.dyndns.org>
Message-ID: <uekvndrgj.fsf@boost-consulting.com>

Skip Montanaro <skip@pobox.com> writes:

>     Dave> When mail is run through sb_filter.py, any line which begins with
>     Dave> a period ("."), and all lines thereafter, are stripped from the
>     Dave> email.  For example, the following paragraph begins with
>     Dave> "./configure".  If you don't see it, you're seeing the bug.
>
>     Dave> ./configure is the command most people use to run a configure script.
>
> I think the problem is elsewhere in your mail tool chain.  I saw the dot

I thought I'd ruled that out, but on further investigation, you're
quite right.  Sorry for the noise.

Here's an excerpt from what I sent my sysadmin:

---

Not long ago I started seeing a problem with my incoming emails.  Any
line beginning with a period ("."), and all following lines, are
stripped from the message.  If I turn off the procmail processing
altogether, the effect goes away.

I'm running all mail through procmail to filter spam and re-forwarding
them to myself using the following .procmailrc:


  LOGFILE=$HOME/.procmaillog
  PYTHONPATH=/usr/home/dave/src/spambayes:/usr/home/dave/src/email-2.5

  # Pass everything through the Spambayes filter
  :0 fw
  |/usr/local/bin/python /usr/home/dave/src/spambayes/scripts/sb_filter.py -d $HOME/h
  # Forward the mail back to myself
  :0
  ! dave

I can verify that the problem has nothing to do with Spambayes by
replacing that line with:

  | (echo "X-Spambayes-Classification: ham; 0.00" ; cat)


which basically force-classifies the message as ham to prevent an
infinite mail-rule loop.

procmail is still seeing the right email contents up until the point
the mail is forwarded back to me, which I can verify by adding:

  :0 c
  |cat >> bayeslog2

and inspecting bayeslog2.  

-- 
Dave Abrahams
Boost Consulting
www.boost-consulting.com


From kennypitt at hotmail.com  Tue Dec  2 13:10:12 2003
From: kennypitt at hotmail.com (Kenny Pitt)
Date: Tue Dec  2 13:10:44 2003
Subject: [spambayes-dev] new Spambayes "feature"??
In-Reply-To: <769511B15AAE874591E117534314507704F00D@dppexchange.dynapower.com>
Message-ID: <Law11-OE44ofObDpbeV00000cd2@hotmail.com>

Clarke, Peter wrote:
> 	Suggestion:
> 	It would be neat if the "you've got mail" icon in the "system
tray"
> (at the bottom of the screen) would be deleted whenever Spambayes
> sends any spam e-mail off to the assigned "junk" folder.

See FAQ 3.8:
http://spambayes.sourceforge.net/faq.html#how-can-i-get-rid-of-the-envel
ope-tray-icon-for-spam

-- 
Kenny Pitt


From kennypitt at hotmail.com  Tue Dec  2 14:30:44 2003
From: kennypitt at hotmail.com (Kenny Pitt)
Date: Tue Dec  2 14:31:27 2003
Subject: [spambayes-dev] Bug in timcv.py
Message-ID: <Law11-OE71VMum03yMJ00000dba@hotmail.com>

It looks like there is a bug in timcv.  I tried to run a test of
training on only a small number of messages, and I got the following
output.

"""
C:\src\python\spambayes_exp\testtools>python timcv.py -n 5 --HamTrain 10
--SpamTrain 10 --HamTest 150 --SpamTest 400  timcv_10-10.txt
Traceback (most recent call last):
  File "timcv.py", line 170, in ?
    main()
  File "timcv.py", line 167, in main
    drive(nsets)
  File "timcv.py", line 115, in drive
    d.test(hamstream, spamstream)
  File "C:\src\python\spambayes_exp\spambayes\TestDriver.py", line 266,
in test
    t.predict(spam, True, new_spam)
  File "C:\src\python\spambayes_exp\spambayes\Tester.py", line 92, in
predict
    prob = guess(example)
  File "C:\src\python\spambayes\spambayes\classifier.py", line 158, in
chi2_spamprob
    clues = self._getclues(wordstream)
  File "C:\src\python\spambayes\spambayes\classifier.py", line 395, in
_getclues
    prob = self.probability(record)
  File "C:\src\python\spambayes\spambayes\classifier.py", line 242, in
probability
    assert hamcount <= nham
AssertionError
"""

I took a quick look at timcv.py, and I think I know what is happening.
The ham and spam streams for initial training are created with
"train=1", but the untrain() for the set being tested is done using
streams that are created with "train=0".  If the HamTrain/SpamTrain
counts are different from the HamTest/SpamTest counts then the untrain()
does not use the same set of messages.  I can, of course, work around
this by setting build_each_classifier_from_scratch, but just wanted to
let everyone know about the mismatch.

I noticed another curiosity in the traceback:  I ran the test from
inside directory "C:\src\python\spambayes_exp", which contains my
modified version of SpamBayes.  When the traceback gets to
classifier.py, however, you can see that classifier.py was loaded from
"C:\src\python\spambayes" instead, which is where I have my original CVS
version of SpamBayes.  I don't have any PYTHONPATH environment variable
set, and I don't know what else might cause it to jump paths like that.
Can one of you more experienced python'ers explain this?

-- 
Kenny Pitt


From hugoduncan at users.sf.net  Tue Dec  2 14:50:40 2003
From: hugoduncan at users.sf.net (Hugo Duncan)
Date: Tue Dec  2 15:40:26 2003
Subject: [spambayes-dev] Re: Re: sb_notesfilter.py changes
References: <OF4D37D21E.76820919-ON85256DEF.005E9DE5-85256DEF.005F4204@alcan.com>
	<oprzijxlh0it6vze@mail.fourstonesExpressions.com>
Message-ID: <oprzkj2qrd264dan@localhost>


>>> One is that notes does not give you headers in the rfc822 sense.
>>
>> Although you can access them in the document fields.
>
> Hmmmm.... don't know if that's available to the com interface.

Attached an updated diff that adds From, Sender, Received and ReplyTo 
fields
in a new getMessage function.

> What version of notes are you using?  V5, I presume...
6.02


Hugo
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.python.org/pipermail/spambayes-dev/attachments/20031202/c23a5004/sb_notesfilter.html
From papaDoc at videotron.ca  Tue Dec  2 18:56:31 2003
From: papaDoc at videotron.ca (Remi Ricard)
Date: Tue Dec  2 18:52:56 2003
Subject: [spambayes-dev] Feature of allow_remote_connections ?
Message-ID: <1070409391.14340.6.camel@porsche.hq.simlog.com>

Hi,

I was testing something and found a strange behavior

It is looking like the options allow_remote_connections needs two items
separated by a comma.

This won't work
[html_ui]
xxx.xxx.xxx.xxx  N.B xxx.xxx.xxx.xxx is a real IP address
the error is  
Attempted to set [html_ui] allow_remote_connections with invalid value
xxx.xxx.xxx.xxx (<type 'str'>)
Traceback (most recent call last):
  File "/gmc/logiciels/spambayes/scripts/sb_server.py", line 106, in ?
    from spambayes.UserInterface import UserInterfaceServer
  File "/gmc/logiciels/spambayes/spambayes/UserInterface.py", line 46,
in ?
    """

If I use
[html_ui]
localhost,xxx.xxx.xxx.xxx

then the is no error

This is the function is UserInterface.py

  def onIncomingConnection(self, clientSocket):
        """Checks the security settings."""
        remoteIP = clientSocket.getpeername()[0]
        trustedIPs = options["html_ui", "allow_remote_connections"]

        if trustedIPs == "*" or remoteIP ==
clientSocket.getsockname()[0]:
           return True

        trustedIPs = trustedIPs.replace('.', '\.').replace('*',
'([01]?\d\d?|2[04]\d|25[0-5])')
        for trusted in trustedIPs.split(','):
          if re.search("^" + trusted + "$", remoteIP):
             return True

        return False

If I read the python code correctly you need to have a "," in
the trestedIPs string !

-- 
Remi Ricard <papaDoc@videotron.ca>


From mhammond at skippinet.com.au  Tue Dec  2 19:49:22 2003
From: mhammond at skippinet.com.au (Mark Hammond)
Date: Tue Dec  2 19:49:36 2003
Subject: [spambayes-dev] Branch merge
Message-ID: <038f01c3b937$4b738e70$2c00a8c0@eden>

I had a bash at merging the branch back onto the trunk.  There were a fairly
large number of conflicts, but after examining them all, it appears Tony has
been doing an excellent job at keeping all hist patches on both the branch
and the trunk - so most of the conflicts were resolved in favour of the
trunk.

I have attached 2 patches.  sb_docs.patch is patches to the various doc
files - README, README-DEVEL, etc etc etc.  Most of these are typos fixed on
the branch, and the most recent release notes.  A quick look by people who
edit these files would be great!

sb_code.patch contains the changes to code files required to merge the trunk
and the branch.  ImapUI.py has a number of reasonable looking changes which
check if currently logged on, and that the server name is valid.
spambayes/__init__.py bumps the version number.  This makes a grand total of
2 .py files that are affected by the merge.

Given the trivial nature of the patch required to do the merge, the question
appears to be "what is *missing*"!

I also attempted to upgrade the test suite in the hope of catching any
errors.  I have already checked these in.

Any comments, or +1s on me checking this in?

After we get past this, I will update README-DEVEL with our current 1.0
plan, and re-visit Version.py ;)

Thanks,

Mark.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: sb_docs.patch
Type: application/octet-stream
Size: 12498 bytes
Desc: not available
Url : http://mail.python.org/pipermail/spambayes-dev/attachments/20031203/c8862d0b/sb_docs.obj
-------------- next part --------------
A non-text attachment was scrubbed...
Name: sb_code.patch
Type: application/octet-stream
Size: 4249 bytes
Desc: not available
Url : http://mail.python.org/pipermail/spambayes-dev/attachments/20031203/c8862d0b/sb_code.obj
From mhammond at skippinet.com.au  Tue Dec  2 19:54:42 2003
From: mhammond at skippinet.com.au (Mark Hammond)
Date: Tue Dec  2 19:54:57 2003
Subject: [spambayes-dev] More CVS branch/tags questions
In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F13026F29F5@its-xchg4.massey.ac.nz>
Message-ID: <039a01c3b938$0a5ab610$2c00a8c0@eden>

> If you could make me a copy, that would be fantastic.  At the
> moment I'm
> stuck either testing whatever Mark throws my way or running
> from source
> only.  I suppose I could just install VC++ (we have some sort of site
> license here, I gather), but I really can't be bothered <wink>.

I threw together a py2exe binary for Tony yesterday.  Let me know if anyone
else wants it.

[As an aside, but useful for people who want to build *everything* from
sources, it is now possible to build win32all itself via a distutils script.
This should mean that given the source to SpamBayes, win32all and py2exe, a
single automated build procedure to churn out all distribution types should
be possible.]

Mark.


From tim.one at comcast.net  Tue Dec  2 22:59:52 2003
From: tim.one at comcast.net (Tim Peters)
Date: Tue Dec  2 22:59:56 2003
Subject: [spambayes-dev] Bug in timcv.py
In-Reply-To: <Law11-OE71VMum03yMJ00000dba@hotmail.com>
Message-ID: <LNBBLJKPBEHFEDALKOLCIEHKHIAB.tim.one@comcast.net>

[Kenny Pitt]
> It looks like there is a bug in timcv.  I tried to run a test of
> training on only a small number of messages, and I got the following
> output.
>
> """
> C:\src\python\spambayes_exp\testtools>python timcv.py -n 5 --HamTrain
> 10 --SpamTrain 10 --HamTest 150 --SpamTest 400  timcv_10-10.txt

Wow -- I didn't even know those options ({Ham,Spam}{Train,Test}) existed.
They warp the meaning of "cross validation" beyond my recognition, so I wish
they had been added to a new "cv-like" test driver instead.  Oh well.

> Traceback (most recent call last):
>    ...
>   File "C:\src\python\spambayes_exp\spambayes\TestDriver.py",
         line 266, in test
>     t.predict(spam, True, new_spam)
> ...
>   File "C:\src\python\spambayes\spambayes\classifier.py", line 242,
         in probability
>     assert hamcount <= nham
> AssertionError
> """

Ouch.

> I took a quick look at timcv.py, and I think I know what is happening.
> The ham and spam streams for initial training are created with
> "train=1",

Right.

> but the untrain() for the set being tested is done using streams that
> are created with "train=0".

Right.

> If the HamTrain/SpamTrain counts are different from the
> HamTest/SpamTest counts then the untrain() does not use the same
> set of messages.

This isn't cross-validation testing, so the optimizations in timcv.py *for*
true cv testing stopped making sense when these other options were added.

> I can, of course, work around this by setting
> build_each_classifier_from_scratch, but just wanted to let everyone
> know about the mismatch.

I'd rather see these options moved into a different test driver, leaving
timcv.py unsurprising again.  Since timcv.py is the primary driver for
serious testing, it should be kept as simple and bulletproof as possible.  I
regret that the build_each_classifier_from_scratch option was added to it
for the same reason (as the comments for that option say, there was a need
for that option at one time, when evaluating some since-rejected combining
schemes where *incremental* training and untraining were impossible; those
schemes went away, but the option stayed behind to muddy the waters).

> I noticed another curiosity in the traceback:  I ran the test from
> inside directory "C:\src\python\spambayes_exp", which contains my
> modified version of SpamBayes.  When the traceback gets to
> classifier.py, however, you can see that classifier.py was loaded from
> "C:\src\python\spambayes" instead, which is where I have my original
> CVS version of SpamBayes.  I don't have any PYTHONPATH environment
> variable set, and I don't know what else might cause it to jump paths
> like that. Can one of you more experienced python'ers explain this?

Run Python with -v to get a report of how every import got satisfied.  Then
stare until your eyes bleed <0.9 wink>.  I notice that a lot of the scripts
these days muck around with sys.path directly, thus changing Python's search
path dynamically, at runtime.  That's *usually* a Bad Idea.  If I were you,
I'd take a critical look at the fix_sys_path() function in
sb_test_support.py.  I don't know how this got so convoluted, but gobs of
dynamic code trying to "fix" what should be statically known (or at worst
fiddled once in a config file) is a pretty sure recipe for confusion.


From mhammond at skippinet.com.au  Tue Dec  2 23:26:35 2003
From: mhammond at skippinet.com.au (Mark Hammond)
Date: Tue Dec  2 23:26:50 2003
Subject: [spambayes-dev] Bug in timcv.py
In-Reply-To: <LNBBLJKPBEHFEDALKOLCIEHKHIAB.tim.one@comcast.net>
Message-ID: <042a01c3b955$a3cc9bc0$2c00a8c0@eden>

> > I noticed another curiosity in the traceback:  I ran the test from
> > inside directory "C:\src\python\spambayes_exp", which contains my
> > modified version of SpamBayes.  When the traceback gets to
> > classifier.py, however, you can see that classifier.py was
> loaded from
> > "C:\src\python\spambayes" instead, which is where I have my original
> > CVS version of SpamBayes.  I don't have any PYTHONPATH environment
> > variable set, and I don't know what else might cause it to
> jump paths
> > like that. Can one of you more experienced python'ers explain this?
>
> Run Python with -v to get a report of how every import got
> satisfied.  Then stare until your eyes bleed <0.9 wink>.

I guess that both the spambayes directory itself, *and* the spambayes
parent, are on sys.path (and probably the different versions of each).  Thus
'import Options' may be resolved as either 'spambayes.Options' or simply
'Options'.

But as Tim said, you can confirm it yourself if for some strange reason you
really care <wink>

> I notice that a lot of the scripts
> these days muck around with sys.path directly, thus changing
> Python's search
> path dynamically, at runtime.  That's *usually* a Bad Idea.

Yeah, I'd like to fix these, as I am responsible for some.  IMO, the
"package-ness" of SpamBayes isn't that well defined - mainly as the concept
was created after the core code.  The Outlook2000 directory isn't a package,
but arguably should be.

Another reason is that for Outlook, I have never insisted that a user do a
"setup.py install" before using the addin.  I attempt to use the code
directly from the source-tree, including the core spambayes package.  If we
do move towards forcing source-code users to use distutils to install the
package, we may be able to drop even more.

Is this a good thing?  Tim - I assume you tend to use SpamBayes directly
from the CVS tree - is that correct?  If so, you manage sys.path manually?

> If I were you,
> I'd take a critical look at the fix_sys_path() function in
> sb_test_support.py.  I don't know how this got so convoluted,
> but gobs of
> dynamic code trying to "fix" what should be statically known
> (or at worst
> fiddled once in a config file) is a pretty sure recipe for confusion.

Well, sb_test_support just got created *today*, so poor Kenny would not have
seen it when he sent the mail <wink>.  Also, this file is used *only* by the
'unit test style tests' rather than the 'validation style tests' that timcv
exists in.  The hacks in sb_test_support were a small step towards reducing
the sys.path hacking, but only for that single directory.

Mark.


From skip at pobox.com  Wed Dec  3 10:28:43 2003
From: skip at pobox.com (Skip Montanaro)
Date: Wed Dec  3 10:28:52 2003
Subject: [spambayes-dev] Re: [Spambayes-checkins] website faq.txt,1.51,1.52
In-Reply-To: <E1ARNhg-00009L-00@sc8-pr-cvs1.sourceforge.net>
References: <E1ARNhg-00009L-00@sc8-pr-cvs1.sourceforge.net>
Message-ID: <16334.299.607894.605629@montanaro.dyndns.org>


    Tony> Missing mail: it's odd that this is very suddenly a FAQ, and
    Tony> nowhere near a release. 

Proof positive that SpamBayes is moving further down the food chain. <wink>

Skip

From tameyer at ihug.co.nz  Wed Dec  3 21:15:09 2003
From: tameyer at ihug.co.nz (Tony Meyer)
Date: Wed Dec  3 21:15:16 2003
Subject: [spambayes-dev] Branch merge
In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F130447814C@its-xchg4.massey.ac.nz>
Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F13026F29FB@its-xchg4.massey.ac.nz>

> I had a bash at merging the branch back onto the trunk.  
[...]
> I have attached 2 patches.  sb_docs.patch is patches to the 
> various doc files - README, README-DEVEL, etc etc etc.

It makes no difference if the WHAT_IS_NEW file doesn't get updated, or what
the contents end up looking like - this gets rewritten with each release so
will need to be redone for 1.0a8 or whatever it is that the scheme has us
releasing next.  (Even a binary-only 1.0a75 would need to have it modified
from the 1.0a7 version *).  The rest looks fine to me.

* Actually, we'll have to figure out what happens here anyway, since it's
probably not even included with the binary, although it does form the basis
of the sourceforge 'release notes'.

> sb_code.patch contains the changes to code files required to 
> merge the trunk and the branch.  ImapUI.py has a number of 
> reasonable looking changes which check if currently logged 
> on, and that the server name is valid.

Yeah, these should be on the trunk, too.  Not sure how I missed that,
although I was a bit preoccupied at the time <wink>.  Actually, the code
needs to be improved a bit, since I think at the moment it'll give a half
completed page if not logged in; I'll fix that once it's on the trunk since
that seems easiest (and it's hardly urgent).

(This does mean that I lose my bet, though.  I'd also forgotten about the
version number bump and the README-DEVEL tidying).

> I also attempted to upgrade the test suite in the hope of 
> catching any errors.  I have already checked these in.

When you're updating README-DEVEL, you could put something in saying that we
expect that all new code will come with an appropriate unittest.  You never
know, we might fool newcomers into thinking that that's the actual
situation, and get some movement along those lines <wink>.

> Any comments, or +1s on me checking this in?

+1 here.

=Tony Meyer


From tameyer at ihug.co.nz  Wed Dec  3 21:37:57 2003
From: tameyer at ihug.co.nz (Tony Meyer)
Date: Wed Dec  3 21:38:04 2003
Subject: [spambayes-dev] Feature of allow_remote_connections ?
In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F130447813D@its-xchg4.massey.ac.nz>
Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F13026F29FC@its-xchg4.massey.ac.nz>

> It is looking like the options allow_remote_connections needs 
> two items separated by a comma.
> 
> This won't work
> [html_ui]
> xxx.xxx.xxx.xxx  N.B xxx.xxx.xxx.xxx is a real IP address
> the error is  
> Attempted to set [html_ui] allow_remote_connections with invalid value
> xxx.xxx.xxx.xxx (<type 'str'>)

I tried this:
"""
[html_ui]
allow_remote_connections:123.123.123.123
"""

And it worked fine here.  "allow_remote_connections:123.123.123.123,
123.123.123.132" didn't, but that's because of the space after the comma, I
think ("allow_remote_connections:123.123.123.123,123.123.123.132" works).

That said, the 'correct' way for that option to be set up would really be to
expect a tuple and have the regex only allow *one* of the possibilities to
match.  The options code would then take care of single/multiple values.
It's a pretty simple fix, I think, but I'm a bit wary of checking it in
since I don't use this, and didn't write it.  Any volunteers?  (I'll produce
the code).

> This is the function is UserInterface.py
[...]
>         for trusted in trustedIPs.split(','):
[...]
> If I read the python code correctly you need to have a "," in
> the trestedIPs string !

Doing split(',') should return a single item (the whole string) if there
aren't any commas at all, so this should work.  That said, if the options
code was taking care of the multiple values as it should be, then the split
wouldn't be necessary at all (and they could be separated by spaces, or
whatever, like the other options).

This doesn't explain why you had troubles, though.  Maybe the regex failed
for some other reason?  It's certainly very complicated looking (I presume
it checks the IP is valid).  For our purposes "((?:\d{1,3}\.){3}\d{1,3})"
would probably do.  If you're comfortable mucking about with regexs you
could take the IP_LIST one out of OptionsClass.py and see why it failed the
IP you gave it (I like using Kodos for this sort of thing).

<http://kodos.sf.net>

=Tony Meyer


From tameyer at ihug.co.nz  Wed Dec  3 22:09:33 2003
From: tameyer at ihug.co.nz (Tony Meyer)
Date: Wed Dec  3 22:10:18 2003
Subject: [spambayes-dev] More CVS branch/tags questions
In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F1304477EFB@its-xchg4.massey.ac.nz>
Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F13026F29FD@its-xchg4.massey.ac.nz>

> Here is a py2exe.  It *should* work.

I had a quick go a building it: I got a 'can't find the gen_py file' error,
which I fixed by changing the versions from 0,9,0 to 0,9,1 and 0,2,0 to
0,2,2 which is what I have.  Does this hurt?  Is there a better way to do
this?

After that it all appeared to work (from 2 minutes testing).  I'll swap my
wife's system over to this to test it out there (win98) and do some more
next week.

=Tony Meyer


From tameyer at ihug.co.nz  Wed Dec  3 22:25:02 2003
From: tameyer at ihug.co.nz (Tony Meyer)
Date: Wed Dec  3 22:25:08 2003
Subject: [spambayes-dev] More CVS branch/tags questions
In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F1304477F8F@its-xchg4.massey.ac.nz>
Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F13026F29FE@its-xchg4.massey.ac.nz>

> Unzip the attached file into your site-packages directory and 
> see if it does the trick.  I don't think py2exe setup makes 
> any other changes, but I'm not certain.  This was built for 
> Python 2.3.2 using VC++ 6.0.

Thanks!  (Mark gave me a copy, too, so I'm well supplied).

Building the installer seems ok, but I get these errors:

Traceback (most recent call last):
  File "pop3proxy_tray.py", line 407, in _ProxyThread
  File "sb_server.pyc", line 869, in start
  File "sb_server.pyc", line 847, in main
  File "spambayes\ProxyUI.pyc", line 156, in __init__
  File "spambayes\UserInterface.pyc", line 254, in __init__
  File "spambayes\UserInterface.pyc", line 122, in __init__
  File "spambayes\UserInterface.pyc", line 240, in readUIResources
  File "spambayes\resources\__init__.pyc", line 30, in ?
  File "resourcepackage\package.pyc", line 100, in scan
WindowsError: [Errno 3] The system cannot find the path specified:
'C:\\Program Files\\SpamBayes\\lib\\spambayes.zip\\spambayes\\resources/*.*'
Loading database... SMTP Listener on port 25 is proxying
smtp.massey.ac.nz:25
Traceback (most recent call last):
  File "pop3proxy_tray.py", line 389, in OnCommand
  File "pop3proxy_tray.py", line 431, in Start
  File "sb_server.pyc", line 863, in prepare
  File "sb_server.pyc", line 688, in buildServerStrings
TypeError: iteration over non-sequence

Have you seen those?  Neither occurs when I use the source.  (The zip is
there, and has the resources directory in it).

If you haven't, then I'll try and rummage around and figure out what's
causing them next week.

=Tony Meyer


From ta-meyer at ihug.co.nz  Wed Dec  3 22:42:07 2003
From: ta-meyer at ihug.co.nz (Tony Meyer)
Date: Wed Dec  3 22:42:14 2003
Subject: [spambayes-dev] Testing the binary installer (Was More CVS
	branch/tags questions)
Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F130212B1EC@its-xchg4.massey.ac.nz>

[me, just now]
> Building the installer seems ok, but I get these errors:
> 
> Traceback (most recent call last):
>   File "pop3proxy_tray.py", line 407, in _ProxyThread
>   File "sb_server.pyc", line 869, in start
>   File "sb_server.pyc", line 847, in main
>   File "spambayes\ProxyUI.pyc", line 156, in __init__
>   File "spambayes\UserInterface.pyc", line 254, in __init__
>   File "spambayes\UserInterface.pyc", line 122, in __init__
>   File "spambayes\UserInterface.pyc", line 240, in readUIResources
>   File "spambayes\resources\__init__.pyc", line 30, in ?
>   File "resourcepackage\package.pyc", line 100, in scan
> WindowsError: [Errno 3] The system cannot find the path specified:
> 'C:\\Program
Files\\SpamBayes\\lib\\spambayes.zip\\spambayes\\resources/*.*'

I figured this one out (and probably why Kenny and Mark don't see it).  This
happens if you have the resourcepackage __init__.py in the
spambayes/resources directory (which is necessary to get the files in there
to automatically update).  Resource package isn't able to find the files
inside the zip (and even if it could, would have to alter the zip to change
the files).

The easy solution is to just have the cvs __init__.py; the correct solution
is probably to add something in that stops the check if we're frozen.  Does
that sound right to you, Richie?

=Tony Meyer


From tameyer at ihug.co.nz  Wed Dec  3 22:46:18 2003
From: tameyer at ihug.co.nz (Tony Meyer)
Date: Wed Dec  3 22:46:24 2003
Subject: [spambayes-dev] Testing the binary installer (Was More
	CVSbranch/tags questions)
In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F130458F616@its-xchg4.massey.ac.nz>
Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F130212B1ED@its-xchg4.massey.ac.nz>

[Me, making another mistake]
> The easy solution is to just have the cvs __init__.py;

Opps.  I meant a blank __init__.py, or to not have resourcepackage
installed.

=Tony Meyer


From tameyer at ihug.co.nz  Wed Dec  3 23:11:21 2003
From: tameyer at ihug.co.nz (Tony Meyer)
Date: Wed Dec  3 23:11:26 2003
Subject: [spambayes-dev] Using reload() with modules from zips (Was More CVS
	branch/tags questions)
In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F1304477FA0@its-xchg4.massey.ac.nz>
Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F13026F29FF@its-xchg4.massey.ac.nz>

> Oops, may have spoken too soon.  Just noticed I'm getting the 
> following error whenever I save configuration in the binary 
> version.
[...]
> ImportError: No module named Options

I remember this from the last binary (from Mark) I tried, and I get it too.

IMO, this is a Python bug.  Try this:
"""
>set PYTHONPATH=path/to/spambayes.zip
>python

>>> from spambayes import Options
>>> Options
<module 'spambayes.Options' from
'c:\spambayes\windows\py2exe\dist\lib\spambayes
.zip\spambayes\Options.pyc'>
>>> reload(Options)
Traceback (most recent call last):
  File "<stdin>", line 1, in ?
ImportError: No module named Options
"""

Looks to me like reload() doesn't work with module from zip's, which I
presume it should.

Could one of the Python experts correct me if I'm wrong here?  Otherwise I
presume I should open a (python) sf bug about this.

=Tony Meyer


From richie at entrian.com  Thu Dec  4 04:02:26 2003
From: richie at entrian.com (Richie Hindle)
Date: Thu Dec  4 04:02:37 2003
Subject: [spambayes-dev] Testing the binary installer (Was More CVS
	branch/tags questions)
In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F130212B1EC@its-xchg4.massey.ac.nz>
References: <1ED4ECF91CDED24C8D012BCF2B034F130212B1EC@its-xchg4.massey.ac.nz>
Message-ID: <fmttsvsq06brcddltu1jfc5f74gc779mvq@4ax.com>


[Tony]
> the correct solution
> is probably to add something in that stops the check if we're frozen.  Does
> that sound right to you, Richie?

Sounds good.  I'm sure Mike Fletcher would appreciate a patch.  8-)

-- 
Richie Hindle
richie@entrian.com


From richie at entrian.com  Thu Dec  4 04:02:29 2003
From: richie at entrian.com (Richie Hindle)
Date: Thu Dec  4 04:02:39 2003
Subject: [spambayes-dev] Re: [Spambayes-checkins] spambayes/windows
	pop3proxy_tray.py, 1.15, 1.16
In-Reply-To: <E1ARjgX-0000JA-00@sc8-pr-cvs1.sourceforge.net>
References: <E1ARjgX-0000JA-00@sc8-pr-cvs1.sourceforge.net>
Message-ID: <umttsvcvvnh9n7kq1etganuck0bgdoeh1b@4ax.com>


[Tony]
> Since 15/10/03, SetDefaultItem has been available for menus in win32all, so use that
> as we should.  (It appears to set the font of the item correctly, but not have any effect
> in terms of action, so still capture the double-click ourselves.  Someone correct me if
> I've done this wrongly).

I've just had a look at a tray app I wrote years ago, and it does the same
thing.

-- 
Richie Hindle
richie@entrian.com


From richie at entrian.com  Thu Dec  4 04:02:30 2003
From: richie at entrian.com (Richie Hindle)
Date: Thu Dec  4 04:02:41 2003
Subject: [spambayes-dev] Branch merge
In-Reply-To: <038f01c3b937$4b738e70$2c00a8c0@eden>
References: <038f01c3b937$4b738e70$2c00a8c0@eden>
Message-ID: <1nttsvk6smcgunmjup66glucnjioqbfo81@4ax.com>


[Mark]
> Any comments, or +1s on me checking this in?

I have no time to read it, but I trust you.

+0.99

(so if it all goes wrong I can blame the 0.01 8-)

-- 
Richie Hindle
richie@entrian.com


From kennypitt at hotmail.com  Thu Dec  4 11:11:33 2003
From: kennypitt at hotmail.com (Kenny Pitt)
Date: Thu Dec  4 11:12:06 2003
Subject: [spambayes-dev] Testing the binary installer (Was More
	CVSbranch/tags questions)
In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F130212B1EC@its-xchg4.massey.ac.nz>
Message-ID: <LAW11-OE30XfT433nD0000025f4@hotmail.com>

Tony Meyer wrote:
> [me, just now]
>> Building the installer seems ok, but I get these errors:
>> 
>> Traceback (most recent call last):
>>   File "pop3proxy_tray.py", line 407, in _ProxyThread
>>   File "sb_server.pyc", line 869, in start
>>   File "sb_server.pyc", line 847, in main
>>   File "spambayes\ProxyUI.pyc", line 156, in __init__
>>   File "spambayes\UserInterface.pyc", line 254, in __init__
>>   File "spambayes\UserInterface.pyc", line 122, in __init__
>>   File "spambayes\UserInterface.pyc", line 240, in readUIResources
>>   File "spambayes\resources\__init__.pyc", line 30, in ?
>>   File "resourcepackage\package.pyc", line 100, in scan
>> WindowsError: [Errno 3] The system cannot find the path specified:
>> 'C:\\Program
> Files\\SpamBayes\\lib\\spambayes.zip\\spambayes\\resources/*.*'
> 
> I figured this one out (and probably why Kenny and Mark don't see
> it).  This happens if you have the resourcepackage __init__.py in the
> spambayes/resources directory (which is necessary to get the files in
> there to automatically update).  Resource package isn't able to find
> the files inside the zip (and even if it could, would have to alter
> the zip to change the files).

I just checked and the last binary I built was done with the
resourcepackage __init__.py still in place, yet I don't get the error.
I think you are probably still correct on the problem, though.  There is
a good chance that I ran from source at least once since I last modified
any of the resources, so the .py files would have gotten updated then.
Would that cause resourcepackage to not try to regenerate?  Try running
from source first and then rebuilding the binary to see if you still get
the error.

> The easy solution is to just have the cvs __init__.py; the correct
> solution is probably to add something in that stops the check if
> we're frozen.  Does that sound right to you, Richie?

This definitely should be done for a release version.

-- 
Kenny Pitt


From kennypitt at hotmail.com  Thu Dec  4 11:27:18 2003
From: kennypitt at hotmail.com (Kenny Pitt)
Date: Thu Dec  4 11:27:49 2003
Subject: [spambayes-dev] RE: [Spambayes-checkins] spambayes/windows
	spambayes.iss, 1.2, 1.3
In-Reply-To: <E1ARjv2-0000oh-00@sc8-pr-cvs1.sourceforge.net>
Message-ID: <Law11-OE71gn5yIVqVW00002643@hotmail.com>

Tony Meyer wrote:
> Update of /cvsroot/spambayes/spambayes/windows
> In directory sc8-pr-cvs1:/tmp/cvs-serv3138/windows
> 
> Modified Files:
> 	spambayes.iss
> Log Message:
> These dlls end up in the lib directory here, and I'm pretty sure that
that's where 
> they're meant to be these days.  An old .iss, maybe?
> 
> Index: spambayes.iss
> ===================================================================
> RCS file: /cvsroot/spambayes/spambayes/windows/spambayes.iss,v
> retrieving revision 1.2
> retrieving revision 1.3
> diff -C2 -d -r1.2 -r1.3
> *** spambayes.iss	23 Oct 2003 22:54:09 -0000	1.2
> --- spambayes.iss	4 Dec 2003 03:12:41 -0000	1.3
> ***************
> *** 19,24 ****
>   Source: "py2exe\dist\lib\*.*"; DestDir: "{app}\lib"; Flags:
ignoreversion
>   Source: "py2exe\dist\bin\python23.dll"; DestDir: "{app}\bin"; Flags:
ignoreversion
> ! Source: "py2exe\dist\bin\pythoncom23.dll"; DestDir: "{app}\bin";
Flags: ignoreversion
> ! Source: "py2exe\dist\bin\PyWinTypes23.dll"; DestDir: "{app}\bin";
Flags: ignoreversion  
> 
>   Source: "py2exe\dist\bin\outlook_addin.dll"; DestDir: "{app}\bin";
Check:
> InstallingOutlook; Flags: ignoreversion regserver
> --- 19,24 ----
>   Source: "py2exe\dist\lib\*.*"; DestDir: "{app}\lib"; Flags:
ignoreversion
>   Source: "py2exe\dist\bin\python23.dll"; DestDir: "{app}\bin"; Flags:
ignoreversion
> ! Source: "py2exe\dist\lib\pythoncom23.dll"; DestDir: "{app}\bin";
Flags: ignoreversion
> ! Source: "py2exe\dist\lib\PyWinTypes23.dll"; DestDir: "{app}\bin";
Flags: ignoreversion  
> 
>   Source: "py2exe\dist\bin\outlook_addin.dll"; DestDir: "{app}\bin";
Check: 
> InstallingOutlook; Flags: ignoreversion regserver 

Actually, those two DLLs only go in the lib directory now, not in the
bin directory.  The "py2exe\dist\lib\*.*" line already takes care of
that.

-- 
Kenny Pitt


From kennypitt at hotmail.com  Thu Dec  4 11:46:17 2003
From: kennypitt at hotmail.com (Kenny Pitt)
Date: Thu Dec  4 11:46:48 2003
Subject: [spambayes-dev] RE: [Spambayes-checkins]
	spambayes/windowsspambayes.iss, 1.2, 1.3
In-Reply-To: <Law11-OE71gn5yIVqVW00002643@hotmail.com>
Message-ID: <LAW11-OE47p4b0NnHIX000026a3@hotmail.com>

[Me, before realizing I had modified this file locally]
> Tony Meyer wrote:
>> Update of /cvsroot/spambayes/spambayes/windows
>> In directory sc8-pr-cvs1:/tmp/cvs-serv3138/windows
>> 
>> Modified Files:
>> 	spambayes.iss
>> Log Message:
>> These dlls end up in the lib directory here, and I'm pretty sure
>> that that's where they're meant to be these days.  An old .iss,
>> maybe? 
> 
> Actually, those two DLLs only go in the lib directory now, not in the
> bin directory.  The "py2exe\dist\lib\*.*" line already takes care of
> that.

And now for the completion of that partial thought:

So, the correct fix is to delete those two DLLs from the .iss entirely.

Some background:  A little while back, I noticed that these DLLs weren't
found in some cases when they were in the bin directory (e.g. if you
tried to register the plugin DLL when you weren't sitting in the bin
directory).  Mark and I determined that if the DLLs are in the python
path, they will be found correctly in all cases, so that's why setup_all
was changed to copy them to the dist\lib directory instead of the
dist\bin directory.

-- 
Kenny Pitt


From dbulgrien at vcsd.com  Thu Dec  4 12:06:03 2003
From: dbulgrien at vcsd.com (Dennis W. Bulgrien)
Date: Thu Dec  4 12:20:26 2003
Subject: [spambayes-dev] Re: Outlook Envelope Tray Icon
References: <IEEDKAHMEBPPLILCLFKFCEECCJAA.bob@jellyvision.com>
	<E1AIB7K-0001Ie-C3@mail.python.org>
Message-ID: <bqnpi5$d86$1@sea.gmane.org>

One place that I have noticed would be nice is when the "Delete as Spam" button
is pressed.  With SpamBayes Manager, Training tab, Incremental Training frame,
Clicking Delete as Spam should "mark the message as read", the icon is not
cleared even though the message is marked as read.  This is unexpected because
the Filtering tab, Certain Spam frame, Mark spam as read check-box keeps the
icon from appearing when spam comes in and is ushered to the spam folder
(Advanced tab set to Enabled background filtering, default delays).  Maybe the
later works because SpamBayes marks it read even BEFORE Outlook displays the
icon.

"Kenny Pitt" <kennypitt@hotmail.com> wrote...
...
Thanks for the link.  I created the following code to implement this in
the Outlook plugin and attached it to a menu item for testing.  It was,
in fact, successful in removing the new mail envelope from the taskbar.
Now, the *really* tricky part is figuring out when to remove the icon.
...


From kennypitt at hotmail.com  Thu Dec  4 13:47:59 2003
From: kennypitt at hotmail.com (Kenny Pitt)
Date: Thu Dec  4 13:48:34 2003
Subject: [spambayes-dev] pop3proxy_tray error
Message-ID: <LAW11-OE28AuMroXsAy000027f5@hotmail.com>

I decided it was probably time to do a little more thorough testing of
the proxy tray than just my average daily usage.  I tried to stop
SpamBayes from the right-click menu, and then start it again.  Here's
the output I got when it tried to restart SpamBayes.

"""
Loading database...
Traceback (most recent call last):
  File "pop3proxy_tray.py", line 389, in OnCommand
    function()
  File "pop3proxy_tray.py", line 431, in Start
    sb_server.prepare(state=sb_server.state)
  File "C:\src\python\spambayes_exp\scripts\sb_server.py", line 861, in
prepare
    state.buildServerStrings()
  File "C:\src\python\spambayes_exp\scripts\sb_server.py", line 685, in
buildServerStrings
    serverStrings = ["%s:%s" % (s, p) for s, p in self.servers]
TypeError: iteration over non-sequence
"""

At this point, the tray icon indicates that SpamBayes is running (I had
to hover my mouse over the icon to get it to update, maybe a different
problem), but none of the ports have been opened so any attempt to
connect to the mail server, review messages in the ui, etc. fails.

This happens whether I am running from source or the py2exe binary.

-- 
Kenny Pitt


From kennypitt at hotmail.com  Thu Dec  4 14:16:44 2003
From: kennypitt at hotmail.com (Kenny Pitt)
Date: Thu Dec  4 14:17:16 2003
Subject: [spambayes-dev] pop3proxy_tray icons
Message-ID: <Law11-OE48VKW447p9N0000289d@hotmail.com>

I don't know if anyone else has noticed this or not, but on my Windows
2000 system the green and red circles in the current pop3proxy_tray
icons are very difficult to make out.  I created the attached icons as a
possible alternative.  They are basic 16-color icons and show up quite
nicely on both Windows 2000 and Windows XP.

The attached patch is also required because the LoadImage calls pass 0,0
for the icon size.  That loads the icon using the default 32x32 size,
scaling a 16x16 icon up to 32x32 if necessary.  Since icons in the tray
are only 16x16, they then get scaled back down when displayed and still
end up looking bad.

I also attached an alternate sbicon that I created in the spirit of the
icons in the Web UI.  It uses the envelope icon from the Wingdings font
with the same blue outline color used in the UI icons.  I modified my
py2exe\setup_all.py to use this as the icon for all the generated exe's.

-- 
Kenny Pitt
-------------- next part --------------
A non-text attachment was scrubbed...
Name: sb-stopped.ico
Type: image/x-icon
Size: 318 bytes
Desc: not available
Url : http://mail.python.org/pipermail/spambayes-dev/attachments/20031204/490f76fe/sb-stopped.bin
-------------- next part --------------
A non-text attachment was scrubbed...
Name: sb-started.ico
Type: image/x-icon
Size: 318 bytes
Desc: not available
Url : http://mail.python.org/pipermail/spambayes-dev/attachments/20031204/490f76fe/sb-started.bin
-------------- next part --------------
A non-text attachment was scrubbed...
Name: pop3proxy_tray.diff
Type: application/octet-stream
Size: 1992 bytes
Desc: not available
Url : http://mail.python.org/pipermail/spambayes-dev/attachments/20031204/490f76fe/pop3proxy_tray.obj
-------------- next part --------------
A non-text attachment was scrubbed...
Name: sbicon.ico
Type: image/x-icon
Size: 4710 bytes
Desc: not available
Url : http://mail.python.org/pipermail/spambayes-dev/attachments/20031204/490f76fe/sbicon.bin
From sanjaydarisi at cox.net  Thu Dec  4 14:29:29 2003
From: sanjaydarisi at cox.net (Sanjay Darisi)
Date: Thu Dec  4 14:29:34 2003
Subject: [spambayes-dev] Closing Manager window...
Message-ID: <3FCF8B19.5030108@cox.net>


I realized that everytime the spambayes Manager dialog is closed it 
saves to the config file. I found that the close button just closes the 
dialog. So, is there any event associated with the closing of this 
dialog that invokes the SaveConfig function in Manager.py file. Oops...I 
didn't tell you, I am running spambayes outlook addin 0.81

Thanks in advance,
Sanjay.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.python.org/pipermail/spambayes-dev/attachments/20031204/bb33296e/attachment.html
From mhammond at skippinet.com.au  Thu Dec  4 17:35:38 2003
From: mhammond at skippinet.com.au (Mark Hammond)
Date: Thu Dec  4 17:35:55 2003
Subject: [spambayes-dev] Using reload() with modules from zips (Was More
	CVSbranch/tags questions)
In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F13026F29FF@its-xchg4.massey.ac.nz>
Message-ID: <07dd01c3bab6$f2ff7cf0$2c00a8c0@eden>

> Looks to me like reload() doesn't work with module from zip's, which I
> presume it should.
>
> Could one of the Python experts correct me if I'm wrong here?
>  Otherwise I
> presume I should open a (python) sf bug about this.

I agree this is a bug.  While .zip files are generally logically "readonly",
there is no reason that a .zip file could not be updated dynamically while
an app is running.

I'm not so sure it will see quick attention though, so we should consider
handling this in our code.

I'm also not sure exactly *why* we are doing a reload - saving user options
should not require us to reload the Options module, and I'm fairly sure no
code exists that updates the .zip with a new Options file even if it did :)

Mark.


From kennypitt at hotmail.com  Thu Dec  4 17:55:40 2003
From: kennypitt at hotmail.com (Kenny Pitt)
Date: Thu Dec  4 17:56:12 2003
Subject: [spambayes-dev] Using reload() with modules from zips (Was
	MoreCVSbranch/tags questions)
In-Reply-To: <07dd01c3bab6$f2ff7cf0$2c00a8c0@eden>
Message-ID: <LAW11-OE40hhIOp37Q600002b23@hotmail.com>

Mark Hammond wrote:
> I'm also not sure exactly *why* we are doing a reload - saving user
> options should not require us to reload the Options module, and I'm
> fairly sure no code exists that updates the .zip with a new Options
> file even if it did :) 

I agree.  The Python manual says of the reload() function:  "This is
useful if you have edited the module source file using an external
editor and want to try out the new version without leaving the Python
interpreter."  In the binary, we have no source to reload.  Even when
running from source, it doesn't seem useful to recompile Options.py.  It
isn't done for any of the other modules.

I assume the reload has the side-effect of rerunning any initialization
that occurs when the module is first loaded, but what good would that do
during a save?  We already have all the options that the user specified
or we wouldn't be able to save them, so why throw them all out and then
immediately reload them again?  As far as I can tell, neither Options.py
or OptionsClass.py does anything except read in the values, so it
doesn't seem like there should be any side-effects.

Maybe we should just comment out the reload and see what happens (he
said while opening his text editor).

-- 
Kenny Pitt


From mhammond at skippinet.com.au  Thu Dec  4 23:47:35 2003
From: mhammond at skippinet.com.au (Mark Hammond)
Date: Thu Dec  4 23:47:46 2003
Subject: [spambayes-dev] release_1_0 branch is dead
Message-ID: <083701c3baea$e7695760$2c00a8c0@eden>

I've checked in my merge of the branch.  As per my previous mail, the
changes were pretty trivial, so I expect no problems.  But consider it
official - the branch is dead.

Mark.


From kennypitt at hotmail.com  Fri Dec  5 10:21:02 2003
From: kennypitt at hotmail.com (Kenny Pitt)
Date: Fri Dec  5 10:21:35 2003
Subject: [spambayes-dev] Using reload() with modules from zips
	(WasMoreCVSbranch/tags questions)
In-Reply-To: <LAW11-OE40hhIOp37Q600002b23@hotmail.com>
Message-ID: <Law11-OE522hgdhsKMZ000033b2@hotmail.com>

[Me, yesterday evening]
> Mark Hammond wrote:
>> I'm also not sure exactly *why* we are doing a reload - saving user
>> options should not require us to reload the Options module, and I'm
>> fairly sure no code exists that updates the .zip with a new Options
>> file even if it did :)
> 
> [snip meandering commentary by me]
> 
> Maybe we should just comment out the reload and see what happens (he
> said while opening his text editor).

I commented out the 4 lines that do the importing and reloading of the
Options module, and then rebuilt the binary.  Brief initial testing
showed no problems with these lines taken out.

I changed the listening port for one of my POP servers.  I was able to
save the configuration change successfully with no traceback, and by
monitoring with TCPView I saw the listening port change immediately.  I
then also changed the spam cutoff and again saved successfully.
Finally, I exited sb_tray and reloaded it, and both changes were still
present in the config file.

-- 
Kenny Pitt


From skip at pobox.com  Fri Dec  5 16:43:54 2003
From: skip at pobox.com (Skip Montanaro)
Date: Fri Dec  5 16:44:01 2003
Subject: [spambayes-dev] More on training - eyeballs and edits appreciated
Message-ID: <16336.64538.768060.984524@montanaro.dyndns.org>


I added a bunch of text to the SpamBayes Wiki about training today (several
doses of caffeine later).  I apologize for the long delay.  Thanks to Seth
and Ryan for stepping up to the plate in my virtual absence.  I also added
the training aphorisms I posted a couple weeks ago.  Have a look:

    http://www.entrian.com/sbwiki/TrainingIdeas

Feed free to comment on anything or edit the page using the link at the
bottom...

Skip

From sl6dt at cc.usu.edu  Sun Dec  7 15:46:48 2003
From: sl6dt at cc.usu.edu (sl6dt)
Date: Sun Dec  7 15:49:30 2003
Subject: [spambayes-dev] Web filtering
Message-ID: <3FD367C7@webster.usu.edu>

Hello everyone,

I am the new guy to the list.  I was wondering today if it were possible to 
make a bayesian filter for web pages to block undesired content?  This way we 
could provide a free plugin to everybody who doesn't want porn sites to show 
up in their web browser.  What does everyone think?

John Mulholland


From skip at pobox.com  Sun Dec  7 18:13:01 2003
From: skip at pobox.com (Skip Montanaro)
Date: Sun Dec  7 18:13:03 2003
Subject: [spambayes-dev] Web filtering
In-Reply-To: <3FD367C7@webster.usu.edu>
References: <3FD367C7@webster.usu.edu>
Message-ID: <16339.46077.318890.127461@montanaro.dyndns.org>


    John> I was wondering today if it were possible to make a bayesian
    John> filter for web pages to block undesired content?  

Check out mod_spambayes.py in the contrib directory.  It's a SpamBayes
plugin for Amit Patel's proxy web server.

Skip

From sl6dt at cc.usu.edu  Sun Dec  7 18:32:30 2003
From: sl6dt at cc.usu.edu (sl6dt)
Date: Sun Dec  7 18:34:46 2003
Subject: [spambayes-dev] Web filtering
Message-ID: <3FD3BEEC@webster.usu.edu>

Thank you for the information.  I am not familiar with Amit Patel's proxy web 
server.  Who is working on this plugin?  Can I help?

John Mulholland


>===== Original Message From skip@pobox.com =====
>    John> I was wondering today if it were possible to make a bayesian
>    John> filter for web pages to block undesired content?
>
>Check out mod_spambayes.py in the contrib directory.  It's a SpamBayes
>plugin for Amit Patel's proxy web server.
>
>Skip


From skip at pobox.com  Sun Dec  7 19:48:21 2003
From: skip at pobox.com (Skip Montanaro)
Date: Sun Dec  7 19:48:21 2003
Subject: [spambayes-dev] Web filtering
In-Reply-To: <3FD3BEEC@webster.usu.edu>
References: <3FD3BEEC@webster.usu.edu>
Message-ID: <16339.51797.668396.312753@montanaro.dyndns.org>


    John> Thank you for the information.  I am not familiar with Amit
    John> Patel's proxy web server.  Who is working on this plugin?  Can I
    John> help?

The URL for the proxy server is at the top of the mod_spambayes.py file.  I
wrote the plugin, though as you can see, it's pretty minimal.  Nobody's
working on it at the moment.  You're more than welcome to enhance it.  I
only wrote it as an exercise.

Skip

From tameyer at ihug.co.nz  Mon Dec  8 03:13:56 2003
From: tameyer at ihug.co.nz (Tony Meyer)
Date: Mon Dec  8 03:14:07 2003
Subject: [spambayes-dev] Re: [Spambayes-checkins]
	spambayes/windowspop3proxy_tray.py, 1.15, 1.16
In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F130458F6D2@its-xchg4.massey.ac.nz>
Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F130212B1F1@its-xchg4.massey.ac.nz>

[Tony]
> Since 15/10/03, SetDefaultItem has been available for menus in 
> win32all, so use that as we should.  (It appears to set the font of 
> the item correctly, but not have any effect in terms of action, so 
> still capture the double-click ourselves.  Someone correct me if I've 
> done this wrongly).

[Richie]
> I've just had a look at a tray app I wrote years ago, and it 
> does the same thing.

Good enough for me.  (I presume this was not in Python, or you had your own
extension to get access to the SetDefaultItem function?)

=Tony Meyer


From tameyer at ihug.co.nz  Mon Dec  8 03:17:29 2003
From: tameyer at ihug.co.nz (Tony Meyer)
Date: Mon Dec  8 03:17:35 2003
Subject: [spambayes-dev] Using reload() with modules from zips (Was More
	CVSbranch/tags questions)
In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F130458F6FA@its-xchg4.massey.ac.nz>
Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F130212B1F2@its-xchg4.massey.ac.nz>

[Tony]
> Looks to me like reload() doesn't work with module from 
> zip's, which I presume it should.

[Mark]
> I agree this is a bug.  While .zip files are generally 
> logically "readonly", there is no reason that a .zip file 
> could not be updated dynamically while an app is running.

At the least a more accurate ("can't reload from zip") message would be
good, I would think.  I'll submit this as a bug for Python.

> I'm not so sure it will see quick attention though, so we 
> should consider handling this in our code.

I'm not sure that it deserves quick attention either, since I don't imagine
this is a high use feature.

> I'm also not sure exactly *why* we are doing a reload - 
> saving user options should not require us to reload the 
> Options module, and I'm fairly sure no code exists that 
> updates the .zip with a new Options file even if it did :)

I *think* (before my time, IIRC) the reason is to generate a new
Options.options object, that has all the new values.  I had figured that the
correct behaviour for us would be to remove the reload and explicitly
recreate/update the options object.

=Tony Meyer


From tameyer at ihug.co.nz  Mon Dec  8 03:24:29 2003
From: tameyer at ihug.co.nz (Tony Meyer)
Date: Mon Dec  8 03:24:35 2003
Subject: [spambayes-dev] RE: [Spambayes] Hotmail Confusion
In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F130458FE32@its-xchg4.massey.ac.nz>
Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F130212B1F3@its-xchg4.massey.ac.nz>

[Mark, more than once]
> I am starting to believe 
> that the 'background filtering' option should be the default.

+1.

=Tony Meyer


From richie at entrian.com  Mon Dec  8 03:32:14 2003
From: richie at entrian.com (Richie Hindle)
Date: Mon Dec  8 03:32:21 2003
Subject: [spambayes-dev] Re: [Spambayes-checkins]
	spambayes/windowspop3proxy_tray.py, 1.15, 1.16
In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F130212B1F1@its-xchg4.massey.ac.nz>
References: <1ED4ECF91CDED24C8D012BCF2B034F130458F6D2@its-xchg4.massey.ac.nz>
	<1ED4ECF91CDED24C8D012BCF2B034F130212B1F1@its-xchg4.massey.ac.nz>
Message-ID: <8nd8tv46t6cdsi1vnil5mmv45bacq2m636@4ax.com>


[Tony]
> Since 15/10/03, SetDefaultItem has been available for menus in 
> win32all, so use that as we should.  (It appears to set the font of 
> the item correctly, but not have any effect in terms of action, so 
> still capture the double-click ourselves.  Someone correct me if I've 
> done this wrongly).
> 
> [Richie]
> I've just had a look at a tray app I wrote years ago, and it 
> does the same thing.

[Tony]
> Good enough for me.  (I presume this was not in Python, or you had your own
> extension to get access to the SetDefaultItem function?)

It was in C.

-- 
Richie Hindle
richie@entrian.com


From mhammond at skippinet.com.au  Mon Dec  8 07:03:16 2003
From: mhammond at skippinet.com.au (Mark Hammond)
Date: Mon Dec  8 07:03:31 2003
Subject: [spambayes-dev] Using reload() with modules from zips (Was
	MoreCVSbranch/tags questions)
In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F130212B1F2@its-xchg4.massey.ac.nz>
Message-ID: <002d01c3bd83$43afb560$2c00a8c0@eden>

[Tony, quoting me]
> > I'm also not sure exactly *why* we are doing a reload -
> > saving user options should not require us to reload the
> > Options module, and I'm fairly sure no code exists that
> > updates the .zip with a new Options file even if it did :)
>
> I *think* (before my time, IIRC) the reason is to generate a new
> Options.options object, that has all the new values.  I had
> figured that the
> correct behaviour for us would be to remove the reload and explicitly
> recreate/update the options object.

Yes - the 'Options' module's mainline code actually reads the config file -
so I can see why a reload is needed if you want to re-read an options file
that may have been externally modified (now that I think about it <wink>)

The solution seems pretty simple: the top-level Options code gets moved into
a function, an 'if __name__' block is added which calls it, and all
occurrences of reload(Options) are also replaced similarly.

I'll do it, unless someone beats me to it (fingers crossed - I've lost the
context, such as any existing bugs etc), or I forget <0.1-wink>

Mark.


From mhammond at skippinet.com.au  Mon Dec  8 07:07:42 2003
From: mhammond at skippinet.com.au (Mark Hammond)
Date: Mon Dec  8 07:07:52 2003
Subject: [spambayes-dev] Re:
	[Spambayes-checkins]spambayes/windowspop3proxy_tray.py, 1.15, 1.16
In-Reply-To: <8nd8tv46t6cdsi1vnil5mmv45bacq2m636@4ax.com>
Message-ID: <002e01c3bd83$e29de570$2c00a8c0@eden>

> [Tony]
> > Good enough for me.  (I presume this was not in Python, or
> > you had your own extension to get access to the SetDefaultItem
function?)
>
> It was in C.

Isn't it great to have moved on from those bad old days?  I know it is for
me :)

Good-enough-to-keep-persisting-with-win32all <wink>, ly

Mark.


From richie at entrian.com  Mon Dec  8 15:15:30 2003
From: richie at entrian.com (Richie Hindle)
Date: Mon Dec  8 15:15:37 2003
Subject: [spambayes-dev] Re:
	[Spambayes-checkins]spambayes/windowspop3proxy_tray.py, 1.15, 1.16
In-Reply-To: <002e01c3bd83$e29de570$2c00a8c0@eden>
References: <8nd8tv46t6cdsi1vnil5mmv45bacq2m636@4ax.com>
	<002e01c3bd83$e29de570$2c00a8c0@eden>
Message-ID: <anm9tv09h7brra21o8u3l39t73gqk2265k@4ax.com>


[Richie]
> It was in C.

[Mark]
> Isn't it great to have moved on from those bad old days?  I know it is for
> me :)

Certainly is.  The day I next write a C program from scratch will be the
day I need to write a device driver.  Or maybe not even then:

http://groups.google.com/groups?dq=&hl=en&lr=&ie=UTF-8&oe=UTF-8&th=a5f5f93827f5d230&seekm=mailman.226.1070896324.16879.python-list%40python.org&frame=off

-- 
Richie Hindle
richie@entrian.com


From skip at pobox.com  Tue Dec  9 11:03:18 2003
From: skip at pobox.com (Skip Montanaro)
Date: Tue Dec  9 11:03:45 2003
Subject: [spambayes-dev] RE: [Spambayes] Strip Subject of Non-alpha
In-Reply-To: <LNBBLJKPBEHFEDALKOLCOEJFHKAB.tim.one@comcast.net>
References: <16340.55172.171511.255475@montanaro.dyndns.org>
	<LNBBLJKPBEHFEDALKOLCOEJFHKAB.tim.one@comcast.net>
Message-ID: <16341.62022.48555.624970@montanaro.dyndns.org>


    >> I never got overwhelming encouragement for my ideas about how to add
    >> experimental extensions to the CVS repository.

    Tim> Probably because it came attached to such a weak change <wink>.

Okay, ignore the bit about a specific "enhancement".  We all know most of
them don't work anyway.  Still, suppose someone comes up with an idea (we
get them all the time in the spambayes mailing list): "I know, how about
using the new header transmogrification feature of RFC-4822?", but doesn't
have the programming cojones to implement it.  Someone else comes along,
realizes it wouldn't be such a big deal to implement, does so and posts,
"Okay, try the version in CVS.  SpamBayes now has a "Headers:X-transmogrify"
option.  Let us know whether it helps or not."

People can then experiment with RFC-4822 transmogrification.  If it proves
not to be a worthy addition, the code can be ripped out.  The key is
tweaking the options parser to not care if there is no
"Tokenizer:X-transmogrify" option (because the code was ripped out later) or
to map "Tokenizer:X-transmogrify" to "Tokenizer:transmogrify" if it gains
acceptance and moves out of the trial stage.  (In fact, perhaps it should
work the other was as well, so we can rip stuff out that's not useful
without breaking peoples' options files.  See below.)

I just checked in a change to spambayes/OptionsClass.py which implements an
experimental/deprecated option feature.  It works like this:

    * Option is "foo", user sets "foo".  status quo.
    * Option is "X-foo", user sets "X-foo".  status quo.
    * Option is "foo", user sets "X-foo".  "foo" is set silently.
    * Option is "X-foo", user sets "foo".  "X-foo" is set and a warning
      emitted.

The third case covers experimental options.  The fourth case covers
deprecated options.  (The description for deprecated options in Options.py
should start with "(DEPRECATED) ".)

    Tim> Really, a few people tested it and it didn't seem to matter either
    Tim> way.

Granted.  One thing I wonder about is how "current" peoples' training
databases are.  New techniques like c?mm?nt ?cc?nt??t??n or em.bed-ed
punc#tua_tion aren't likely to turn up much in older training databases.  I
canned my old training database recently and have been working on rebuilding
it from scratch.  I think it's important that our training databases evolve
as spam does.

Another change I have locally is the remove_punctuation tokenizer gimmick I
alluded to above.  It also doesn't seem to change fp/fn results at the level
of pushing messages clearly out of one category into another, however it
seems to pretty consistently spread the ham/spam means apart a bit and
reduce their standard deviations.  I'm more interested in a framework for
making such experimental changes easier for non-programmers to try out.

    Tim> Experimental extensions are fine by me, and you proposed a decent
    Tim> scheme for putting them in.  The downside is that every piece of
    Tim> code complicates the whole, and I really don't know why you'd
    Tim> *want* to check in a gimmick that made no real difference to anyone
    Tim> who tried it (if I remember all the reports correctly -- maybe
    Tim> not).

The point isn't sticking code in, it's being able to easily yank it back
out.  (I think my checking should make that easier.)  You mentioned
generate_time_buckets and extract_dow.  I'll turn the screws in a moment to
deprecate them.  If this idea doesn't fly with people, or these options are
deemed crucial for enough people we can just un-deprecate them.

(BTW, has anyone on a Unix-ish system tried out testtools/Makefile when
running timcv?  If so, does it help or am I the only person who finds it
useful?)

Skip

From skip at pobox.com  Tue Dec  9 11:17:50 2003
From: skip at pobox.com (Skip Montanaro)
Date: Tue Dec  9 11:17:57 2003
Subject: [spambayes-dev] mboxutils.DirOfTxtFileMailbox - broaden its scope?
Message-ID: <16341.62894.540999.159231@montanaro.dyndns.org>


In mboxutils.getmbox it creates a DirOfTxtFileMailbox() object in certain
situations.  Looking at the code, it ignores any hierarchy within the given
directory, and only considers files ending in ".txt" or ".lorien".  Would
anyone object if I broaded this class's mandate to recursively traverse
subdirectories and consider all other files it encounters as message files?
This would (for example), allow you to call

    spambayes.mboxutils("Data/Ham")

in your test directory and walk through all the ham in your training
database.  I've been using this for a month or so with no ill effects,
though I have to admit I have no idea what a ".lorien" file is, so I have no
directories like that to break.  (Also, in the world outside SpamBayes, I
often add ".txt" to files which don't contain email. <wink>)

Skip

From mhammond at skippinet.com.au  Wed Dec 10 00:50:07 2003
From: mhammond at skippinet.com.au (Mark Hammond)
Date: Wed Dec 10 00:50:24 2003
Subject: [spambayes-dev] Outlook CVS 'UserProperty' changes
Message-ID: <000201c3bee1$7a119ec0$2c00a8c0@eden>

I've just changed the way we manage UserProperties in the Outlook addin.
The way we check if an Outlook folder has a "UserProperty" has changed, and
the way we create this "UserProperty" has also changed.  It does *not*
change the way the "Spam" field is saved in the message (that uses MAPI
properties), but the way Outlook shows these values (a subtle but real
distinction)

Most people should see absolutely no change - all your folders will already
have this 'Spam' field, and this should be detected correctly.  Until now,
the 'Unsure' folder never has this property automatically created by
SpamBayes, so unless you created this field manually (via the 'Field
Chooser') the field should now be created for you - but I assume almost
everyone here has already done that.

So if a few of you would like to kill a few minutes <wink>, I would
appreciate a little test - especially by Outlook XP users.
* For at least one of your Watched, Spam and Unsure folders:
*  If the 'Spam' field is being shown for this folder, right-click the
column header, and select 'Remove this column'
*  Right-click any column header, and select 'Field Chooser'
*  Select 'User defined fields' - The 'Spam' field should appear.  Select
it.
*  Click the 'Delete' button, confirm the deletion, and close the field
chooser.

Re-start outlook.  The log (at any level) should show:
...
Folder 'Personal Folders/Inbox' has no field named 'Spam' - creating
SpamBayes: Watching for new messages in folder Personal Folders/Inbox
...

Note the first entry - each folder that you deleted the field from should
show a similar message.  Then restart Outlook again, this time checking you
do *not* see that message.

Finally, go back to your folders and bring up the 'Field Chooser'.  When you
select 'User Defined Fields', you should find the 'Spam' field magically
re-created.  Drag it back to your view, and you are back where you started.

If you are using anon-cvs, please wait until manager rev 1.91 and msgstore
rev 1.78 appear .

Thanks!

Mark


From tameyer at ihug.co.nz  Wed Dec 10 02:10:11 2003
From: tameyer at ihug.co.nz (Tony Meyer)
Date: Wed Dec 10 02:10:18 2003
Subject: [spambayes-dev] Outlook CVS 'UserProperty' changes
In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F13046B4310@its-xchg4.massey.ac.nz>
Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F1304677741@its-xchg4.massey.ac.nz>

> So if a few of you would like to kill a few minutes <wink>, I would
> appreciate a little test - especially by Outlook XP users.

Outlook XP SP2 here.

> * For at least one of your Watched, Spam and Unsure folders:
[...]
> Re-start outlook.  The log (at any level) should show:

Sorry, I got:
"""
Error adding field to 'Unsure' folder
('0000000038A1BB1005E5101AA1BB08002B2A56C20000454D534D44422E444C4C0000000000
0000001B55FA20AA6611CD9BC800AA002FC45A0C0000004954532D5843484734002F6F3D4D61
7373657920556E69766572736974792F6F753D4D41535345592F636E3D526563697069656E74
732F636E3D542E412E4D6579657200',
'000000002CFF45187C119D4295E615A8AD7B7676010098B01D2717B9D411B38F0008C784093
1000010DF96B20000')
NameError: global name 'PR_USERFIELDS' is not defined
"""

I get this every time I start Outlook now, and don't have the field
available in the field chooser.

=Tony Meyer


From mhammond at skippinet.com.au  Wed Dec 10 02:19:32 2003
From: mhammond at skippinet.com.au (Mark Hammond)
Date: Wed Dec 10 02:19:44 2003
Subject: [spambayes-dev] Outlook CVS 'UserProperty' changes
In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F1304677741@its-xchg4.massey.ac.nz>
Message-ID: <000001c3beed$f5c80980$2c00a8c0@eden>

> > So if a few of you would like to kill a few minutes <wink>, I would
> > appreciate a little test - especially by Outlook XP users.
> 
> Outlook XP SP2 here.
> 
> > * For at least one of your Watched, Spam and Unsure folders:
> [...]
> > Re-start outlook.  The log (at any level) should show:
> 
> Sorry, I got:

Thanks!  Fixed - please try again.

Mark


From papaDoc at videotron.ca  Wed Dec 10 10:01:10 2003
From: papaDoc at videotron.ca (papaDoc)
Date: Wed Dec 10 10:01:23 2003
Subject: [spambayes-dev] Command line options
Message-ID: <3FD73536.3080101@videotron.ca>

Hi,

I submitted on sourceforge a patch to have more consistent command line 
options across all the scripts. (-d for dmb and -D for pickle)

Remi


From papaDoc at videotron.ca  Wed Dec 10 18:55:25 2003
From: papaDoc at videotron.ca (papaDoc)
Date: Wed Dec 10 18:55:28 2003
Subject: [spambayes-dev] Patch to make sb_mboxtrain to work on windows
Message-ID: <3FD7B26D.8000304@videotron.ca>

Hi,

I submitted a patch on sourceforge to make the script sb_mboxtrain to 
work on windows.


Remi


From skip at pobox.com  Wed Dec 10 21:22:59 2003
From: skip at pobox.com (Skip Montanaro)
Date: Wed Dec 10 21:23:14 2003
Subject: [spambayes-dev] RE: [Spambayes] Watch out for digests...
In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F1304677744@its-xchg4.massey.ac.nz>
References: <1ED4ECF91CDED24C8D012BCF2B034F13046B4478@its-xchg4.massey.ac.nz>
	<1ED4ECF91CDED24C8D012BCF2B034F1304677744@its-xchg4.massey.ac.nz>
Message-ID: <16343.54531.802087.451246@montanaro.dyndns.org>

    >> Big mistake. Stuff started getting wacky real fast.... Guess what?
    >> One of the messages in the digest was an obvious spam.

    Tony> This is perhaps a drawback of the minimalist database size
    Tony> training strategy.  I'm guessing that if you had a larger
    Tony> database, the effect wouldn't have been as pronounced?  

Maybe.  At the moment, I have 9768 tokens in my database and 7731 of them
are hapaxes.  As you suggest, it would appear mistakes can throw things off
more dramatically, but it is also easier to detect.

I'd be interested to see what others' hapax fractions are:

    >>> import shelve
    >>> db = shelve.open(".hammiedb")
    >>> n = 0
    >>> len([k for k in db if db[k] in [(0,1),(1,0)]])
    7731
    >>> len(db)
    9769
    >>> len([k for k in db if db[k] in [(0,1),(1,0)]])/float(len(db)-1)
    0.79146191646191644

(The -1 is to eliminate the 'saved state' token.  I'm just being
pedantic. ;-)

Another interesting thing (I think) might be to investigate the importance
of synthetic tokens (e.g.: 'url:eweek' or 'received:168.10.156') vs. natural
tokens (e.g., 'highlight' or 'dot') for smaller vs larger databases.  I
think one of the reasons training a single unsure has a dramatic effect on a
bunch of other unsure spams is because of all the synthetic tokens they have
in common due to similar delivery mechanisms (gotta use that account before
it gets shut down...).  If a spammer spews a bunch of messages from ISP A,
then gets booted, his next spew will be from somewhere else.  I suspect many
of the ISP-related synthetic tokens generated will only ever be hapaxes, and
thus be much more important with a small database than with a large one.

It's just a theory.  Hey, maybe that's another master's thesis idea for
Brett Cannon... ;-)

Skip

From kennypitt at hotmail.com  Thu Dec 11 09:58:45 2003
From: kennypitt at hotmail.com (Kenny Pitt)
Date: Thu Dec 11 09:59:25 2003
Subject: [spambayes-dev] RE: [Spambayes] Watch out for digests...
In-Reply-To: <16343.54531.802087.451246@montanaro.dyndns.org>
Message-ID: <Law11-OE729slIo0KqJ00007a5e@hotmail.com>

Skip Montanaro wrote:
> I'd be interested to see what others' hapax fractions are:
> 
>     >>> import shelve
>     >>> db = shelve.open(".hammiedb")
>     >>> n = 0
>     >>> len([k for k in db if db[k] in [(0,1),(1,0)]])
>     7731
>     >>> len(db)
>     9769
>     >>> len([k for k in db if db[k] in
[(0,1),(1,0)]])/float(len(db)-1) 
>     0.79146191646191644

My current Outlook training database has 40 good and 59 spam.  Here are
my results:

>>> len([k for k in db if db[k] in [(0,1),(1,0)]])
8158
>>> len(db)
11274
>>> len([k for k in db if db[k] in [(0,1),(1,0)]])/float(len(db)-1)
0.72367604009580411

-- 
Kenny Pitt


From tim.one at comcast.net  Thu Dec 11 11:17:47 2003
From: tim.one at comcast.net (Tim Peters)
Date: Thu Dec 11 11:17:48 2003
Subject: [spambayes-dev] RE: [Spambayes] Watch out for digests...
In-Reply-To: <16343.54531.802087.451246@montanaro.dyndns.org>
Message-ID: <LNBBLJKPBEHFEDALKOLCAEIDHLAB.tim.one@comcast.net>

[Tony]
> This is perhaps a drawback of the minimalist database size
> training strategy.

I think it's a consequence of mistake-based training (and minimal database
size is a (another) consequence of *that*).

> I'm guessing that if you had a larger database, the effect wouldn't
> have been as pronounced?

A mistake in training has smaller effect under TOE (train-on-everything).
The other side of that is that a correctly-trained example also has smaller
effect under TOE.

[Skip]
> Maybe.  At the moment, I have 9768 tokens in my database and 7731 of
> them are hapaxes.  As you suggest, it would appear mistakes can throw
> things off more dramatically,

We're rediscovering the bases for these old mantras:

    Mistake-based training leads to hapax-driven scoring.

    Hapax-driven scoring is brittle.

"brittle" is an antonymn of "robust" <wink>.  But in my personal email life,
I've been very happy with mistake-based training despite its drawbacks.

> but it is also easier to detect.

Heh -- isn't that *because* it throws things off so dramatically <wink>?

> I'd be interested to see what others' hapax fractions are:

I don't think that's the right thing to measure.  There's really nothing in
a database that's interesting on its own, the only thing that matters to
performance is what gets used during *scoring* (everything else just sits
there, passively, the same as if it didn't exist (except for its effect on
database size)).  A message score mostly derived from hapaxes is brittle
because a single contrary training example can change the classifier's view
of a hapax from "hammy" or "spammy" to "neither", and two contrary training
examples can swing it to the other classification.

In the early days, the database kept track of the last time a token was used
in scoring, and the test framework kept track of often each token got used
in scoring.  There isn't an out-of-the-box way to get at that info anymore,
so it's much harder to investigate how mistake-based training leads to
hapax-driven scoring now.

It's not *all* bad, or mistake-based training wouldn't be so effective for
so many of us.  Maybe the clearest example is that the hapaxes found in a
new spam campaign are precisely what let us get away with training one
sample and thereafter catch others from that campaign; in effect, hapaxes
act like a pretty large set of lexical fingerprints in that case.

> ...
> Another interesting thing (I think) might be to investigate the
> importance of synthetic tokens (e.g.: 'url:eweek' or
> 'received:168.10.156') vs. natural tokens (e.g., 'highlight' or
> 'dot') for smaller vs larger databases.  I think one of the reasons
> training a single unsure has a dramatic effect on a bunch of other
> unsure spams is because of all the synthetic tokens they have in
> common due to similar delivery mechanisms (gotta use that account
> before it gets shut down...).  If a spammer spews a bunch of messages
> from ISP A, then gets booted, his next spew will be from somewhere
> else.  I suspect many of the ISP-related synthetic tokens generated
> will only ever be hapaxes, and thus be much more important with a
> small database than with a large one.

It was established before that hapaxes are vital in mistake-based training.
If you want to test that quickly but informally, modify a copy of your
database to throw away all the hapaxes, then live with that reduced database
for a while.  It will probably have a hard time even with the messages it
was originally trained with.


From skip at pobox.com  Thu Dec 11 11:38:28 2003
From: skip at pobox.com (Skip Montanaro)
Date: Thu Dec 11 11:38:47 2003
Subject: [spambayes-dev] RE: [Spambayes] Watch out for digests...
In-Reply-To: <LNBBLJKPBEHFEDALKOLCAEIDHLAB.tim.one@comcast.net>
References: <16343.54531.802087.451246@montanaro.dyndns.org>
	<LNBBLJKPBEHFEDALKOLCAEIDHLAB.tim.one@comcast.net>
Message-ID: <16344.40324.210107.698842@montanaro.dyndns.org>


    >> I'd be interested to see what others' hapax fractions are:

    Tim> I don't think that's the right thing to measure.  There's really
    Tim> nothing in a database that's interesting on its own, the only thing
    Tim> that matters to performance is what gets used during *scoring*
    Tim> (everything else just sits there, passively, the same as if it
    Tim> didn't exist (except for its effect on database size)).  

Yes, you're correct, of course.  So what we might want to look at is the
relative occurrence of 0.84 and 0.16 scores in message clues?

    Tim> It's not *all* bad, or mistake-based training wouldn't be so
    Tim> effective for so many of us.  Maybe the clearest example is that
    Tim> the hapaxes found in a new spam campaign are precisely what let us
    Tim> get away with training one sample and thereafter catch others from
    Tim> that campaign; in effect, hapaxes act like a pretty large set of
    Tim> lexical fingerprints in that case.

This is where I think the synthetic vs. natural tokens thing would be
interesting.  I get lots of Viagra spam, most of which is caught, but in my
current database, 'viagra' is a hapax.  In fact, it appears I only added it
very recently.  Here's the evidence header from a message with the subject:

    Viagra, Soma, Fioricet, Prescribed Online for Free, Shipped Overnight

which was scored around 12:25 AM today:

    X-Spambayes-Evidence: '*H*': 0.03; '*S*': 0.90; 'drug': 0.16;
            'subject:Free': 0.16; 'store': 0.23; 'next': 0.25; 'list,': 0.30;
            'via': 0.34; 'subject:, ': 0.37; 'our': 0.62;
            'header:Reply-To:1': 0.64; 'enter': 0.67;
            'content-type:multipart/alternative': 0.68;
            'content-type:text/html': 0.74; 'doctors': 0.84;
            'prescription': 0.84; 'received:103]': 0.84;
            'received:165.175': 0.84; 'received:175': 0.84;
            'received:199.249.165.175': 0.84; 'received:249.165.175': 0.84;
            'reply-to:addr:yahoo.com': 0.93; 'url:biz': 0.98

Most of the spammy clues are synthetic tokens related to delivery (and are
mostly hapaxes), not content.  My 'train an unsure or false negative, check
for spams' method suggests this is the case, since training on a single
message often pushes several other spams about completely different topics
into the spam category.

This suggests a couple other downsides to minimalist training.  One,
spammers have to move, so hapaxes related to delivery are likely to only be
useful for a short period while the spammer is abusing a single account.
Two, if a delivery token pushes a bunch of other messages into the spam
category which are then never used as inputs to training, the opportunity to
reinforce that token's quality is lost, even though it might actually appear
fairly frequently in spam.

Skip

From richie at entrian.com  Thu Dec 11 12:38:33 2003
From: richie at entrian.com (Richie Hindle)
Date: Thu Dec 11 12:38:37 2003
Subject: [spambayes-dev] A new and altogether different bsddb breakage
Message-ID: <hfahtvkejos62a0d8jfmc2mkvghpvgvkih@4ax.com>


In response to Skip's question about hapax ratios, I ran his script and
received an error.  I boiled the problem down to this:

>>> print [db[k] for k in db]
Traceback (most recent call last):
  File "hapaxes.py", line 3, in ?
    print [db[k] for k in db]
  File "C:\Python23\lib\shelve.py", line 118, in __getitem__
    f = StringIO(self.dict[key])
  File "C:\Python23\lib\bsddb\__init__.py", line 86, in __getitem__
    return self.db[key]
KeyError: 'pics'

Excuse me?  Er, so how many of these things are there?

>>> len([k for k in db if db.get(k, None) is None])
306

And what do they look like?

>>> from pprint import pprint as p
>>> p([k for i, k in enumerate(db) if db.get(k, None) is None and i % 50 == 0])
['magnetism',
 'url:mlqnuvs',
 'from:addr:wi872u',
 'autograph.',
 'url:ff-programs',
 'motels,']

So they have nothing obvious in common.  Looking through the full list
it's obvious that they don't all come from one message.  Some are
obviously ham clues and some are obviously spam.

I'm probably winging my way towards a DBRunRecovery error, unless someone
can explain what's going on?

-- 
Richie Hindle
richie@entrian.com


From skip at pobox.com  Thu Dec 11 17:00:11 2003
From: skip at pobox.com (Skip Montanaro)
Date: Thu Dec 11 17:00:14 2003
Subject: [spambayes-dev] Saving last set and get times in database
Message-ID: <16344.59627.337541.794331@montanaro.dyndns.org>

Here's an initial patch which maintains last set and get times for tokens:

    https://sourceforge.net/tracker/index.php?func=detail&aid=858564&group_id=61702&atid=498105

Very experimental...  Caveat emptor...

Skip

From kennypitt at hotmail.com  Fri Dec 12 09:36:49 2003
From: kennypitt at hotmail.com (Kenny Pitt)
Date: Fri Dec 12 09:37:23 2003
Subject: [spambayes-dev] FW: [Spambayes] feature request
Message-ID: <LAW11-OE12KzahUlbBh000088fd@hotmail.com>

Mark Hammond wrote:
> [Rayfes]
>> Is there
>> any way to have SpamBayes play a sound every time is marks
>> a message a spam and maybe a different sound for possible spam?
>> That way if I hear a new message come in I could
>> just wait a few seconds to hear whether it was marked spam.
>> If I happen to get multiple messages at time I may miss realizing
>> that some Ham messages came in with some spam but that's ok with me.
> 
> That is a pretty good idea.  If someone can nail down the exact
> feature request, I think we could add it.

I just submitted patch #858925 that implements a first stab at this.
The file notify_sound_patch.txt in the attached ZIP describes the
approach I took to answer Mark's original issues.  It borrows heavily
from Mark's background filtering timer code to implement a "message
batch accumulation" delay timer.

If you think there is sufficient interest in having this feature, then
please try this out and comment on what you do or don't like about the
approach.  It's been working well for the way I use Outlook, but then
that's what I designed it for so others might prefer it to work
differently.  If we decide to add it to the product, I'll put together
an update to SpamBayes Manager to configure it.

-- 
Kenny Pitt


From mhammond at skippinet.com.au  Sat Dec 13 22:19:15 2003
From: mhammond at skippinet.com.au (Mark Hammond)
Date: Sat Dec 13 22:19:37 2003
Subject: [spambayes-dev] Nuke experimental_ham_spam_imbalance_adjustment?
Message-ID: <0dec01c3c1f1$0e3b6b00$2c00a8c0@eden>

As far as I understand it, experimental_ham_spam_imbalance_adjustment has
been found to be ineffective, and that all default options have now set this
to False.

However, there is an issue regarding existing users of the addin.  For
Outlook in particular, if an old copy of "default_bayes_customize.ini"
exists, we do not copy our new version over it.  As
experimental_ham_spam_imbalance_adjustment was set to True in early
versions, this option will remain in effect, even when these users upgrade
to newer versions.

I don't think a similar issue exists with the other apps.

Short term, the solution seems to be to nuke
experimental_ham_spam_imbalance_adjustment from classifier.py - unless of
course, there is some good reason to leave it for continued experiments (in
which case I would just force it False in the Outlook init code)

Longer term, I think the way we copy this file to the users data directory
was a mistake, and I am likely to fix it (there is a bug on it from a
confused user).

Any thoughts?

Mark.


From tim.one at comcast.net  Sat Dec 13 23:18:42 2003
From: tim.one at comcast.net (Tim Peters)
Date: Sat Dec 13 23:18:43 2003
Subject: [spambayes-dev] Nuke experimental_ham_spam_imbalance_adjustment?
In-Reply-To: <0dec01c3c1f1$0e3b6b00$2c00a8c0@eden>
Message-ID: <LNBBLJKPBEHFEDALKOLCCEDOHMAB.tim.one@comcast.net>

[Mark Hammond]
> As far as I understand it, experimental_ham_spam_imbalance_adjustment
> has been found to be ineffective,

It seemed OK so long as the data didn't get *too* unbalanced -- when there
was extreme imbalance, it was not only ineffective, it did major harm.

> and that all default options have now set this to False.

I hope so.  That was the plan.

> However, there is an issue regarding existing users of the addin.  For
> Outlook in particular, if an old copy of "default_bayes_customize.ini"
> exists, we do not copy our new version over it.  As
> experimental_ham_spam_imbalance_adjustment was set to True in early
> versions, this option will remain in effect, even when these users
> upgrade to newer versions.
>
> I don't think a similar issue exists with the other apps.
>
> Short term, the solution seems to be to nuke
> experimental_ham_spam_imbalance_adjustment from classifier.py -
> unless of course, there is some good reason to leave it for continued
> experiments (in which case I would just force it False in the Outlook
> init code)

Na, it's a proven loser.  I just deleted the code from classifier.py, and
reworded some of the docs.  Options.py still knows about it, though, to
avoid breaking any .ini file that still references it.  I'm not sure how to
get rid of it completely.

> Longer term, I think the way we copy this file to the users data
> directory was a mistake, and I am likely to fix it (there is a bug on
> it from a confused user).
>
> Any thoughts?

Unsure what you have in mind -- but doubt it's insane <wink>.


From tim.one at comcast.net  Sun Dec 14 00:05:02 2003
From: tim.one at comcast.net (Tim Peters)
Date: Sun Dec 14 00:05:04 2003
Subject: [spambayes-dev] A new and altogether different bsddb breakage
In-Reply-To: <hfahtvkejos62a0d8jfmc2mkvghpvgvkih@4ax.com>
Message-ID: <LNBBLJKPBEHFEDALKOLCEEEBHMAB.tim.one@comcast.net>

[Richie Hindle]
> In response to Skip's question about hapax ratios, I ran his script
> and received an error.  I boiled the problem down to this:
>
> >>> print [db[k] for k in db]
> Traceback (most recent call last):
>   File "hapaxes.py", line 3, in ?
>     print [db[k] for k in db]
>   File "C:\Python23\lib\shelve.py", line 118, in __getitem__
>     f = StringIO(self.dict[key])
>   File "C:\Python23\lib\bsddb\__init__.py", line 86, in __getitem__
>     return self.db[key]
> KeyError: 'pics'
>
> Excuse me?  Er, so how many of these things are there?
>
> >>> len([k for k in db if db.get(k, None) is None]) 306

Ouch.  What do you get if you open the database directly, instead of
indirecting thru a shelf?  I'm just trying to make sure it's really the
database that's hosed.  For example, here's a complete program picking on my
database:

PATH = "/WINDOWS/Application Data/SpamBayes/default_bayes_database.db"
import bsddb
d = bsddb.hashopen(PATH, 'r')
print len(d)
print len([k for k in d if d.get(k, None) is None])

That printed 40787, then 0, when I ran it just now.

> And what do they look like?

Doesn't matter -- it should never happen!

> >>> from pprint import pprint as p
> >>> p([k for i, k in enumerate(db) if db.get(k, None) is None and i
> % 50 == 0])
> ['magnetism',
>  'url:mlqnuvs',
>  'from:addr:wi872u',
>  'autograph.',
>  'url:ff-programs',
>  'motels,']
>
> So they have nothing obvious in common.  Looking through the full list
> it's obvious that they don't all come from one message.  Some are
> obviously ham clues and some are obviously spam.
>
> I'm probably winging my way towards a DBRunRecovery error, unless
> someone can explain what's going on?

I've fixed miserable *similar* bugs in ZODB's BTrees (enumerating finds keys
that direct lookup doesn't believe exist), so I'm not shocked if some other
database screws up in this way too.  Gotta say, I'm half ready to declare
that ZODB is the only database anyone should ever use (the bugs in that are
long fixed <wink>).


From mhammond at skippinet.com.au  Sun Dec 14 06:06:31 2003
From: mhammond at skippinet.com.au (Mark Hammond)
Date: Sun Dec 14 06:06:45 2003
Subject: [spambayes-dev] Nuke experimental_ham_spam_imbalance_adjustment?
In-Reply-To: <LNBBLJKPBEHFEDALKOLCCEDOHMAB.tim.one@comcast.net>
Message-ID: <0fe101c3c232$55023ff0$2c00a8c0@eden>

> Na, it's a proven loser.  I just deleted the code from 
> classifier.py, and reworded some of the docs.  

Thanks!  That was exactly what I hoped would happen :)

> Options.py still knows about it, though, to
> avoid breaking any .ini file that still references it.  I'm 
> not sure how to get rid of it completely.

Yep - me too, and just perfect!

Thanks,

Mark.


From skip at pobox.com  Sun Dec 14 15:31:29 2003
From: skip at pobox.com (Skip Montanaro)
Date: Sun Dec 14 15:31:35 2003
Subject: [spambayes-dev] Nuke experimental_ham_spam_imbalance_adjustment?
In-Reply-To: <0dec01c3c1f1$0e3b6b00$2c00a8c0@eden>
References: <0dec01c3c1f1$0e3b6b00$2c00a8c0@eden>
Message-ID: <16348.51361.368410.400031@montanaro.dyndns.org>


    Mark> However, there is an issue regarding existing users of the addin.
    Mark> For Outlook in particular, if an old copy of
    Mark> "default_bayes_customize.ini" exists, we do not copy our new
    Mark> version over it.  As experimental_ham_spam_imbalance_adjustment
    Mark> was set to True in early versions, this option will remain in
    Mark> effect, even when these users upgrade to newer versions.

If you rip out the code in classifier.py, you should be able to simply
change its name in Options.py to

    x-experimental_ham_spam_imbalance_adjustment

That's how you deprecate an option based upon the code I added to
OptionsClass.py the other day.  If the user sets "foo" but it doesn't exist
and "x-foo" does, a message is printed to stderr, but nothing bombs.  Take a
look at the docstring for the OptionsClass module.

    Mark> Longer term, I think the way we copy this file to the users data
    Mark> directory was a mistake, and I am likely to fix it (there is a bug
    Mark> on it from a confused user).

What's the mistake?  I think it's correct to not obliterate the user's local
copy of the config file.  Plenty of programs either don't copy config files
during install if a copy is already present, or install it to a different
name so the user can compare local and as-distributed versions of the file.

Skip

From skip at pobox.com  Sun Dec 14 15:34:32 2003
From: skip at pobox.com (Skip Montanaro)
Date: Sun Dec 14 15:34:35 2003
Subject: [spambayes-dev] A new and altogether different bsddb breakage
In-Reply-To: <LNBBLJKPBEHFEDALKOLCEEEBHMAB.tim.one@comcast.net>
References: <hfahtvkejos62a0d8jfmc2mkvghpvgvkih@4ax.com>
	<LNBBLJKPBEHFEDALKOLCEEEBHMAB.tim.one@comcast.net>
Message-ID: <16348.51544.686632.177594@montanaro.dyndns.org>


    Tim> import bsddb
    Tim> d = bsddb.hashopen(PATH, 'r')
    Tim> print len(d)
    Tim> print len([k for k in d if d.get(k, None) is None])

    Tim> That printed 40787, then 0, when I ran it just now.

I also get N and 0 for my working database (opened with anydbm).

Skip

From tim.one at comcast.net  Sun Dec 14 19:22:42 2003
From: tim.one at comcast.net (Tim Peters)
Date: Sun Dec 14 19:22:46 2003
Subject: [spambayes-dev] Nuke experimental_ham_spam_imbalance_adjustment?
In-Reply-To: <16348.51361.368410.400031@montanaro.dyndns.org>
Message-ID: <LNBBLJKPBEHFEDALKOLCEEGCHMAB.tim.one@comcast.net>

[Skip Montanaro]
> ...
> If you rip out the code in classifier.py, you should be able to simply
> change its name in Options.py to
>
>     x-experimental_ham_spam_imbalance_adjustment
>
> That's how you deprecate an option based upon the code I added to
> OptionsClass.py the other day.  If the user sets "foo" but it doesn't
> exist and "x-foo" does, a message is printed to stderr, but nothing
> bombs.  Take a look at the docstring for the OptionsClass module.

Cool!  I just did that.  There's a minor problem:  the OptionsClass module
says the magic prefix is

    X-

(uppercase), but only

    x-

(lowercase) works as intended.  With

    X-experimental_ham_spam_imbalance_adjustment

the warning is

    warning: Invalid option experimental_ham_spam_imbalance_adjustment in
    section Classifier in file
    C:\WINDOWS\Application Data\SpamBayes\default_bayes_customize.ini

and with

    x-experimental_ham_spam_imbalance_adjustment

it's the mostly <wink> hoped-for

    warning: option experimental_ham_spam_imbalance_adjustment in
    section Classifier is deprecated

I'm not sure what your intent was, but the code should match the docs one
way or the other.  The second form of message should probably include the
filename too.

Works slick, though!


From tim.one at comcast.net  Sun Dec 14 20:06:03 2003
From: tim.one at comcast.net (Tim Peters)
Date: Sun Dec 14 20:06:14 2003
Subject: [spambayes-dev] RE: [Spambayes] Watch out for digests...
In-Reply-To: <16344.40324.210107.698842@montanaro.dyndns.org>
Message-ID: <LNBBLJKPBEHFEDALKOLCOEGGHMAB.tim.one@comcast.net>

[Skip Montanaro]
> ...
> This is where I think the synthetic vs. natural tokens thing would be
> interesting.

I'm not sure what's being distinguished here.

> I get lots of Viagra spam, most of which is caught, but in my current
> database, 'viagra' is a hapax.  In fact, it appears I only added it
> very recently.  Here's the evidence header from a message with the
> subject:
>
>     Viagra, Soma, Fioricet, Prescribed Online for Free, Shipped
>     Overnight
>
> which was scored around 12:25 AM today:
>
>     X-Spambayes-Evidence: '*H*': 0.03; '*S*': 0.90; 'drug': 0.16;
>             'subject:Free': 0.16;

"Free" in a Subject line and "drug" in the body are hammy for you?  Staring
at clues from mistake-based training can be, umm, counter-intuitive <wink>.

>             'store': 0.23; 'next': 0.25; 'list,': 0.30;
>             'via': 0.34; 'subject:, ': 0.37; 'our': 0.62;
>             'header:Reply-To:1': 0.64; 'enter': 0.67;
>             'content-type:multipart/alternative': 0.68;
>             'content-type:text/html': 0.74; 'doctors': 0.84;
>             'prescription': 0.84; 'received:103]': 0.84;
>             'received:165.175': 0.84; 'received:175': 0.84;
>             'received:199.249.165.175': 0.84; 'received:249.165.175':
>             0.84; 'reply-to:addr:yahoo.com': 0.93; 'url:biz': 0.98
>
> Most of the spammy clues are synthetic tokens related to delivery
> (and are mostly hapaxes), not content.

I'm not sure what's synthetic about these.  Most of your spam clues come
from the email *headers*, but that's fair game.  Note that mining received
headers is disabled by default, so you're getting a pile of clues most
people aren't getting.  Maybe they should.

> My 'train an unsure or false negative, check for spams' method suggests
> this is the case, since training on a single message often pushes several
> other spams about completely different topics into the spam category.

I'm unclear on what's noteworthy about that.  The biz domain is used by lots
of spam, lots of spam has a yahoo.com return address, lots of spam is
multipart/alternative HTML, and so on.  Looks like you're generating 4
correlated clues from a single Received header, and that you got one spam
before from the same box.  Strangely, though, it looks like you're sucking
out *suffixes* of IP addrs instead of prefixes (you've got

    199.249.165.175
        249.165.175
            165.175
and
                175

but not the almost-surely more useful

    199.249.165
    199.249
and
    199
).

> This suggests a couple other downsides to minimalist training.  One,
> spammers have to move, so hapaxes related to delivery are likely to
> only be useful for a short period while the spammer is abusing a
> single account.

IP *prefixes* should be useful despite that, due to the way IP space is
handed out.  If you're a spammer with a cooperative host, you're likely to
get other IP addresses from the netblocks assigned to that host, and they'll
share a common prefix.

> Two, if a delivery token pushes a bunch of other messages into the
> spam category which are then never used as inputs to training, the
> opportunity to reinforce that token's quality is lost, even though it
> might actually appear fairly frequently in spam.

I expect 'subject:Free' was a fine example of that.


From skip at pobox.com  Sun Dec 14 21:13:43 2003
From: skip at pobox.com (Skip Montanaro)
Date: Sun Dec 14 21:13:50 2003
Subject: [spambayes-dev] RE: [Spambayes] Watch out for digests...
In-Reply-To: <LNBBLJKPBEHFEDALKOLCOEGGHMAB.tim.one@comcast.net>
References: <16344.40324.210107.698842@montanaro.dyndns.org>
	<LNBBLJKPBEHFEDALKOLCOEGGHMAB.tim.one@comcast.net>
Message-ID: <16349.6359.927187.517763@montanaro.dyndns.org>


    >> X-Spambayes-Evidence: '*H*': 0.03; '*S*': 0.90; 'drug': 0.16;
    >> 'subject:Free': 0.16;

    Tim> "Free" in a Subject line and "drug" in the body are hammy for you?
    Tim> Staring at clues from mistake-based training can be, umm,
    Tim> counter-intuitive <wink>.

Yeah, one of the online communities I participate in is a list of parents of
"troubled kids", hence the hammy "drug" reference.  "subject:Free" comes
from the music community:

    Subject: SFS Special Announcement (Free Guest List to Fluid this Friday)

    >> 'store': 0.23; 'next': 0.25; 'list,': 0.30;
    >> 'via': 0.34; 'subject:, ': 0.37; 'our': 0.62;
    >> 'header:Reply-To:1': 0.64; 'enter': 0.67;
    >> 'content-type:multipart/alternative': 0.68;
    >> 'content-type:text/html': 0.74; 'doctors': 0.84;
    >> 'prescription': 0.84; 'received:103]': 0.84;
    >> 'received:165.175': 0.84; 'received:175': 0.84;
    >> 'received:199.249.165.175': 0.84; 'received:249.165.175':
    >> 0.84; 'reply-to:addr:yahoo.com': 0.93; 'url:biz': 0.98
    >> 
    >> Most of the spammy clues are synthetic tokens related to delivery
    >> (and are mostly hapaxes), not content.

    Tim> I'm not sure what's synthetic about these.  

I guess my operational definitions of "synthetic" and "natural" tokens are
in order:

    "natural tokens" are those which derive simply by splitting the message
    body on whitespace boundaries.

    "synthetic tokens" are those which are not "natural tokens".

    Tim> Most of your spam clues come from the email *headers*, but that's
    Tim> fair game.  Note that mining received headers is disabled by
    Tim> default, so you're getting a pile of clues most people aren't
    Tim> getting.  Maybe they should.

Sure, email headers are fair game, but if the tokenizer didn't do anything
special with them, that "subject:Free" token would at most just be "free" or
"Free".

    >> My 'train an unsure or false negative, check for spams' method
    >> suggests this is the case, since training on a single message often
    >> pushes several other spams about completely different topics into the
    >> spam category.

    Tim> I'm unclear on what's noteworthy about that.  The biz domain is
    Tim> used by lots of spam, lots of spam has a yahoo.com return address,
    Tim> lots of spam is multipart/alternative HTML, and so on.  Looks like
    Tim> you're generating 4 correlated clues from a single Received header,
    Tim> and that you got one spam before from the same box.  Strangely,
    Tim> though, it looks like you're sucking out *suffixes* of IP addrs
    Tim> instead of prefixes (you've got

    Tim>     199.249.165.175
    Tim>         249.165.175
    Tim>             165.175
    Tim> and
    Tim>                 175

    Tim> but not the almost-surely more useful

    Tim>     199.249.165
    Tim>     199.249
    Tim> and
    Tim>     199
    Tim> ).

I don't know.  I agree those look backwards (that's my mail server, BTW).
OTOH, given the fairly random assignment of IP networks, I doubt it makes
much sense for the above IP address to be stripped of more than the last two
octets ("received:199.249.165.175", "received:199.249.165" and
"received:199.249").  "recevied:199", where 199 is the first octet, not the
last, almost certainly means nothing.  If it's spammy or hammy, it's just by
sheer coincidence.

    >> This suggests a couple other downsides to minimalist training.  One,
    >> spammers have to move, so hapaxes related to delivery are likely to
    >> only be useful for a short period while the spammer is abusing a
    >> single account.

    Tim> IP *prefixes* should be useful despite that, due to the way IP
    Tim> space is handed out.  If you're a spammer with a cooperative host,
    Tim> you're likely to get other IP addresses from the netblocks assigned
    Tim> to that host, and they'll share a common prefix.

Again, no more general than the first two octets (a class B network).  Class
A networks are very rare (for obvious reasons):

    http://euclid.math.brandeis.edu/turtschi/whois/neta1.html

    >> Two, if a delivery token pushes a bunch of other messages into the
    >> spam category which are then never used as inputs to training, the
    >> opportunity to reinforce that token's quality is lost, even though it
    >> might actually appear fairly frequently in spam.

    Tim> I expect 'subject:Free' was a fine example of that.

'subject:Free' is now slightly spammy, having turned up in three spams and
only one ham at this point.

Skip


From tim.one at comcast.net  Sun Dec 14 22:09:04 2003
From: tim.one at comcast.net (Tim Peters)
Date: Sun Dec 14 22:09:14 2003
Subject: [spambayes-dev] RE: [Spambayes] Watch out for digests...
In-Reply-To: <16349.6359.927187.517763@montanaro.dyndns.org>
Message-ID: <LNBBLJKPBEHFEDALKOLCIEGMHMAB.tim.one@comcast.net>

[Skip Montanaro]
> I guess my operational definitions of "synthetic" and "natural"
> tokens are in order:
>
>     "natural tokens" are those which derive simply by splitting the
>     message body on whitespace boundaries.
>
>     "synthetic tokens" are those which are not "natural tokens".

OK.  Now I've forgotten why you drew the distinction to begin with <0.9
wink>.

[about busting apart IP addrs]
> I don't know.  I agree those look backwards (that's my mail server,
> BTW). OTOH, given the fairly random assignment of IP networks, I
> doubt it makes much sense for the above IP address to be stripped of
> more than the last two octets ("received:199.249.165.175",
> "received:199.249.165" and "received:199.249").  "recevied:199",
> where 199 is the first octet, not the last, almost certainly means
> nothing.  If it's spammy or hammy, it's just by sheer coincidence.

In that case, the database will learn it; since it can't generate more than
126 legitimate "Class A" tokens total, it's a trivial database burden.
OTOH, for someone in the DOD, it may be valuable to know that email came
from a DOD Class A network.  On the third hand, spammers often forge
Received headers, and I doubt most do research to forge sensible IPs.  IOW,
the system learns what does and doesn't work, in both directions, provided
only that it's shown potentially interesting stuff.

> ...
> Again, no more general than the first two octets (a class B network).
> Class A networks are very rare (for obvious reasons):
>
>     http://euclid.math.brandeis.edu/turtschi/whois/neta1.html

They're rarer than that now -- that's over 4 years old, and lots of those
have been busted up.  Since current practice is to assign a range of initial
bits instead of initial bytes, maybe we should generate all *bit* prefixes
instead.  That would sure test whether correlation is our friend <wink>.


From richie at entrian.com  Mon Dec 15 04:00:19 2003
From: richie at entrian.com (Richie Hindle)
Date: Mon Dec 15 04:00:52 2003
Subject: [spambayes-dev] A new and altogether different bsddb breakage
In-Reply-To: hfahtvkejos62a0d8jfmc2mkvghpvgvkih@4ax.com
Message-ID: <k6rqtvguechcaclkrd6ssiam0pokbvtcn7@4ax.com>


[Richie]
> >>> print [db[k] for k in db]
> KeyError: 'pics'

[Tim]
> Ouch.  What do you get if you open the database directly, instead of
> indirecting thru a shelf?  I'm just trying to make sure it's really the
> database that's hosed.

I think we're using different versions of bsddb - your code fails for me:

>>> d = bsddb.hashopen("/src/tests/spambayes/hammie.db")
>>> len(d)
52331
>>> len([k for k in d if d.get(k, None) is None])
Traceback (most recent call last):
  File "<pyshell#4>", line 1, in -toplevel-
    len([k for k in d if d.get(k, None) is None])
  File "C:\Python23\lib\bsddb\__init__.py", line 86, in __getitem__
    return self.db[key]
TypeError: Integer keys only allowed for Recno and Queue DB's

I think this is because GET_ITER is creating a list-style iterator rather
than a dict-style one.  bsddb objects don't look much like dictionaries:

>>> len([k for k in d.keys() if d.get(k, None) is None])
Traceback (most recent call last):
  File "<pyshell#11>", line 1, in -toplevel-
    len([k for k in d.keys() if d.get(k, None) is None])
AttributeError: _DBWithCursor instance has no attribute 'get'

I have Python 2.3 (#46, Jul 29 2003, 18:54:32) [MSC v.1200 32 bit (Intel)]
on win32.  Assuming that's a red herring, here's an equivalent that works
for me:

>>> def get(d, k, default):
        try:
                return d[k]
        except KeyError:
                return default

>>> len([k for k in d.keys() if get(d, k, None) is None])
305

So yes, the underlying database is screwed.  But one token less screwed
than last time - lovely.  (I now get 305 when going through shelve as
well.)  I've done some training in between, which must have jiggled things
around.

[Tim]
> Gotta say, I'm half ready to declare
> that ZODB is the only database anyone should ever use (the bugs in that are
> long fixed <wink>).

I'm certainly underwhelmed by bsddb in single-file mode.  One day I want
to make spambayes use full transaction mode - that really ought to work.
(Does anyone know of any simple Python code I can steal that uses bsddb in
full-on multi-everything DBEnv mode?  The pybsddb docs just link to the
SleepyCat C API docs, which aren't very approachable.)

-- 
Richie Hindle
richie@entrian.com


From tim at fourstonesExpressions.com  Mon Dec 15 08:24:34 2003
From: tim at fourstonesExpressions.com (Tim Stone)
Date: Mon Dec 15 08:24:41 2003
Subject: [spambayes-dev] Fwd: [Spambayes] Won't work anymore
In-Reply-To: <000201c3c2cf$352f58a0$e09b2e04@home>
References: <000201c3c2cf$352f58a0$e09b2e04@home>
Message-ID: <oprz74u8kuit6vze@mail.fourstonesExpressions.com>

I don't know if anyone saw this on the spambayes list, but it seems 
severe, and I don't know how to respond....

------- Forwarded message -------
From: Jones Clan <jonesclan@verizon.net>
To: spambayes@python.org
Subject: [Spambayes] Won't work anymore
Date: Sun, 14 Dec 2003 21:49:29 -0800

> I loved your product with Outlook 2000. But now that I have installed
> XP, it won't work.  I don't get any errors but I click the button on the
> toolbar and nothing happens.  I have uninstalled and reinstalled
> thinking it had to be installed again after the Outlook upgrade.  Still
> nothing.  Please help because I miss using your product.
>
> McLean Jones
> NO Sugar - NO Carb Energy Drink
> www.getsomexs.com
> user: mclean
> pass: guest
> 888.870.5070
>
>


-- 

Vous exprimer; Expr?sese; Te stesso esprimere; Express yourself!
Tim Stone
See my photography at www.fourstonesExpressions.com
See my writing at www.xanga.com/obj3kshun
-------------- next part --------------
A non-text attachment was scrubbed...
Name: attachment675.dat
Type: application/octet-stream
Size: 180 bytes
Desc: not available
Url : http://mail.python.org/pipermail/spambayes-dev/attachments/20031215/2b1b86ea/attachment675-0001.obj
From tim at fourstonesExpressions.com  Mon Dec 15 08:26:27 2003
From: tim at fourstonesExpressions.com (Tim Stone)
Date: Mon Dec 15 08:26:33 2003
Subject: [spambayes-dev] Fwd: [Spambayes] SpamBayes Corrupted My Profile
In-Reply-To: <000001c3c2b8$ce2de0b0$1e02a8c0@JDi8000>
References: <000001c3c2b8$ce2de0b0$1e02a8c0@JDi8000>
Message-ID: <oprz74ydytit6vze@mail.fourstonesExpressions.com>

Oops... forwarded the wrong message.  This is the one I was thinking of.  
This seems severe, and I've not seen this problem pop up in the list 
before.  I don't know how to respond.

------- Forwarded message -------
From: My Tech <jtech@hyperionmail.com>
To: spambayes@python.org
Subject: [Spambayes] SpamBayes Corrupted My Profile
Date: Sun, 14 Dec 2003 22:09:07 -0500

> After installing SpamBayes, Outlook could only be opened in Safe Mode.
> (That is to say that when clicking on the Outlook desktop icon, a 
> dialogue
> box popped up before the application would open, informing me that 
> Outlook
> had encountered an error and needed to shut down.  The checkbox to 
> "Restart
> Outlook" was already checked and I clicked on the "Don't Send [Error 
> Report
> to Microsoft]" button.  Then, a new dialogue box popped up saying that
> Outlook failed to start correctly and asked me if I wanted to start in 
> Safe
> Mode, "Yes" or "No."  If I select "No", then the first dialogue box
> re-appears telling me about Outlook encountering an error and wanting to
> restart.  If I select "Yes", only then will Outlook open.)
>
> I've come to find out that installing SpamBayes has corrupted by Windows
> Administrator profile and that is why Outlook will not open.  PLEASE HELP
> ASAP!!!  I do not want to have to reinstall my OS (and all of my 
> software)
> because of this.
>
> FYI:  My Windows OS:  2000 Professional, 5.00.2195, Service Pack 4
>         SpamBayes installer used:  SpamBayes-Outlook-Setup-0081.exe
>         Outlook version:  2002, part of Office XP Small Business Edition
>
> If there is a way to fix this, please tell me.  Also, please send 
> detailed
> instructions for installing SpamBayes, as it appears that I did not do it
> correctly (even though I followed the instructions per the SpamBayes
> website.)
>
> Thank you.


-- 

Vous exprimer; Expr?sese; Te stesso esprimere; Express yourself!
Tim Stone
See my photography at www.fourstonesExpressions.com
See my writing at www.xanga.com/obj3kshun
-------------- next part --------------
A non-text attachment was scrubbed...
Name: attachment695.dat
Type: application/octet-stream
Size: 180 bytes
Desc: not available
Url : http://mail.python.org/pipermail/spambayes-dev/attachments/20031215/64523b79/attachment695.obj
From skip at pobox.com  Mon Dec 15 09:48:48 2003
From: skip at pobox.com (Skip Montanaro)
Date: Mon Dec 15 09:48:51 2003
Subject: [spambayes-dev] Nuke experimental_ham_spam_imbalance_adjustment?
In-Reply-To: <LNBBLJKPBEHFEDALKOLCEEGCHMAB.tim.one@comcast.net>
References: <16348.51361.368410.400031@montanaro.dyndns.org>
	<LNBBLJKPBEHFEDALKOLCEEGCHMAB.tim.one@comcast.net>
Message-ID: <16349.51664.580182.339027@montanaro.dyndns.org>


    Tim> Cool!  I just did that.  There's a minor problem: the OptionsClass
    Tim> module says the magic prefix is

    Tim>     X-

    Tim> (uppercase), but only

    Tim>     x-

    Tim> (lowercase) works as intended.  With

    Tim>     X-experimental_ham_spam_imbalance_adjustment

    Tim> the warning is

    Tim>     warning: Invalid option experimental_ham_spam_imbalance_adjustment in
    Tim>     section Classifier in file
    Tim>     C:\WINDOWS\Application Data\SpamBayes\default_bayes_customize.ini

    Tim> and with

    Tim>     x-experimental_ham_spam_imbalance_adjustment

    Tim> it's the mostly <wink> hoped-for

    Tim>     warning: option experimental_ham_spam_imbalance_adjustment in
    Tim>     section Classifier is deprecated

    Tim> I'm not sure what your intent was, but the code should match the
    Tim> docs one way or the other.  The second form of message should
    Tim> probably include the filename too.

My intent was to mimic rfc-822-style experimental headers, but it appears I
don't really understand what ConfigParser does vis a vis case-sensitivity.
(I thought it was case-insensitive, and the code suggests it is, but that
seems to not quite be the case.)  In OptionsClass.merge_file() I originally
wanted to use X- (note its presence in a comment I forgot to change), but
wound up switching to x-.  A little more investigation suggests that
ConfigParser does indeed ignore case in the values it reads from the options
file, but that the code in OptionsClass.py doesn't treat the options it
stores in self._options that way.

I don't really care what hoops we as programmers have to jump through, but
I'd like users to be able to use either x- or X-.  I agree, the docstrings
should match required usage.  In any case, it appears Tim (or someone else)
has fixed things).

Skip

From tim.one at comcast.net  Mon Dec 15 11:00:13 2003
From: tim.one at comcast.net (Tim Peters)
Date: Mon Dec 15 11:00:47 2003
Subject: [spambayes-dev] A new and altogether different bsddb breakage
In-Reply-To: <k6rqtvguechcaclkrd6ssiam0pokbvtcn7@4ax.com>
Message-ID: <LNBBLJKPBEHFEDALKOLCEEKKHMAB.tim.one@comcast.net>

[Richie Hindle]
> I think we're using different versions of bsddb - your code fails for
> me:
>
> >>> d = bsddb.hashopen("/src/tests/spambayes/hammie.db")
> >>> len(d)
> 52331
> >>> len([k for k in d if d.get(k, None) is None])
> Traceback (most recent call last):
>   File "<pyshell#4>", line 1, in -toplevel-
>     len([k for k in d if d.get(k, None) is None])
>   File "C:\Python23\lib\bsddb\__init__.py", line 86, in __getitem__
>     return self.db[key]
> TypeError: Integer keys only allowed for Recno and Queue DB's
>
> I think this is because GET_ITER is creating a list-style iterator
> rather than a dict-style one.  bsddb objects don't look much like
> dictionaries:
>
> >>> len([k for k in d.keys() if d.get(k, None) is None])
> Traceback (most recent call last):
>   File "<pyshell#11>", line 1, in -toplevel-
>     len([k for k in d.keys() if d.get(k, None) is None])
> AttributeError: _DBWithCursor instance has no attribute 'get'

Not here:

>>> PATH = "/WINDOWS/Application Data/SpamBayes/default_bayes_database.db"
>>> import bsddb
>>> d = bsddb.hashopen(PATH, 'r')
>>> len([k for k in d.keys() if d.get(k, None) is None])
0
>>>

> I have Python 2.3 (#46, Jul 29 2003, 18:54:32) [MSC v.1200 32 bit
> (Intel)] on win32.  Assuming that's a red herring,

I wouldn't assume that -- it may be the whole ball of wax.  I'm using
exactly the same, *except* I'm using 2.3.3c1 (also on Windows), and a number
of bsddb3 fixes have been checked in since Python 2.3.  It would help if you
tried 2.3.3c1.  If your symptoms above persist, then we've got a Major
Mystery to sort out (e.g., maybe you-- or I --aren't getting the version of
bsddb the Windows installer intended us to get).


> here's an equivalent that works for me:
>
> >>> def get(d, k, default):
>         try:
>                 return d[k]
>         except KeyError:
>                 return default
>
> >>> len([k for k in d.keys() if get(d, k, None) is None]) 305
>
> So yes, the underlying database is screwed.  But one token less
> screwed than last time - lovely.  (I now get 305 when going through
> shelve as well.)  I've done some training in between, which must have
> jiggled things around.

...

> I'm certainly underwhelmed by bsddb in single-file mode.  One day I
> want to make spambayes use full transaction mode - that really ought
> to work. (Does anyone know of any simple Python code I can steal that
> uses bsddb in full-on multi-everything DBEnv mode?  The pybsddb docs
> just link to the SleepyCat C API docs, which aren't very
> approachable.)

Best I can suggest is studying Python's bsddb3 substantial test suite.  ZODB
has modules to build ZODB's transaction model on top of a Berkeley database,
but I don't think I'd call that simple.  I'm not a bsddb guy, though, so
those are just random things I've seen.


From tim.one at comcast.net  Mon Dec 15 11:12:52 2003
From: tim.one at comcast.net (Tim Peters)
Date: Mon Dec 15 11:12:57 2003
Subject: [spambayes-dev] Nuke experimental_ham_spam_imbalance_adjustment?
In-Reply-To: <16349.51664.580182.339027@montanaro.dyndns.org>
Message-ID: <LNBBLJKPBEHFEDALKOLCCELBHMAB.tim.one@comcast.net>

[Skip Montanaro]
> My intent was to mimic rfc-822-style experimental headers, but it
> appears I don't really understand what ConfigParser does vis a vis
> case-sensitivity. (I thought it was case-insensitive, and the code
> suggests it is, but that seems to not quite be the case.)  In
> OptionsClass.merge_file() I originally wanted to use X- (note its
> presence in a comment I forgot to change), but wound up switching to
> x-.  A little more investigation suggests that ConfigParser does
> indeed ignore case in the values it reads from the options file, but
> that the code in OptionsClass.py doesn't treat the options it stores
> in self._options that way.

I'd call that a bug in OptionsClass.py, then.  ConfigParser is modeled on
RFC 822 header fields, and supplies case-insensitive option names *because*
822 mandates case-insensitive semantics.

> I don't really care what hoops we as programmers have to jump
> through, but I'd like users to be able to use either x- or X-.  I
> agree, the docstrings should match required usage.  In any case, it
> appears Tim (or someone else) has fixed things).

Tony checked in a bunch of changes, but I suspect it's still case-sensitive.


From kennypitt at hotmail.com  Mon Dec 15 11:27:46 2003
From: kennypitt at hotmail.com (Kenny Pitt)
Date: Mon Dec 15 11:28:35 2003
Subject: [spambayes-dev] A new and altogether different bsddb breakage
In-Reply-To: <LNBBLJKPBEHFEDALKOLCEEKKHMAB.tim.one@comcast.net>
Message-ID: <Law11-OE738sNcxqkwT0000a9a0@hotmail.com>

Tim Peters wrote:
> [Richie Hindle]
>> I think we're using different versions of bsddb - your code fails
>> for me: 
>> 
>> >>> d = bsddb.hashopen("/src/tests/spambayes/hammie.db")
>> >>> len(d)
>> 52331
>> >>> len([k for k in d if d.get(k, None) is None])
>> Traceback (most recent call last):
>>   File "<pyshell#4>", line 1, in -toplevel-
>>     len([k for k in d if d.get(k, None) is None])
>>   File "C:\Python23\lib\bsddb\__init__.py", line 86, in __getitem__ 
>>     return self.db[key]
>> TypeError: Integer keys only allowed for Recno and Queue DB's 
> 
> Not here:
> 
> >>> PATH = "/WINDOWS/Application
Data/SpamBayes/default_bayes_database.db"
> >>> import bsddb
> >>> d = bsddb.hashopen(PATH, 'r') 
> >>> len([k for k in d.keys() if d.get(k, None) is None])
> 0
> >>> 
> 
>> I have Python 2.3 (#46, Jul 29 2003, 18:54:32) [MSC v.1200 32 bit
>> (Intel)] on win32.  Assuming that's a red herring,
> 
> I wouldn't assume that -- it may be the whole ball of wax.  I'm using
> exactly the same, *except* I'm using 2.3.3c1 (also on Windows), and a
> number of bsddb3 fixes have been checked in since Python 2.3.  It
> would help if you tried 2.3.3c1.  If your symptoms above persist,
> then we've got a Major Mystery to sort out (e.g., maybe you-- or I
> --aren't getting the version of bsddb the Windows installer intended
> us to get). 

I get the same results as Tim using the 2.3.2 final version: Python
2.3.2 (#49, Oct  2 2003, 20:02:00) [MSC v.1200 32 bit (Intel)] on win32

In my 2.3.2 lib, the "return self.db[key]" line in __getitem__ is on
line 116 of __init__.py, not line 86 as in Richie's traceback.  I could
expect some changes between Python 2.3 and 2.3.2, but 30 lines seems a
bit much between minor bugfix releases.  Is that possibly an indicator
of a bsddb version mismatch?

-- 
Kenny Pitt


From tim.one at comcast.net  Mon Dec 15 11:40:36 2003
From: tim.one at comcast.net (Tim Peters)
Date: Mon Dec 15 11:40:40 2003
Subject: [spambayes-dev] A new and altogether different bsddb breakage
In-Reply-To: <Law11-OE738sNcxqkwT0000a9a0@hotmail.com>
Message-ID: <LNBBLJKPBEHFEDALKOLCAELFHMAB.tim.one@comcast.net>

[Kenny Pitt]
> I get the same results as Tim using the 2.3.2 final version: Python
> 2.3.2 (#49, Oct  2 2003, 20:02:00) [MSC v.1200 32 bit (Intel)] on
> win32
>
> In my 2.3.2 lib, the "return self.db[key]" line in __getitem__ is on
> line 116 of __init__.py, not line 86 as in Richie's traceback.  I
> could expect some changes between Python 2.3 and 2.3.2, but 30 lines
> seems a bit much between minor bugfix releases.  Is that possibly an
> indicator of a bsddb version mismatch?

It's more an indicator of bugs in 2.3's bsddb support.  __init__.py was at
rev 1.5 in the 2.3 release, and is at rev 1.12(!) today:

http://cvs.sf.net/viewcvs.py/python/python/dist/src/Lib/bsddb/__init__.py

I see that support for the iterator and mapping protocols wasn't added until
rev 1.6, which is why they don't work for Richie in 2.3 final.


From kennypitt at hotmail.com  Mon Dec 15 12:19:19 2003
From: kennypitt at hotmail.com (Kenny Pitt)
Date: Mon Dec 15 12:19:57 2003
Subject: [spambayes-dev] sb_server UI error
Message-ID: <LAW11-OE23gtc0kALvE0000a822@hotmail.com>

Looks like a usage got missed when deprecating the extract_dow option:

"""
500 Server error

Traceback (most recent call last):

  File "spambayes\Dibbler.pyc", line 457, in found_terminator

  File "spambayes\UserInterface.pyc", line 629, in onAdvancedconfig

  File "spambayes\UserInterface.pyc", line 692, in _buildConfigPage

  File "spambayes\OptionsClass.pyc", line 563, in valid_input

KeyError: ('Tokenizer', 'extract_dow')
"""

This appears to come from the adv_map in ProxyUI.py.  The
generate_time_buckets option will likely generate the same error.

-- 
Kenny Pitt


From barry at python.org  Mon Dec 15 12:53:26 2003
From: barry at python.org (Barry Warsaw)
Date: Mon Dec 15 12:53:32 2003
Subject: [spambayes-dev] A new and altogether different bsddb breakage
In-Reply-To: <k6rqtvguechcaclkrd6ssiam0pokbvtcn7@4ax.com>
References: <k6rqtvguechcaclkrd6ssiam0pokbvtcn7@4ax.com>
Message-ID: <1071510805.970.122.camel@anthem>

On Mon, 2003-12-15 at 04:00, Richie Hindle wrote:

> (Does anyone know of any simple Python code I can steal that uses bsddb in
> full-on multi-everything DBEnv mode?

Sorry, but that's too oxymoronic of a request to fulfill.

But you can look at ZODB's BerkeleyDB based storage code which is a good
working example of a full-on transactional BerkeleyDB application.  If
you can ignore the peculiarities of ZODB's storage API and all the
tables used to support it, the code should be helpful.

-Barry


From skip at pobox.com  Mon Dec 15 12:57:51 2003
From: skip at pobox.com (Skip Montanaro)
Date: Mon Dec 15 12:59:29 2003
Subject: [spambayes-dev] sb_server UI error
In-Reply-To: <LAW11-OE23gtc0kALvE0000a822@hotmail.com>
References: <LAW11-OE23gtc0kALvE0000a822@hotmail.com>
Message-ID: <16349.63007.770473.205555@montanaro.dyndns.org>


    Kenny> Looks like a usage got missed when deprecating the extract_dow
    Kenny> option:
    ...

Yeah, I wasn't aware these things were referenced anywhere but in the
Options.py and tokenizer files.  Try removing lines from ImapUI.py and
ProxyUI.py which contain extract_dow or generate_time_buckets, then start
again.  If that works, let me know and I'll check in the change.
(Deprecated options should probably not be offered in even the advanced
options page, right?)

Skip

From kennypitt at hotmail.com  Mon Dec 15 13:13:03 2003
From: kennypitt at hotmail.com (Kenny Pitt)
Date: Mon Dec 15 13:13:39 2003
Subject: [spambayes-dev] sb_server UI error
In-Reply-To: <16349.63007.770473.205555@montanaro.dyndns.org>
Message-ID: <Law11-OE36u2uDoPM6B0000aa34@hotmail.com>

Skip Montanaro wrote:
>     Kenny> Looks like a usage got missed when deprecating the
extract_dow
>     Kenny> option:
>     ...
> 
> Yeah, I wasn't aware these things were referenced anywhere but in the
> Options.py and tokenizer files.  Try removing lines from ImapUI.py and
> ProxyUI.py which contain extract_dow or generate_time_buckets, then
> start again.  If that works, let me know and I'll check in the change.

I don't have a setup to test ImapUI.py, but that works for ProxyUI.py.

> (Deprecated options should probably not be offered in even the
> advanced options page, right?)

Probably not, but FWIW adding the 'x-' in front of the option name in
ProxyUI.py also works.  I suppose you could make a case for leaving the
options on the config page for a release or two so users can see that
they have been deprecated.  Don't know if anyone is more likely to see
it there than in the logs, though.

-- 
Kenny Pitt


From tameyer at ihug.co.nz  Mon Dec 15 19:22:00 2003
From: tameyer at ihug.co.nz (Tony Meyer)
Date: Mon Dec 15 19:22:08 2003
Subject: [spambayes-dev] Nuke experimental_ham_spam_imbalance_adjustment?
In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F13047C0815@its-xchg4.massey.ac.nz>
Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F13026F2A0A@its-xchg4.massey.ac.nz>

[Skip]
> I don't really care what hoops we as programmers have to jump through, 
> but I'd like users to be able to use either x- or X-.  I agree, the 
> docstrings should match required usage.  In any case, it appears Tim 
> (or someone else) has fixed things).

What happens at the moment is that ConfigParser lowercases all the option
names (but not section names; I don't know whether that's deliberate or not)
when it reads them from a file.  So users can happily use "X-" or "x-" and
by the time OptionsClass deals with them it'll be "x-".  I changed the
comments so that they all use "x-", but haven't added anything about this.

Us programmers *must* use "x-" when referring to the options at the moment.

[Tim]
> I'd call that a bug in OptionsClass.py, then.  ConfigParser 
> is modeled on RFC 822 header fields, and supplies 
> case-insensitive option names *because* 822 mandates 
> case-insensitive semantics.

So should get/set in our OptionsClass also be case insensitive in situations
other than reading in the config files?  At the moment options["Sect",
"Opt"] != options["Sect", "opt"], but it would be a simple enough change
(and we certainly don't have any options with the same name but differing
case).

> Tony checked in a bunch of changes, but I suspect it's still 
> case-sensitive.

I made the mistake of checking things in before I really had figured out
what was happening in ConfigParser, so half the check-ins are repairing the
other half.  It's case-sensitive *apart* from reading in the file.

=Tony Meyer


From tameyer at ihug.co.nz  Mon Dec 15 19:29:30 2003
From: tameyer at ihug.co.nz (Tony Meyer)
Date: Mon Dec 15 19:29:38 2003
Subject: [spambayes-dev] sb_server UI error
In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F13047C085C@its-xchg4.massey.ac.nz>
Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F1304677765@its-xchg4.massey.ac.nz>

[Kenny]
> I don't have a setup to test ImapUI.py, but that works for ProxyUI.py.

You can test ImapUI.py without actually having an IMAP connection.  Just run
"sb_imapfilter.py -b" and go to the config page.  It'll work, anyway.  I've
checked these in.

[Skip]
> (Deprecated options should probably not be offered in even the 
> advanced options page, right?)

+1 here.

> I suppose you could make a case for leaving the options on the
> config page for a release or two so users can see that they have
> been deprecated.  
> Don't know if anyone is more likely to see it there than in 
> the logs, though.

I think leaving it there is just asking for someone to set it, and that any
"x-" (experimental *or* deprecated) option shouldn't be exposed via the
regular config pages.  (OTOH, I have the basis of a web interface for
timcv.py which exposes *only* those options).

I think we need some other way of presenting the warnings.  Even just in the
status panel of the home web interface page would be better than only in the
logs.  If someone wanted to put in the effort, the tray app could also put
up a little window that pointed out that important messages were on that
page.

=Tony Meyer


From skip at pobox.com  Mon Dec 15 20:02:45 2003
From: skip at pobox.com (Skip Montanaro)
Date: Mon Dec 15 20:02:47 2003
Subject: [spambayes-dev] Nuke experimental_ham_spam_imbalance_adjustment?
In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F13026F2A0A@its-xchg4.massey.ac.nz>
References: <1ED4ECF91CDED24C8D012BCF2B034F13047C0815@its-xchg4.massey.ac.nz>
	<1ED4ECF91CDED24C8D012BCF2B034F13026F2A0A@its-xchg4.massey.ac.nz>
Message-ID: <16350.22965.664272.622282@montanaro.dyndns.org>


    >> Tony checked in a bunch of changes, but I suspect it's still
    >> case-sensitive.

    Tony> I made the mistake of checking things in before I really had
    Tony> figured out what was happening in ConfigParser...

Join the club. :-)

Skip

From tim.one at comcast.net  Mon Dec 15 21:23:07 2003
From: tim.one at comcast.net (Tim Peters)
Date: Mon Dec 15 21:23:15 2003
Subject: [spambayes-dev] RE: [Spambayes-checkins] spambayes/spambayes
	Options.py, 1.90, 1.91 UserInterface.py, 1.35,
	1.36 classifier.py, 1.11, 1.12
In-Reply-To: <E1AW4Yf-0004kH-00@sc8-pr-cvs1.sourceforge.net>
Message-ID: <LNBBLJKPBEHFEDALKOLCEEOMHMAB.tim.one@comcast.net>

> *** UserInterface.py	11 Dec 2003 18:44:23 -0000	1.35
> --- UserInterface.py	16 Dec 2003 02:03:31 -0000	1.36
> ***************
> *** 306,309 ****
> --- 306,313 ----
>               for tok in tokens:
>                   clues.append((tok, None))
> +             # Need to regenerate the tokens (is there a way to
> +             # 'rewind' or copy a generator?  Would that be
> +             # more effecient?
> +             tokens = tokenizer.tokenize(message)
>               probability = self.classifier.spamprob(tokens)
>               cluesTable = self._fillCluesTable(clues)

Change the first line of the function to:

    tokens = list(tokenizer.tokenize(message))

There's no need to tokenize again, then.  The construction of clues can be
the one-liner:

    clues = [(tok, None) for tok in tokens]


From tameyer at ihug.co.nz  Mon Dec 15 21:48:35 2003
From: tameyer at ihug.co.nz (Tony Meyer)
Date: Mon Dec 15 21:48:41 2003
Subject: [spambayes-dev] RE: [Spambayes-checkins]
	spambayes/spambayesOptions.py, 1.90, 1.91 UserInterface.py, 1.35,
	1.36 classifier.py, 1.11, 1.12
In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F13047C0976@its-xchg4.massey.ac.nz>
Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F13026F2A0B@its-xchg4.massey.ac.nz>

[...]
> +             # Need to regenerate the tokens (is there a way to
> +             # 'rewind' or copy a generator?  Would that be
> +             # more efficient?
[...]
> Change the first line of the function to:
> 
>     tokens = list(tokenizer.tokenize(message))
> 
> There's no need to tokenize again, then.  The construction of 
> clues can be the one-liner:
> 
>     clues = [(tok, None) for tok in tokens]

Thanks.  The _getclues call later was expecting a generator rather than a
list, but I've fixed that too, and the code is nicer now, I think.  I've
checked this in.

=Tony Meyer


From tameyer at ihug.co.nz  Mon Dec 15 22:05:18 2003
From: tameyer at ihug.co.nz (Tony Meyer)
Date: Mon Dec 15 22:07:53 2003
Subject: [spambayes-dev] Fwd: [Spambayes] SpamBayes Corrupted My Profile
In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F13047C07AF@its-xchg4.massey.ac.nz>
Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F1304677769@its-xchg4.massey.ac.nz>

> Oops... forwarded the wrong message.  This is the one I was 
> thinking of.  
> This seems severe, and I've not seen this problem pop up in the list 
> before.  I don't know how to respond.

I thought the same thing, although I'm not entirely convinced that it was
the SpamBayes installer that did this.  Also, whenever I've had a corrupted
profile, I've had to dump the entire profile, which this guy obviously
hasn't.

Presumably rolling back the registry would fix it.  If it's actually a
problem with Outlook, not the Windows profile (which seems more likely),
then Outlook's detect and repair should fix it.

Does everything work apart from Outlook?  If so, it seems highly unlikely
that it's the Windows profile that is corrupt.  If not, what is it that
fails?

As for the instructions to install the SpamBayes Outlook plug-in:
 1. Download the installer.
 2. Double-click the installer.
 3. Go through the installer prompts.
 4. You're done.

=Tony Meyer


From tim.one at comcast.net  Mon Dec 15 23:37:49 2003
From: tim.one at comcast.net (Tim Peters)
Date: Mon Dec 15 23:37:53 2003
Subject: [spambayes-dev] Nuke experimental_ham_spam_imbalance_adjustment?
In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F13026F2A0A@its-xchg4.massey.ac.nz>
Message-ID: <LNBBLJKPBEHFEDALKOLCCEPFHMAB.tim.one@comcast.net>

[Tony Meyer]
> ...
> I made the mistake of checking things in before I really had figured
> out what was happening in ConfigParser, so half the check-ins are
> repairing the other half.  It's case-sensitive *apart* from reading
> in the file.

I don't follow the distinctions being made here, but that's OK because
nobody should have to <wink>:  option names were intended to be
case-insensitive, regardless of context.  Whether they're read from .py
files, or from .ini files, or passed as arguments -- all should act the same
way.

As things were when I last checked something in, I left this comment in
Options.py, because it explained the truth of it at that time:

    # XXX The "x-" prefix can't be "X-" instead, else it's considered
    # XXX an invalid option instead of a deprecated one.  That behavior
    # XXX doesn't match the OptionsClass comments.
    ("x-experimental_ham_spam_imbalance_adjustment", ...

It "shouldn't" make any difference there either (but it did make a
difference) whether that's spelled

    x-experimental_ham_spam_imbalance_adjustment

or

    X-experimental_ham_spam_imbalance_adjustment

or

    x-ExPeRiMeTtAl_HaM_sPaM_iMbaLaNcE_aDjUsYmEnT

etc.


From tameyer at ihug.co.nz  Mon Dec 15 23:40:34 2003
From: tameyer at ihug.co.nz (Tony Meyer)
Date: Mon Dec 15 23:40:43 2003
Subject: [spambayes-dev] Nuke experimental_ham_spam_imbalance_adjustment?
In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F13047C09B4@its-xchg4.massey.ac.nz>
Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F130467776C@its-xchg4.massey.ac.nz>

> I don't follow the distinctions being made here, but that's OK because
> nobody should have to <wink>:  option names were intended to be
> case-insensitive, regardless of context.  Whether they're 
> read from .py files, or from .ini files, or passed as arguments -- all 
> should act the same way.

This is not the case at the moment, but I'll check in some changes in a
minute to make it so.

=Tony Meyer


From tim at fourstonesExpressions.com  Mon Dec 15 23:42:06 2003
From: tim at fourstonesExpressions.com (Tim Stone)
Date: Mon Dec 15 23:42:14 2003
Subject: [spambayes-dev] Nuke experimental_ham_spam_imbalance_adjustment?
In-Reply-To: <LNBBLJKPBEHFEDALKOLCCEPFHMAB.tim.one@comcast.net>
References: <LNBBLJKPBEHFEDALKOLCCEPFHMAB.tim.one@comcast.net>
Message-ID: <oprz9bcgcjit6vze@mail.fourstonesExpressions.com>

On Mon, 15 Dec 2003 23:37:49 -0500, Tim Peters <tim.one@comcast.net> wrote:


> It "shouldn't" make any difference there either (but it did make a
> difference) whether that's spelled
>
>     x-ExPeRiMeTtAl_HaM_sPaM_iMbaLaNcE_aDjUsYmEnT

I suspect that the reason this one didn't work as advertised is because 
"adjusyment" isn't an actual option <wink>


-- 

Vous exprimer; Expr?sese; Te stesso esprimere; Express yourself!
Tim Stone
See my photography at www.fourstonesExpressions.com
See my writing at www.xanga.com/obj3kshun

From tim at fourstonesExpressions.com  Mon Dec 15 23:54:14 2003
From: tim at fourstonesExpressions.com (Tim Stone)
Date: Mon Dec 15 23:54:23 2003
Subject: [spambayes-dev] Re: [Spambayes-checkins] spambayes/spambayes
	OptionsClass.py, 1.19, 1.20
In-Reply-To: <E1AW78J-0002Se-00@sc8-pr-cvs1.sourceforge.net>
References: <E1AW78J-0002Se-00@sc8-pr-cvs1.sourceforge.net>
Message-ID: <oprz9bwopiit6vze@mail.fourstonesExpressions.com>

Watch the world reel now <wink>

On Mon, 15 Dec 2003 20:48:31 -0800, Tony Meyer 
<anadelonbrin@users.sourceforge.net> wrote:

> Update of /cvsroot/spambayes/spambayes/spambayes
> In directory sc8-pr-cvs1:/tmp/cvs-serv9453/spambayes
>
> Modified Files:
> 	OptionsClass.py
> Log Message:
> Option names are always case insensitive, no matter what.
>
> Index: OptionsClass.py
> ===================================================================
> RCS file: /cvsroot/spambayes/spambayes/spambayes/OptionsClass.py,v
> retrieving revision 1.19
> retrieving revision 1.20
> diff -C2 -d -r1.19 -r1.20
> *** OptionsClass.py	15 Dec 2003 09:20:33 -0000	1.19
> --- OptionsClass.py	16 Dec 2003 04:48:28 -0000	1.20
> ***************
> *** 552,586 ****
>       def display_name(self, sect, opt):
>           '''A name for the option suitable for display to a user.'''
> !         return self._options[sect, opt].display_name()
>       def default(self, sect, opt):
>           '''The default value for the option.'''
> !         return self._options[sect, opt].default()
>       def doc(self, sect, opt):
>           '''Documentation for the option.'''
> !         return self._options[sect, opt].doc()
>       def valid_input(self, sect, opt):
>           '''Valid values for the option.'''
> !         return self._options[sect, opt].valid_input()
>       def no_restore(self, sect, opt):
>           '''Do not restore this option when restoring to defaults.'''
> !         return self._options[sect, opt].no_restore()
>       def is_valid(self, sect, opt, value):
>           '''Check if this is a valid value for this option.'''
> !         return self._options[sect, opt].is_valid(value)
>       def multiple_values_allowed(self, sect, opt):
>           '''Multiple values are allowed for this option.'''
> !         return self._options[sect, opt].multiple_values_allowed()
>
>       def is_boolean(self, sect, opt):
>           '''The option is a boolean value. (Support for Python 2.2).'''
> !         return self._options[sect, opt].is_boolean()
>
>       def convert(self, sect, opt, value):
>           '''Convert value from a string to the appropriate type.'''
> !         return self._options[sect, opt].convert(value)
>
>       def unconvert(self, sect, opt):
>           '''Convert value from the appropriate type to a string.'''
> !         return self._options[sect, opt].unconvert()
>
>       def get_option(self, sect, opt):
> --- 552,586 ----
>       def display_name(self, sect, opt):
>           '''A name for the option suitable for display to a user.'''
> !         return self._options[sect, opt.lower()].display_name()
>       def default(self, sect, opt):
>           '''The default value for the option.'''
> !         return self._options[sect, opt.lower()].default()
>       def doc(self, sect, opt):
>           '''Documentation for the option.'''
> !         return self._options[sect, opt.lower()].doc()
>       def valid_input(self, sect, opt):
>           '''Valid values for the option.'''
> !         return self._options[sect, opt.lower()].valid_input()
>       def no_restore(self, sect, opt):
>           '''Do not restore this option when restoring to defaults.'''
> !         return self._options[sect, opt.lower()].no_restore()
>       def is_valid(self, sect, opt, value):
>           '''Check if this is a valid value for this option.'''
> !         return self._options[sect, opt.lower()].is_valid(value)
>       def multiple_values_allowed(self, sect, opt):
>           '''Multiple values are allowed for this option.'''
> !         return self._options[sect, 
> opt.lower()].multiple_values_allowed()
>
>       def is_boolean(self, sect, opt):
>           '''The option is a boolean value. (Support for Python 2.2).'''
> !         return self._options[sect, opt.lower()].is_boolean()
>
>       def convert(self, sect, opt, value):
>           '''Convert value from a string to the appropriate type.'''
> !         return self._options[sect, opt.lower()].convert(value)
>
>       def unconvert(self, sect, opt):
>           '''Convert value from the appropriate type to a string.'''
> !         return self._options[sect, opt.lower()].unconvert()
>
>       def get_option(self, sect, opt):
> ***************
> *** 588,598 ****
>           if self.conversion_table.has_key((sect, opt)):
>               sect, opt = self.conversion_table[sect, opt]
> !         return self._options[sect, opt]
>
>       def get(self, sect, opt):
>           '''Get an option value.'''
> !         if self.conversion_table.has_key((sect, opt)):
> !             sect, opt = self.conversion_table[sect, opt]
> !         return self.get_option(sect, opt).get()
>
>       def __getitem__(self, key):
> --- 588,598 ----
>           if self.conversion_table.has_key((sect, opt)):
>               sect, opt = self.conversion_table[sect, opt]
> !         return self._options[sect, opt.lower()]
>
>       def get(self, sect, opt):
>           '''Get an option value.'''
> !         if self.conversion_table.has_key((sect, opt.lower())):
> !             sect, opt = self.conversion_table[sect, opt.lower()]
> !         return self.get_option(sect, opt.lower()).get()
>
>       def __getitem__(self, key):
> ***************
> *** 601,612 ****
>       def set(self, sect, opt, val=None):
>           '''Set an option.'''
> !         if self.conversion_table.has_key((sect, opt)):
> !             sect, opt = self.conversion_table[sect, opt]
>           if self.is_valid(sect, opt, val):
> !             self._options[sect, opt].set(val)
>           else:
>               print >> sys.stderr, ("Attempted to set [%s] %s with 
> invalid"
>                                     " value %s (%s)" %
> !                                   (sect, opt, val, type(val)))
>
>       def set_from_cmdline(self, arg, stream=None):
> --- 601,612 ----
>       def set(self, sect, opt, val=None):
>           '''Set an option.'''
> !         if self.conversion_table.has_key((sect, opt.lower())):
> !             sect, opt = self.conversion_table[sect, opt.lower()]
>           if self.is_valid(sect, opt, val):
> !             self._options[sect, opt.lower()].set(val)
>           else:
>               print >> sys.stderr, ("Attempted to set [%s] %s with 
> invalid"
>                                     " value %s (%s)" %
> !                                   (sect, opt.lower(), val, type(val)))
>
>       def set_from_cmdline(self, arg, stream=None):
> ***************
> *** 617,620 ****
> --- 617,621 ----
>           """
>           sect, opt, val = arg.split(':', 2)
> +         opt = opt.lower()
>           try:
>               val = self.convert(sect, opt, val)
> ***************
> *** 716,720 ****
>          if section is not None and option is not None:
>              output.write(self._options[section,
> !                                       option].as_nice_string(section))
>              return output.getvalue()
>
> --- 717,721 ----
>          if section is not None and option is not None:
>              output.write(self._options[section,
> !                                       
> option.lower()].as_nice_string(section))
>              return output.getvalue()
>
> ***************
> *** 724,728 ****
>              if section is not None and sect != section:
>                  continue
> !            output.write(self._options[sect, opt].as_nice_string(sect))
>          return output.getvalue()
>
> --- 725,729 ----
>              if section is not None and sect != section:
>                  continue
> !            output.write(self._options[sect, 
> opt.lower()].as_nice_string(sect))
>          return output.getvalue()
>
>
>
>
> _______________________________________________
> Spambayes-checkins mailing list
> Spambayes-checkins@python.org
> http://mail.python.org/mailman/listinfo/spambayes-checkins
>


-- 

Vous exprimer; Expr?sese; Te stesso esprimere; Express yourself!
Tim Stone
See my photography at www.fourstonesExpressions.com
See my writing at www.xanga.com/obj3kshun

From tim.one at comcast.net  Mon Dec 15 23:55:51 2003
From: tim.one at comcast.net (Tim Peters)
Date: Mon Dec 15 23:55:55 2003
Subject: [spambayes-dev] Nuke experimental_ham_spam_imbalance_adjustment?
In-Reply-To: <oprz9bcgcjit6vze@mail.fourstonesExpressions.com>
Message-ID: <LNBBLJKPBEHFEDALKOLCEEPHHMAB.tim.one@comcast.net>

>> It "shouldn't" make any difference there either (but it did make a
>> difference) whether that's spelled
>>
>>     x-ExPeRiMeTtAl_HaM_sPaM_iMbaLaNcE_aDjUsYmEnT

[Tim Stone]
> I suspect that the reason this one didn't work as advertised is
> because "adjusyment" isn't an actual option <wink>

Good eye!  As I meant to say the first time, option names were meant to be
case-insensitive regardless of context, *and* to be insensitive to any
substitutions of the letters in "Timmy".  So, for example,

    adjustment

and

    adjusiieny

are also the same option.  Generalization to Unicode is left as an exercise.


From tim at fourstonesExpressions.com  Tue Dec 16 00:05:19 2003
From: tim at fourstonesExpressions.com (Tim Stone)
Date: Tue Dec 16 00:05:26 2003
Subject: [spambayes-dev] Nuke experimental_ham_spam_imbalance_adjustment?
In-Reply-To: <LNBBLJKPBEHFEDALKOLCEEPHHMAB.tim.one@comcast.net>
References: <LNBBLJKPBEHFEDALKOLCEEPHHMAB.tim.one@comcast.net>
Message-ID: <oprz9ce5q2it6vze@mail.fourstonesExpressions.com>

On Mon, 15 Dec 2003 23:55:51 -0500, Tim Peters <tim.one@comcast.net> wrote:

> Good eye!  As I meant to say the first time, option names were meant to 
> be
> case-insensitive regardless of context, *and* to be insensitive to any
> substitutions of the letters in "Timmy".  So, for example,

Well, at least you didn't use the hated (and more redundant) "Timmie" 
form... but I suppose you suffered as much of that type of abuse as I did 
<wink>.

>
>     adjustment
>
> and
>
>     adjusiieny

I wonder if there are encryption possibilities here...

-- 

Vous exprimer; Expr?sese; Te stesso esprimere; Express yourself!
Tim Stone
See my photography at www.fourstonesExpressions.com
See my writing at www.xanga.com/obj3kshun

From skip at pobox.com  Tue Dec 16 01:33:21 2003
From: skip at pobox.com (Skip Montanaro)
Date: Tue Dec 16 01:33:17 2003
Subject: [spambayes-dev] one bigram nit
Message-ID: <16350.42801.651892.388851@montanaro.dyndns.org>


I see one compatibility problem with the bigram stuff.  We currently have a
key in the database called 'saved state' which stores a tuple: (db version,
spamcount, hamcount).  If that is ever generated as a bigram the database
will get hosed.  If backwards compatibility is an issue you might want to
choose a different bigram connector than ' '.  If backwards compatibility
isn't a big deal, I'd bump the PICKLE_VERSION value and choose another value
for the state key, probably a non-string object.

Skip

From kennypitt at hotmail.com  Tue Dec 16 10:17:16 2003
From: kennypitt at hotmail.com (Kenny Pitt)
Date: Tue Dec 16 10:17:56 2003
Subject: [spambayes-dev] RE: [Spambayes-checkins] spambayes/spambayes
	OptionsClass.py, 1.19, 1.20
In-Reply-To: <E1AW78J-0002Se-00@sc8-pr-cvs1.sourceforge.net>
Message-ID: <LAW11-OE26S6xopZz8Q0000b81a@hotmail.com>

I was looking at this same code a little bit yesterday, and one thing
struck me as odd.  The get(), get_option(), and set() functions use
self.conversion_table to translate the requested option, but many of the
other functions such as display_name() don't.  Is there a reason for
that?  If not, I was wondering if it would be easier to just fix
get_option() for case-insensitivity and then have all the other getter
functions call it instead of accessing self._options directly.

Also note that the self.conversion_table translation in get() is already
redundant, as it then calls get_option() which will do the exact same
translation.

Tony Meyer wrote:
> Update of /cvsroot/spambayes/spambayes/spambayes
> In directory sc8-pr-cvs1:/tmp/cvs-serv9453/spambayes
> 
> Modified Files:
> 	OptionsClass.py
> Log Message:
> Option names are always case insensitive, no matter what.
> 
> Index: OptionsClass.py
> ===================================================================
> RCS file: /cvsroot/spambayes/spambayes/spambayes/OptionsClass.py,v
> retrieving revision 1.19
> retrieving revision 1.20
> diff -C2 -d -r1.19 -r1.20
> *** OptionsClass.py	15 Dec 2003 09:20:33 -0000	1.19
> --- OptionsClass.py	16 Dec 2003 04:48:28 -0000	1.20
> ***************
> *** 552,586 ****
>       def display_name(self, sect, opt):
>           '''A name for the option suitable for display to a user.'''
> !         return self._options[sect, opt].display_name()
>[snip]
>       def get_option(self, sect, opt):
> --- 552,586 ----
>       def display_name(self, sect, opt):
>           '''A name for the option suitable for display to a user.'''
> !         return self._options[sect, opt.lower()].display_name()
>[snip]
>       def get_option(self, sect, opt):
> ***************
> *** 588,598 ****
>           if self.conversion_table.has_key((sect, opt)):
>               sect, opt = self.conversion_table[sect, opt]
> !         return self._options[sect, opt]
> 
>       def get(self, sect, opt):
>           '''Get an option value.'''
> !         if self.conversion_table.has_key((sect, opt)):
> !             sect, opt = self.conversion_table[sect, opt]
> !         return self.get_option(sect, opt).get()
> 
>       def __getitem__(self, key):
> --- 588,598 ----
>           if self.conversion_table.has_key((sect, opt)):
>               sect, opt = self.conversion_table[sect, opt]
> !         return self._options[sect, opt.lower()]
> 
>       def get(self, sect, opt):
>           '''Get an option value.'''
> !         if self.conversion_table.has_key((sect, opt.lower())):
> !             sect, opt = self.conversion_table[sect, opt.lower()]
> !         return self.get_option(sect, opt.lower()).get()
> 
>       def __getitem__(self, key):
> ***************
> *** 601,612 ****
>       def set(self, sect, opt, val=None):
>           '''Set an option.'''
> !         if self.conversion_table.has_key((sect, opt)):
> !             sect, opt = self.conversion_table[sect, opt]
>           if self.is_valid(sect, opt, val):
> !             self._options[sect, opt].set(val)
>           else:
>               print >> sys.stderr, ("Attempted to set [%s] %s with
>                                     invalid" " value %s (%s)" %
> !                                   (sect, opt, val, type(val)))
> 
>       def set_from_cmdline(self, arg, stream=None):
> --- 601,612 ----
>       def set(self, sect, opt, val=None):
>           '''Set an option.'''
> !         if self.conversion_table.has_key((sect, opt.lower())):
> !             sect, opt = self.conversion_table[sect, opt.lower()]
>           if self.is_valid(sect, opt, val):
> !             self._options[sect, opt.lower()].set(val)
>           else:
>               print >> sys.stderr, ("Attempted to set [%s] %s with
>                                     invalid" " value %s (%s)" %
> !                                   (sect, opt.lower(), val,
> type(val))) 
> 
>       def set_from_cmdline(self, arg, stream=None):

-- 
Kenny Pitt


From tim.one at comcast.net  Tue Dec 16 12:10:32 2003
From: tim.one at comcast.net (Tim Peters)
Date: Tue Dec 16 12:10:43 2003
Subject: [spambayes-dev] RE: [Spambayes] Accidentally deleted Junk email
	folder.
In-Reply-To: <000601c3c3f3$ff64ef20$1014a8c0@station16>
Message-ID: <BIEJKCLHCIOIHAGOKOLHGEGEHHAA.tim.one@comcast.net>

[from the spambayes list]
> We use Spambayes in my company with great success, and have come
> across only one bug, which I have not found listed. Since this has
> happened to all three of us using Spambayes, I was surprised to not
> find it in the troubleshooting guide.
>
> After the user accidentally deleted the Junk email folder or the Junk
> Suspect folder, I created new ones, but Spambayes would not filter to
> them.
> ...

I wonder whether the Outlook addin should stop trying to remember Outlook's
internal folder IDs, remember the user-visible string paths instead, and
enumerate the folders to (re)discover the internal Outlook IDs "whenever
anything may have changed".  It's hard to explain that creating a folder
with the same name in the same place doesn't create a folder with the same
name in the same place <wink>.


From tim at fourstonesExpressions.com  Tue Dec 16 12:19:44 2003
From: tim at fourstonesExpressions.com (Tim Stone)
Date: Tue Dec 16 12:19:50 2003
Subject: [spambayes-dev] RE: [Spambayes] Accidentally deleted Junk email
	folder.
In-Reply-To: <BIEJKCLHCIOIHAGOKOLHGEGEHHAA.tim.one@comcast.net>
References: <BIEJKCLHCIOIHAGOKOLHGEGEHHAA.tim.one@comcast.net>
Message-ID: <opr0aae6r9it6vze@mail.fourstonesExpressions.com>

On Tue, 16 Dec 2003 12:10:32 -0500, Tim Peters <tim.one@comcast.net> wrote:

> I wonder whether the Outlook addin should stop trying to remember 
> Outlook's
> internal folder IDs, remember the user-visible string paths instead, and
> enumerate the folders to (re)discover the internal Outlook IDs "whenever
> anything may have changed".  It's hard to explain that creating a folder
> with the same name in the same place doesn't create a folder with the 
> same
> name in the same place <wink>.

This problem seems to crop up a LOT.  I don't know if it's possible to do 
what you say, but I think this is gonna continue to be an achilles heel 
for the plugin unless we do *something* about it...


-- 

Vous exprimer; Expr?sese; Te stesso esprimere; Express yourself!
Tim Stone
See my photography at www.fourstonesExpressions.com
See my writing at www.xanga.com/obj3kshun

From popiel at wolfskeep.com  Tue Dec 16 12:41:48 2003
From: popiel at wolfskeep.com (T. Alexander Popiel)
Date: Tue Dec 16 12:41:52 2003
Subject: [spambayes-dev] one bigram nit 
In-Reply-To: Message from Skip Montanaro <skip@pobox.com> of "Tue,
	16 Dec 2003 00:33:21 CST."
	<16350.42801.651892.388851@montanaro.dyndns.org> 
References: <16350.42801.651892.388851@montanaro.dyndns.org> 
Message-ID: <20031216174148.B737B2DF7F@cashew.wolfskeep.com>

In message:  <16350.42801.651892.388851@montanaro.dyndns.org>
             Skip Montanaro <skip@pobox.com> writes:
>
>I see one compatibility problem with the bigram stuff.  We currently have a
>key in the database called 'saved state' which stores a tuple: (db version,
>spamcount, hamcount).  If that is ever generated as a bigram the database
>will get hosed.  If backwards compatibility is an issue you might want to
>choose a different bigram connector than ' '.  If backwards compatibility
>isn't a big deal, I'd bump the PICKLE_VERSION value and choose another value
>for the state key, probably a non-string object.

I'd actually take a different approach: we should prefix all "natural"
tokens (defined elsewhere as those tokens generated by the whitespace
split over the message body) with "body:", so that text in the body
cannot conflict with our synthetic tokens of any flavor.  As it stands,
I think that the words url:python and url:org would get confused with
parts of http://python.org, just because we don't have any protections
for naturals aliasing synthetics...

Backwards compatibility is overrated; retraining is easy.

- Alex

From tim at fourstonesExpressions.com  Tue Dec 16 12:42:56 2003
From: tim at fourstonesExpressions.com (Tim Stone)
Date: Tue Dec 16 12:43:04 2003
Subject: [spambayes-dev] Fwd: Re: [Spambayes] RE: Spambayes Digest, Vol 64,
	Issue 68
In-Reply-To: <opr0aavcjjit6vze@mail.fourstonesExpressions.com>
References: <GAEAKPFGBHJFLBMBDCGFEEOPDOAA.akiva@atwood.co.il>
	<opr0aavcjjit6vze@mail.fourstonesExpressions.com>
Message-ID: <opr0abhubwit6vze@mail.fourstonesExpressions.com>

Gosh, I still get goofed up between sending to spambayes and 
spambayes-dev... I'm livin in the past...

------- Forwarded message -------
From: Tim Stone <tim@fourstonesExpressions.com>
To: akiva@atwood.co.il, spambayes@python.org
Subject: Re: [Spambayes] RE: Spambayes Digest, Vol 64, Issue 68
Date: Tue, 16 Dec 2003 11:29:26 -0600

> On Tue, 16 Dec 2003 19:20:55 +0200, Akiva Atwood <akiva@atwood.co.il> 
> wrote:
>
>>> Is anyone else having problems with these types of spams recently? Has
>>> some prolific spammer changed tactics? Most of the one's I've seen seem
>>> to originate from Australia or Asia.
>>
>> I've been getting a lot of them. I thought there was a problem with MY
>> filter, and reinstalled it.
>
> This might be well dealt with by changing the unknown word probability 
> to indicate a stronger spamminess.  By default, it's .5, iirc.  Perhaps 
> we should do some experiments with pushing it to .6 or .7.  My corpus 
> has virtually none of these spams, so I can't say what would happen, and 
> I imagine that our test corpus has relatively few of them as well.  
> Comments anyone?
>


-- 

Vous exprimer; Expr?sese; Te stesso esprimere; Express yourself!
Tim Stone
See my photography at www.fourstonesExpressions.com
See my writing at www.xanga.com/obj3kshun

From tim at fourstonesExpressions.com  Tue Dec 16 14:58:17 2003
From: tim at fourstonesExpressions.com (Tim Stone)
Date: Tue Dec 16 14:58:29 2003
Subject: [spambayes-dev] Who wants to pretend to be a spammer?
Message-ID: <opr0ahrfsqit6vze@mail.fourstonesExpressions.com>

How about this for a "testing" regimen... one of us can send a known list 
of spambayes users a series of "spams," with the idea being to see how 
many of them can get through existing databases, and how long it takes 
their databases to learn to correctly classify them?  Would that be an 
interesting exercise?

-- 

Vous exprimer; Expr?sese; Te stesso esprimere; Express yourself!
Tim Stone
See my photography at www.fourstonesExpressions.com
See my writing at www.xanga.com/obj3kshun

From kennypitt at hotmail.com  Tue Dec 16 15:13:11 2003
From: kennypitt at hotmail.com (Kenny Pitt)
Date: Tue Dec 16 15:13:49 2003
Subject: [spambayes-dev] Who wants to pretend to be a spammer?
In-Reply-To: <opr0ahrfsqit6vze@mail.fourstonesExpressions.com>
Message-ID: <Law11-OE39mzvq9eSgN0000bbbf@hotmail.com>

Tim Stone wrote:
> How about this for a "testing" regimen... one of us can send a known
> list of spambayes users a series of "spams," with the idea being to
> see how many of them can get through existing databases, and how long
> it takes their databases to learn to correctly classify them?  Would
> that be an interesting exercise?

Interesting idea, but wouldn't it be tricky to make your psuedo-spams
representative of real-world spam patterns?  For example, it seems like
whatever e-mail address and/or SMTP server you use to send the messages
would quickly become a significant spam clue.

-- 
Kenny Pitt


From tim at fourstonesExpressions.com  Tue Dec 16 15:18:47 2003
From: tim at fourstonesExpressions.com (Tim Stone)
Date: Tue Dec 16 15:19:45 2003
Subject: [spambayes-dev] Who wants to pretend to be a spammer?
In-Reply-To: <Law11-OE39mzvq9eSgN0000bbbf@hotmail.com>
References: <Law11-OE39mzvq9eSgN0000bbbf@hotmail.com>
Message-ID: <opr0aiplr5it6vze@mail.fourstonesExpressions.com>

On Tue, 16 Dec 2003 15:13:11 -0500, Kenny Pitt <kennypitt@hotmail.com> 
wrote:

> Tim Stone wrote:
>> How about this for a "testing" regimen... one of us can send a known
>> list of spambayes users a series of "spams," with the idea being to
>> see how many of them can get through existing databases, and how long
>> it takes their databases to learn to correctly classify them?  Would
>> that be an interesting exercise?
>
> Interesting idea, but wouldn't it be tricky to make your psuedo-spams
> representative of real-world spam patterns?  For example, it seems like
> whatever e-mail address and/or SMTP server you use to send the messages
> would quickly become a significant spam clue.

Yeah, those could be some challenges.  I'm not convinced of the 
usefullness of the idea, but it *could* give us a leg up on spam as it 
evolves.  I dunno, maybe it can't evolve fast enough to fool us for long, 
but...

-- 

Vous exprimer; Expr?sese; Te stesso esprimere; Express yourself!
Tim Stone
See my photography at www.fourstonesExpressions.com
See my writing at www.xanga.com/obj3kshun

From richie at entrian.com  Tue Dec 16 15:28:31 2003
From: richie at entrian.com (Richie Hindle)
Date: Tue Dec 16 15:28:39 2003
Subject: [spambayes-dev] A new and altogether different bsddb breakage
In-Reply-To: <LNBBLJKPBEHFEDALKOLCEEKKHMAB.tim.one@comcast.net>
References: <k6rqtvguechcaclkrd6ssiam0pokbvtcn7@4ax.com>
	<LNBBLJKPBEHFEDALKOLCEEKKHMAB.tim.one@comcast.net>
Message-ID: <pcqutv0l44ta82kg04srrdbnpvdjjkab14@4ax.com>


[Tim]
> It would help if you tried 2.3.3c1.

Your code works under 2.3.3c1, but still lists 304 (yes, 304 - just under
a year until I'm clear 8-) broken tokens.

[Tim]
> Best I can suggest is studying Python's bsddb3 substantial test suite.

[Barry]
> you can look at ZODB's BerkeleyDB based storage code which is a good
> working example of a full-on transactional BerkeleyDB application.

Thanks guys - if and when I get the chance, I'll have a look.

Unless there's a Python-savvy lurker out there who'd like to take on a
smallish, fairly well-spec'd and potentially very important SpamBayes
development task?  IMHO this is the only bug that's preventing sb_server
from entering beta status.

-- 
Richie Hindle
richie@entrian.com


From barry at python.org  Tue Dec 16 15:41:44 2003
From: barry at python.org (Barry Warsaw)
Date: Tue Dec 16 15:41:46 2003
Subject: [spambayes-dev] A new and altogether different bsddb breakage
In-Reply-To: <pcqutv0l44ta82kg04srrdbnpvdjjkab14@4ax.com>
References: <k6rqtvguechcaclkrd6ssiam0pokbvtcn7@4ax.com>
	<LNBBLJKPBEHFEDALKOLCEEKKHMAB.tim.one@comcast.net>
	<pcqutv0l44ta82kg04srrdbnpvdjjkab14@4ax.com>
Message-ID: <1071607304.7979.39.camel@geddy>

On Tue, 2003-12-16 at 15:28, Richie Hindle wrote:

> Unless there's a Python-savvy lurker out there who'd like to take on a
> smallish, fairly well-spec'd and potentially very important SpamBayes
> development task?  IMHO this is the only bug that's preventing sb_server
> from entering beta status.

I really wish I had the time.  But I'll help play "consultant" on
BerkeleyDB stuff.

-Barry


From popiel at wolfskeep.com  Tue Dec 16 15:46:08 2003
From: popiel at wolfskeep.com (T. Alexander Popiel)
Date: Tue Dec 16 15:46:12 2003
Subject: [spambayes-dev] Who wants to pretend to be a spammer? 
In-Reply-To: Message from Tim Stone <tim@fourstonesExpressions.com> of "Tue,
	16 Dec 2003 14:18:47 CST."
	<opr0aiplr5it6vze@mail.fourstonesExpressions.com> 
References: <Law11-OE39mzvq9eSgN0000bbbf@hotmail.com>
	<opr0aiplr5it6vze@mail.fourstonesExpressions.com> 
Message-ID: <20031216204608.C52152DF7F@cashew.wolfskeep.com>

In message:  <opr0aiplr5it6vze@mail.fourstonesExpressions.com>
             Tim Stone <tim@fourstonesExpressions.com> writes:
>On Tue, 16 Dec 2003 15:13:11 -0500, Kenny Pitt <kennypitt@hotmail.com>
>wrote:
>
>> Interesting idea, but wouldn't it be tricky to make your psuedo-spams
>> representative of real-world spam patterns?  For example, it seems like
>> whatever e-mail address and/or SMTP server you use to send the messages
>> would quickly become a significant spam clue.
>
>Yeah, those could be some challenges.  I'm not convinced of the
>usefullness of the idea, but it *could* give us a leg up on spam as it
>evolves.  I dunno, maybe it can't evolve fast enough to fool us for long,
>but...

Those would be the same challenges that the initial testing had
with the multi-source corpora (where significant spam all came
from one source and significant ham all came for a different place)...
which is why headers were almost completely ignored for the first
six months or so of development.

A good first approximation of returning to that would be to turn
off all the from/to/received/msgid header parsing.

Responding to the idea (someone emulating a spammer): wouldn't it
be easier to just distribute a corpus of spam, and have people
grab it and test it against their databases?

- Alex

From richie at entrian.com  Tue Dec 16 17:01:47 2003
From: richie at entrian.com (Richie Hindle)
Date: Tue Dec 16 17:02:00 2003
Subject: [spambayes-dev] A new and altogether different bsddb breakage
In-Reply-To: <1071607304.7979.39.camel@geddy>
References: <k6rqtvguechcaclkrd6ssiam0pokbvtcn7@4ax.com>
	<LNBBLJKPBEHFEDALKOLCEEKKHMAB.tim.one@comcast.net>
	<pcqutv0l44ta82kg04srrdbnpvdjjkab14@4ax.com>
	<1071607304.7979.39.camel@geddy>
Message-ID: <t40vtv47grl885qpd9iuu0f883nlhh02hd@4ax.com>


[Barry]
> I really wish I had the time.  But I'll help play "consultant" on
> BerkeleyDB stuff.

If I find the time for this project, I might just make you regret saying
that.  8-)

-- 
Richie Hindle
richie@entrian.com


From tim.one at comcast.net  Tue Dec 16 21:40:53 2003
From: tim.one at comcast.net (Tim Peters)
Date: Tue Dec 16 21:41:00 2003
Subject: [spambayes-dev] RE: [Spambayes] How low can you go?
In-Reply-To: <LNBBLJKPBEHFEDALKOLCGEPNHMAB.tim.one@comcast.net>
Message-ID: <LNBBLJKPBEHFEDALKOLCMECPHNAB.tim.one@comcast.net>

[Tim, on the spambayes list, about x-use_bigrams in CVS]
> I see that it's a cruder approximation to the suggested scoring
> algorithm (which I implemented at one time).  For example ...

I checked in the intended implementation.  Here's the checkin comment:

    Implemented the intended "tiling" version of x-use_bigrams.  Tried
    to restore most of the speed lost when this option *isn't* in use.

    Will add comments later.

    Anyone using x-use_bigrams needs to retrain:  synthesized bigrams
    now begin with a "bi:" prefix.

Skip, that last point addresses your (good!) concern about ambiguity wrt the
special 'saved state' key.

Here's what I've found so far.  My main personal database is currently
trained on 474 ham and 489 spam, using mostly mistake-and-unsure-based
training, with a spam cutoff of 95 and a ham cutoff of 4 (yup, those are
extreme -- I've been experimenting).

Database size (a bsddb3 hash database):

    without x-use_bigrams   2,544KB
    with x-use_bigrams     10,288KB

That's a major size boost, and (of course) is expected (bigrams create fat
hapaxes at a prodigious rate).

There's no reason to suppose that the selection of training ham and spam
based on mistake-and-unsure training from a unigram-only classifier makes
much sense for a mixed uni+bi-gram classifier; to the contrary, the latter
almost certainly has different strengths and weaknesses.

An example of that is the highest scoring ham in my inbox.  Because I had
previously put copies of some of those into my ham training data, back when
my ham cutoff was 20, without x-use_bigrams no message in my inbox today
scores above 20.  These are the worst:

     6  6  6  7  7  7  7  8  8  8  9  9  9 12 13 13 14 16

After retraining on the same training sets with x-use_bigrams, then
rescoring my inbox, the highest-scoring ham in my inbox are worse:

     7  8  8  9 10 12 13 13 13 13 16 22 25 31 34 38 45 49

I'm confident that this is an artifact of using training sets based on
picking on the weakest performance of a different scoring strategy, and that
had I been using train-on-everything all along, that result would have been
very different.

There's an interesting example in the other direction too:  the last time I
started over from scratch, I left one Unsure in my Unsure folder, and have
kept it there ever since.  It's a long and chatty spam, about a topic I even
have some interest in (no, my wang already has carpet burns <wink>), and I
wanted to see how mistake-based training changed its score over time.  It
drifted slowly upward all along, from the low 40s to the low 80s.  Under
x-use_bigrams, though, the score zoomed to 95.34.

The difference is high-scoring bigrams that appeared in a few other spam:

'bi:any questions,'                 0.908163            0      2
'bi:website at:'                    0.908163            0      2
'bi:visit our'                      0.931987            1     17
'bi:create your'                    0.934783            0      3
'bi:than years'                     0.934783            0      3

"than years" is a peculiar one, eh?!  Then original text was

    ... more than 30 years ago ...

and we skipped "30" because it's shorter than 3 characters.

So, conclusions for now:

+ x-use_bigrams is going to bloat your database bigtime.

+ If you use train-on-everything, and want to try it, no problem.

+ If you're doing mistake-based training and want to try it, probably
  best to start over from scratch.

+ I believe that mistake-based training under this method is likely
  to be substantially more brittle than mistake-based training under
  the (still default) unigram-only scheme, because it's even more
  hapax-driven (synthesizing bigrams creates many more hapaxes).

+ OTOH, bigrams are better at recognizing the language of advertising.
  For example, "bi:website at:" is more clearly a "call to action" than
  either "website" or "at:".


From hooft at o2w.nl  Wed Dec 17 00:35:43 2003
From: hooft at o2w.nl (hooft@o2w.nl)
Date: Wed Dec 17 00:35:48 2003
Subject: [spambayes-dev] A new and altogether different bsddb breakage
In-Reply-To: <LNBBLJKPBEHFEDALKOLCAELFHMAB.tim.one@comcast.net>
References: <Law11-OE738sNcxqkwT0000a9a0@hotmail.com>
	<LNBBLJKPBEHFEDALKOLCAELFHMAB.tim.one@comcast.net>
Message-ID: <43800.80.126.9.240.1071639343.squirrel@secure.o2w.nl>

> [Kenny Pitt]
>> I get the same results as Tim using the 2.3.2 final version: Python
>> 2.3.2 (#49, Oct  2 2003, 20:02:00) [MSC v.1200 32 bit (Intel)] on
>> win32
>>
>> In my 2.3.2 lib, the "return self.db[key]" line in __getitem__ is on
>> line 116 of __init__.py, not line 86 as in Richie's traceback.  I
>> could expect some changes between Python 2.3 and 2.3.2, but 30 lines
>> seems a bit much between minor bugfix releases.  Is that possibly an
>> indicator of a bsddb version mismatch?
>
> It's more an indicator of bugs in 2.3's bsddb support.  __init__.py was
> at rev 1.5 in the 2.3 release, and is at rev 1.12(!) today:
>
> http://cvs.sf.net/viewcvs.py/python/python/dist/src/Lib/bsddb/__init__.py
>
> I see that support for the iterator and mapping protocols wasn't added
> until rev 1.6, which is why they don't work for Richie in 2.3 final.

Imagine the people like me that are using Python2.2 on the systems of
their ISPs:
Python 2.2.2 (#1, Oct 26 2002, 20:34:17)
[GCC 2.96 20000731 (Red Hat Linux 7.2 2.96-108.7.2)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import shelve
>>> d=shelve.open('.hammiedb')
>>> [k for k in d]
Traceback (most recent call last):
  File "<stdin>", line 1, in ?
  File "/usr/local/lib/python2.2/shelve.py", line 70, in __getitem__
    f = StringIO(self.dict[key])
TypeError: key type must be string
>>>

Regards,

Rob Hooft


From jtech at hyperionmail.com  Wed Dec 17 00:57:42 2003
From: jtech at hyperionmail.com (My Tech)
Date: Wed Dec 17 00:58:13 2003
Subject: [spambayes-dev] Fwd: [Spambayes] SpamBayes Corrupted My
	Profile
In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F1304677769@its-xchg4.massey.ac.nz>
Message-ID: <005301c3c462$b01cc360$1e02a8c0@JDi8000>

Hi Guys -

Just to provide some more detail (but unfortunately I don't think it will
help solve the mystery of what happened)...

First, I ran Outlook's detect and repair.  No change.  Second, I uninstalled
and reinstalled my Office XP application (which includes Outlook).  No
change.  Third, I created a new profile in Outlook to see if that would make
a difference.  Nope.  Still no change.  Fourth, I created a new Windows
profile/user with administrative rights.  Doing this, Outlook opened without
a problem.  Didn't re-install SpamBayes for fear of making a bad problem
worse.

I can't roll back the registry because I'm running Windows 2000, not XP
(unless you know something about 2000 functionality that I don't.)  Also, I
was logged in as Administrator when I installed SpamBayes (and subsequently
encountered the Outlook problem), so I couldn't dump this profile.

Everything else seems to work fine, except for Outlook, so my first guess
was that it was an Outlook problem.  However, considering that I tried
Outllok detect & repair and then uninstalled/reinstalled Office XP with no
resulting change, I'm left to conclude that it's a Windows profile problem.

I'm really at a loss to know what to do, outside of reinstalling the OS and
all of my software (a long and tedious process that I'm not looking forward
to).

Any further advice would be greatly appreciated.

Thanks.

-----Original Message-----
From: Tony Meyer [mailto:tameyer@ihug.co.nz] 
Sent: Monday, December 15, 2003 10:05 PM
To: 'Tim Stone'; spambayes-dev@python.org
Cc: jtech@hyperionmail.com
Subject: RE: [spambayes-dev] Fwd: [Spambayes] SpamBayes Corrupted My Profile


> Oops... forwarded the wrong message.  This is the one I was
> thinking of.  
> This seems severe, and I've not seen this problem pop up in the list 
> before.  I don't know how to respond.

I thought the same thing, although I'm not entirely convinced that it was
the SpamBayes installer that did this.  Also, whenever I've had a corrupted
profile, I've had to dump the entire profile, which this guy obviously
hasn't.

Presumably rolling back the registry would fix it.  If it's actually a
problem with Outlook, not the Windows profile (which seems more likely),
then Outlook's detect and repair should fix it.

Does everything work apart from Outlook?  If so, it seems highly unlikely
that it's the Windows profile that is corrupt.  If not, what is it that
fails?

As for the instructions to install the SpamBayes Outlook plug-in:  1.
Download the installer.  2. Double-click the installer.  3. Go through the
installer prompts.  4. You're done.

=Tony Meyer


From tim.one at comcast.net  Wed Dec 17 01:15:50 2003
From: tim.one at comcast.net (Tim Peters)
Date: Wed Dec 17 01:15:52 2003
Subject: [spambayes-dev] A new and altogether different bsddb breakage
In-Reply-To: <43800.80.126.9.240.1071639343.squirrel@secure.o2w.nl>
Message-ID: <LNBBLJKPBEHFEDALKOLCMEDOHNAB.tim.one@comcast.net>

[hooft@o2w.nl]
> Imagine the people like me that are using Python2.2 on the systems of
> their ISPs:
> Python 2.2.2 (#1, Oct 26 2002, 20:34:17)
> [GCC 2.96 20000731 (Red Hat Linux 7.2 2.96-108.7.2)] on linux2
> Type "help", "copyright", "credits" or "license" for more information.
> >>> import shelve
> >>> d=shelve.open('.hammiedb')
> >>> [k for k in d]
> Traceback (most recent call last):
>   File "<stdin>", line 1, in ?
>   File "/usr/local/lib/python2.2/shelve.py", line 70, in __getitem__
>     f = StringIO(self.dict[key])
> TypeError: key type must be string
> >>> 

OK, I did.  Now what <wink>?

From mhammond at skippinet.com.au  Wed Dec 17 01:45:41 2003
From: mhammond at skippinet.com.au (Mark Hammond)
Date: Wed Dec 17 01:46:00 2003
Subject: [spambayes-dev] pop3proxy_tray icons
In-Reply-To: <Law11-OE48VKW447p9N0000289d@hotmail.com>
Message-ID: <002801c3c469$66671e30$2c00a8c0@eden>

Better late than never :)

> I don't know if anyone else has noticed this or not, but on my Windows
> 2000 system the green and red circles in the current pop3proxy_tray
> icons are very difficult to make out.  I created the attached 
> icons as a
> possible alternative.  They are basic 16-color icons and show up quite
> nicely on both Windows 2000 and Windows XP.

I agree.

> The attached patch is also required because the LoadImage 
> calls pass 0,0
> for the icon size.  That loads the icon using the default 32x32 size,
> scaling a 16x16 icon up to 32x32 if necessary.  Since icons 
> in the tray
> are only 16x16, they then get scaled back down when displayed 
> and still
> end up looking bad.

Excellent!

> I also attached an alternate sbicon that I created in the 
> spirit of the
> icons in the Web UI.  It uses the envelope icon from the 
> Wingdings font
> with the same blue outline color used in the UI icons.  I modified my
> py2exe\setup_all.py to use this as the icon for all the 
> generated exe's.

I haven't done that :)

I've checked it all in.

Thanks,

Mark.


From anthony at interlink.com.au  Wed Dec 17 01:47:13 2003
From: anthony at interlink.com.au (Anthony Baxter)
Date: Wed Dec 17 01:47:28 2003
Subject: [spambayes-dev] Re: Auto-response for your message to the
	"Spambayes" mailing list 
In-Reply-To: <mailman.789.1071643468.9308.spambayes@python.org> 
Message-ID: <200312170647.hBH6lDCV008087@localhost.localdomain>


A whole bunch of the header lines in the spambayes autoresponse 
are being included on the line with the header. Can someone fix,
or else do whatever's necessary for me to be able to fix it?

>>> spambayes-bounces@python.org wrote
> READ THIS!  (If you want help.)
> 
> This is an automated response to an email message you sent to the
> spambayes@python.org mailing list.  Please read this message carefully
> to see if it answers your question(s).
> 
> 
> Before you do anything else: ----------------------------
> 
> Before asking a question on the list, please take a moment and check
> the frequently asked questions page:
> 
>     http://spambayes.sourceforge.net/faq.html
> 
> 
> What is Spambayes? ------------------
> 

[snip]


From tameyer at ihug.co.nz  Wed Dec 17 04:21:45 2003
From: tameyer at ihug.co.nz (Tony Meyer)
Date: Wed Dec 17 04:21:53 2003
Subject: [spambayes-dev] RE: [Spambayes-checkins] spambayes/testtools
	urlslurper.py, 1.6, NONE
In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F13047C0D79@its-xchg4.massey.ac.nz>
Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F1304677770@its-xchg4.massey.ac.nz>

Opps.  The comment window had scrolled down and I didn't notice.  Only the
last line should be there in the comments for this.

> -----Original Message-----
> From: spambayes-checkins-bounces@python.org 
> [mailto:spambayes-checkins-bounces@python.org] On Behalf Of Tony Meyer
> Sent: Wednesday, 17 December 2003 10:17 p.m.
> To: spambayes-checkins@python.org
> Subject: [Spambayes-checkins] spambayes/testtools 
> urlslurper.py,1.6,NONE
> 
> 
> Update of /cvsroot/spambayes/spambayes/testtools
> In directory sc8-pr-cvs1:/tmp/cvs-serv671/testtools
> 
> Removed Files:
> 	urlslurper.py 
> Log Message:
> Add the basis of a new experimental (and highly debatable) 
> option to 'slurp' URLs.
> 
> This is based on the urlslurper.py script in the testtools 
> directory, which in turn
> was based on Richard Jowsey's URLSlurper.java.
> 
> Basically, when the option is enabled, instead of just 
> tokenizing the URLs in a message,
> we also retrieve the content at that address (if it's not 
> text, we ignore it).
> 
> When classifying, if the message has a 'raw' score in the 
> unsure range, and if the
> number of tokens is less than max_discriminators, and adding 
> these 'slurped' tokens
> would push the message into the ham/spam range, then they are used.
> 
> This isn't necessary anymore; use the experimental 
> URLRetriever options
> instead.
> 
> --- urlslurper.py DELETED ---
> 
> 
> 
> _______________________________________________
> Spambayes-checkins mailing list
> Spambayes-checkins@python.org
> http://mail.python.org/mailman/listinfo/spambayes-checkins
> 


From mhammond at skippinet.com.au  Wed Dec 17 06:44:31 2003
From: mhammond at skippinet.com.au (Mark Hammond)
Date: Wed Dec 17 06:44:48 2003
Subject: [spambayes-dev] RE: [Spambayes] Accidentally deleted Junk
	emailfolder.
In-Reply-To: <BIEJKCLHCIOIHAGOKOLHGEGEHHAA.tim.one@comcast.net>
Message-ID: <009e01c3c493$2489c0b0$2c00a8c0@eden>

> I wonder whether the Outlook addin should stop trying to
> remember Outlook's
> internal folder IDs, remember the user-visible string paths
> instead, and
> enumerate the folders to (re)discover the internal Outlook
> IDs "whenever
> anything may have changed".

I'm not sure what you had in mind for "anything may have changed", but in
general, I agree.  I always had the idea that we would also store the FQN,
and fall back to that when necessary, making the folder ID more a "cached"
value.  It just never happened.  It does get complex though - what happens
when the user renames the folder?  Before you know it, we have even more
cruft that noone really understand why is there <wink>

Another alternative would be to change things so that most errors
re-displayed the config wizard.  Of course, 0.81 has a bug in the config
wizard that relates directly to deleted folders <frown>, but otherwise, it
seems a reasonable approach.  If the config wizard also detected "we are
probably trained OK", and allowed you to continue without retraining (really
just a checkbox and 20 LOC), that whole process should take under a minute.

Either way, I'm going for a new combined binary before this even gets a look
in <wink>

Mark.


From mhammond at skippinet.com.au  Wed Dec 17 06:47:39 2003
From: mhammond at skippinet.com.au (Mark Hammond)
Date: Wed Dec 17 06:47:54 2003
Subject: [spambayes-dev] Re: [Spambayes-checkins]
	spambayes/spambayesOptionsClass.py, 1.19, 1.20
In-Reply-To: <oprz9bwopiit6vze@mail.fourstonesExpressions.com>
Message-ID: <009f01c3c493$936963a0$2c00a8c0@eden>

> Watch the world reel now <wink>
>
> On Mon, 15 Dec 2003 20:48:31 -0800, Tony Meyer
> <anadelonbrin@users.sourceforge.net> wrote:
>
> > Update of /cvsroot/spambayes/spambayes/spambayes
> > In directory sc8-pr-cvs1:/tmp/cvs-serv9453/spambayes
> >
> > Modified Files:
> > 	OptionsClass.py
> > Log Message:
> > Option names are always case insensitive, no matter what.

Yay!  I *nearly* did that quite some time ago, but was worried I would be
(silently) accused of loosening reasonable code to handle my sloppy style.
It also means a number of .lower() calls can be removed from Outlook!

Entropy-catches-us-all <wink> ly,

Mark.


From kennypitt at hotmail.com  Wed Dec 17 10:00:15 2003
From: kennypitt at hotmail.com (Kenny Pitt)
Date: Wed Dec 17 10:00:55 2003
Subject: [spambayes-dev] RE: [Spambayes] Accidentally deleted
	Junkemailfolder.
In-Reply-To: <009e01c3c493$2489c0b0$2c00a8c0@eden>
Message-ID: <Law11-OE46g389CDALO0000c57d@hotmail.com>

Mark Hammond wrote:
>> I wonder whether the Outlook addin should stop trying to remember
>> Outlook's internal folder IDs, remember the user-visible string
>> paths instead, and enumerate the folders to (re)discover the
>> internal Outlook IDs "whenever anything may have changed".
> 
> I'm not sure what you had in mind for "anything may have changed",
> but in general, I agree.  I always had the idea that we would also
> store the FQN, and fall back to that when necessary, making the
> folder ID more a "cached" value.  It just never happened.  It does
> get complex though - what happens when the user renames the folder? 
> Before you know it, we have even more cruft that noone really
> understand why is there <wink> 

One of the most common problems seems to be when the spam folder is
actually still sitting under Deleted Items.  The ID is unchanged so
SpamBayes keeps moving the spam there and people think the messages are
just disappearing.

As a partial interim solution, could we check for this special case,
i.e. if we successfully access the spam folder by ID but it's parent
folder is Deleted Items then move it back to the top level (or store the
original FQN and move it back to there)?

Another possibility might be to attach an ItemAdd event handler to the
Deleted Items folder and check for an item with the same ID as the spam
folder.  Does ItemAdd get called for added folders, or only for added
items?

-- 
Kenny Pitt


From skip at pobox.com  Wed Dec 17 11:10:55 2003
From: skip at pobox.com (Skip Montanaro)
Date: Wed Dec 17 11:11:06 2003
Subject: [spambayes-dev] RE: [Spambayes] How low can you go?
In-Reply-To: <LNBBLJKPBEHFEDALKOLCMECPHNAB.tim.one@comcast.net>
References: <LNBBLJKPBEHFEDALKOLCGEPNHMAB.tim.one@comcast.net>
	<LNBBLJKPBEHFEDALKOLCMECPHNAB.tim.one@comcast.net>
Message-ID: <16352.32783.360628.370482@montanaro.dyndns.org>


    Tim> Database size (a bsddb3 hash database):

    Tim>     without x-use_bigrams   2,544KB
    Tim>     with x-use_bigrams     10,288KB

    Tim> That's a major size boost, and (of course) is expected (bigrams
    Tim> create fat hapaxes at a prodigious rate).

I've been experimenting with the bigram stuff and like it so far.  I also
have some mods to the DBDictClassifier stuff which add timestamps (last set,
last used) to the database.  There's some interaction between the two which
keeps me from using the two together.  It may be worthwhile considering a
last used timestamp to control the number of unused (or rarely used) tokens.

The first thing I did was retrain and then score my then current unsure
mailbox.  Out of about 40 messages it scored over half of them as spam with
bigrams enabled.  I then took my entire training database (around 140 spams
and 100 hams) and tossed them into my unsure mailbox.  Using that now much
bigger mailbox (about 280 messages), I then started with a fresh round of
unsure+mistake based training.  I got to roughly the same performance as
without bigrams using a much smaller set of training messages.  I'm
currently at 97 spams and 64 hams.  I'm still getting a fair number of
unsures, but the false positive rate doesn't seem horrible (I've seen a few,
but haven't been counting).

    Tim> + I believe that mistake-based training under this method is likely
    Tim>   to be substantially more brittle than mistake-based training
    Tim>   under the (still default) unigram-only scheme, because it's even
    Tim>   more hapax-driven (synthesizing bigrams creates many more
    Tim>   hapaxes).

As I was training, I noticed some wild fluctuations in scores with bigrams
enabled, especially with small databases.

Skip


From tim at fourstonesExpressions.com  Wed Dec 17 11:18:49 2003
From: tim at fourstonesExpressions.com (Tim Stone)
Date: Wed Dec 17 11:18:55 2003
Subject: [spambayes-dev] Re: [Spambayes] How low can you go?
In-Reply-To: <16352.32783.360628.370482@montanaro.dyndns.org>
References: <LNBBLJKPBEHFEDALKOLCGEPNHMAB.tim.one@comcast.net>
	<LNBBLJKPBEHFEDALKOLCMECPHNAB.tim.one@comcast.net>
	<16352.32783.360628.370482@montanaro.dyndns.org>
Message-ID: <opr0b19nydit6vze@mail.fourstonesExpressions.com>

On Wed, 17 Dec 2003 10:10:55 -0600, Skip Montanaro <skip@pobox.com> wrote:

> I've been experimenting with the bigram stuff and like it so far.  I also
> have some mods to the DBDictClassifier stuff which add timestamps (last 
> set,
> last used) to the database.  There's some interaction between the two 
> which
> keeps me from using the two together.  It may be worthwhile considering a
> last used timestamp to control the number of unused (or rarely used) 
> tokens.

iirc, there was quite a bit of discussion about aging mechanisms quite a 
few months ago.  It seemed like most everyone agreed that it was a good 
idea, but nobody wanted to implement it for database size considerations.  
It still seems like a good idea...


-- 

Vous exprimer; Expr?sese; Te stesso esprimere; Express yourself!
Tim Stone
See my photography at www.fourstonesExpressions.com
See my writing at www.xanga.com/obj3kshun

From skip at pobox.com  Wed Dec 17 11:29:12 2003
From: skip at pobox.com (Skip Montanaro)
Date: Wed Dec 17 11:29:11 2003
Subject: [spambayes-dev] RE: [Spambayes] Accidentally deleted Junk
	emailfolder.
In-Reply-To: <009e01c3c493$2489c0b0$2c00a8c0@eden>
References: <BIEJKCLHCIOIHAGOKOLHGEGEHHAA.tim.one@comcast.net>
	<009e01c3c493$2489c0b0$2c00a8c0@eden>
Message-ID: <16352.33880.908212.67671@montanaro.dyndns.org>


    Mark> .... It does get complex though - what happens when the user
    Mark> renames the folder?  Before you know it, we have even more cruft
    Mark> that noone really understand why is there <wink>

And thus a proper Windows application. <wink> <wink> <nudge> <nudge>

Skip

From skip at pobox.com  Wed Dec 17 11:45:30 2003
From: skip at pobox.com (Skip Montanaro)
Date: Wed Dec 17 11:45:42 2003
Subject: [spambayes-dev] Re: [Spambayes] How low can you go?
In-Reply-To: <opr0b19nydit6vze@mail.fourstonesExpressions.com>
References: <LNBBLJKPBEHFEDALKOLCGEPNHMAB.tim.one@comcast.net>
	<LNBBLJKPBEHFEDALKOLCMECPHNAB.tim.one@comcast.net>
	<16352.32783.360628.370482@montanaro.dyndns.org>
	<opr0b19nydit6vze@mail.fourstonesExpressions.com>
Message-ID: <16352.34858.121487.578149@montanaro.dyndns.org>


    Tim> iirc, there was quite a bit of discussion about aging mechanisms
    Tim> quite a few months ago.  It seemed like most everyone agreed that
    Tim> it was a good idea, but nobody wanted to implement it for database
    Tim> size considerations.  It still seems like a good idea...

Size definitely does matter. <wink> With both bigrams and my set/used
timestamps (datetime objects), the size of the database ballooned.  I think
the set timestamp could be dispensed with and the last used timestamp
converted to something smaller, like a YYYYMMDD string.

Skip

From tim.one at comcast.net  Wed Dec 17 12:39:54 2003
From: tim.one at comcast.net (Tim Peters)
Date: Wed Dec 17 12:40:01 2003
Subject: [spambayes-dev] RE: [Spambayes] How low can you go?
In-Reply-To: <16352.34858.121487.578149@montanaro.dyndns.org>
Message-ID: <LNBBLJKPBEHFEDALKOLCIEGGHNAB.tim.one@comcast.net>

[Skip Montanaro]
> Size definitely does matter. <wink> With both bigrams and my set/used
> timestamps (datetime objects), the size of the database ballooned.  I
> think the set timestamp could be dispensed with and the last used
> timestamp converted to something smaller, like a YYYYMMDD string.

A small integer should be enough for last-used, like the number of days
between the day the database was first created and the day a feature was
most recently used in scoring.  That's easily computed, easy to use *in*
computations, and consumes no more than 3 bytes in a binary pickle (proto 1
or proto 2) until about 180 years after the database was created <wink>.

Especially with the bigram scheme-- which creates a relatively enormous
number of hapaxes --I expect the best use for a per-feature "last used"
timestamp is to expire hapaxes that haven't been used in scoring for N days.
That should yield major size savings, actually increase resistance to
"spectacular failures" (which so far most often seem to be associated with
hitting a large number of old hapaxes from "the other" category), and
*probably* not hurt anything else.  Expiring "near hapaxes" too gets dicier,
and more so the more liberal the conception of "near".


From nobody at spamcop.net  Wed Dec 17 13:21:08 2003
From: nobody at spamcop.net (Seth Goodman)
Date: Wed Dec 17 13:21:13 2003
Subject: [spambayes-dev] RE: [Spambayes] How low can you go?
In-Reply-To: <16352.34858.121487.578149@montanaro.dyndns.org>
Message-ID: <MHEGIFHMACFNNIMMBACAAEINGOAA.nobody@spamcop.net>

[Tim Stone]
>     Tim> iirc, there was quite a bit of discussion about aging mechanisms
>     Tim> quite a few months ago.  It seemed like most everyone agreed that
>     Tim> it was a good idea, but nobody wanted to implement it
> for database
>     Tim> size considerations.  It still seems like a good idea...
>
> [Skip Montanaro]
> Size definitely does matter. <wink> With both bigrams and my set/used
> timestamps (datetime objects), the size of the database
> ballooned.  I think
> the set timestamp could be dispensed with and the last used timestamp
> converted to something smaller, like a YYYYMMDD string.

I know this is a developer conversation, so I hope you don't mind if I offer
my two cents.  And I definitely agree that size matters, at least for
databases.  I have seen a lot of references, not just in this thread, to
ageing out individual tokens.  For a probability calculation in which one of
the variables is the number of messages of a given class that a token
appears in, it seems dangerous to remove only some tokens from a message and
not adjust the message count.  Here's my problem with it: all tokens from a
trained message *could* conceivably age out individually, but the trained
message count for the appropriate category would not change.  This would
result in a wrong probabilities for *all* other tokens, since the database
is the same state as before the message was trained but the trained message
count is now wrong.  It is even harder to conceive what the trained message
count should be if you only remove some of the tokens from a message.  Using
a token ageing scheme, the trained message counts would monotonically rise
until you started over, despite removing plenty of tokens over time.  I do
understand that most of the aged out tokens would be oddball hapaxes, but
not all of them will be.

Though I often hear "intuition is a poor guide", I would propose ageing out
whole messages rather than tokens.  This at least maintains the integrity of
your basic probability calculation.  It also has the advantage of enforcing
balanced (or unbalanced in a particular way) training set size.  This would
require adding all the tokens from a trained message to the message database
and the message entry would be timestamped rather than the individual
tokens.  When a message got too old, all it's tokens would have their counts
decremented and the trained message count for that message class would also
be decremented.

I would propose going one step further to give the train on everything
approach some additional "memory" for atypical messages (of either type)
that don't occur regularly enough to always be in a fixed-size database.
This might give it some of the advantages of the train on exceptions
schemes, perhaps with less of the "brittle" behavior others have noted and I
have seen as well.  One possible mechanism to do this is as follows:

1) If the database message count is at maximum, untrain the oldest message.

2) Score the new message to be trained.

3) Move the new training message timestamp into the future by an amount
related to it's "distance" from a perfect score for that message type.


More atypical messages that classify poorly would be timestamped further
into the future and would thus stick around longer than ones that classify
perfectly.  The ones that classify perfectly would have their tokens
replaced sooner, which should be no great loss.  With train on everything,
there should be lots of messages that classify very well to take their
place.  There could be a scaling constant that sets the maximum amount of
extra time that an unusual message remains in the database.  This determines
how long the database "memory" is, along with the maximum message count and
the number of messages that you train per day (depends on your training
scheme).

The goal of this is to allow train on everything, keep moderate database
sizes and still have a long enough memory for atypical messages that are
infrequent.

--
Seth Goodman

  Humans:   off-list replies to sethg [at] GoodmanAssociates [dot] com

  Spambots: disregard the above


From wsy at merl.com  Wed Dec 17 13:44:48 2003
From: wsy at merl.com (Bill Yerazunis)
Date: Wed Dec 17 13:45:03 2003
Subject: [spambayes-dev] Re: [Spambayes] How low can you go?
In-Reply-To: <MHEGIFHMACFNNIMMBACAAEINGOAA.nobody@spamcop.net>
References: <MHEGIFHMACFNNIMMBACAAEINGOAA.nobody@spamcop.net>
Message-ID: <200312171844.hBHIimx07769@localhost.localdomain>


   From: "Seth Goodman" <nobody@spamcop.net>

   [... re aging out tokens ...]

Here's a particularly cute solution I implemented in CRM114.

The problem is that if you choose to store a token's last-seen
date, you will likely consume almost as much space in the storage
of the date as you will in the token count or the token hash.

But most tokens are hapaxes anyway.  They have very low value, and you
probably will _never_ see them again.

So, when you need to clean up the database a little, go through and
decrement the "seen" count on a few (very few!) tokens

Choose the tokens to decrement randomly.  REALLY randomly.  Don't 
pick one chain that's too long and decrement every element in it.
Decrement only every sixteenth one, or only the ones that have 
values that, when added to the system clock, have a hash with the
low order byte == 0x00, or something like that.

Sure, you're losing information- but that's a necessary consequence of
forgetting tokens.

The net result is very fast and has an acceptable level of damage to
accuracy.  Tests show that, at least for CRM114 which is HEAVILY
hapax-oriented, that the damage does not increase the error rate until
you get into obscenely small databases (i.e. less than 100K slots).

Anyway, this is how <microgroom> is implemented in CRM114, and it
seems to work acceptably well.

      -Bill Yerazunis


From jm at jmason.org  Wed Dec 17 13:59:02 2003
From: jm at jmason.org (Justin Mason)
Date: Wed Dec 17 13:59:24 2003
Subject: [spambayes-dev] RE: [Spambayes] How low can you go? 
In-Reply-To: <LNBBLJKPBEHFEDALKOLCIEGGHNAB.tim.one@comcast.net> 
Message-ID: <20031217185904.1E1F217076@jmason.org>

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1


Tim Peters writes:
> [Skip Montanaro]
> > Size definitely does matter. <wink> With both bigrams and my set/used
> > timestamps (datetime objects), the size of the database ballooned.  I
> > think the set timestamp could be dispensed with and the last used
> > timestamp converted to something smaller, like a YYYYMMDD string.
> 
> A small integer should be enough for last-used, like the number of days
> between the day the database was first created and the day a feature was
> most recently used in scoring.  That's easily computed, easy to use *in*
> computations, and consumes no more than 3 bytes in a binary pickle (proto 1
> or proto 2) until about 180 years after the database was created <wink>.

FWIW -- in SpamAssassin, we used to use an approximate scheme that fit the
remaining UNIX epoch into 2 bytes something like you're suggesting (by dividing
time_t by several hours and starting the current epoch from 1 Jan 2000, or
something like that).

However we found that we ran into expiry problems for large dbs and busy
sites, because that just didn't give us enough precision -- having a
granularity of hours wasn't good enough.  so SpamAssassin db version 2 now
just uses a plain old long containing a time_t value, and damn the db
bloat.  A bit bigger, but expiry now works reliably ;)

However a good way we found to cut down hapax db bloat was to use a
polymorphic format for the tokens in the db; if a token has spamcount < 8
and hamcount < 8, it's marshalled so that the spamcount and hamcount are
both shoved into 1 byte as a bitmask, with the high bits set.

Here's the perl code in question:

  sub tok_pack {
    my ($self, $ts, $th, $atime) = @_;
    $ts ||= 0; $th ||= 0; $atime ||= 0;
    if ($ts < 8 && $th < 8) {
      return pack ("CV", ONE_BYTE_FORMAT | ($ts << 3) | $th, $atime);
    } else {
      return pack ("CVVV", TWO_LONGS_FORMAT, $ts, $th, $atime);
    }
  }

I do like Bill Y's "sunspots expiry" scheme though ;)

- --j.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.2 (GNU/Linux)
Comment: Exmh CVS

iD8DBQE/4Kd2QTcbUG5Y7woRAh/DAKC6MGlXpd1bEeR2/BzTmhtH71075ACgg21j
pJ85tiGe697R3s90bP/LRS4=
=slib
-----END PGP SIGNATURE-----


From nobody at spamcop.net  Wed Dec 17 14:00:45 2003
From: nobody at spamcop.net (Seth Goodman)
Date: Wed Dec 17 14:00:51 2003
Subject: [spambayes-dev] RE: [Spambayes] How low can you go?
In-Reply-To: <200312171844.hBHIimx07769@localhost.localdomain>
Message-ID: <MHEGIFHMACFNNIMMBACAIEJAGOAA.nobody@spamcop.net>

[Bill Yerazunis]
> Here's a particularly cute solution I implemented in CRM114.

---------snip----------------

> Choose the tokens to decrement randomly.  REALLY randomly.  Don't

Does CRM114 use the number of trained ham and trained spam *messages* as
variables in its probability calculation?  If not, then you wouldn't expect
that deleting infrequently used tokens would do much damage.  AFAIK,
SpamBayes uses the trained message counts in the probability calculation and
those becomes inaccurate if you delete individual tokens.

--
Seth Goodman

  Humans:   off-list replies to sethg [at] GoodmanAssociates [dot] com

  Spambots: disregard the above


From spambayes at whateley.com  Wed Dec 17 14:08:52 2003
From: spambayes at whateley.com (Brendon Whateley)
Date: Wed Dec 17 14:09:00 2003
Subject: [spambayes-dev] RE: [Spambayes] How low can you go?
In-Reply-To: <MHEGIFHMACFNNIMMBACAIEJAGOAA.nobody@spamcop.net>
References: <MHEGIFHMACFNNIMMBACAIEJAGOAA.nobody@spamcop.net>
Message-ID: <200312171108.56120.spambayes@whateley.com>

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Seth,

Couldn't we maintain (and use) a synthetic #messages value that is generated 
using the average number of tokens/message.  This way, as tokens are removed 
from the database, the synthetic number could be adjusted.  It seems (and I 
don't have time to think about it now, have to go pay the dog license!) that 
such a scheme would work quite well along with the "remove old tokens" scheme 
that ages unused tokens?  It probably doesn't matter if the number is 
accurate, provided the DB doesn't contain far too few tokens.

Brendon.

On Wednesday 17 December 2003 11:00 am, Seth Goodman wrote:
> [Bill Yerazunis]
>
> > Here's a particularly cute solution I implemented in CRM114.
>
> ---------snip----------------
>
> > Choose the tokens to decrement randomly.  REALLY randomly.  Don't
>
> Does CRM114 use the number of trained ham and trained spam *messages* as
> variables in its probability calculation?  If not, then you wouldn't expect
> that deleting infrequently used tokens would do much damage.  AFAIK,
> SpamBayes uses the trained message counts in the probability calculation
> and those becomes inaccurate if you delete individual tokens.
>
> --
> Seth Goodman
>
>   Humans:   off-list replies to sethg [at] GoodmanAssociates [dot] com
>
>   Spambots: disregard the above
>
>
> _______________________________________________
> spambayes-dev mailing list
> spambayes-dev@python.org
> http://mail.python.org/mailman/listinfo/spambayes-dev

-----BEGIN PGP SIGNATURE-----
Version: PGP 6.5.8

iQA/AwUBP+CpxJuupqACStRwEQJZEACg23t52C7CDk5ghZsRU3KsmetsPUMAoIXQ
nYVJM0QJ0tQOKT5RjZZugjRn
=ZxqV
-----END PGP SIGNATURE-----


From richie at entrian.com  Wed Dec 17 16:32:36 2003
From: richie at entrian.com (Richie Hindle)
Date: Wed Dec 17 16:32:45 2003
Subject: [spambayes-dev] A new and altogether different bsddb breakage
In-Reply-To: <1071607304.7979.39.camel@geddy>
References: <k6rqtvguechcaclkrd6ssiam0pokbvtcn7@4ax.com>
	<LNBBLJKPBEHFEDALKOLCEEKKHMAB.tim.one@comcast.net>
	<pcqutv0l44ta82kg04srrdbnpvdjjkab14@4ax.com>
	<1071607304.7979.39.camel@geddy>
Message-ID: <i9c1uv80pgqt2gi0drj0tn3obc0f31i3st@4ax.com>


[Barry]
> I'll help play "consultant" on BerkeleyDB stuff.

[Tim]
> I'm half ready to declare that ZODB is the only database anyone
> should ever use

This is probably a hopelessly naive question, but can I have the best of
both worlds?  If I use ZODB with a BerkeleyDB back end, will that be
process- and thread-safe (without using ZEO)?

-- 
Richie Hindle
richie@entrian.com


From tim.one at comcast.net  Wed Dec 17 17:16:44 2003
From: tim.one at comcast.net (Tim Peters)
Date: Wed Dec 17 17:16:47 2003
Subject: [spambayes-dev] A new and altogether different bsddb breakage
In-Reply-To: <i9c1uv80pgqt2gi0drj0tn3obc0f31i3st@4ax.com>
Message-ID: <LNBBLJKPBEHFEDALKOLCAEIFHNAB.tim.one@comcast.net>

[Richie Hindle]
> This is probably a hopelessly naive question, but can I have the best
> of both worlds?  If I use ZODB with a BerkeleyDB back end, will that
> be process- and thread-safe (without using ZEO)?

My understanding is that, regardless of back end, ZODB is thread-safe among
the threads in a single process, but that you cannot open a connection to a
ZODB database from more than one process simultaneously without using ZEO.
Don't consider ZEO to be such a big deal, though:  code using ZEO looks
exactly the same as code not using ZEO, except for the lines that initially
open the database.  Where a direct use of ZODB may open a FileStorage, for
example, the same code wishing to use ZEO would open a ClientStorage
instead, and that's it.  Once you are using ZEO, you get distributed access
for free (you can connect to the ZEO server via an arbitrary <hostname,
port> pair, so can access a ZODB database living anywhere your network can
reach).

Note that Jeremy already wrote code to run spambayes via ZEO, in the
project's pspam/ directory.  I don't know how much bitrot that's suffered.

Note too that in addition to getting the best of both worlds, you may also
get the worst of both worlds.  For example, if BDB really does suffer
corruption problems, then it would be something of a miracle if ZODB-on-BDB
were somehow immune.

Also note that the full ZODB back ends (like FileStorage and Berkeley)
support unlimited undo, so the physical database keeps every revision ever
made to every object.  So they need 'pack' steps from time to time to
announce that you promise never to care about revisions before a time you
specify to pack, so that the physical database can reclaim their space.

Finally, note that any form of concurrent modification can end up creating
inconsistent data.  ZODB solves this by raising ConflictError whenever
inconsistency is possible, and the app has to be prepared to catch that (the
usual response then is to try the transaction again, and on the second
attempt it will *start* with the data successfully committed by the other
transaction(s) involved in the conflict).  That could be a real problem if
many threads or processes keep modifying the same info simultaneously (like
the counts attached to, say, "the").


From barry at python.org  Wed Dec 17 17:32:03 2003
From: barry at python.org (Barry Warsaw)
Date: Wed Dec 17 17:32:12 2003
Subject: [spambayes-dev] A new and altogether different bsddb breakage
In-Reply-To: <LNBBLJKPBEHFEDALKOLCAEIFHNAB.tim.one@comcast.net>
References: <LNBBLJKPBEHFEDALKOLCAEIFHNAB.tim.one@comcast.net>
Message-ID: <1071700323.27808.50.camel@anthem>

On Wed, 2003-12-17 at 17:16, Tim Peters wrote:

> Note too that in addition to getting the best of both worlds, you may also
> get the worst of both worlds.  For example, if BDB really does suffer
> corruption problems, then it would be something of a miracle if ZODB-on-BDB
> were somehow immune.

Except that the BerkeleyDB based storages use the full-blown bsddb
transactional interface, so from that side of things, they should be
thread and multiproc safe.  Assuming anyone really understands how
BerkeleyDB (and the Python wrapper around it) works <wink>, I'd feel
pretty confident storing data into it.

> Also note that the full ZODB back ends (like FileStorage and Berkeley)
> support unlimited undo, so the physical database keeps every revision ever
> made to every object.  So they need 'pack' steps from time to time to
> announce that you promise never to care about revisions before a time you
> specify to pack, so that the physical database can reclaim their space.

Note that there is a "full" BDB storage and a "minimal" storage.  The
latter doesn't retain multiple revisions.  The former can be configured
to "autopack" occasionally to cut down on space it consumes.

-Barry


From skip at pobox.com  Wed Dec 17 17:40:23 2003
From: skip at pobox.com (Skip Montanaro)
Date: Wed Dec 17 17:40:21 2003
Subject: [spambayes-dev] sb_filter experimental args
Message-ID: <16352.56151.499376.839685@montanaro.dyndns.org>

Are the sb_filter.py arguments marked [EXPERIMENTAL] (try sb_filter.py
--help) really still experimental?  They've been there a long while and as
far as I know there's no move afoot to get rid of them (I use them from my
.procmailrc file).  If not, I will update the docstring.

Skip

From skip at pobox.com  Wed Dec 17 17:59:45 2003
From: skip at pobox.com (Skip Montanaro)
Date: Wed Dec 17 17:59:43 2003
Subject: [spambayes-dev] empty urls in bigram?
Message-ID: <16352.57313.372507.14545@montanaro.dyndns.org>


I just noticed this bigram in my clues: 'bi:url: url:'.  If 'url:' would
only be presented once as a clue, does it make sense to form a bigram with
two instances of it?  What does an empty "url:" token mean?

Skip

From skip at pobox.com  Wed Dec 17 18:07:21 2003
From: skip at pobox.com (Skip Montanaro)
Date: Wed Dec 17 18:07:19 2003
Subject: [spambayes-dev] empty urls in bigram?
Message-ID: <16352.57769.222771.631763@montanaro.dyndns.org>


   I just noticed this bigram in my clues: 'bi:url: url:'.  If 'url:' would
   only be presented once as a clue, does it make sense to form a bigram
   with two instances of it?

More examples:

    >>> [k for k in db if re.match(r"bi:([^ ]+) \1$", k) is not None]
    ['bi:very, very,',
     'bi:charset:utf-8 charset:utf-8',
     'bi:megamek megamek',
     'bi:time time',
     'bi:[input] [input]',
     'bi:billboard billboard',
     'bi:the the',
     'bi:state state',
     'bi:prince prince',
     'bi:subject:$ subject:$',
     'bi:phpmyadmin phpmyadmin',
     'bi:amsn amsn',
     'bi:fund fund',
     'bi:against, against,',
     'bi:camera camera',
     'bi:received:mailnull@localhost) received:mailnull@localhost)',
     'bi:pago pago',
     'bi:chicago chicago',
     'bi:charset:iso-8859-1 charset:iso-8859-1',
     'bi:pdfcreator pdfcreator',
     'bi:gour gour',
     'bi:subject:. subject:.',
     'bi:received:30950 received:30950',
     'bi:subject:- subject:-', "bi:subject:' subject:'", 'bi:fma fma',
     'bi:subject:.. subject:..',
     'bi:miktex miktex',
     'bi:this this',
     'bi:help help',
     'bi:url:2 url:2',
     'bi:fluid fluid',
     'bi:sell, sell,',
     'bi:$50.00 $50.00',
     'bi:forum forum',
     'bi:scummvm scummvm',
     'bi:url:com url:com',
     'bi:received:2612 received:2612',
     'bi:download download',
     'bi:hanukah hanukah',
     'bi:becomes becomes',
     'bi:men men',
     'bi:url:ami url:ami',
     'bi:subject:2003 subject:2003',
     'bi:*** ***',
     'bi:encore encore',
     'bi:virus:src="cid: virus:src="cid:',
     'bi:subject:You subject:You',
     'bi:filezilla filezilla',
     'bi:received:3948 received:3948',
     'bi:charset:windows-874 charset:windows-874',
     'bi:content-type:text/plain content-type:text/plain',
     'bi:subject:, subject:,',
     'bi:url:contactus url:contactus',
     'bi:charset:windows-1252 charset:windows-1252',
     'bi:have have',
     'bi:url:catalog url:catalog',
     'bi:or: or:',
     'bi:aid aid',
     'bi:url:sendmail url:sendmail',
     'bi:url:%s url:%s',
     'bi:url:tracking url:tracking',
     'bi:described described',
     'bi:you you',
     'bi:music music',
     'bi:springs springs',
     'bi:any any',
     'bi:charset:us-ascii charset:us-ascii',
     'bi:url:email-reports url:email-reports',
     'bi:url:cgi url:cgi',
     'bi:url:newsletter_2003_oct url:newsletter_2003_oct',
     'bi:indianapolis indianapolis',
     'bi:dev-c++ dev-c++',
     'bi:subject:* subject:*',
     'bi:url:forums url:forums',
     'bi:relix relix',
     'bi:mau mau',
     'bi:subject:: subject::',
     'bi:$$$ $$$',
     'bi:url:signup url:signup',
     'bi:#include #include',
     'bi:%s, %s,',
     'bi:speech speech',
     'bi:content-type:image/gif content-type:image/gif',
     'bi:url:news url:news',
     'bi:record, record,',
     'bi:url:3 url:3',
     'bi:subject:/ subject:/',
     'bi:gaim gaim',
     'bi:bang bang',
     'bi:&gt;&gt; &gt;&gt;',
     'bi:charset:windows-1256 charset:windows-1256',
     'bi:liberopops liberopops',
     'bi:url: url:',
     'bi:subject:spambayes subject:spambayes',
     'bi:url:complaint url:complaint',
     'bi:received:jln@localhost) received:jln@localhost)',
     'bi:free free',
     'bi:coast coast',
     'bi:received:16781 received:16781',
     'bi:following following',
     'bi:url:xdr2 url:xdr2',
     'bi:card card',
     'bi:a1> a1>',
     'bi:unsubscribe unsubscribe',
     'bi:toshiba toshiba',
     'bi:jingle jingle',
     'bi:charset:iso-2022-jp charset:iso-2022-jp',
     'bi:subject:% subject:%',
     'bi:your your']

I suppose some of them might make sense, but most are probably artifacts.
Maybe bigrams should only be generated of the current and previous tokens
differ.

Skip

From nobody at spamcop.net  Wed Dec 17 18:23:48 2003
From: nobody at spamcop.net (Seth Goodman)
Date: Wed Dec 17 18:23:48 2003
Subject: [spambayes-dev] RE: [Spambayes] How low can you go?
In-Reply-To: <MHEGIFHMACFNNIMMBACAIEJAGOAA.nobody@spamcop.net>
Message-ID: <MHEGIFHMACFNNIMMBACAAEJLGOAA.nobody@spamcop.net>

An interesting aside to the message ageing proposal I made is that it would
help fight what is being discussed in the "Spam of the Future" threads.  It
would do this by keeping the token databases current with the message stream
so that it would adapt as quickly as possible to the extraneous words used
and then retire them after a time.

Another implementation suggestion for using an approach like this with a
train-on-everything scheme is to only train *after* the user has verified
all the classifications.  If we allow it to classify on-the-fly and it makes
a mistake, a whole bunch of mistakes will likely follow.  It's probably
better to allow the classifier to do the best it can do in it's present
form, then after moving any mis-classified messages into their appropriate
folders, do an incremental training on all emails in a given list of
folders.  This will only train messages which are previously untrained, at
least in the Outlook plug-in version.

--
Seth Goodman

  Humans:   off-list replies to sethg [at] GoodmanAssociates [dot] com

  Spambots: disregard the above


From richie at entrian.com  Wed Dec 17 18:25:45 2003
From: richie at entrian.com (Richie Hindle)
Date: Wed Dec 17 18:25:54 2003
Subject: [spambayes-dev] A new and altogether different bsddb breakage
In-Reply-To: <1071700323.27808.50.camel@anthem>
References: <LNBBLJKPBEHFEDALKOLCAEIFHNAB.tim.one@comcast.net>
	<1071700323.27808.50.camel@anthem>
Message-ID: <e3o1uvke4lrd6s0lafkr9fegh3sjeql4fu@4ax.com>


[Barry, responding to Tim]
> the BerkeleyDB based storages use the full-blown bsddb
> transactional interface, so from that side of things, they should be
> thread and multiproc safe.

and:
> Note that there is a "full" BDB storage and a "minimal" storage.  The
> latter doesn't retain multiple revisions.

Fantastic.  So in theory at least...

 o All the SpamBayes programs could use BDB-backed ZODB instead of
   directly using bsddb.

 o They would automatically work nicely together with a single writer (eg.
   sb_server is training while sb_filter is classifying), and with a bit
   more work catching ConflictErrors, we could even have multiple writers.

 o The database wouldn't get significantly bigger than with direct use of
   bsddb.

 o Since BDB uses bsddb in transaction mode rather than single-file mode,
   we can say goodbye to those nasty little DBRunRecovery errors.  Yay!

Tim, did this:
> I'm half ready to declare that ZODB is the only database anyone should
> ever use

apply to BDB-backed ZODB, or only to ZODB's native storage?

Unless there's something I'm missing (licensing problems, deployment
problems, portability problems...?) it could be that we should replace our
current DBDictClassifier (which suffers from DBRunRecovery errors and
isn't multiprocess-safe) with a ZODBClassifier using a BDB back end.  From
a position of complete ignorance, I'd hazard a guess that the
implementation would end up a lot simpler than rewriting DBDictClassifier
to use bsddb in full-on transactional mode - the hassles of doing that
have already been sorted out in ZODB.

Am I in cloud cuckoo land?

-- 
Richie Hindle
richie@entrian.com


From tim at fourstonesExpressions.com  Wed Dec 17 18:37:24 2003
From: tim at fourstonesExpressions.com (Tim Stone)
Date: Wed Dec 17 18:37:30 2003
Subject: [spambayes-dev] A new and altogether different bsddb breakage
In-Reply-To: <e3o1uvke4lrd6s0lafkr9fegh3sjeql4fu@4ax.com>
References: <LNBBLJKPBEHFEDALKOLCAEIFHNAB.tim.one@comcast.net>
	<1071700323.27808.50.camel@anthem>
	<e3o1uvke4lrd6s0lafkr9fegh3sjeql4fu@4ax.com>
Message-ID: <opr0cmkma1it6vze@mail.fourstonesExpressions.com>

On Wed, 17 Dec 2003 23:25:45 +0000, Richie Hindle <richie@entrian.com> 
wrote:

> Unless there's something I'm missing (licensing problems, deployment
> problems, portability problems...?)

Not insignificant issues...

  it could be that we should replace
> our
> current DBDictClassifier (which suffers from DBRunRecovery errors and
> isn't multiprocess-safe) with a ZODBClassifier using a BDB back end.

It certainly can't hurt to give it a try... any sample code out there?

> Am I in cloud cuckoo land?

Well... we're all cloud dwellers, you know <wink>

-- 

Vous exprimer; Expr?sese; Te stesso esprimere; Express yourself!
Tim Stone
See my photography at www.fourstonesExpressions.com
See my writing at www.xanga.com/obj3kshun

From tim.one at comcast.net  Wed Dec 17 18:43:45 2003
From: tim.one at comcast.net (Tim Peters)
Date: Wed Dec 17 18:43:47 2003
Subject: [spambayes-dev] empty urls in bigram?
In-Reply-To: <16352.57313.372507.14545@montanaro.dyndns.org>
Message-ID: <LNBBLJKPBEHFEDALKOLCKEIPHNAB.tim.one@comcast.net>

[Skip]
> I just noticed this bigram in my clues: 'bi:url: url:'.  If 'url:'
> would only be presented once as a clue, does it make sense to form a
> bigram with two instances of it?

Sure -- why not?  The same thing might happen to "really really" in

    The only product that makes your toes really really big!

Since repetition is a form of advertising hyperbole (FREE FREE FREE!), I
like the chance to catch it this way.  You could try removing the
possibility and running large-scale tests both ways, but I think there are
more basic questions about the unibi approach open now.  Note that we
*won't* score more than one instance of "really really" per message --
bigram clues are subjected to the same duplicate-squashing as unigram clues.


> What does an empty "url:" token mean?

It doesn't *mean* anything <wink>.  Staring at the code, looks like it's
produced if and only if a URL contains two adjacent characters from this
set:

    ;?:@&=+,$.

So 'bi:url: url:' would come from three adjacent characters in that set.
Sounds spammy to me.


From tameyer at ihug.co.nz  Wed Dec 17 18:55:14 2003
From: tameyer at ihug.co.nz (Tony Meyer)
Date: Wed Dec 17 18:55:36 2003
Subject: [spambayes-dev] Re:
	[Spambayes-checkins]spambayes/spambayesOptionsClass.py, 1.19, 1.20
In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F13047C0DCF@its-xchg4.massey.ac.nz>
Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F1304677772@its-xchg4.massey.ac.nz>

[Tony in log message]
> Option names are always case insensitive, no matter what.

[Mark]
> Yay!  I *nearly* did that quite some time ago, but was 
> worried I would be
> (silently) accused of loosening reasonable code to handle my 
> sloppy style. It also means a number of .lower() calls can be 
> removed from Outlook!

Oh good, we have a guinea pig! <wink>

=Tony Meyer


From tim.one at comcast.net  Wed Dec 17 19:13:58 2003
From: tim.one at comcast.net (Tim Peters)
Date: Wed Dec 17 19:14:04 2003
Subject: [spambayes-dev] RE: [Spambayes] How low can you go?
In-Reply-To: <MHEGIFHMACFNNIMMBACAIEJAGOAA.nobody@spamcop.net>
Message-ID: <LNBBLJKPBEHFEDALKOLCCEJCHNAB.tim.one@comcast.net>

[Seth Goodman]
> Does CRM114 use the number of trained ham and trained spam *messages*
> as variables in its probability calculation?  If not, then you
> wouldn't expect that deleting infrequently used tokens would do much
> damage.  AFAIK, SpamBayes uses the trained message counts in the
> probability calculation

Yes.

> and those becomes inaccurate if you delete individual tokens.

No, it doesn't matter if that's *all* you do.  Say I've trained on 243 ham,
and 257 spam, total, and throw out the hapax 'bi:choose the'.  That has no
effect on that the features I didn't throw out still came from training on
243 ham and 257 spam, total.

The problem comes when untraining a message M.  That reduces the count of
total messages trained on, but if I threw away a hapax H from M previously,
and H reappeared again later, it would be a mistake to reduce the category
count on H during untraining M.

There's another bullet we haven't bitten yet, saving a map of message id to
an explicit list of all tokens produced by that message (Skip wants the
inverse of that mapping for diagnostic purposes too).  Given that, training
and untraining of individual messages could proceed smoothly despite
intervening changes in tokenization details; expiring entire messages would
be straightforward; and when expiring an individual feature, it would be
enough to remove that feature from each msg->[feature] list it's in (then
untraining on a msg later wouldn't *try* to decrement the per-feature count
of any feature that had previously been expired individually and appeared in
the msg at the time).

That's all easy enough to do, but the database grows ever bigger.  It would
probably need reworking to start using "feature ids" (little integers) too,
so that relatively big strings didn't have to get duplicated all over the
database.


From tim.one at comcast.net  Wed Dec 17 20:16:30 2003
From: tim.one at comcast.net (Tim Peters)
Date: Wed Dec 17 20:16:31 2003
Subject: [spambayes-dev] A new and altogether different bsddb breakage
In-Reply-To: <e3o1uvke4lrd6s0lafkr9fegh3sjeql4fu@4ax.com>
Message-ID: <LNBBLJKPBEHFEDALKOLCAEJGHNAB.tim.one@comcast.net>

[Richie]
> Fantastic.  So in theory at least...
>
>  o All the SpamBayes programs could use BDB-backed ZODB instead of
>    directly using bsddb.

Yes, but then they also have to use persistent objects in ZODB's sense of
the word.  That's not scary to me, because SpamBayes was originally designed
with ZODB's BTrees in mind as the mapping data structure.  There is no
*direct* access to bsddb via ZODB, you interact with ZODB's view of the
world then, and BDB is just a (mostly) invisible, and wholly inaccessible,
implementation detail.

>  o They would automatically work nicely together with a single writer
>    (eg. sb_server is training while sb_filter is classifying),

Surprise!  Nope.  The reader will suffer a ReadConflictError if it tries to
access anything that's been modified by the writer since the reader began
its current transaction.  This protects the reader from seeing inconsistent
data.  The reader is always in *some* transaction, so you can't worm around
this.

ZODB 3.3 will support "multiversion concurrency control", which will deliver
the state of the data (to the reader) current *at* the time the transaction
began, and there are no ReadConflictErrors then.  But that hasn't been
released yet.

>    and with a bit more work catching ConflictErrors, we could even have
> multiple writers.

ConflictErrors can only be guaranteed not to happen now if there are no
writers.

>  o The database wouldn't get significantly bigger than with direct
>    use of bsddb.

That one's hard to guess in advance.  The BDB back end creates a number of
distinct database tables to support ZODB's ideas of object identity, object
revisions, and how objects all tie together.  That's all metadata, on top of
the application data we work with directly.  But BTrees are a pretty
space-efficient structure, and there are builtin flavors of BTree that are
especially compact for mappings having integers as keys or values.

>  o Since BDB uses bsddb in transaction mode rather than single-file
>    mode, we can say goodbye to those nasty little DBRunRecovery
>    errors.  Yay!

That would be great -- although I still haven't seen one of these, despite
running 3 different Outlooks on 3 different bsddb3's for a loooong time now!

>> I'm half ready to declare that ZODB is the only database anyone
>> should ever use

> apply to BDB-backed ZODB, or only to ZODB's native storage?

ZODB's BTrees rock.  The backend storage format is just a detail.  ZODB
doesn't have a native format, BTW -- you get the kind of storage you
explicitly ask for (there is no default), and I bet there are at least 10
flavors of storage by now.  FileStorage is by far the most frequently used.

We should all be aware that BDB-backed ZODB is a pretty new thing, and isn't
yet used in production anywhere that I'm aware of.  FileStorage has been
through the wringer at sites with enormous loads for years, so is easier to
trust -- and its pragmatics are much better understood too.  Tuning BDB
appears to be a major undertaking even on a tuning-friendly platform like
Linux.

> Unless there's something I'm missing (licensing problems, deployment
> problems, portability problems...?)

ZODB is OSI-certified Open Source, like Python.  You can even piss on it and
sell the result as art, if you want to <wink>.

> it could be that we should replace our current DBDictClassifier (which
> suffers from DBRunRecovery errors and isn't multiprocess-safe) with a
> ZODBClassifier using a BDB back end.  From a position of complete
> ignorance, I'd hazard a guess that the implementation would end up a
> lot simpler than rewriting DBDictClassifier to use bsddb in full-on
> transactional mode - the hassles of doing that have already been
> sorted out in ZODB.

Having never written anything myself using bsddb3's "real" interface, I
can't say how hard that would be.  I *expect* it would actually be easy for
someone with a non-trivial understanding of BDB.  The only use we have for
BDB now is to use it as if it were a giant dict -- it probably doesn't get
any simpler than that.

> Am I in cloud cuckoo land?

Na, talk is cheap and always sane <wink>.


From barry at python.org  Wed Dec 17 20:21:53 2003
From: barry at python.org (Barry Warsaw)
Date: Wed Dec 17 20:22:01 2003
Subject: [spambayes-dev] A new and altogether different bsddb breakage
In-Reply-To: <e3o1uvke4lrd6s0lafkr9fegh3sjeql4fu@4ax.com>
References: <LNBBLJKPBEHFEDALKOLCAEIFHNAB.tim.one@comcast.net>
	<1071700323.27808.50.camel@anthem>
	<e3o1uvke4lrd6s0lafkr9fegh3sjeql4fu@4ax.com>
Message-ID: <1071710513.27808.62.camel@anthem>

On Wed, 2003-12-17 at 18:25, Richie Hindle wrote:

>  o The database wouldn't get significantly bigger than with direct use of
>    bsddb.

I didn't say that. :)

ZODB's storage api and object model requires many ancillary tables in
order to keep house properly.  The overall disk usage of a BDB-backed
ZODB will be greater than if you could just model the data structures
you needed directly onto BerkeleyDB BTrees (most likely).  With ZODB,
it's probably likely that object pickles overwhelm the the housekeeping
tables so it may not matter much, but for spambayes, I'm not sure that
would be the case (I haven't looked closely at exactly what data
spambayes wants to store).

>  o Since BDB uses bsddb in transaction mode rather than single-file mode,
>    we can say goodbye to those nasty little DBRunRecovery errors.  Yay!

That's the hope, anyway. :)

-Barry


From barry at python.org  Wed Dec 17 20:31:01 2003
From: barry at python.org (Barry Warsaw)
Date: Wed Dec 17 20:31:11 2003
Subject: [spambayes-dev] A new and altogether different bsddb breakage
In-Reply-To: <LNBBLJKPBEHFEDALKOLCAEJGHNAB.tim.one@comcast.net>
References: <LNBBLJKPBEHFEDALKOLCAEJGHNAB.tim.one@comcast.net>
Message-ID: <1071711060.27808.67.camel@anthem>

On Wed, 2003-12-17 at 20:16, Tim Peters wrote:

> Having never written anything myself using bsddb3's "real" interface, I
> can't say how hard that would be.  I *expect* it would actually be easy for
> someone with a non-trivial understanding of BDB.  The only use we have for
> BDB now is to use it as if it were a giant dict -- it probably doesn't get
> any simpler than that.

If you map all square-bracket setitems to .put()'s and square-bracket
getitems to .get()'s, it's fairly straightforward.  That is, provided
you can define the transaction boundaries so you can call txn begin,
abort, and commit at the Right Times.  You will want to pass the BDB txn
object into the .gets and .puts to make it all work smoothly.  Add a
little extra goo to create the environment if it doesn't exist (or join
it if it does), and viola!  or contrabasso!

-Barry


From tim.one at comcast.net  Wed Dec 17 20:36:48 2003
From: tim.one at comcast.net (Tim Peters)
Date: Wed Dec 17 20:36:49 2003
Subject: [spambayes-dev] A new and altogether different bsddb breakage
In-Reply-To: <1071710513.27808.62.camel@anthem>
Message-ID: <LNBBLJKPBEHFEDALKOLCMEJJHNAB.tim.one@comcast.net>

[Barry]
> ...
> (I haven't looked closely at exactly what data spambayes wants to store).

The token statistics database now is a single (but large) mapping from short
8-bit strings to 2-tuples of little integers.  The strings are usually less
than 16 characters, and never a lot longer than that (the tokenizer
truncates very long strings, synthesizing short "skip" tokens as proxies).

It would be nice to have other mappings too, like forward and inverse msgid
<-> bag_of_tokens maps.  A little-integer timestamp may get added to the
2-tuples.


From nobody at spamcop.net  Wed Dec 17 20:41:30 2003
From: nobody at spamcop.net (Seth Goodman)
Date: Wed Dec 17 20:41:30 2003
Subject: [spambayes-dev] RE: [Spambayes] How low can you go?
In-Reply-To: <LNBBLJKPBEHFEDALKOLCCEJCHNAB.tim.one@comcast.net>
Message-ID: <MHEGIFHMACFNNIMMBACAAEJPGOAA.nobody@spamcop.net>

[Tim Peters]
> No, it doesn't matter if that's *all* you do.  Say I've trained
> on 243 ham,
> and 257 spam, total, and throw out the hapax 'bi:choose the'.  That has no
> effect on that the features I didn't throw out still came from training on
> 243 ham and 257 spam, total.

OK, but there are still a couple of potential problems.

1) Let's say the discarded bi-gram occurs in a spam at a later date.  Though
it was only a hapax, it now contributes nothing.

2) Let's say we want to train on a spam with the discarded bi-gram.  It was
originally a hapax, so it should now have an occurrence count of two.  After
training, it again shows up as a hapax.  This is a more significant problem.

3) Do we eventually reduce the occurrence count of a non-hapax token?  If we
do, we could eventually have none of the tokens from a trained message
present but its message count will still be there.  Unless we implement your
token cross-reference as explained below, the message counts will eventually
not be correct if we expire enough tokens. If we don't expire a lot of
tokens over the long run, why bother?

>
> The problem comes when untraining a message M.  That reduces the count of
> total messages trained on, but if I threw away a hapax H from M
> previously,
> and H reappeared again later, it would be a mistake to reduce the category
> count on H during untraining M.

Yup, and you have the solution below.

>
> There's another bullet we haven't bitten yet, saving a map of
> message id to
> an explicit list of all tokens produced by that message (Skip wants the
> inverse of that mapping for diagnostic purposes too).  Given
> that, training
> and untraining of individual messages could proceed smoothly despite
> intervening changes in tokenization details; expiring entire
> messages would
> be straightforward; and when expiring an individual feature, it would be
> enough to remove that feature from each msg->[feature] list it's in (then
> untraining on a msg later wouldn't *try* to decrement the
> per-feature count
> of any feature that had previously been expired individually and
> appeared in
> the msg at the time).

This definitely works.  But why bother tracking, cross-referencing and
expiring individual tokens when we can just expire whole messages, which is
a lot simpler?  It accomplishes the goal of keeping the token databases
cleaned of excessive hapaxes and gradually expires non-hapax tokens, as
well.  There is also less need for reverse indexing of tokens to messages,
since all messages and their tokens will eventually expire.  However, if
people need that feature, they need it.

>
> That's all easy enough to do, but the database grows ever bigger.
>  It would
> probably need reworking to start using "feature ids" (little
> integers) too,
> so that relatively big strings didn't have to get duplicated all over the
> database.

No argument there.  How about a 32-bit hash for any token whether unigram,
bi-gram, etc.?  The token database could then consist of an ordered list of
32-bit hashes paired with an occurrence count (16-bits would probably do
it).  That's only six bytes/token, and you could use your indexing method of
choice, if any, to speed up the lookups.  Similarly, if we implemented a
message database with this method, each token in a message would only take
up four bytes.  The hash calculation costs something, but the smaller
database size and quicker lookup time could make up for it.

--
Seth Goodman

  Humans:   off-list replies to sethg [at] GoodmanAssociates [dot] com

  Spambots: disregard the above


From tameyer at ihug.co.nz  Wed Dec 17 20:53:21 2003
From: tameyer at ihug.co.nz (Tony Meyer)
Date: Wed Dec 17 20:53:39 2003
Subject: [spambayes-dev] A new and altogether different bsddb breakage
In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F13047C0F6D@its-xchg4.massey.ac.nz>
Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F1304677773@its-xchg4.massey.ac.nz>

[Tim]
> The token statistics database now is a single (but large) 
> mapping from short 8-bit strings to 2-tuples of little 
> integers.

I think part of the Japanese/Asian languages patch which I keep meaning to
look more closely into has these turn into unicode strings (how many bits is
that?  I know nothing much about unicode; English is good enough for me
<wink>).

(Just in case someone was about to implement a new spambayes db system with
only 8-bit tokens).

=Tony Meyer


From barry at python.org  Wed Dec 17 21:17:49 2003
From: barry at python.org (Barry Warsaw)
Date: Wed Dec 17 21:18:04 2003
Subject: [spambayes-dev] A new and altogether different bsddb breakage
In-Reply-To: <LNBBLJKPBEHFEDALKOLCMEJJHNAB.tim.one@comcast.net>
References: <LNBBLJKPBEHFEDALKOLCMEJJHNAB.tim.one@comcast.net>
Message-ID: <1071713869.27808.75.camel@anthem>

On Wed, 2003-12-17 at 20:36, Tim Peters wrote:

> The token statistics database now is a single (but large) mapping from short
> 8-bit strings to 2-tuples of little integers.  The strings are usually less
> than 16 characters, and never a lot longer than that (the tokenizer
> truncates very long strings, synthesizing short "skip" tokens as proxies).

The raw bsddb interface wants keys and values to be strings and for
btree access methods, the length doesn't really matter.  You could
pickle the 2-tuples or just do something easily splittable like '%s|%s'
% two_tuple.

Sounds like one BTree table would do the trick there.

> It would be nice to have other mappings too, like forward and inverse msgid
> <-> bag_of_tokens maps.  A little-integer timestamp may get added to the
> 2-tuples.

Each of those would be a separate table, of course.  bag_of_token maps
sounds like you'd want to pickle the data value.

-Barry


From tim.one at comcast.net  Wed Dec 17 21:47:29 2003
From: tim.one at comcast.net (Tim Peters)
Date: Wed Dec 17 21:47:33 2003
Subject: [spambayes-dev] A new and altogether different bsddb breakage
In-Reply-To: <1071713869.27808.75.camel@anthem>
Message-ID: <LNBBLJKPBEHFEDALKOLCGEJOHNAB.tim.one@comcast.net>

[Tim]
>> The token statistics database now is a single (but large) mapping
>> from short 8-bit strings to 2-tuples of little integers.  The
>> strings are usually less than 16 characters, and never a lot longer
>> than that (the tokenizer truncates very long strings, synthesizing
>> short "skip" tokens as proxies).

[Barry Warsaw]
> The raw bsddb interface wants keys and values to be strings and for
> btree access methods, the length doesn't really matter.  You could
> pickle the 2-tuples or just do something easily splittable like
> '%s|%s' % two_tuple.

We already pickle this stuff, but it goes through the shelve module so
pretends to be transparent.  I want to get shelve out of it anyway, because
shelve adds little value at high cost (there are too many layers of
indirection through Python-level methods now -- slooooow).  There are very
few textual sites where pickle<->unpickle dances are needed (that's already
been cleanly factored out).

> Sounds like one BTree table would do the trick there.

Yup.  We're using BDB hash now.  I don't know that this was a conscious
decision.  I'd ask whether BDB hash or BDB BTree would be faster, but I
don't want to put you on the spot <wink>.

>> It would be nice to have other mappings too, like forward and
>> inverse msgid <-> bag_of_tokens maps.  A little-integer timestamp
>> may get added to the 2-tuples.

> Each of those would be a separate table, of course.  bag_of_token maps
> sounds like you'd want to pickle the data value.

They would be very much like the indices we build for full search in
ZCTextIndex.  This is easy to do with ZODB's IO and OO flavors of BTree,
because BTree values can also be BTrees (etc), and all the pieces are
automagically cut down to reasonably small storage chunks then.  I'd ask
whether BDB supports something similar, but ... <heh>.

the-best-thing-to-do-with-consultants-is-shame-them-into-writing-the-
    code-ly y'rs  - tim


From barry at python.org  Wed Dec 17 22:17:12 2003
From: barry at python.org (Barry Warsaw)
Date: Wed Dec 17 22:17:23 2003
Subject: [spambayes-dev] A new and altogether different bsddb breakage
In-Reply-To: <LNBBLJKPBEHFEDALKOLCGEJOHNAB.tim.one@comcast.net>
References: <LNBBLJKPBEHFEDALKOLCGEJOHNAB.tim.one@comcast.net>
Message-ID: <1071717431.27808.122.camel@anthem>

On Wed, 2003-12-17 at 21:47, Tim Peters wrote:

> I'd ask whether BDB hash or BDB BTree would be faster, but I
> don't want to put you on the spot <wink>.

Oh, you can do better than that.  It's easy: the answer is yes!

> the-best-thing-to-do-with-consultants-is-shame-them-into-writing-the-
>     code-ly y'rs  - tim

But see, I'm actually doing you a favor by resolutely ducking that
responsibility.  How else are you going to be able to fix things when I
take a leave of absence to follow the Britster on her 2-year long come
back world tour?  You really will eventually thank me for forcing you to
write it.

anyone-going-on-such-a-tour-has-no-shame-anyway-ly y'rs,
-Barry


From anthony at interlink.com.au  Wed Dec 17 23:48:55 2003
From: anthony at interlink.com.au (Anthony Baxter)
Date: Wed Dec 17 23:49:36 2003
Subject: [spambayes-dev] A new and altogether different bsddb breakage 
In-Reply-To: <LNBBLJKPBEHFEDALKOLCGEJOHNAB.tim.one@comcast.net> 
Message-ID: <200312180448.hBI4muIJ010785@localhost.localdomain>


>>> "Tim Peters" wrote
> the-best-thing-to-do-with-consultants-is-shame-them-into-writing-the-
>     code-ly y'rs  - tim


Isn't that how I suckered MarkH into working on the Outlook plugin?

Anthony

-- 
Anthony Baxter     <anthony@interlink.com.au>   
It's never too late to have a happy childhood.


From tim.one at comcast.net  Thu Dec 18 00:08:39 2003
From: tim.one at comcast.net (Tim Peters)
Date: Thu Dec 18 00:08:50 2003
Subject: [spambayes-dev] RE: [Spambayes] How low can you go?
In-Reply-To: <MHEGIFHMACFNNIMMBACAAEJPGOAA.nobody@spamcop.net>
Message-ID: <LNBBLJKPBEHFEDALKOLCMEKHHNAB.tim.one@comcast.net>

[Seth Goodman]
> OK, but there are still a couple of potential problems.

Oh, sure -- but testing is the only judge of what works here.

> 1) Let's say the discarded bi-gram occurs in a spam at a later date.
> Though it was only a hapax, it now contributes nothing.

I doubt it matters.  Most text classification systems (this field is more
than 40 years old, BTW) ignore hapaxes entirely, and also ignore tokens that
don't appear in at least *several* distinct training examples (see Paul
Graham's essay, where he carried on that tradition).  We don't ignore
anything, because testing said it worked better not to ignore anything in
this particular task.  It wasn't a killer-strong improvement to pay
attention to everything, but was a statistically significant win.  Good
enough.

Since then, use in real life, unlike our randomized cross-validation
testing, doesn't see messages "at random" at all:  it sees them ordered in
time.  That appears to make a difference, and actually helps us overall.

After some 16 months of watching this algorithm in various tests and in
practice, I've identified only two clear, repeated effects of hapaxes:

1. Good:  When a spam campaign begins, the hapaxes in its first example
   very often help to nail the upcoming variations in that campaign.
   People with small databases using mistake-based training see this
   dramatically, and it's very handy for them in real-life use.  A
   similar effect helps on the ham side, when training (e.g.) on that
   once-per-month HTML newletter from (say) American Century Investments,
   which look very spammy the first time around.  Because legit companies
   pay ad firms small fortunes to establish "brand identity", such
   newletters are typically *stuffed* with hapaxes identifying the
   source.

2. Bad:  Most spam campaigns fizzle out within a month.  The hapaxes
   stick around, though.  Sooner or later an unusual ham comes across
   that just happens to hit a large number of the leftover spam hapaxes,
   then serves as a "spectacular failure" example here.  They're very
   rare, but very unsettling when they occur (well, likely *because*
   they're so rare for most people).

> 2) Let's say we want to train on a spam with the discarded bi-gram.
> It was originally a hapax, so it should now have an occurrence count
> of two.  After training, it again shows up as a hapax.  This is a
> more significant problem.

Based on what evidence?  Token spamprobs are guesses at best, and an
estimated spamprob based on only one or two examples isn't even reliable to
one significant digit.  The difference between seeing something once or
twice doesn't move a spamprob much, either.  So I have to guess that this
effect is so tiny it will be lost in estimation noise.

In early experiments, the database stored more info, and the test framework
was able to report which features were used *most* often in making a correct
decision.  Several times I took the few hundred "most valuable" features
(based on a combination of how often they contributed to a correct decision,
and their spamprob strength (distance, in either direction, from 0.5)), and
threw them out of the database.  An amazing (at the time) thing was that
this didn't hurt performance -- if the classifier was blinded to what *were*
its best clues, it found another set of clues that did just as well overall.
Performance eventually deteriorated dramatically if this was done over and
over again, but the system has already been shown to be very robust against
losing even its best features.  That's one reason I'm not worried about
throwing away its least useful features (hapaxes have weak spamprobs, and
hapaxes that haven't been *used* in scoring for N days may as well not have
existed at all for the last N days -- and most hapaxes are like that, no
matter how big N is).

> 3) Do we eventually reduce the occurrence count of a non-hapax token?

There are many possible schemes.  Strongly storage-conscious schemes only
save a byte or two for a count, and periodically shift all the counts right
by 1 bit, to prevent overflow.  That seems to work very well in systems that
do it.  I've already said here that I see the primary point of expiring
hapaxes as being a means to reduce database size, and in the context of the
much more storage-intensive mixed unigram/bigram scheme.  Hapaxes can
account for the bulk of the storage all by themselves (this isn't unique to
spam filtering, btw -- across many kinds of computer text indexing systems,
hapaxes typically account for about half the content), and most hapaxes are
never seen again.

I'm experimenting with a mixed unigram/bigram classifier right now.  It's
been trained on (just) 94 ham and 96 spam so far, but there are already
51,378 features in the database.  45,624 of them are hapaxes -- that's 89%!
I could eliminate the rest of the database entirely, and not cut its size
enough to care about.  This is why picking specifically on hapaxes is a
high-value proposition (high potential, low risk).

> If we do, we could eventually have none of the tokens from a trained
> message present but its message count will still be there.  Unless we
> implement your token cross-reference as explained below, the message
> counts will eventually not be correct if we expire enough tokens.

I want to do expiration "correctly".  But even if all the tokens from a
message expire when the total message count is N, it still doesn't change
that counts on tokens that remain were in fact derived from N messages, and
so N remains the best possible thing to feed into the spamprob guesses.

> If we don't expire a lot of tokens over the long run, why bother?

I expect an enormous number of hapaxes to expire, in steady state
essentially equaling the rate at which they're created by new messages.  In
the example above, 90% of the features created for me right now *are*
hapaxes.  I expect that to drop with more training, but for hapaxes to
remain both the single biggest database consumer, and the least valuable
tokens to retain.

>> ...
>> There's another bullet we haven't bitten yet, saving a map of
>> message id to an explicit list of all tokens produced by that
>> message (Skip wants the inverse of that mapping for diagnostic
>> purposes too).  Given that, training and untraining of individual
>> messages could proceed smoothly despite intervening changes in
>> tokenization details; expiring entire messages would be
>> straightforward; and when expiring an individual feature, it would
>> be enough to remove that feature from each msg->[feature] list it's
>> in (then untraining on a msg later wouldn't *try* to decrement the
>> per-feature count of any feature that had previously been expired
>> individually and appeared in the msg at the time).

> This definitely works.  But why bother tracking, cross-referencing and
> expiring individual tokens when we can just expire whole messages,
> which is a lot simpler?

I doubt that it's simpler at all, and you earlier today sketched quite an
elaborate scheme for expiring different messages at different rates.  That's
got its share of tuning parameters (aka wild-ass guesses <wink>) too, showed
every sign of being just the beginning of its brand of complication, and has
no testing or experience to support it.  We know a lot about the real-life
effects of hapaxes now.

BTW, the single worst thing you can do with a system of this type is train a
message into the wrong category.  Everyone does it eventually, and some
people can't seem to help but doing it often.  Maybe that's a UI problem at
heart -- I don't know, because I seem to be unusually resistant to it.  It's
happened to me too, though, and it can be hard to recover.  One sterling use
for a feature -> msg_ids map is, as Skip noted, a way to find out *why* your
latest spam was a false negative:  look at the low-scoring features, then
look at the messages with those features that were trained on as ham.  This
has an excellent shot at pinpointing mis-trained messages.  That's difficult
at best now, and is a real problem for some people.  I've got gigabytes of
unused disk space myself <wink>.

Evolution of this system would also be served by saving an explict msg_id ->
features map.  When we change tokenization to get a small win, sometimes the
tokens originally added to a database by training on message M can no longer
be reconstructed by re-tokenizing M (the tokenizer has changed!  if it
always returned exactly what it returned before the change, there wasn't
much point to the change <wink>).  Blindly untraining anyway can violate
database invariants then, eventually manifesting as assertion errors and the
need to retrain from scratch.  The only clear and simple way to prevent this
is to save a map from msg_id to the tokens it originally produced.  Then
untraining simply walks that list, and nothing can go wrong as a result.

That's a bit subtle, so takes some long-term experience to appreciate at a
gut level.  Of more immediate concern to most users is that only the
obsessed *want* to save their spam.  Most people want to throw spam away
ASAP.  But, if they do that, we currently have no way to expire any spam
they ever trained on.  Moving toward saving msg_ids <-> features maps solves
that too, and with suitable reuse of little integers for feature ids can
store the relevant bits about trained messages in less space than it takes
to save the original messages.  Note that hapaxes would waste the most
resource in this context too.

>> That's all easy enough to do, but the database grows ever bigger.
>> It would probably need reworking to start using "feature ids"
>> (little integers) too, so that relatively big strings didn't have to
>> get duplicated all over the database.

> No argument there.  How about a 32-bit hash for any token whether
> unigram, bi-gram, etc.?  The token database could then consist of an
> ordered list of 32-bit hashes paired with an occurrence count
> (16-bits would probably do it).  That's only six bytes/token, and you
> could use your indexing method of choice, if any, to speed up the
> lookups.

We ran experiments on that before, and results were dreadful.  32-bit hashes
have far too high a collision rate on a sizable database (don't forget the
Birthday Paradox here!), confusing ham with spam in highly entertaining ways
(provided you're just experimenting and don't really care how well it does).
An MD5 or SHA-1 hash would be fine, but then it's up to 16 or 20 bytes per
feature, and most of the strings we store in the current pure unigram scheme
are shorter than that.  A 64-bit hash would probably be OK.

Another hated (widely in this project, among the developers) consequence of
using hash codes is that mining the database for clues is useless then.
"Hey, hash code 45485448 is your strongest spam clue!"  "Oh -- no wonder,
then" <wink>.  Storing the actual feature strings as plainly as possible is
extremely helpful for development, debugging, and research.

> Similarly, if we implemented a message database with this method, each
> token in a message would only take up four bytes.  The hash calculation
> costs something, but the smaller database size and quicker lookup time
> could make up for it.

We're not going to abandon plain strings, because they're far too useful and
loved in various reports intended for human consumption.  Adding feature_id
<-> feature_string maps would allow for effective compression of message
storage.


From tim.one at comcast.net  Thu Dec 18 00:38:55 2003
From: tim.one at comcast.net (Tim Peters)
Date: Thu Dec 18 00:38:57 2003
Subject: [spambayes-dev] A new and altogether different bsddb breakage
In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F1304677773@its-xchg4.massey.ac.nz>
Message-ID: <LNBBLJKPBEHFEDALKOLCEEKKHNAB.tim.one@comcast.net>

[Tony Meyer]
> I think part of the Japanese/Asian languages patch which I keep
> meaning to look more closely into has these turn into unicode strings
> (how many bits is that?  I know nothing much about unicode; English
> is good enough for me <wink>).
>
> (Just in case someone was about to implement a new spambayes db
> system with only 8-bit tokens).

Overall, I'd encourage them in that vice.  I did all I could to keep
SpamBayes neutral across European "Latin-insert-your-favorite-number"
languages, except for the non-default Anglocentric replace_nonascii_chars
option.  That's why I favored split-on-whitespace as the only msg body
lexing gimmick (of course it helped a lot that s-o-w did best in tests
across all lexing schemes ever tried!); have consistently resisted attempts
to add knowledge about "punctuation" (except in header-line contexts, where
standards constrain the permitted characters); haven't voiced any support
for gimmicks like "map Latin-1 into letters that look more like the ones I'm
used to" (but as the replace_nonascii_chars perpetrator, couldn't oppose
them in good conscience as options either <wink>); and haven't written a u''
literal anywhere in the source.

My belief is that Asian languages are so different in what they would need
to do a good job that someone wanting that would be better off forking the
project.  I really don't want to see masses of deeply different algorithms
all slammed into the same codebase, not even if "the cost" were just
massively refactoring SpamBayes to add another two layers of expensive
indirection.  SpamBayes isn't required to be all things to all people.

I haven't studied the patch you're talking about, so maybe it's just a
one-liner <wink>.  Alas, I'm aware of it, and have read the patch comments,
and the panic above is a fair reflection of my first, second, and third
reactions.

As to how many bits are in a Unicode string, you don't want to know.  "It
depends."   Pickles store them in an Anglocentric format (UTF-8) that
happens to consume exactly the same number of bytes as now if the string
consists of just US ASCII characters.  The memory burden is much larger,
though (Python Unicode string objects are big beasts).


From tim.one at comcast.net  Thu Dec 18 00:49:04 2003
From: tim.one at comcast.net (Tim Peters)
Date: Thu Dec 18 00:49:05 2003
Subject: [spambayes-dev] A new and altogether different bsddb breakage 
In-Reply-To: <200312180448.hBI4muIJ010785@localhost.localdomain>
Message-ID: <LNBBLJKPBEHFEDALKOLCGEKLHNAB.tim.one@comcast.net>

[Tim]
>> the-best-thing-to-do-with-consultants-is-shame-them-into-writing-the-
>>     code-ly y'rs  - tim

[Anthony Baxter]
> Isn't that how I suckered MarkH into working on the Outlook plugin?

Yes it is!  It's *also* how Barry suckered me into writing the spambayes
tokenizer and classifier to begin with -- although he's conviently
forgetting the karmic reciprocal obligation now <wink>.


From tameyer at ihug.co.nz  Thu Dec 18 01:44:11 2003
From: tameyer at ihug.co.nz (Tony Meyer)
Date: Thu Dec 18 01:44:20 2003
Subject: [spambayes-dev] A new and altogether different bsddb breakage
In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F13047C0EE3@its-xchg4.massey.ac.nz>
Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F1304677778@its-xchg4.massey.ac.nz>

[Tim]
> Note that Jeremy already wrote code to run spambayes via ZEO, 
> in the project's pspam/ directory.  I don't know how much 
> bitrot that's suffered.

Not as much as I had thought.  I believe the check-ins I just made get it
working again - at least the three main scripts (pop.py, scoremsg.py and
update.py) appear to do what they are meant to.  It works without
socket.AF_UNIX now, too.

It is still separate from everything else, of course, but it does work
again...

=Tony Meyer


From barry at python.org  Thu Dec 18 02:02:09 2003
From: barry at python.org (Barry Warsaw)
Date: Thu Dec 18 02:02:18 2003
Subject: [spambayes-dev] A new and altogether different bsddb breakage
In-Reply-To: <LNBBLJKPBEHFEDALKOLCGEKLHNAB.tim.one@comcast.net>
References: <LNBBLJKPBEHFEDALKOLCGEKLHNAB.tim.one@comcast.net>
Message-ID: <1071730928.17717.13.camel@anthem>

On Thu, 2003-12-18 at 00:49, Tim Peters wrote:
> [Tim]
> >> the-best-thing-to-do-with-consultants-is-shame-them-into-writing-the-
> >>     code-ly y'rs  - tim
> 
> [Anthony Baxter]
> > Isn't that how I suckered MarkH into working on the Outlook plugin?
> 
> Yes it is!  It's *also* how Barry suckered me into writing the spambayes
> tokenizer and classifier to begin with -- although he's conviently
> forgetting the karmic reciprocal obligation now <wink>.

Nope, I'm just younger than you, and old guys are so easily duped.

-Barry


From anthony at interlink.com.au  Thu Dec 18 04:52:35 2003
From: anthony at interlink.com.au (Anthony Baxter)
Date: Thu Dec 18 04:53:00 2003
Subject: [spambayes-dev] A new and altogether different bsddb breakage 
In-Reply-To: <LNBBLJKPBEHFEDALKOLCGEKLHNAB.tim.one@comcast.net> 
Message-ID: <200312180952.hBI9qZC6005810@localhost.localdomain>


>>> "Tim Peters" wrote
> Yes it is!  It's *also* how Barry suckered me into writing the spambayes
> tokenizer and classifier to begin with -- although he's conviently
> forgetting the karmic reciprocal obligation now <wink>.

I'm sure this is some sort of standard method of getting things 
done in the opensource world. Eric Raymond's Cathedral and Bazaar
metaphor extends here, of course - in a bazaar often you end up
getting suckered.

now-who-took-my-wallet,
Anthony
-- 
Anthony Baxter     <anthony@interlink.com.au>   
It's never too late to have a happy childhood.


From skip at pobox.com  Thu Dec 18 08:41:11 2003
From: skip at pobox.com (Skip Montanaro)
Date: Thu Dec 18 08:41:11 2003
Subject: [spambayes-dev] RE: [Spambayes] How low can you go?
In-Reply-To: <LNBBLJKPBEHFEDALKOLCMEKHHNAB.tim.one@comcast.net>
References: <MHEGIFHMACFNNIMMBACAAEJPGOAA.nobody@spamcop.net>
	<LNBBLJKPBEHFEDALKOLCMEKHHNAB.tim.one@comcast.net>
Message-ID: <16353.44663.34193.301968@montanaro.dyndns.org>


    Tim> I'm experimenting with a mixed unigram/bigram classifier right now.
    Tim> It's been trained on (just) 94 ham and 96 spam so far, but there
    Tim> are already 51,378 features in the database.  45,624 of them are
    Tim> hapaxes -- that's 89%!

Late yesterday afternoon I tweaked my procmailrc file to automatically train
on everything which scored as ham or spam.  I awoke this morning to a
database with 489 spam, 600 ham and 198,747 features, 158,116 of were
hapaxes (80%).  At the same time I moved my ham/spam thresholds closer to 0
and 1 to minimize the amount of retraining necessary to counteract false
positives and false negatives.  (It's kind of a pain because I'm also saving
the messages I train on, so I have to rummage around in a Unix mbox to find
incorrectly trained messages.)  I train unsures by hand.  Still only 16
unsures overnight, but my database is up to 10.5MB, so training and scoring
time is on the rise.

Bringing it back to this topic, hapax expiration seems like both a
worthwhile step to take from space/time considerations, and even less likely
to produce problems because I'm training on everything I see.

Now if I could only test this setup easily without a huge time investment.
Perhaps a few more Emacs keybindings are in order.

    Tim> BTW, the single worst thing you can do with a system of this type
    Tim> is train a message into the wrong category.  Everyone does it
    Tim> eventually, and some people can't seem to help but doing it often.

:-)

Skip

From kennypitt at hotmail.com  Thu Dec 18 10:01:42 2003
From: kennypitt at hotmail.com (Kenny Pitt)
Date: Thu Dec 18 10:02:23 2003
Subject: [spambayes-dev] A new and altogether different bsddb breakage
In-Reply-To: <1071710513.27808.62.camel@anthem>
Message-ID: <LAW11-OE16Aaaz7EjQx0000d3bf@hotmail.com>

Barry Warsaw wrote:
> On Wed, 2003-12-17 at 18:25, Richie Hindle wrote:
>>  o Since BDB uses bsddb in transaction mode rather than single-file
>>    mode, we can say goodbye to those nasty little DBRunRecovery
>> errors.  Yay! 
> 
> That's the hope, anyway. :)

Unfortunately I don't think we can say that with any confidence until we
know why they are occurring in the first place.  The following comments
apply to using bsddb directly.  I'm not familiar with ZODB, and don't
know if they are already handling all of these issues.

The BerkeleyDB docs say this:

"""
Errors can occur in the Berkeley DB library where the only solution is
to shut down the application and run recovery (for example, if Berkeley
DB is unable to allocate heap memory). When a fatal error occurs in
Berkeley DB, methods will throw a DbRunRecoveryException, at which point
all subsequent database calls will also fail in the same way. When this
occurs, recovery should be performed.
""" (http://www.sleepycat.com/docs/api_cxx/runrec_class.html)

This seems to indicate that this problem can be caused by more than just
threading problems.  It also says this:

"""
When building transactionally protected applications, there are some
special issues that must be considered. The most important one is that
if any thread of control exits for any reason while holding Berkeley DB
resources, recovery must be performed...
""" (http://www.sleepycat.com/docs/ref/transapp/app.html)

This seems very clear that using full transactional mode does not
protect you from DbRunRecovery errors.

I wonder if the only real solution is to run recovery when opening the
database.  This should be easy for the Outlook add-in, sb_server, etc.
where a single, long-running process performs all database access (just
specify the DB_RECOVER flag when opening the environment).  Running
recovery requires that only one thread in one process has access to the
database environment until recovery is complete.  This might be harder
to accomplish for apps such as sb_filter that could be run as multiple
simultaneous processes.

-- 
Kenny Pitt


From barry at python.org  Thu Dec 18 10:10:22 2003
From: barry at python.org (Barry Warsaw)
Date: Thu Dec 18 10:10:29 2003
Subject: [spambayes-dev] A new and altogether different bsddb breakage
In-Reply-To: <LAW11-OE16Aaaz7EjQx0000d3bf@hotmail.com>
References: <LAW11-OE16Aaaz7EjQx0000d3bf@hotmail.com>
Message-ID: <1071760220.26140.3.camel@anthem>

On Thu, 2003-12-18 at 10:01, Kenny Pitt wrote:

> This seems very clear that using full transactional mode does not
> protect you from DbRunRecovery errors.

True, but it makes them rarer.

> I wonder if the only real solution is to run recovery when opening the
> database.  This should be easy for the Outlook add-in, sb_server, etc.
> where a single, long-running process performs all database access (just
> specify the DB_RECOVER flag when opening the environment).  Running
> recovery requires that only one thread in one process has access to the
> database environment until recovery is complete.  This might be harder
> to accomplish for apps such as sb_filter that could be run as multiple
> simultaneous processes.

In all the BDB apps I've written, I always pass the DB_RECOVER flag to
the open call.  Except for coordinating the above, it's harmless if
recovery doesn't need to happen.

-Barry


From kennypitt at hotmail.com  Thu Dec 18 10:16:51 2003
From: kennypitt at hotmail.com (Kenny Pitt)
Date: Thu Dec 18 10:17:27 2003
Subject: [spambayes-dev] A new and altogether different bsddb breakage
In-Reply-To: <1071711060.27808.67.camel@anthem>
Message-ID: <Law11-OE35oXeExPKpQ0000d5a4@hotmail.com>

Barry Warsaw wrote:
> On Wed, 2003-12-17 at 20:16, Tim Peters wrote:
> 
>> Having never written anything myself using bsddb3's "real"
>> interface, I can't say how hard that would be.  I *expect* it would
>> actually be easy for someone with a non-trivial understanding of
>> BDB.  The only use we have for BDB now is to use it as if it were a
>> giant dict -- it probably doesn't get any simpler than that.
> 
> If you map all square-bracket setitems to .put()'s and square-bracket
> getitems to .get()'s, it's fairly straightforward.  That is, provided
> you can define the transaction boundaries so you can call txn begin,
> abort, and commit at the Right Times.  You will want to pass the BDB
> txn object into the .gets and .puts to make it all work smoothly. 
> Add a little extra goo to create the environment if it doesn't exist
> (or join it if it does), and viola!  or contrabasso!

The bsddb package includes a dbshelve module that handles all the
required dictionary access methods to provide compatibility with
standard shelve functionality.  It also allows specifying the DB_ENV
when opening the database.  The only thing it doesn't seem to handle is
transactions, but I'm not convinced we need that.

Transactions are only really important if you are updating several
related entries, and need to be able to rollback the whole lot if any
one of them fails.  There are some points in SpamBayes that could be
reworked to use transactions (e.g. rollback all token count updates for
a single message if we can't update them all), but I don't think that
has anything to do with the DbRunRecovery errors.  The important thing
re our suspected cause would be the multi-thread and multi-process
locking, and that can be used independently of transactions.

-- 
Kenny Pitt


From tim.one at comcast.net  Thu Dec 18 10:21:15 2003
From: tim.one at comcast.net (Tim Peters)
Date: Thu Dec 18 10:21:13 2003
Subject: [spambayes-dev] A new and altogether different bsddb breakage
In-Reply-To: <1071760220.26140.3.camel@anthem>
Message-ID: <LNBBLJKPBEHFEDALKOLCGENFHNAB.tim.one@comcast.net>

[Barry]
> ...
> In all the BDB apps I've written, I always pass the DB_RECOVER flag to
> the open call.  Except for coordinating the above, it's harmless if
> recovery doesn't need to happen.

OTOH, don't you also do some *seemingly* senseless dance (revealed to you by
a SleepyCat guy) involving back-to-back checkpoints so that the next
harmless recovery doesn't take forever not to do any harm <wink>?

That's probably all wrong, but it might jog your memory.


From barry at python.org  Thu Dec 18 10:29:32 2003
From: barry at python.org (Barry Warsaw)
Date: Thu Dec 18 10:29:40 2003
Subject: [spambayes-dev] A new and altogether different bsddb breakage
In-Reply-To: <LNBBLJKPBEHFEDALKOLCGENFHNAB.tim.one@comcast.net>
References: <LNBBLJKPBEHFEDALKOLCGENFHNAB.tim.one@comcast.net>
Message-ID: <1071761371.26140.7.camel@anthem>

On Thu, 2003-12-18 at 10:21, Tim Peters wrote:
> [Barry]
> > ...
> > In all the BDB apps I've written, I always pass the DB_RECOVER flag to
> > the open call.  Except for coordinating the above, it's harmless if
> > recovery doesn't need to happen.
> 
> OTOH, don't you also do some *seemingly* senseless dance (revealed to you by
> a SleepyCat guy) involving back-to-back checkpoints so that the next
> harmless recovery doesn't take forever not to do any harm <wink>?

That's why the Sleepycat guy told me to checkpoint occasionally (the BDB
storages do that in a thread), /and/ to force a checkpoint twice before
closing the database.  I'm sure the latter is mostly voodoo, but our
faith gives us the strength of conviction.

-Barry


From kennypitt at hotmail.com  Thu Dec 18 12:34:32 2003
From: kennypitt at hotmail.com (Kenny Pitt)
Date: Thu Dec 18 12:35:08 2003
Subject: [spambayes-dev] Broken link on website
Message-ID: <LAW11-OE13rQ61fM3C60000d573@hotmail.com>

Just discovered a (partially) broken link on the website.  On the
Windows page (http://spambayes.sourceforge.net/windows.html) in the "Non
Outlook Solutions" section, the POP3 link goes to the correct page but
the wrong bookmark.  The link references a bookmark of "#pop3", but the
anchor tag on the destination page now uses the bookmark name
"sb_server".

-- 
Kenny Pitt


From skip at pobox.com  Thu Dec 18 12:56:51 2003
From: skip at pobox.com (Skip Montanaro)
Date: Thu Dec 18 12:56:50 2003
Subject: [spambayes-dev] Broken link on website
In-Reply-To: <LAW11-OE13rQ61fM3C60000d573@hotmail.com>
References: <LAW11-OE13rQ61fM3C60000d573@hotmail.com>
Message-ID: <16353.60003.407697.88323@montanaro.dyndns.org>


    Kenny> ... the POP3 link goes to the correct page but the wrong
    Kenny> bookmark.  The link references a bookmark of "#pop3", but the
    Kenny> anchor tag on the destination page now uses the bookmark name
    Kenny> "sb_server".

Thanks.  Should be fixed.

Skip

From tim.one at comcast.net  Thu Dec 18 14:09:09 2003
From: tim.one at comcast.net (Tim Peters)
Date: Thu Dec 18 14:09:07 2003
Subject: [spambayes-dev] A new and altogether different bsddb breakage
In-Reply-To: <Law11-OE35oXeExPKpQ0000d5a4@hotmail.com>
Message-ID: <LNBBLJKPBEHFEDALKOLCCEPAHNAB.tim.one@comcast.net>

[Kenny Pitt]
> The bsddb package includes a dbshelve module that handles all the
> required dictionary access methods to provide compatibility with
> standard shelve functionality.  It also allows specifying the DB_ENV
> when opening the database.

Speaking of which, 4 of the test_bsddb3.py tests fail on Win98SE with the
soon-to-be-released Python 2.3.3 (which is at least as well as that test has
ever done on that platform).  The 4 failing tests all exercise the dbshelve
module:

ERROR: test01_basics (bsddb.test.test_dbshelve.EnvBTreeShelveTestCase)
ERROR: test01_basics (bsddb.test.test_dbshelve.EnvHashShelveTestCase)
ERROR: test01_basics (bsddb.test.test_dbshelve.EnvThreadBTreeShelveTestCase)
ERROR: test01_basics (bsddb.test.test_dbshelve.EnvThreadHashShelveTestCase)

and all die with the same traceback and error:

Traceback (most recent call last):
  File "C:\CODE\23\lib\bsddb\test\test_dbshelve.py", line 75, in
test01_basics
    self.do_open()
  File "C:\CODE\23\lib\bsddb\test\test_dbshelve.py", line 238, in do_open
    self.env.open(homeDir, self.envflags | db.DB_INIT_MPOOL | db.DB_CREATE)
DBAgainError: (11, 'Resource temporarily unavailable -- unable to join the
environment')

If that isn't just an artifact of something else the test suite is doing,
it's enough to kill the idea of using dbshelve on Windows.

> The only thing it doesn't seem to handle is transactions, but I'm not
> convinced we need that.
>
> Transactions are only really important if you are updating several
> related entries, and need to be able to rollback the whole lot if any
> one of them fails.

I expect a transaction commit supplies a natural and useful boundary for
doing a database checkpoint operation (see earlier email w/ Barry; making
frequent checkpoints is probably important so that running recovery when the
database is opened runs quickly).

> ...
> The important thing re our suspected cause would be the multi-thread
> and multi-process locking, and that can be used independently of
> transactions.

Gregory Smith found and fixed several bugs in the bsddb3 use-it-like-a-dict
wrappers we've *been* using, all related to concurrent access.
Unfortunately, it doesn't look like anyone backported those fixes for the
Python 2.3 release (the last few checkins only exist on the trunk, which is
Python 2.4 development).

Given the history of bsddb3 support so far, I think we'll be best off using
the Berkeley-native APIs as directly as possible, avoiding "convenience
wrappers" like the plague.  Very little of our code interacts with the
database directly, and bugs in those wrappers have probably caused hundreds
of times more hours of bug-chasing than would have been required to write a
few extra lines of lower-level code.  Of course, using the Berkeley-native
API directly should run faster too, but I don't hold that it against it
*too* much <wink>.


From tim.one at comcast.net  Thu Dec 18 14:46:16 2003
From: tim.one at comcast.net (Tim Peters)
Date: Thu Dec 18 14:46:16 2003
Subject: [spambayes-dev] RE: [Spambayes] Accidentally deleted Junk
	emailfolder.
In-Reply-To: <009e01c3c493$2489c0b0$2c00a8c0@eden>
Message-ID: <LNBBLJKPBEHFEDALKOLCAEPEHNAB.tim.one@comcast.net>

[Tim]
>> I wonder whether the Outlook addin should stop trying to remember
>> Outlook's internal folder IDs, remember the user-visible string
>> paths instead, and enumerate the folders to (re)discover the
>> internal Outlook IDs "whenever anything may have changed".

[Mark Hammond]
> I'm not sure what you had in mind for "anything may have changed",

In the limit, I suppose that means finding the folder object again from
scratch every time a folder object is needed.  Anything else is just
optimization <wink>.

> but in general, I agree.  I always had the idea that we would also
> store the FQN, and fall back to that when necessary, making the
> folder ID more a "cached" value.  It just never happened.  It does
> get complex though - what happens when the user renames the folder?
> Before you know it, we have even more cruft that noone really
> understand why is there <wink>

It's a part of Outlook's model that doesn't make sense to people.  When my
sister, for example, renames or moves a folder holding Word documents, she'd
be baffled if Word *did* magically notice this.  The idea that a data object
is accessed by, and only by, its current "string path", has been beat into
her by Explorer, by all the other Office programs, and-- for that
matter --by all other programs she uses too.

Outlook is the oddball exception here, but only wrt its internal objects.  I
remember how surprised *I* was the first time I changed the name of a folder
that was the "move to" target of an Outlook rule, and the rule kept moving
things into the renamed folder; I had expected Outlook to screw up, or at
best to pop up an error box the next time the rule conditions fired, telling
me the rule no longer made sense.  That would have been fine by me.

So we can play along with Outlook's model, and have 99% of our users wonder
why SpamBayes deletes all their email <wink>, or match the mental model
everyone (except Outlook experts) has from the start.  I agree they're
deeply incompatible, but each gives a clear answer to questions like "what
happens when the user renames the folder?".

> Another alternative would be to change things so that most errors
> re-displayed the config wizard.

I'm not sure how that could help.  The problem people have now is that there
*aren't* any "hard" errors after they delete their Spam or Unsure folders by
mistake -- SpamBayes plays along with Outlook's "the name and path are
irrelevant, I still know where the *object* is", and users have no idea
Outlook works that way.  What they do instead is create a new Spam folder
directly under Personal Folders, and then are baffled again because that
doesn't work either (for the same reason -- they're thinking of string name
and path, not Outlook object identity).

> ...
> Either way, I'm going for a new combined binary before this even gets
> a look in <wink>

+1


From kennypitt at hotmail.com  Thu Dec 18 14:53:08 2003
From: kennypitt at hotmail.com (Kenny Pitt)
Date: Thu Dec 18 14:53:45 2003
Subject: [spambayes-dev] A new and altogether different bsddb breakage
In-Reply-To: <LNBBLJKPBEHFEDALKOLCCEPAHNAB.tim.one@comcast.net>
Message-ID: <LAW11-OE217ByNP29M50000d7e1@hotmail.com>

Tim Peters wrote:
> [Kenny Pitt]
>> The bsddb package includes a dbshelve module that handles all the
>> required dictionary access methods to provide compatibility with
>> standard shelve functionality.  It also allows specifying the DB_ENV
>> when opening the database.
> 
> Speaking of which, 4 of the test_bsddb3.py tests fail on Win98SE with
> the soon-to-be-released Python 2.3.3 (which is at least as well as
> that test has ever done on that platform).  The 4 failing tests all
> exercise the dbshelve module:
> 
> ERROR: test01_basics (bsddb.test.test_dbshelve.EnvBTreeShelveTestCase)
> ERROR: test01_basics (bsddb.test.test_dbshelve.EnvHashShelveTestCase)
> ERROR: test01_basics
> (bsddb.test.test_dbshelve.EnvThreadBTreeShelveTestCase) ERROR:
> test01_basics (bsddb.test.test_dbshelve.EnvThreadHashShelveTestCase) 
> 
> and all die with the same traceback and error:
> 
> Traceback (most recent call last):
>   File "C:\CODE\23\lib\bsddb\test\test_dbshelve.py", line 75, in
> test01_basics
>     self.do_open()
>   File "C:\CODE\23\lib\bsddb\test\test_dbshelve.py", line 238, in
>     do_open self.env.open(homeDir, self.envflags | db.DB_INIT_MPOOL |
> db.DB_CREATE) DBAgainError: (11, 'Resource temporarily unavailable --
> unable to join the environment')
> 
> If that isn't just an artifact of something else the test suite is
> doing, it's enough to kill the idea of using dbshelve on Windows.

Notice that the exception is occurring in test_dbshelve itself, not in
the dbshelve module.  I have a sneaking suspicion that this would be a
general problem with bsddb on Win98, and not just if we used dbshelve.
dbshelve doesn't do a whole lot besides pickling and unpickling the item
values before calling the direct API.

The "Windows Notes" page in the Berkeley DB docs says:

"""
On Windows/9X, files opened by multiple processes do not share data
correctly. For this reason, the DB_SYSTEM_MEM flag is implied for any
application that does not specify the DB_PRIVATE flag, causing the
system paging file to be used for sharing data.
""" (http://www.sleepycat.com/docs/ref/build_win/notes.html)

Possibly related?  I assume DBAgainError maps to error code EAGAIN in
the C API, which means "The shared memory region was locked and
(repeatedly) unavailable."  Maybe something isn't getting released
properly from the earlier non-environment tests.

-- 
Kenny Pitt


From tim.one at comcast.net  Thu Dec 18 15:45:46 2003
From: tim.one at comcast.net (Tim Peters)
Date: Thu Dec 18 15:45:44 2003
Subject: [spambayes-dev] A new and altogether different bsddb breakage
In-Reply-To: <LAW11-OE217ByNP29M50000d7e1@hotmail.com>
Message-ID: <LNBBLJKPBEHFEDALKOLCAEPMHNAB.tim.one@comcast.net>

[Kenny Pitt]
> ...
>
> Notice that the exception is occurring in test_dbshelve itself, not in
> the dbshelve module.  I have a sneaking suspicion that this would be a
> general problem with bsddb on Win98, and not just if we used dbshelve.
> dbshelve doesn't do a whole lot besides pickling and unpickling the
> item values before calling the direct API.
>
> The "Windows Notes" page in the Berkeley DB docs says:

Sorry, I can't make more time for this.  The Berkeley docs don't mention any
Berkeley bugs on Win9x, just cautions.  The Berkeley wrappers "we" (Python)
wrote have had a miserable history on Windows, which is why I'll just repeat
that we'll be better off avoiding "our" convenience wrappers like the
plague, sticking as close to base Berkeley as possible.

This isn't getting better.  The bsddb3 tests on Win98SE on the current
Python CVS trunk suffer 6 errors and 48 failures, up from 4 errors and 0
failures under 2.3.3:  Gregory's attempts to "fix the wrappers" have
actually made things much worse on Win98SE, and this can't be pinned on
Sleepycat because the Sleepycat distro in use hasn't changed.

OTOH, the ZODB test suite exercises ZODB-on-BDB, which is also coded in
Python, and is actually more likely to fail on Linux than on Win98SE these
days (presumably due to thread-race bugs in our test setup).  Barry coded
ZODB-on-BDB using the Berkeley-native API, and *that* hasn't given us any
headaches on any flavor of Windows.


From kennypitt at hotmail.com  Thu Dec 18 16:32:15 2003
From: kennypitt at hotmail.com (Kenny Pitt)
Date: Thu Dec 18 16:33:00 2003
Subject: [spambayes-dev] A new and altogether different bsddb breakage
In-Reply-To: <LNBBLJKPBEHFEDALKOLCAEPMHNAB.tim.one@comcast.net>
Message-ID: <Law11-OE59xgSAI1jEX0000d8c4@hotmail.com>

Tim Peters wrote:
> [Kenny Pitt]
>> ... I have a sneaking suspicion that this would
>> be a general problem with bsddb on Win98, and not just if we used
>> dbshelve. dbshelve doesn't do a whole lot besides pickling and
>> unpickling the item values before calling the direct API.
> 
> Sorry, I can't make more time for this.

Understood.  Just one more quick question if you would, since I think I
may have misunderstood where you were coming from.

> ... The Berkeley
> wrappers "we" (Python) wrote have had a miserable history on Windows,
> which is why I'll just repeat that we'll be better off avoiding "our"
> convenience wrappers like the plague, sticking as close to base
> Berkeley as possible. 
> 
> [snip]
> 
> OTOH, the ZODB test suite exercises ZODB-on-BDB, which is also coded
> in Python, and is actually more likely to fail on Linux than on
> Win98SE these days (presumably due to thread-race bugs in our test
> setup).  Barry coded ZODB-on-BDB using the Berkeley-native API, and
> *that* hasn't given us any headaches on any flavor of Windows.

So are you recommending that we avoid using the whole _bsddb.pyd binary
package?  I originally thought you were referring only to the
Python-coded wrappers.

-- 
Kenny Pitt


From tameyer at ihug.co.nz  Thu Dec 18 17:30:23 2003
From: tameyer at ihug.co.nz (Tony Meyer)
Date: Thu Dec 18 17:30:32 2003
Subject: [spambayes-dev] RE: [Spambayes] Accidentally deleted
	Junkemailfolder.
In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F13047C11A2@its-xchg4.massey.ac.nz>
Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F13026F2A10@its-xchg4.massey.ac.nz>

[Mark]
> It does get complex though - what happens when the user renames
> the folder?

[Tim]
> It's a part of Outlook's model that doesn't make sense to 
> people.  When my sister, for example, renames or moves a 
> folder holding Word documents, she'd be baffled if Word *did* 
> magically notice this.  The idea that a data object is 
> accessed by, and only by, its current "string path", has been 
> beat into her by Explorer, by all the other Office programs, 
> and-- for that matter --by all other programs she uses too.

This surprised me - I've usually found people don't understand why Explorer
et al can't keep track of things when they are renamed.  I think maybe the
difference is that most of these people started out using Macs rather than
Windows, and Macs have always managed this better (compare aliases to
shortcuts, for example).

Of course, the Outlook plug-in is for Windows users, so that's not all that
relevant ;)

+1 to switching to names in a release at some point in the future.  How
often do Outlook 'experts' change the names of their spam/unsure folders
anyway?

=Tony Meyer


From tim.one at comcast.net  Thu Dec 18 18:04:21 2003
From: tim.one at comcast.net (Tim Peters)
Date: Thu Dec 18 18:04:21 2003
Subject: [spambayes-dev] RE: [Spambayes] Accidentally deleted
	Junkemailfolder.
In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F13026F2A10@its-xchg4.massey.ac.nz>
Message-ID: <LNBBLJKPBEHFEDALKOLCOEBBHOAB.tim.one@comcast.net>

[Tony Meyer]
> This surprised me - I've usually found people don't understand why
> Explorer et al can't keep track of things when they are renamed.  I
> think maybe the difference is that most of these people started out
> using Macs rather than Windows, and Macs have always managed this
> better (compare aliases to shortcuts, for example).

I said Windows programs "beat it into her" because it wasn't natural at
first.  But I'm not sure if anything explainable could have captured initial
expectations -- I think people quickly forget how overwhelmingly complicated
most GUIs are at first glance.  Hell, there still things I did in Visual
Studio 5 years ago that I've never been able to find again <0.5 wink>.

> ...
> +1 to switching to names in a release at some point in the future.

I'm not voting yet -- still something to mull over.  I guess that's +0,
then.

> How often do Outlook 'experts' change the names of their spam/unsure
> folders anyway?

Not often.  I did several times at the start, while working out training
strategies I could live with.

There's an inconsistency here:  if you move an Outlook folder by dragging it
to a *different* .pst file in your Folder List display, *then* Outlook rules
(and SpamBayes) lose track of it entirely.  Any rules that reference it turn
themselves off.  So sometimes Outlook makes you reselect the folder after a
change, and sometimes it doesn't.  I can guarandamntee that my sisters will
never grow a mental model of "OK, folders in Outlook work by object
identity, not name, but object identity is relative to the containing .pst
file" <snort>.


From nobody at spamcop.net  Thu Dec 18 18:41:04 2003
From: nobody at spamcop.net (Seth Goodman)
Date: Thu Dec 18 18:41:06 2003
Subject: [spambayes-dev] RE: [Spambayes] How low can you go?
In-Reply-To: <LNBBLJKPBEHFEDALKOLCMEKHHNAB.tim.one@comcast.net>
Message-ID: <MHEGIFHMACFNNIMMBACAEENBGOAA.nobody@spamcop.net>

Tim,

Thanks for taking the time to construct such a complete set of answers.  I
learned a lot from it and I assume other list readers did as well.

> > [Seth Goodman]
> > If we do, we could eventually have none of the tokens from a trained
> > message present but its message count will still be there.  Unless we
> > implement your token cross-reference as explained below, the message
> > counts will eventually not be correct if we expire enough tokens.
>
> [Tim Peters]
> I want to do expiration "correctly".  But even if all the tokens from a
> message expire when the total message count is N, it still doesn't change
> that counts on tokens that remain were in fact derived from N
> messages, and
> so N remains the best possible thing to feed into the spamprob guesses.

Not really.  If you decrement all the token counts from a trained message,
the database is in the exact same state as it was before you trained on that
message (ignoring subsequent messages trained).  At that point, the trained
message count was N-1, so that is the best thing to use for the probability
calculation rather than N.  The message count will keep increasing as you
train new messages but the token database will eventually level off.  That
suggests that the trained message counts will become too large as time goes
on.

If you only expire hapaxes, perhaps the incorrect message count is a
technicality and won't have a significant effect on the spam probabilities.
But unless you expire non-hapaxes as well, the token database can't track a
changing message stream very well.  Once you start expiring non-hapax tokens
(is there a name for these?), my guess is that you can no longer ignore the
incorrect message count issue.  So how _do_ you do expiration "correctly" if
not by whole messages?

> >> [Tim Peters]
> >> ...
> >> There's another bullet we haven't bitten yet, saving a map of
> >> message id to an explicit list of all tokens produced by that
> >> message (Skip wants the inverse of that mapping for diagnostic
> >> purposes too).  Given that, training and untraining of individual
> >> messages could proceed smoothly despite intervening changes in
> >> tokenization details; expiring entire messages would be
> >> straightforward; and when expiring an individual feature, it would
> >> be enough to remove that feature from each msg->[feature] list it's
> >> in (then untraining on a msg later wouldn't *try* to decrement the
> >> per-feature count of any feature that had previously been expired
> >> individually and appeared in the msg at the time).
>
> > [Seth Goodman]
> > This definitely works.  But why bother tracking, cross-referencing and
> > expiring individual tokens when we can just expire whole messages,
> > which is a lot simpler?
>
> [Tim Peters]
> I doubt that it's simpler at all, and you earlier today sketched quite an
> elaborate scheme for expiring different messages at different
> rates.  That's
> got its share of tuning parameters (aka wild-ass guesses <wink>)
> too, showed
> every sign of being just the beginning of its brand of
> complication, and has
> no testing or experience to support it.  We know a lot about the real-life
> effects of hapaxes now.

Offhand, adding a single timestamp per message at training time sounds
easier than tracking the last time seen for every token in the database.  As
far as the "elaborate" scheme I suggested for variable expiration times, all
that's involved is changing the message timestamp before storing it.  Since
you don't have anything like that now, you can just ignore that idea and the
extra parameter that goes with it.  BTW, that parameter value is not just a
wild-ass guess, it's a SWAG (sophisticated wild-ass guess), and I don't like
them any better than you do :)

Either way, rather than frequently searching for expired tokens (in a very
long list), you would only do token expiration when you have to train a new
message.  At that point, you find the oldest trained message (from a much
shorter list) and untrain it.  The extra complication is storing the token
list with each message ID plus its training timestamp.  That doesn't sound
big compared to cross referencing every token to every message it appeared
in.  They're certainly not mutually exclusive and you later made a good
argument for having this extra information anyway.


> [Tim Peters]
> BTW, the single worst thing you can do with a system of this type
> is train a
> message into the wrong category.  Everyone does it eventually, and some
> people can't seem to help but doing it often.  Maybe that's a UI
> problem at
> heart -- I don't know, because I seem to be unusually resistant
> to it.  It's

I agree completely.  This was an important motivation for expiring a whole
message at a time.  Training mistakes would eventually drop out of the
database without user intervention.  Not that a tool to help track down
training mistakes wouldn't be great, but a "casual" user could still make
occasional mistakes and the system would recover by itself.


> [Tim Peters]
> happened to me too, though, and it can be hard to recover.  One
> sterling use
> for a feature -> msg_ids map is, as Skip noted, a way to find out
> *why* your
> latest spam was a false negative:  look at the low-scoring features, then
> look at the messages with those features that were trained on as
> ham.  This
> has an excellent shot at pinpointing mis-trained messages.
> That's difficult
> at best now, and is a real problem for some people.  I've got gigabytes of
> unused disk space myself <wink>.

No argument there, it's a great feature for problem-solving.


> [Tim Peters]
> Evolution of this system would also be served by saving an
> explict msg_id ->
> features map.  When we change tokenization to get a small win,
> sometimes the
> tokens originally added to a database by training on message M
> can no longer
> be reconstructed by re-tokenizing M (the tokenizer has changed!  if it
> always returned exactly what it returned before the change, there wasn't
> much point to the change <wink>).  Blindly untraining anyway can violate
> database invariants then, eventually manifesting as assertion
> errors and the
> need to retrain from scratch.  The only clear and simple way to
> prevent this
> is to save a map from msg_id to the tokens it originally produced.  Then
> untraining simply walks that list, and nothing can go wrong as a result.

I agree completely and that's why I suggested saving the token list with
each message.  Your feature_ID scheme makes it practical.


> [Tim Peters]
> That's a bit subtle, so takes some long-term experience to appreciate at a
> gut level.  Of more immediate concern to most users is that only the
> obsessed *want* to save their spam.  Most people want to throw spam away
> ASAP.  But, if they do that, we currently have no way to expire any spam
> they ever trained on.  Moving toward saving msg_ids <-> features
> maps solves
> that too, and with suitable reuse of little integers for feature ids can
> store the relevant bits about trained messages in less space than it takes
> to save the original messages.  Note that hapaxes would waste the most
> resource in this context too.

Sounds like _you're_ arguing for expiration of whole messages :)  I know
you're not arguing that, but if there were bidirectional msg_id <->
feature_ID maps, it would be fairly easy to expire whole messages.  That
would obviate the need to track last time seen for every token.  In any
case, I hope you move in the direction of saving such maps as it adds so
much flexibility.


> [Tim Peters]
> We're not going to abandon plain strings, because they're far too
> useful and
> loved in various reports intended for human consumption.  Adding
> feature_id
> <-> feature_string maps would allow for effective compression of message
> storage.

All your arguments on this point make lots of sense.  I'm a little surprised
that you had significant collisions mapping perhaps 100K items (my guess)
into a 32-bit space.  I think that is rather dependent on the hash used, but
that's what you saw.  Since you need the cleartext anyway, your feature-ID
concept is far superior.  Thanks for educating me.

--
Seth Goodman

  Humans:   off-list replies to sethg [at] GoodmanAssociates [dot] com

  Spambots: disregard the above


From rmalayter at bai.org  Thu Dec 18 18:57:11 2003
From: rmalayter at bai.org (Ryan Malayter)
Date: Thu Dec 18 18:57:19 2003
Subject: [spambayes-dev] RE: [Spambayes] How low can you go?
Message-ID: <792DE28E91F6EA42B4663AE761C41C2A01A75280@cliff.bai.org>

{Seth Goodman}
> All your arguments on this point make lots of sense. 
> I'm a little surprised that you had significant 
> collisions mapping perhaps 100K items (my guess)
> into a 32-bit space.  I think that is rather dependent 
> on the hash used, but that's what you saw. 

That's not surprising at all to me. Because of the "birthday paradox",
even very input-sensitive (random-looking) hash functions like the
160-bit SHA-1 only give 80 bits of collision resistance. With a 32 bit
perfect hash, you get just 16 bits of collision resistance. That means
there is a 50% chance of a collision if you hash just 65,536 items. Hash
more items than that, and your chances of collision go up further.

If your hash function isn't perfectly (randomly) distributed in the
32-bit space, things could be much worse with 100,000 hashes in a
collection.

I would suggest using storing at least a 64 bit hash; perhaps the first
8 bytes of an SHA-1 or MD5 hash would be appropriate. There exists good
optimized code for both algorithms in the public domain.

Regards,
	Ryan

From tim.one at comcast.net  Thu Dec 18 19:09:45 2003
From: tim.one at comcast.net (Tim Peters)
Date: Thu Dec 18 19:09:44 2003
Subject: [spambayes-dev] A new and altogether different bsddb breakage
In-Reply-To: <Law11-OE59xgSAI1jEX0000d8c4@hotmail.com>
Message-ID: <LNBBLJKPBEHFEDALKOLCGEBIHOAB.tim.one@comcast.net>

[Kenny Pitt]
> So are you recommending that we avoid using the whole _bsddb.pyd
> binary package?  I originally thought you were referring only to the
> Python-coded wrappers.

We need _bsddb.pyd to use a modern BDB at all from within Python on Windows.
I'm not familiar with what all is in that DLL, but there certainly aren't
any Python-coded wrappers in it.  It's about 5000 lines worth of compiled C
code, and wraps Sleepycat's C API so that it can be called from Python.
Most of it consists of short wrapper functions calling Sleepycat functions
with similar names, just converting raw C bits to and from Python objects at
the boundaries ...

Hmm.  Maybe you're not aware of this:

    http://pybsddb.sourceforge.net/bsddb3.html

That's effictively the *real* documentation for Python's modern Berkeley
module.  It hasn't been folded into the Python doc set yet.  _bsddb.pyd is
intended to provide quite directs way of calling the native Sleepycat API
functions.  The *Python* docs never get around to explaining that, so if
that's what you've been looking at, you've been looking at old docs
describing lots of "legacy tricks" last updated for Berkeley 1.85, around
the time of the last asteroid-induced mass extinction event <wink>.

I'm suggesting avoiding the legacy tricks, avoiding the slow & buggy stuff
trying to make current BDB "act just like a magical Python dict", and
writing as directly as possible to Sleepycat's C API (which, I hasten to
add, is much easier from Python than from C).  Barry will testify it's not
that bad, and since he's actually done it in a much more difficult (wrt
database demands) project, I believe him.  Sleepycat BDB has its own idioms
and rhythms, and I'd like us to try to play along with them instead of
fighting, or trying to hide, them.


From nobody at spamcop.net  Thu Dec 18 20:25:16 2003
From: nobody at spamcop.net (Seth Goodman)
Date: Thu Dec 18 20:25:25 2003
Subject: [spambayes-dev] RE: [Spambayes] How low can you go?
In-Reply-To: <792DE28E91F6EA42B4663AE761C41C2A01A75280@cliff.bai.org>
Message-ID: <MHEGIFHMACFNNIMMBACAGENFGOAA.nobody@spamcop.net>

> {Seth Goodman}
> > All your arguments on this point make lots of sense.
> > I'm a little surprised that you had significant
> > collisions mapping perhaps 100K items (my guess)
> > into a 32-bit space.  I think that is rather dependent
> > on the hash used, but that's what you saw.
>
> [Ryan Malayter]
> That's not surprising at all to me. Because of the "birthday paradox",
> even very input-sensitive (random-looking) hash functions like the
> 160-bit SHA-1 only give 80 bits of collision resistance. With a 32 bit
> perfect hash, you get just 16 bits of collision resistance. That means
> there is a 50% chance of a collision if you hash just 65,536 items. Hash
> more items than that, and your chances of collision go up further.
>
> If your hash function isn't perfectly (randomly) distributed in the
> 32-bit space, things could be much worse with 100,000 hashes in a
> collection.

As I understand it, the birthday paradox leads to the conclusion that for a
32-bit perfect hash function, after hashing around 78,000 items (just over
16-bits worth), you are likely to experience a _single_ collision.  What Tim
described sounded like they probably had multiple collisions to account for
the spectacular failures they saw.  I don't know the size of the token
databases they dealt with back then, but I doubt a single collision in a
token list of 78K items would affect the classifier.  Since most of the
tokens are hapaxes anyway (perhaps 80-90% ?), it is most probable that there
would be no visible effect.

You are of course correct that going over 78K items limit would give more
collisions, but it would take quite a few collisions for one of the
colliding tokens to be something other than a hapax.  I am guessing that
unless there were a lot more than 100K tokens, the 32-bit hash function used
probably didn't do as good a randomizing job as needed.

Since they ultimately had to construct a map of hash_value <-> token_string,
they could have detected collisions (check the token already stored with the
hash value) and done something about it (i.e. use next empty bucket).  Since
this would be a rare event, it wouldn't have cost much.  In any case, Tim's
idea of a mapping token_string <-> feature_ID (i.e. sequentially allocated
number with "wrap-around") sounds much simpler.  However, it is important
that the number has enough bits that previously allocated feature_ID's are
ready to be reused (their tokens expired) by the time the allocation number
"wraps around" to them.  This just means that the number should probably be
32-bits.  Assuming you generate 100K tokens per day, the wrap-around time
for a 32-bit number is 117 years.  For a 24-bit number and the same rate of
token production, the wrap-around time is 167 days (around 5.5 months).  I'd
go for the 32-bit number and not worry about pathological operating schemes
or new tokenizers.  Even at 1 million new tokens per day, the wrap-around
time for a 32-bit feature_ID is over 10 years.  Why hash when you can
sequentially allocate?  This was just a bad idea on my part.  And it won't
be the last one :)

--
Seth Goodman

  Humans:   off-list replies to sethg [at] GoodmanAssociates [dot] com

  Spambots: disregard the above


From tim.one at comcast.net  Thu Dec 18 20:26:43 2003
From: tim.one at comcast.net (Tim Peters)
Date: Thu Dec 18 20:26:49 2003
Subject: [spambayes-dev] RE: [Spambayes] How low can you go?
In-Reply-To: <MHEGIFHMACFNNIMMBACAEENBGOAA.nobody@spamcop.net>
Message-ID: <LNBBLJKPBEHFEDALKOLCIEBMHOAB.tim.one@comcast.net>

[Seth Goodman]
> Thanks for taking the time to construct such a complete set of
> answers.  I learned a lot from it and I assume other list readers did
> as well.

My pleasure, but I'm afraid it was taken out of sleep time, and I can't do
that again.  So, no offense intended, I have to be very brief here, while
wanting to do more:

> Not really.  If you decrement all the token counts from a trained
> message, the database is in the exact same state as it was before you
> trained on that message (ignoring subsequent messages trained).  At
> that point, the trained message count was N-1, so that is the best
> thing to use for the probability calculation rather than N.  The
> message count will keep increasing as you train new messages but the
> token database will eventually level off.  That suggests that the
> trained message counts will become too large as time goes on.
>
> If you only expire hapaxes, perhaps the incorrect message count is a
> technicality and won't have a significant effect on the spam
> probabilities. But unless you expire non-hapaxes as well, the token
> database can't track a changing message stream very well.  Once you
> start expiring non-hapax tokens (is there a name for these?), my
> guess is that you can no longer ignore the incorrect message count
> issue.  So how _do_ you do expiration "correctly" if not by whole
> messages?

I only intend to expire hapaxes for now, with whole-msg expiration after;
but one thing at a time, and each step will take a long time for testing.
There's no rush.  The idea that all the tokens in a message could get
expired seems too implausible to me to worry about, when only hapaxes are
expired.

...

> Offhand, adding a single timestamp per message at training time sounds
> easier than tracking the last time seen for every token in the
> database.  As far as the "elaborate" scheme I suggested for variable
> expiration times, all that's involved is changing the message
> timestamp before storing it.  Since you don't have anything like that
> now, you can just ignore that idea and the extra parameter that goes
> with it.  BTW, that parameter value is not just a wild-ass guess,
> it's a SWAG (sophisticated wild-ass guess), and I don't like them any
> better than you do :)
>
> Either way, rather than frequently searching for expired tokens (in a
> very long list), you would only do token expiration when you have to
> train a new message.  At that point, you find the oldest trained
> message (from a much shorter list) and untrain it.  The extra
> complication is storing the token list with each message ID plus its
> training timestamp.  That doesn't sound big compared to cross
> referencing every token to every message it appeared in.  They're
> certainly not mutually exclusive and you later made a good argument
> for having this extra information anyway.

There are messages I never want to expire.  That creates major new UI
headaches to be doable.  I believe (but don't yet know) that expiring
hapaxes can be done without need for user intervention, and without harm.

At some point, if you want to try your ideas, *try* your ideas <wink> --
that's what Open Source is all about.  Everyone is born knowing how to
program in Python, although most don't realize it until they try.

...

> I agree completely.  This was an important motivation for expiring a
> whole message at a time.  Training mistakes would eventually drop out
> of the database without user intervention.  Not that a tool to help
> track down training mistakes wouldn't be great, but a "casual" user
> could still make occasional mistakes and the system would recover by
> itself.

Without intervention, it will also expire the screaming bright-red HTML
birthday message sent by my favorite 7-year-old niece, and when she's 8 the
next one may get tagged as spam.  These are the kinds of messages I never
want to expire.  "Elaborate" before referred to untested gimmicks for
adjusting expiration date based on "how far away" a message was from its
correct classification, etc.  I don't have a feel for whether that can be
made to work well in real life, and it needs serious implementation effort
and testing to get a good feel.  In the vanishingly small time I can still
make for this project, I need to give it to things my experience suggests
will almost certainly win with no more effort or surprises than I already
know they require enduring.

...

> Sounds like _you're_ arguing for expiration of whole messages :)

Oh yes, I do want that -- eventually.  We have no experience with that in
this project, though; we have a lot of experience with the consequences of
hapaxes, and I have no fears remaining about picking on them.

> I know you're not arguing that, but if there were bidirectional msg_id
> <-> feature_ID maps, it would be fairly easy to expire whole
> messages.

Yes, and that's a real attraction.  Doing the actual expiration would be
trivially easy and fast then.  Deciding *when* to do expiration, and of
which messages, are the things we really don't know anything about yet.

> That would obviate the need to track last time seen for every token.

Only if you don't want also to be able to expire tokens on their own.

> In any case, I hope you move in the direction of saving such maps as
> it adds so much flexibility.

Not to mention database size <wink>.

...

> All your arguments on this point make lots of sense.  I'm a little
> surprised that you had significant collisions mapping perhaps 100K
> items (my guess) into a 32-bit space.

That would be a very small database for the mixed unigram-bigram scheme, and
the unigram-only database I used most often in original testing (for
filtering high-volume tech mailing lists) contained about 350K tokens.  As
Ryan explained later, the Birthday Paradox can't be avoided here, and has
real consequences.

> I think that is rather dependent on the hash used, but that's what
> you saw.

I used Python's builtin 32-bit hash() function, and the observed collision
rate was indistinguishable from what a truly random 32-bit hash would have
produced (about one standard deviation lower).  The damnable thing is that
you only need one extremely unfortunate collision to start seeing results
that are incomprehensible to the human eye.

> Since you need the cleartext anyway, your feature-ID concept is far
> superior.

We don't *need* the cleartext, really, it's just highly desirable.  I'll
certainly endure a lot to keep the cleartext.  If this isn't the smallest or
fastest spam filter possible, I don't really care.  I don't even care
whether it's popular.  What I care about most is whether it filters my damn
spam.

> Thanks for educating me.

Don't mistake a lecture for education <wink>.  I'd love to be able to afford
the luxury of *discussing* it with you instead (you've got a lot of
plausible ideas and express them well), but afraid I just can't.  With any
luck, maybe my employer will go out of business <wink>.


From listsub at wickedgrey.com  Thu Dec 18 20:43:31 2003
From: listsub at wickedgrey.com (Eli Stevens (WG.c))
Date: Thu Dec 18 20:44:10 2003
Subject: [spambayes-dev] Hapaxes? (was: How low can you go?)
References: <LNBBLJKPBEHFEDALKOLCIEBMHOAB.tim.one@comcast.net>
Message-ID: <3FE257C3.1020800@wickedgrey.com>

Tim Peters wrote:

> 
> Oh yes, I do want that -- eventually.  We have no experience with that in
> this project, though; we have a lot of experience with the consequences of
> hapaxes, and I have no fears remaining about picking on them.

Does anyone have any gentle nudges to information explaining what 
hapaxes are in a spambayes context?  I _think_ I've canvased the 
website, some of the docs that come with the install (enough to get it 
running under Linux), etc., but I don't recall seeing anything about 
them.  I'm not afraid to Use the Source, Luke (but not while I'm here at 
work ;), so just a pointer to a file name would probably be enough (if 
there's a hapax.py I'm gonna feel silly - I'll check when I get home ;).

Thanks!
Eli


From tameyer at ihug.co.nz  Thu Dec 18 20:52:50 2003
From: tameyer at ihug.co.nz (Tony Meyer)
Date: Thu Dec 18 20:52:56 2003
Subject: [spambayes-dev] Hapaxes? (was: How low can you go?)
In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F13047C1287@its-xchg4.massey.ac.nz>
Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F130467777E@its-xchg4.massey.ac.nz>

> Does anyone have any gentle nudges to information explaining what 
> hapaxes are in a spambayes context?

<http://spambayes.sourceforge.net/docs.html#glossary>

Wrt SpamBayes, 'word' is a token, and 'corpus' is the token database.

Is this enough information?

=Tony Meyer


From tim.one at comcast.net  Thu Dec 18 21:16:25 2003
From: tim.one at comcast.net (Tim Peters)
Date: Thu Dec 18 21:16:28 2003
Subject: [spambayes-dev] RE: [Spambayes] How low can you go?
In-Reply-To: <MHEGIFHMACFNNIMMBACAGENFGOAA.nobody@spamcop.net>
Message-ID: <LNBBLJKPBEHFEDALKOLCAECDHOAB.tim.one@comcast.net>

[Ryan Malayter]
>> That's not surprising at all to me. Because of the "birthday
>> paradox", ...

[Seth Goodman]
> As I understand it, the birthday paradox leads to the conclusion that
> for a 32-bit perfect hash function, after hashing around 78,000 items
> (just over 16-bits worth), you are likely to experience a _single_
> collision.  What Tim described sounded like they probably had
> multiple collisions to account for the spectacular failures they saw.
> I don't know the size of the token databases they dealt with back
> then, but I doubt a single collision in a token list of 78K items
> would affect the classifier.  Since most of the tokens are hapaxes
> anyway (perhaps 80-90% ?), it is most probable that there would be no
> visible effect.
> ...

Let me clarify this:  the experiments we ran couldn't actually use a 32-bit
hash code because they used a Python dict to simulate a giant sparse array,
and the box I was using didn't have enough RAM to deal with this load.

Instead we ran with smaller hash codes and smaller training sets, projecting
results.  The results were too discouraging for anyone here to want to
continue along that line.  It's all in the archives if you want to dig back
far enough (I don't <wink>).

With a 32-bit hash code, the expected # of collisions for a truly random
hash is close to 1, with a standard deviation also close to one, at about
92,600 items, so Seth is quite close.  With 350K items (close to the # of
tokens in the pure-unigram database I was actually using at the time), the
mean # of collisions is a bit over 14 with an sdev of about 3.8.  Those
numbers aren't scary, and Python's hash() was indeed behaving as a random
hash would have.  We were considering schemes with much higher
feature-generation rates than pure-unigram at the time, though, so all those
stats don't matter to what we were really wondering about.

BTW, discussions like this really don't belong on the spambayes list.
They're fine spambayes-dev, though, so I've set reply-to to that.  Anyone
who wants to follow that level of tech-talk should subscribe to
spambayes-dev.


From mhammond at skippinet.com.au  Thu Dec 18 21:33:57 2003
From: mhammond at skippinet.com.au (Mark Hammond)
Date: Thu Dec 18 21:34:14 2003
Subject: [spambayes-dev] A new and altogether different bsddb breakage 
In-Reply-To: <200312180952.hBI9qZC6005810@localhost.localdomain>
Message-ID: <004f01c3c5d8$8e28c430$2c00a8c0@eden>

> I'm sure this is some sort of standard method of getting things
> done in the opensource world. Eric Raymond's Cathedral and Bazaar
> metaphor extends here, of course - in a bazaar often you end up
> getting suckered.

The cool thing is that just like a bazaar, the seller does indeed figure the
buyer for a sucker - but still the buyer goes home thinking he got a
bargain.  Amazingly, everyone truly is happy!

But-who-is-who ly,

Mark.


From listsub at wickedgrey.com  Thu Dec 18 21:33:55 2003
From: listsub at wickedgrey.com (Eli Stevens (WG.c))
Date: Thu Dec 18 21:34:33 2003
Subject: [spambayes-dev] Hapaxes? (was: How low can you go?)
References: <1ED4ECF91CDED24C8D012BCF2B034F130467777E@its-xchg4.massey.ac.nz>
Message-ID: <3FE26393.8010105@wickedgrey.com>

Tony Meyer wrote:

> 
> <http://spambayes.sourceforge.net/docs.html#glossary>
> 
> Wrt SpamBayes, 'word' is a token, and 'corpus' is the token database.
> 
> Is this enough information?


Yes, thank you.  :)  Heh, glossary.  Who'd have thunk?

Another newbie Q: were hapaxes not stored at one time?  Some of the 
recent discussion implies that a recent change (storing them?) has 
increased the DB size considerably.  Was that the only heuristic, or was 
it tokens seen less than N times...?

Just trying to get up to speed.  :)
Eli


From matt at mondoinfo.com  Thu Dec 18 21:58:46 2003
From: matt at mondoinfo.com (Matthew Dixon Cowles)
Date: Thu Dec 18 21:59:04 2003
Subject: [spambayes-dev] Hapaxes? (was: How low can you go?)
In-Reply-To: <3FE26393.8010105@wickedgrey.com>
References: <1ED4ECF91CDED24C8D012BCF2B034F130467777E@its-xchg4.massey.ac.nz>
	<3FE26393.8010105@wickedgrey.com>
Message-ID: <1071802446.62.680@mint-julep.mondoinfo.com>

> Another newbie Q: were hapaxes not stored at one time?  Some of the
> recent discussion implies that a recent change (storing them?) has
> increased the DB size considerably.  Was that the only heuristic,
> or was it tokens seen less than N times...?

Hapaxes have always been stored. There have been various experiments
with removing them since they seem to make up about half of an
"average" database. It turns out that if you have a well-trained
database, you can remove hapaxes with little effect on scoring. The
problem comes if you're doing ongoing training. If you remove hapaxes
every day, a strong clue that only arrives once a day will never
persist to become a strong clue.

Regards,
Matt


From tim.one at comcast.net  Fri Dec 19 00:29:50 2003
From: tim.one at comcast.net (Tim Peters)
Date: Fri Dec 19 00:29:53 2003
Subject: [spambayes-dev] A new and altogether different bsddb breakage 
In-Reply-To: <200312180952.hBI9qZC6005810@localhost.localdomain>
Message-ID: <LNBBLJKPBEHFEDALKOLCGEDCHOAB.tim.one@comcast.net>

[Anthony Baxter]
> I'm sure this is some sort of standard method of getting things
> done in the opensource world. Eric Raymond's Cathedral and Bazaar
> metaphor extends here, of course - in a bazaar often you end up
> getting suckered.

Let's see.  Everyone here knows you did the most of the work in setting up
the spambayes web site, and some of us know you're doing the bulk of the
work for producing today's Python 2.3.3 release.  So who thinks Anthony's a
sucker?  And who thinks he's taken on these glamorous jobs just to get
Barry's regurgitated drugs and quality sex from all the SpamBayes and Python
groupies in Australia?

Yes, it is a hard call.

if-we-didn't-appreciate-each-other-we'd-be-rich-ly y'rs  - tim


From skip at pobox.com  Fri Dec 19 07:07:28 2003
From: skip at pobox.com (Skip Montanaro)
Date: Fri Dec 19 07:07:23 2003
Subject: [spambayes-dev] a bit better received header parsing
Message-ID: <16354.59904.523711.847643@montanaro.dyndns.org>

A few days ago Tim noticed that some ip addresses in Received: headers were
being broken down in the wrong order.  For example, '[199.249.165.175]'
would yield fragments in the wrong order: '[199.249.165.175]',
'249.165.175]', '165.175]' and '175]'.  The problem was that the hostname
recognizer was catching them and fragmenting them from right-to-left, as if
they were hostnames.

I solved that problem and a couple others related to locating hostnames and
ip addresses in Received: headers.  I have no idea if it will help or not,
but your database will be oh-so-much-cleaner if you do a complete retrain
after a cvs up.

Cheers,

Skip


From kennypitt at hotmail.com  Fri Dec 19 09:09:37 2003
From: kennypitt at hotmail.com (Kenny Pitt)
Date: Fri Dec 19 09:10:15 2003
Subject: [spambayes-dev] A new and altogether different bsddb breakage
In-Reply-To: <LNBBLJKPBEHFEDALKOLCGEBIHOAB.tim.one@comcast.net>
Message-ID: <Law11-OE68jx2XDniY50000e32d@hotmail.com>

Tim Peters wrote:
> [Kenny Pitt]
>> So are you recommending that we avoid using the whole _bsddb.pyd
>> binary package?  I originally thought you were referring only to the
>> Python-coded wrappers.
> 
> We need _bsddb.pyd to use a modern BDB at all from within Python on
> Windows. I'm not familiar with what all is in that DLL, but there
> certainly aren't any Python-coded wrappers in it.  It's about 5000
> lines worth of compiled C code, and wraps Sleepycat's C API so that
> it can be called from Python. Most of it consists of short wrapper
> functions calling Sleepycat functions with similar names, just
> converting raw C bits to and from Python objects at the boundaries ...

OK, then you were talking about what I originally thought you were
talking about.  I was getting worried that the C wrappers themselves
were too buggy to be of use <0.5 wink>.

> Hmm.  Maybe you're not aware of this:
> 
>     http://pybsddb.sourceforge.net/bsddb3.html
> 
> That's effictively the *real* documentation for Python's modern
> Berkeley module.

Yeah, that's the documentation I started from, which basically just
summarizes what each function is for and shows the Python syntax for
calling them.  It then refers you to the Berkeley docs for any level of
detail about option flags and such, so I usually just go straight to the
Sleepycat C++ API docs.  The Python wrappers are basically a direct
mapping of the C++ class structure, and I've previously worked a little
with BDB in C++.

> I'm suggesting avoiding the legacy tricks, avoiding the slow & buggy
> stuff trying to make current BDB "act just like a magical Python
> dict", and writing as directly as possible to Sleepycat's C API
> (which, I hasten to add, is much easier from Python than from C). 

OK, I can go with that, and it should be relatively straightforward.  My
concern still stands re Win98, though.  Maybe I didn't express it
clearly.

Whenever you use direct BDB through the pybsddb/bsddb3/bsddb module in a
multi-thread/multi-user scenario, you always have to start with a call
to initialize the DB environment before you can do anything else.  You
expressed some concern over the breakage on Win98 of the tests in
test_dbshelve.py.  Unfortunately, the line that always fails is that
very first and most basic initialization call, the same one that we
would need to call for any use in SpamBayes.  It is a direct call into
the C-code wrappers, and happens *before* any "legacy tricks".

Since the test suite opens and closes a number of databases and
environments before it gets to the point that fails, there could be some
adverse effects there.  Maybe the best thing is to throw some test code
into SpamBayes and see if it will even start up on Win98.  I don't have
access to a Win98 test system, but if I can code up enough support that
we can try this out, would you be willing to give it a test?  It will
probably be after the holidays before I can get to it, but we'll see.

-- 
Kenny Pitt


From popiel at wolfskeep.com  Fri Dec 19 11:55:37 2003
From: popiel at wolfskeep.com (T. Alexander Popiel)
Date: Fri Dec 19 11:55:41 2003
Subject: [spambayes-dev] RE: [Spambayes] How low can you go? 
In-Reply-To: Message from "Tim Peters" <tim.one@comcast.net> of "Thu,
	18 Dec 2003 20:26:43 EST."
	<LNBBLJKPBEHFEDALKOLCIEBMHOAB.tim.one@comcast.net> 
References: <LNBBLJKPBEHFEDALKOLCIEBMHOAB.tim.one@comcast.net> 
Message-ID: <20031219165537.EDB162DF7F@cashew.wolfskeep.com>

In message:  <LNBBLJKPBEHFEDALKOLCIEBMHOAB.tim.one@comcast.net>
             "Tim Peters" <tim.one@comcast.net> writes:
>
>> Sounds like _you're_ arguing for expiration of whole messages :)
>
>Oh yes, I do want that -- eventually.  We have no experience with that in
>this project, though; we have a lot of experience with the consequences of
>hapaxes, and I have no fears remaining about picking on them.

Actually, there have been experiments done (by me) with expiry of whole
messages.  I invite you to look at the 'expire4months' regime for my
incremental testing harness.  Performance was worse than remembering
everything, but significantly better than mistake-based training (with
the 'fpfnunsure' regime).

I have not done any experiments with just nuking hapaxes; I didn't see
any reason to do a partial job instead of a full one.

>> I know you're not arguing that, but if there were bidirectional msg_id
>> <-> feature_ID maps, it would be fairly easy to expire whole
>> messages.

>> That would obviate the need to track last time seen for every token.
>
>Only if you don't want also to be able to expire tokens on their own.

No... just find the most recent message that the token appeared in,
which would be a quick search through a few message times.  A really
quick search if you're only looking to expire hapaxes.

- Alex

From skip at pobox.com  Fri Dec 19 14:57:16 2003
From: skip at pobox.com (Skip Montanaro)
Date: Fri Dec 19 16:01:43 2003
Subject: [spambayes-dev] default to mine_received_headers=True,
	"may be forged"
Message-ID: <16355.22556.564839.561779@montanaro.dyndns.org>


I've been running with mine_received_headers set to True for quite awhile.
I fixed a couple nits this morning with the regular expressions used to pick
out hostnames and ip addresses from Received: headers.  The hostname re was
frequently picking up ip addresses and chomping them from the wrong end.  I
am pleased with how well it seems to work at this point(*).  Looking at a
graph or table of the 'received:.*' spamprob distribution shows that (for
me, at least) the bulk of the spamprobs are at or outside of the hapax
points.  See:

    http://www.musi-cal.com/~skip/rcvd.png
    http://www.musi-cal.com/~skip/rcvd.txt

The graph plots the number of features with a given spamprob.  The two
impulses at the hapax points are 523 (0.155...) and 1047 (0.844...).  I
cropped the graph so the smaller values would be visible.

Obviously, this is still strongly hapax-driven (I have a small database at
the moment - 163 spam, 171 ham), but the data suggests that the hapax values
are pretty good indicators of the direction that feature will take when the
second instance is seen.

While I was messing with the received header regular expressions today I
also noticed that Sendmail sometimes adds "may be forged" to a header.
Here's a bit from the sendmail docs in the context of an open relay
discussion:

    QAA02454: <ESCAPEFOUR@AOL.COM>... Relaying denied
    QAA02454: ruleset=check_rcpt, arg1=<ESCAPEFOUR@AOL.COM>,
            relay=some.domain [10.0.0.1] (may be forged),
        reject=550 <ESCAPEFOUR@AOL.COM>... Relaying denied
    QAA02454: from=<Anonymous@aol.com>, size=0, class=0, pri=0, nrcpts=0,
            proto=SMTP, relay=some.domain [10.0.0.1] (may be forged)

    Here the (may be forged) is the important part: it means that the DNS
    data for the host is inconsistent, and hence the name is not used for
    the relaying check but only the IP number.

This is also a very good spam indicator:

    % spamcounts -r 'may be forged'
    db: /Users/skip/.hammiedb
    token,nspam,nham,spam prob
    bi:received:may be forged received:mx,1,0,0.844827586207
    bi:received:may be forged received:biz,2,0,0.908163265306
    received:may be forged,5,0,0.95871559633
    bi:received:may be forged received:com,1,0,0.844827586207
    bi:received:127.0.0.1 received:may be forged,5,0,0.95871559633
    bi:received:may be forged received:il,1,0,0.844827586207

I generate it within the block controlled by the mine_received_headers
option.  A quick scan of my testing databases shows this is overwhelmingly
associated with spam (shows up in 221 out of 6843 spams and only 30 out of
8395 ham).

I'm inclined to trust sendmail on this one and just add it.  It seems like a
very objective feature.  In fact, if other mail transport agents provide
similar clues about forged addresses, I think we should look for their clues
and lump them all into one 'received:may be forged' feature.

Skip

(*) Here's a quick summary of my latest setup.  I'm running from CVS
(natch).  I pushed my cutoffs out to 0.05 and 0.95 and run with bigrams
enabled.  I train on all mistakes and unsures.  I also have it automatically
training on a random 10% of the messages with score as ham or spam.  I tried
training on everything, but the database was growing way too quickly.  The
extreme cutoffs minimize the chance of a fp or fn which would mean to
untrain I have to go find the message and move it from one pile to the
other.  So far, no fp's, a few fn's and fewer unsures than I anticipated.

From tim.one at comcast.net  Sat Dec 20 01:06:28 2003
From: tim.one at comcast.net (Tim Peters)
Date: Sat Dec 20 01:06:33 2003
Subject: [spambayes-dev] FW: SF.NET Project Donation System
Message-ID: <LNBBLJKPBEHFEDALKOLCMEJIHOAB.tim.one@comcast.net>

As the project admins should already know, SourceForge is integrating a
donations system throughout their site (see the attached for a bit of
detail).

I'd like the people here working their spare-time asses off on SpamBayes to
give this some thought.  We don't *have* to give SpamBayes contributions to
the PSF, and I wouldn't object if the people doing the work here wanted to
split donations among themselves.  It probably wouldn't amount to much, but
even 100 bucks now and again can work wonders for morale.

I don't have a stake in this either way.  My employment contract forbids
doing compensated work on anything other than employer-assigned tasks, so my
fingers aren't allowed in the pie.  That's fine by me.  As a Director of the
PSF, I really appreciate that SpamBayes donations have been given to the PSF
so far, but objectively speaking I think the rationale for doing so is weak
(and the way things seem to be heading, we should probably start giving them
to Sleepycat instead <wink>).

Anyway, give it some thought over the holidays!  It's your project more than
mine, and has been for a long time.  I'll support whatever decision you
make.
-------------- next part --------------
An embedded message was scrubbed...
From: "SourceForge.net Team" <noreply@sourceforge.net>
Subject: SF.NET Project Donation System
Date: Fri, 19 Dec 2003 02:29:33 -0500
Size: 2713
Url: http://mail.python.org/pipermail/spambayes-dev/attachments/20031220/07e6863a/attachment.mht
From tim.one at comcast.net  Sat Dec 20 01:15:51 2003
From: tim.one at comcast.net (Tim Peters)
Date: Sat Dec 20 01:15:56 2003
Subject: [spambayes-dev] FW: SF.NET Project Donation System
Message-ID: <LNBBLJKPBEHFEDALKOLCMEJIHOAB.tim.one@comcast.net>

As the project admins should already know, SourceForge is integrating a
donations system throughout their site (see the attached for a bit of
detail).

I'd like the people here working their spare-time asses off on SpamBayes to
give this some thought.  We don't *have* to give SpamBayes contributions to
the PSF, and I wouldn't object if the people doing the work here wanted to
split donations among themselves.  It probably wouldn't amount to much, but
even 100 bucks now and again can work wonders for morale.

I don't have a stake in this either way.  My employment contract forbids
doing compensated work on anything other than employer-assigned tasks, so my
fingers aren't allowed in the pie.  That's fine by me.  As a Director of the
PSF, I really appreciate that SpamBayes donations have been given to the PSF
so far, but objectively speaking I think the rationale for doing so is weak
(and the way things seem to be heading, we should probably start giving them
to Sleepycat instead <wink>).

Anyway, give it some thought over the holidays!  It's your project more than
mine, and has been for a long time.  I'll support whatever decision you
make.
-------------- next part --------------
An embedded message was scrubbed...
From: "SourceForge.net Team" <noreply@sourceforge.net>
Subject: SF.NET Project Donation System
Date: Fri, 19 Dec 2003 02:29:33 -0500
Size: 2713
Url: http://mail.python.org/pipermail/spambayes-dev/attachments/20031220/07e6863a/attachment-0001.mht
From Guido.DellaBruna at meteoswiss.ch  Sat Dec 20 10:38:19 2003
From: Guido.DellaBruna at meteoswiss.ch (Guido.DellaBruna@meteoswiss.ch)
Date: Sat Dec 20 10:38:23 2003
Subject: [spambayes-dev] SpamBayes and Outlook on Metaframe
Message-ID: <27E7C74777D477408C644B95A24E6F87041045@lomex01.meteoswiss.ch>

Hello,

sorry, I'm not sure if this is the right place for such questions, but
here i goes:

I would like to install SpamBayes Outlook-Plugin on Citrix-Metaframe.
The problem seems to be the directory where to install the DB: in "My
Computer" I only have access to some network drives (no local "C:", for
example) and to a "local" folder named "My Documents". Is it possible to
instruct SpamBayes to use that folder for the Spam database (and any
other file needed by SpamBayes)? Or do I need to modify the Python code
and recompile it? I didn't find a way to change this in the GUI.

Many thanks,

-- 
Guido Della Bruna
Processo Meteo Locarno
MeteoSvizzera
Via ai Monti 146
CH-6605 Locarno 5 Monti
Svizzera

From nobody at spamcop.net  Fri Dec 19 17:58:10 2003
From: nobody at spamcop.net (Seth Goodman)
Date: Sat Dec 20 13:48:10 2003
Subject: [spambayes-dev] RE: [Spambayes] How low can you go?
In-Reply-To: <LNBBLJKPBEHFEDALKOLCIEBMHOAB.tim.one@comcast.net>
Message-ID: <MHEGIFHMACFNNIMMBACACEAGGPAA.nobody@spamcop.net>

> [Tim Peters]
> There are messages I never want to expire.  That creates major new UI
> headaches to be doable.  I believe (but don't yet know) that expiring
> hapaxes can be done without need for user intervention, and without harm.

I hope the "without harm" part is true.  See my question two sections down.


> [Tim Peters]
> At some point, if you want to try your ideas, *try* your ideas <wink> --
> that's what Open Source is all about.  Everyone is born knowing how to
> program in Python, although most don't realize it until they try.

I admit I wasn't aware that I could program in Python since birth, but I'm
willing to take your word on that.  We all have hidden potential.  So that I
don't have to re-invent that round thing with the axle in the middle, could
someone please give me some hints as to which of the mapping features we've
discussed in this thread exist or will soon exist and where I can look for
them?  I saw on spambayes-dev that there is discussion of a new database, so
I don't want to go off on a useless fork with the present db if that comes
to pass.  Search for your inner newbie when you answer this.


> > [Seth Goodman]
> > I agree completely.  This was an important motivation for expiring a
> > whole message at a time.  Training mistakes would eventually drop out
> > of the database without user intervention.  Not that a tool to help
> > track down training mistakes wouldn't be great, but a "casual" user
> > could still make occasional mistakes and the system would recover by
> > itself.
>
> [Tim Peters]
> Without intervention, it will also expire the screaming bright-red HTML
> birthday message sent by my favorite 7-year-old niece, and when
> she's 8 the
> next one may get tagged as spam.  These are the kinds of messages I never
> want to expire.  ...

Here lies my concern.  I sincerely hope that correct classification of these
infrequent, unusual messages is not hapax-driven.  If it is, the result of
pruning infrequently-used hapaxes will be as bad as deleting the whole
message.  If that is the case, the _only_ solution will be to keep either
those hapaxes or the whole message trained forever.  Either way, I agree
this is a big UI problem without an obvious intuitive solution.

It does appear from looking at the scoring of some of my "typical" messages
that hapaxes don't contribute much, as you've said before.  Could you look
at the scoring of a couple of those special messages and tell if their
scoring would be seriously affected if the hapaxes were gone?

--
Seth Goodman

  Humans:   off-list replies to sethg [at] GoodmanAssociates [dot] com

  Spambots: disregard the above


From richie at entrian.com  Sat Dec 20 15:42:28 2003
From: richie at entrian.com (Richie Hindle)
Date: Sat Dec 20 15:42:41 2003
Subject: [spambayes-dev] Re: [Spambayes] SpamBayes and financial sponsorship
In-Reply-To: <2045949.1071845232983.JavaMail.jboss@p15135617.pureserver.info>
References: <2045949.1071845232983.JavaMail.jboss@p15135617.pureserver.info>
Message-ID: <mib9uvknmqkt1tlfhflscdsenfli9jafh5@4ax.com>

Hi Dawn,

Replying on behalf of the SpamBayes team:

> You have received this email because your project has been nominated for
> financial sponsorship by Gary Daw. [...] The panel's decision will be
> emailed to you just as soon as your nomination has been evaluated.

Great!  We're very pleased to hear it, and we look forward to hearing the
decision.

> PS. We are considering providing facilities to our members for managing their own 
> projects, such as CVS repositories, issue tracking and mailing lists.  Would that
> be something that you would be interested in?

We already use SourceForge for CVS and issue tracking, and we run our own
mailing lists.  As far as I'm aware, we'll all happy with our current
setup.

-- 
Richie Hindle
richie@entrian.com


From richie at entrian.com  Sat Dec 20 15:56:04 2003
From: richie at entrian.com (Richie Hindle)
Date: Sat Dec 20 15:56:15 2003
Subject: [spambayes-dev] A new and altogether different bsddb breakage
In-Reply-To: <Law11-OE68jx2XDniY50000e32d@hotmail.com>
References: <LNBBLJKPBEHFEDALKOLCGEBIHOAB.tim.one@comcast.net>
	<Law11-OE68jx2XDniY50000e32d@hotmail.com>
Message-ID: <ghd9uvs0dc0kipcr0b5mvar0fdf43m1lgp@4ax.com>


[Tim and Barry]
> [much ZODB and bsddb wisdom]

Thanks, guys.  I'll try to do two things over Christmas:

 o Write a script to hammer the current SpamBayes bsddb code, to try to
   reproduce the problems we've been seeing and to test the second thing:

 o Write a ZODB-on-BDB storage for SpamBayes.

[Kenny]
> Maybe the best thing is to throw some test code
> into SpamBayes and see if it will even start up on Win98.  I don't have
> access to a Win98 test system, but if I can code up enough support that
> we can try this out, would you be willing to give it a test?  It will
> probably be after the holidays before I can get to it, but we'll see.

I have a win98 environment that I'll be happy to run test code in.

-- 
Richie Hindle
richie@entrian.com


From richie at entrian.com  Sat Dec 20 17:04:40 2003
From: richie at entrian.com (Richie Hindle)
Date: Sat Dec 20 17:04:52 2003
Subject: [spambayes-dev] FW: SF.NET Project Donation System
In-Reply-To: <LNBBLJKPBEHFEDALKOLCMEJIHOAB.tim.one@comcast.net>
References: <LNBBLJKPBEHFEDALKOLCMEJIHOAB.tim.one@comcast.net>
Message-ID: <ire9uv830bs6j3i7eta1kjt5m0brv2ek3j@4ax.com>


[Tim]
> As the project admins should already know, SourceForge is integrating a
> donations system throughout their site (see the attached for a bit of
> detail).
> 
> I'd like the people here working their spare-time asses off on SpamBayes to
> give this some thought.  We don't *have* to give SpamBayes contributions to
> the PSF, and I wouldn't object if the people doing the work here wanted to
> split donations among themselves.  It probably wouldn't amount to much, but
> even 100 bucks now and again can work wonders for morale.

This is a great idea in principle, but devising a fair system for
distributing the donations would be difficult.  Trying to measure people's
contributions, when that means original code, bugfixes, patches,
contributions to spambayes-dev, work on the web site, providing support to
users, writing documentation, admin... it's more difficult that SpamBayes
itself.  (Anyway, we all know Mark deserves all of it for fighting Outlook
all this time.  And in Australian dollars, 100 bucks US would set him up
for life! 8-)

A couple of ideas do spring to mind, though:

Anyone who's spent real money on the project, like Rob with the
spambayes.org domain, could be reimbursed.

We could add developer links to the Donations page, so that if a user
wanted to donate to a specific developer, he could.  Though that in itself
raises fairness problems: who gets an entry on that page?  Do they get to
write their own entry (for example, could Barry put up an entry saying
"Bassist. Please help." - that would be grossly unfair to those of us
without that kind of affliction.)

I'd love to make money from SpamBayes (and my wife would *really* love it
8-) but I wouldn't want to leave others feeling short changed.

[Tim]
> Anyway, give it some thought over the holidays!  It's your project more than
> mine, and has been for a long time.  I'll support whatever decision you
> make.

And if we're ever in the same bar at the same time, you won't need to buy
any drinks.  8-)

-- 
Richie Hindle
richie@entrian.com


From mhammond at skippinet.com.au  Sat Dec 20 17:58:54 2003
From: mhammond at skippinet.com.au (Mark Hammond)
Date: Sat Dec 20 17:59:15 2003
Subject: [spambayes-dev] FW: SF.NET Project Donation System
In-Reply-To: <ire9uv830bs6j3i7eta1kjt5m0brv2ek3j@4ax.com>
Message-ID: <029e01c3c74c$d84a8470$2c00a8c0@eden>

[Richie]
> This is a great idea in principle, but devising a fair system for
> distributing the donations would be difficult.  Trying to
> measure people's
> contributions, when that means original code, bugfixes, patches,
> contributions to spambayes-dev, work on the web site,
> providing support to
> users, writing documentation, admin... it's more difficult
> that SpamBayes
> itself.

Agreed.  A more practical problem is that someone would need to collect this
money, and this would have tax implications, even if they tried to say they
were just "holding" it.

> (Anyway, we all know Mark deserves all of it for
> fighting Outlook
> all this time.  And in Australian dollars, 100 bucks US would
> set him up for life! 8-)

With change :)

> A couple of ideas do spring to mind, though:
>
> Anyone who's spent real money on the project, like Rob with the
> spambayes.org domain, could be reimbursed.

I agree, but not sure how this could work in practical terms with the tax
and holding issues.

> We could add developer links to the Donations page, so that if a user
> wanted to donate to a specific developer, he could.  Though
> that in itself
> raises fairness problems: who gets an entry on that page?  Do
> they get to
> write their own entry (for example, could Barry put up an entry saying
> "Bassist. Please help." - that would be grossly unfair to those of us
> without that kind of affliction.)
>
> I'd love to make money from SpamBayes (and my wife would
> *really* love it
> 8-) but I wouldn't want to leave others feeling short changed.

I think there is something here.  One approach would be that any listed
"developers" are eligible.  Our "donations" page could list the developers,
and include a link to their personal sourceforge page.  What they say about
themselves there is their issue.  Our "donations" page makes no attempt to
guide towards individuals - all developers are considered equal.  If you
gain the respect and credibility to be listed as a developer, and opt-in
with your paypal account number, then you qualify to receive donations.  The
developers make no attempt to guide people to anyone - we just point at the
donations page, and shutup.

A risk is that this will lead to lots of people trying to become developers.
We may need a semi-formal process for new members, maybe borrowing from the
Python +-1/+-0 system - all developers must vote, and one +1 is required,
and a single -0 is a veto.

To avoid conflicts (which I doubt, but worth coverting), we could adopt the
same system for the entire donations system - a single developer could
choose to veto the whole scheme (presumably as they felt it unfair).  In
that case, we drop the entire scheme, and move back to 100% going to the
PSA.

Extending this a little to handle reimbursing real costs - assuming the
person with the cost is a developer, then we *could* put a note at the top
of the 'donations' page saying 'please pay this person first, as he has real
costs to recover'.  Once these are recovered, this developer moves back to
the 'normal' list.

> [Tim]
> > Anyway, give it some thought over the holidays!  It's your
> project more than
> > mine, and has been for a long time.  I'll support whatever
> decision you
> > make.
>
> And if we're ever in the same bar at the same time, you won't
> need to buy
> any drinks.  8-)

Isn't it a shame all these fine taxation people wont let us start a
SpamBayes slush fund :)

Mark.


From mhammond at skippinet.com.au  Sat Dec 20 18:00:38 2003
From: mhammond at skippinet.com.au (Mark Hammond)
Date: Sat Dec 20 18:00:55 2003
Subject: [spambayes-dev] SpamBayes and Outlook on Metaframe
In-Reply-To: <27E7C74777D477408C644B95A24E6F87041045@lomex01.meteoswiss.ch>
Message-ID: <029f01c3c74d$16b7cf60$2c00a8c0@eden>

The GUI does not expose this option, but if you read the 'Configuration
Guide' (available via the 'About' document after installation) you will find
information how to manually configure the data directory SpamBayes uses.

Mark.

> Hello,
>
> sorry, I'm not sure if this is the right place for such questions, but
> here i goes:
>
> I would like to install SpamBayes Outlook-Plugin on Citrix-Metaframe.
> The problem seems to be the directory where to install the DB: in "My
> Computer" I only have access to some network drives (no local
> "C:", for
> example) and to a "local" folder named "My Documents". Is it
> possible to
> instruct SpamBayes to use that folder for the Spam database (and any
> other file needed by SpamBayes)? Or do I need to modify the
> Python code
> and recompile it? I didn't find a way to change this in the GUI.
>
> Many thanks,
>
> --
> Guido Della Bruna
> Processo Meteo Locarno
> MeteoSvizzera
> Via ai Monti 146
> CH-6605 Locarno 5 Monti
> Svizzera
>
> _______________________________________________
> spambayes-dev mailing list
> spambayes-dev@python.org
> http://mail.python.org/mailman/listinfo/spambayes-dev


From barry at python.org  Sat Dec 20 18:40:12 2003
From: barry at python.org (Barry Warsaw)
Date: Sat Dec 20 18:40:36 2003
Subject: [spambayes-dev] FW: SF.NET Project Donation System
In-Reply-To: <ire9uv830bs6j3i7eta1kjt5m0brv2ek3j@4ax.com>
References: <LNBBLJKPBEHFEDALKOLCMEJIHOAB.tim.one@comcast.net>
	<ire9uv830bs6j3i7eta1kjt5m0brv2ek3j@4ax.com>
Message-ID: <1071963612.17967.22.camel@anthem>

On Sat, 2003-12-20 at 17:04, Richie Hindle wrote:

> Do they get to
> write their own entry (for example, could Barry put up an entry saying
> "Bassist. Please help." - that would be grossly unfair to those of us
> without that kind of affliction.)

Plus, think of all the bummed out guitar players -- oh wait, they're
guitar players, they gets the chicks, they don't needs the money.  Of
course, since I've managed to contribute so little to the project, it's
only fair I get the lion's share of the dough.  And because I have the
same employer as Tim, you're going to have to put "for funky
accompaniment" in the memo section of the checks just to make everything
copacetic. 

ain't-quittin'-my-day-job-for-this-one-either-ly y'rs,
-Barry


From anthony at interlink.com.au  Sat Dec 20 22:18:30 2003
From: anthony at interlink.com.au (Anthony Baxter)
Date: Sat Dec 20 22:18:52 2003
Subject: [spambayes-dev] FW: SF.NET Project Donation System 
In-Reply-To: <LNBBLJKPBEHFEDALKOLCMEJIHOAB.tim.one@comcast.net> 
Message-ID: <200312210318.hBL3IUcW017811@localhost.localdomain>


As others have raised, trying to distribute this money would be
somewhat tricky. I'd suggest that we instead kick any money we
receive via SourceForge back to SourceForge. They run a service
we all depend on, and I'd like to see them being able to continue
to do so.

Anthony

-- 
Anthony Baxter     <anthony@interlink.com.au>   
It's never too late to have a happy childhood.


From matt at mondoinfo.com  Sat Dec 20 22:39:34 2003
From: matt at mondoinfo.com (Matthew Dixon Cowles)
Date: Sat Dec 20 22:39:41 2003
Subject: [spambayes-dev] FW: SF.NET Project Donation System
In-Reply-To: <200312210318.hBL3IUcW017811@localhost.localdomain>
References: <LNBBLJKPBEHFEDALKOLCMEJIHOAB.tim.one@comcast.net>
	<200312210318.hBL3IUcW017811@localhost.localdomain>
Message-ID: <1071977330.29.10858@mint-julep.mondoinfo.com>

> As others have raised, trying to distribute this money would be
> somewhat tricky. I'd suggest that we instead kick any money we
> receive via SourceForge back to SourceForge. They run a service we
> all depend on, and I'd like to see them being able to continue to
> do so.

In the absence of a more direct way to support SpamBayes or
individual developers, I'd suggest routing donations to the Python
Software Foundation. It's a properly-organized charity so it's
equipped to receive them. And I don't think there's any reason that
it can't spend money to support SpamBayes when there's an opportunity
to. I also understand that that's apt to help the PSF and,
indirectly, Python since the PSF has to show the government that it
gets donations from lots of different people.

Full disclosure: I'm a member of the PSF but don't sit on its board.

Regards,
Matt


From tim.one at comcast.net  Sat Dec 20 23:27:22 2003
From: tim.one at comcast.net (Tim Peters)
Date: Sat Dec 20 23:27:26 2003
Subject: [spambayes-dev] default to mine_received_headers=True,
	"may be forged"
In-Reply-To: <16355.22556.564839.561779@montanaro.dyndns.org>
Message-ID: <LNBBLJKPBEHFEDALKOLCMEOKHOAB.tim.one@comcast.net>

[Skip Montanaro]
> I've been running with mine_received_headers set to True for quite
> awhile. I fixed a couple nits this morning with the regular
> expressions used to pick out hostnames and ip addresses from
> Received: headers.  The hostname re was frequently picking up ip
> addresses and chomping them from the wrong end.  I am pleased with
> how well it seems to work at this point(*).  Looking at a graph or
> table of the 'received:.*' spamprob distribution shows that (for me,
> at least) the bulk of the spamprobs are at or outside of the hapax
> points.  See:
>
>     http://www.musi-cal.com/~skip/rcvd.png
>     http://www.musi-cal.com/~skip/rcvd.txt
>
> The graph plots the number of features with a given spamprob.  The two
> impulses at the hapax points are 523 (0.155...) and 1047 (0.844...).
> I cropped the graph so the smaller values would be visible.
>
> Obviously, this is still strongly hapax-driven (I have a small
> database at the moment - 163 spam, 171 ham), but the data suggests
> that the hapax values are pretty good indicators of the direction
> that feature will take when the second instance is seen.

Cool!  Thanks for the good work.  I'll give this a try too.


> While I was messing with the received header regular expressions
> today I also noticed that Sendmail sometimes adds "may be forged" to
> a header. Here's a bit from the sendmail docs in the context of an
> open relay discussion:
>
>     QAA02454: <ESCAPEFOUR@AOL.COM>... Relaying denied
>     QAA02454: ruleset=check_rcpt, arg1=<ESCAPEFOUR@AOL.COM>,
>             relay=some.domain [10.0.0.1] (may be forged),
>         reject=550 <ESCAPEFOUR@AOL.COM>... Relaying denied
>     QAA02454: from=<Anonymous@aol.com>, size=0, class=0, pri=0,
>             nrcpts=0, proto=SMTP, relay=some.domain [10.0.0.1] (may
> be forged)
>
>     Here the (may be forged) is the important part: it means that the
>     DNS data for the host is inconsistent, and hence the name is not
>     used for the relaying check but only the IP number.
>
> This is also a very good spam indicator:
>
>     % spamcounts -r 'may be forged'
>     db: /Users/skip/.hammiedb
>     token,nspam,nham,spam prob
>     bi:received:may be forged received:mx,1,0,0.844827586207
>     bi:received:may be forged received:biz,2,0,0.908163265306
>     received:may be forged,5,0,0.95871559633
>     bi:received:may be forged received:com,1,0,0.844827586207
>     bi:received:127.0.0.1 received:may be forged,5,0,0.95871559633
>     bi:received:may be forged received:il,1,0,0.844827586207
>
> I generate it within the block controlled by the mine_received_headers
> option.  A quick scan of my testing databases shows this is
> overwhelmingly associated with spam (shows up in 221 out of 6843
> spams and only 30 out of 8395 ham).
>
> I'm inclined to trust sendmail on this one and just add it.  It seems
> like a very objective feature.

I agree -- it's extremely unlikely to lose.  The ones to worry about are
things spammers could inject to push things in the ham direction, but
they're not gonna get far forging "may be forged" unless I have a *very*
weird idea of ham <wink>.

> In fact, if other mail transport agents provide similar clues about
> forged addresses, I think we should look for their clues and lump them
> all into one 'received:may be forged' feature.

I noticed this in the headers of a spam today:

Received: from shawmail-cg-shawcable-net
	(c-24-9-163-244.client.comcast.net[24.9.163.244](untrusted sender))
	by rwcrmxc11.comcast.net (rwcrmxc11) with SMTP
	id <20031220054919r1100n4pj1e>; Sat, 20 Dec 2003 05:49:20 +0000

It's the "(untrusted sender)" part that's interesting.  I'd suggest *not*
folding that in with "may be forged", though.  There probably aren't a lot
of strings of this nature, so the database burden should be trivial, and I
*bet* different strings will prove to have different spamprobs.

> (*) Here's a quick summary of my latest setup.  I'm running from CVS
> (natch).  I pushed my cutoffs out to 0.05 and 0.95 and run with
> bigrams enabled.  I train on all mistakes and unsures.  I also have
> it automatically training on a random 10% of the messages with score
> as ham or spam.  I tried training on everything, but the database was
> growing way too quickly.  The extreme cutoffs minimize the chance of
> a fp or fn which would mean to untrain I have to go find the message
> and move it from one pile to the other.  So far, no fp's, a few fn's
> and fewer unsures than I anticipated.

I'm running 0.04 and 0.95 with bigrams now, sticking to just
mistake-and-unsure training, after seeding with 50 of each, although the
seeds were the most recent trained on from my mistake-and-unsure-trained
unigram classifer.  Am at about 145 of each now.  I don't trust it yet --
it's still surprising too often.  I had disappointing results with a purely
mistake/unsure-trained unigram classifier before; the bigram one isn't
disappointing so far, it just leaves me cautious after a few days.  I expect
(without proof) that *some* random component is very helpful, at least to
get the thing started.

It's still 89% hapax.  I had expected that percentage to drop by now, but
without a random component I'm not sure that was a reasonable expectation:

 spam+ham     count         %      cumm
        1     63611     88.85     88.85
        2      4126      5.76     94.61
        3      1377      1.92     96.54
        4       680      0.95     97.49
        5       397      0.55     98.04
        6       255      0.36     98.40
        7       178      0.25     98.65
        8       134      0.19     98.83
        9       109      0.15     98.98
       10        70      0.10     99.08
 ...


From tim at fourstonesExpressions.com  Sat Dec 20 23:59:46 2003
From: tim at fourstonesExpressions.com (Tim Stone)
Date: Sat Dec 20 23:59:53 2003
Subject: [spambayes-dev] FW: SF.NET Project Donation System 
In-Reply-To: <200312210318.hBL3IUcW017811@localhost.localdomain>
References: <200312210318.hBL3IUcW017811@localhost.localdomain>
Message-ID: <opr0ilhwhcit6vze@mail.fourstonesExpressions.com>

On Sun, 21 Dec 2003 14:18:30 +1100, Anthony Baxter 
<anthony@interlink.com.au> wrote:

>
> As others have raised, trying to distribute this money would be
> somewhat tricky. I'd suggest that we instead kick any money we
> receive via SourceForge back to SourceForge. They run a service
> we all depend on, and I'd like to see them being able to continue
> to do so.
>
> Anthony
>

+1 from me.

-- 

Vous exprimer; Expr?sese; Te stesso esprimere; Express yourself!
Tim Stone
See my photography at www.fourstonesExpressions.com
See my writing at www.xanga.com/obj3kshun

From tim.one at comcast.net  Sun Dec 21 00:03:07 2003
From: tim.one at comcast.net (Tim Peters)
Date: Sun Dec 21 00:03:12 2003
Subject: [spambayes-dev] FW: SF.NET Project Donation System
In-Reply-To: <1071977330.29.10858@mint-julep.mondoinfo.com>
Message-ID: <LNBBLJKPBEHFEDALKOLCKEOMHOAB.tim.one@comcast.net>

[Matthew Dixon Cowles]
> In the absence of a more direct way to support SpamBayes or
> individual developers, I'd suggest routing donations to the Python
> Software Foundation.

Matt, we already do.  Visit

    http://spambayes.sourceforge.net/donations.html

for proof <wink>.  There's a PayPal button on that page, which contributes
directly to the PSF now.  Some users have done that.

> It's a properly-organized charity so it's equipped to receive them.
> And I don't think there's any reason that it can't spend money to
> support SpamBayes when there's an opportunity to.

Me neither, and I want the PSF to do things like that, but I think it's
still a long way off.  The PSF's work is all done by unpaid "spare time"
volunteers too, and while we're accomplishing what we need to accomplish
there, it's in very slow motion.  We're overwhelmingly still bogged down
trying to clean up legalities; e.g., after something like 2 years in
existence, we're still trying to get legally sound contribution (code, not
money) forms established.

I expect that people contributing cash to SpamBayes would like a more direct
connection, and I sympathize.

> I also understand that that's apt to help the PSF and, indirectly,
> Python since the PSF has to show the government that it gets
> donations from lots of different people.

The SpamBayes-derived contributions have helped a lot in moving the PSF
toward meeting the so-called "public support ratio" test, which the PSF must
meet to retain public charity status.  I don't think the PSF *needs* the
SpamBayes contributions for that, though.

> Full disclosure: I'm a member of the PSF but don't sit on its board.

If you want to, you probably can <wink>.

Everyone should keep in mind that we're not talking big money here.  The
total contributed to the PSF from all sources so far wouldn't pay one
person's salary, and the SpamBayes donations are a small part of that.  The
PSF's Treasurer could break out exact numbers, but I don't want to bother
him -- when I said "100 bucks now & again" earlier, that's the right
ballbark, given current contributions.  OTOH, I expect SpamBayes
contributions would increase if there were a plausibly direct connection
between giving money and getting back a better SpamBayes someday.


From tim.one at comcast.net  Sun Dec 21 01:21:25 2003
From: tim.one at comcast.net (Tim Peters)
Date: Sun Dec 21 01:21:30 2003
Subject: [spambayes-dev] default to mine_received_headers=True,
	"may be forged"
In-Reply-To: <LNBBLJKPBEHFEDALKOLCMEOKHOAB.tim.one@comcast.net>
Message-ID: <LNBBLJKPBEHFEDALKOLCOEPCHOAB.tim.one@comcast.net>

Good news and bad news on mine_received_headers in my classifier now.

The good news is that it *generally* made ham hammier and spam spammier.

The bad news is that spam leaking thru python.org mailing lists is much more
likely to score as ham than unsure as before, due to the large number of new
python.org-related clues.  The lowest-scoring spam in my training data now
is this:

"""
...
Subject: HOT OPPORTUNITY
...

    JUST CHECK OUT MY WEBSITE

http://www.webspawner.com/users/hawkk/index.html
--
http://mail.python.org/mailman/listinfo/python-list
"""

It turns out I've actually trained on two copies of that one, but despite
that it's scoring only 19 now:

Combined Score: 19% (0.193802)
Internal ham score (*H*): 0.85051
Internal spam score (*S*): 0.238114

These are all the "ah, this came from a python.org mailing list" features
now, more than doubling the number of such features before:

'url:mailman'                       0.128016
'url:listinfo'                      0.130533
'url:python'                        0.135874
'bi:proto:http url:mail'            0.138712
'url:python-list'                   0.145499
'received:127'                      0.146801
'received:127.0'                    0.146801
'received:127.0.0'                  0.146801
'received:127.0.0.1'                0.146801
'bi:received:12.155.117.29 received:localdomain' 0.1549
'received:localhost.localdomain'    0.16481
'sender:addr:python-list-bounces+tim.one=comcast.net' 0.168566
'sender:addr:python.org'            0.168824
'received:12'                       0.211812
'bi:to:addr:python.org to:no real name:2**0' 0.213042
'received:12.155'                   0.214529
'received:12.155.117'               0.214529
'received:mail.python.org'          0.214529
'received:python.org'               0.214529
'url:org'                           0.221085

So it's got 11(!) new correlated clues extracted from two Received headers:

Received: from mail.python.org ([12.155.117.29])
	by sccrmxc14.comcast.net (sccrmxc14) with ESMTP
	id <20031211091604s14001ch25e>; Thu, 11 Dec 2003 09:16:04 +0000
X-Originating-IP: [12.155.117.29]
Received: from localhost.localdomain ([127.0.0.1] helo=mail.python.org)
	by mail.python.org with esmtp (Exim 4.22) id 1AUMvU-0000lU-7N
	for tim.one@comcast.net; Thu, 11 Dec 2003 04:16:04 -0500

If I were doing train-on-everything instead of just mistakes, I'm afraid the
spamprobs on the python.org clues would approach 0 (I get a couple hundred
ham from mail.python.org every day, but typically no spam from there) --
then we'd be close to "spectacular failure" territory, for such very short
spam.

Something to be aware of, anyway!

On the other side, all the ham in my training data scores 0 now (rounded to
two digits), which I've never seen before.  That's remarkable since the only
ham in there came from mistakes and unsures (50 left over from my unigram
classifier, about 100 added since then).  Only 5 training spam don't score
100 (rounded), which are exactly the 5 training spam that came from a
python.org mailing list.  Overall, that's also better than I've seen before,
although the bit of python.org spam is doing worse than I've seen before
(for the obvious reason explained above).


From matt at mondoinfo.com  Sun Dec 21 14:08:03 2003
From: matt at mondoinfo.com (Matthew Dixon Cowles)
Date: Sun Dec 21 14:09:09 2003
Subject: [spambayes-dev] FW: SF.NET Project Donation System
In-Reply-To: <LNBBLJKPBEHFEDALKOLCKEOMHOAB.tim.one@comcast.net>
References: <1071977330.29.10858@mint-julep.mondoinfo.com>
	<LNBBLJKPBEHFEDALKOLCKEOMHOAB.tim.one@comcast.net>
Message-ID: <1072030630.02.10929@mint-julep.mondoinfo.com>

[Tim, on donations to the PSF]
> Matt, we already do.  Visit

> http://spambayes.sourceforge.net/donations.html

> for proof <wink>.  There's a PayPal button on that page, which
> contributes directly to the PSF now.  Some users have done that.

Actually, I was aware of that. But since there's a new mechanism that
may be better or cheaper or at least different it might be confusing
to accept donations that go to different places.

> I expect that people contributing cash to SpamBayes would like a
> more direct connection, and I sympathize.

Agreed. But since no one has yet suggested a good way to donate
directly to SpamBayes, I think that having the money go to the PSF is
a trifle closer to supporting SpamBayes than having the money go to
SourceForge. Other people may have different thoughts.

>> Full disclosure: I'm a member of the PSF but don't sit on its
>> board.

> If you want to, you probably can <wink>.

You're not getting off the board that easily <wink>.

Regards,
Matt


From tameyer at ihug.co.nz  Sun Dec 21 20:35:09 2003
From: tameyer at ihug.co.nz (Tony Meyer)
Date: Sun Dec 21 20:35:49 2003
Subject: [spambayes-dev] FW: SF.NET Project Donation System
In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F13048D7A6B@its-xchg4.massey.ac.nz>
Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F13026F2A13@its-xchg4.massey.ac.nz>

[Tim]
> I'd like the people here working their spare-time asses off on 
> SpamBayes to give this some thought.  We don't *have* to give 
> SpamBayes contributions to the PSF, and I wouldn't object if the 
> people doing the work here wanted to split donations among 
> themselves.  

I'm happy with any contributions going to the PSF, or SF, or whatever 'good
cause' people like most (including a new yacht for Mark <wink>).  If
contributions were more likely to be 100 bucks every day or two instead of
now and again, I might think differently ;)  In addition, if I received
anything, I don't know how that would be viewed here - if it was seen as
payment for work, then I'd have to greatly reduce the amount of time I muck
about with spambayes.

That said, although that's my preference, I don't feel strongly against an
'individual' system, if that's the majority preference.

> It probably wouldn't amount to much, but even 100 bucks now 
> and again can work wonders for morale.

True, but knowing (or wondering) that others are getting 100 bucks now and
again and you're not, or knowing that people could be giving you a buck now
and then, and aren't, might be negative for morale.

[Richie]
> (Anyway, we all know Mark deserves all of it for 
> fighting Outlook all this time.  And in Australian dollars, 
> 100 bucks US would set him up for life! 8-)

Or in NZ dollars, I could buy most of Narnia & Middle Earth ;)

> Anyone who's spent real money on the project, like Rob with 
> the spambayes.org domain, could be reimbursed.

I think if the money does keep going to the PSF, then we ought to be able to
convince them to fork out for that sort of thing (is there anything apart
from the domain that costs at the moment?).  From my ignorant perspective,
that seems easier in terms of tax etc, too.

> We could add developer links to the Donations page, so that 
> if a user wanted to donate to a specific developer, he could. 

It seems to me that this would end up being more of a donate-for-support
page, which leaves out those people that support but don't develop.  My
personal suspicion is that people are more likely to want to donate for
support than development, anyway.

=Tony Meyer


From tameyer at ihug.co.nz  Sun Dec 21 23:44:58 2003
From: tameyer at ihug.co.nz (Tony Meyer)
Date: Sun Dec 21 23:45:05 2003
Subject: [spambayes-dev] "X-" as a prefix for experimental options
In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F1304314CAC@its-xchg4.massey.ac.nz>
Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F1304677795@its-xchg4.massey.ac.nz>

[Mark]
> A problem I see is that the users will have no way
> of measuring any changes. The binaries don't come with
> any of the test tools, and relying on lots of people
> giving subjective results doesn't seem useful.
[...]
> I think we need some kind of better, application based testing 
> framework first.  The scripts we use now predate all of the 
> applications, and I can never remember how to run them.  If I could 
> just get a test tool to run directly over Outlook folders, we would be 
> much closer (for Outlook anyway <wink>).  This needn't be too hard
> - just abstracting the test tools a little so they allow
> sub-classes to extract the actual message streams for the
> test runs.

I've made (a very rough) start to something like this and checked it in.  If
you apply the attached patch to sb_server.py and then go to
http://localhost:8880/cv, you'll be presented with a page for running the
'timcv' test (defaults against any of the experimental/deprecated options).

It's all very rough at the moment, but I'd be interested to know if people
thought that this would be user friendly enough (for advanced users, not
everyone), or ideas about other ways to go about it.

[Mark]
> Ultimately, we end up with a simple way for either Outlook
> or sb_server to run tests over the training sets, and report succinct
> results.  Otherwise, I doubt anything will change in terms of the 
> number of *users* running tests (let alone developers <wink>)

I definitely agree that this is needed :)  If something like this does end
up in sb_server, then it would be extremely simple to add it to Outlook,
too.  In fact, if it was presented in a message (like "Show Clues") then the
exact html could maybe be used <wink>.

=Tony Meyer
-------------- next part --------------
A non-text attachment was scrubbed...
Name: ttui.patch
Type: application/octet-stream
Size: 1076 bytes
Desc: not available
Url : http://mail.python.org/pipermail/spambayes-dev/attachments/20031222/6a66af97/ttui.obj
From tim.one at comcast.net  Mon Dec 22 00:24:25 2003
From: tim.one at comcast.net (Tim Peters)
Date: Mon Dec 22 00:24:36 2003
Subject: [spambayes-dev] RE: [Spambayes] How low can you go? 
In-Reply-To: <20031219165537.EDB162DF7F@cashew.wolfskeep.com>
Message-ID: <LNBBLJKPBEHFEDALKOLCIEDOHPAB.tim.one@comcast.net>

[T. Alexander Popiel]
> Actually, there have been experiments done (by me) with expiry of
> whole messages.

Yes.  By "the project" having experience I mean controlled tests run by
several across their own email mix, using exactly the same strategy, with
reporting and analysis and all that good stuff.  We've done little of that
(as a group) over the last year.

> I invite you to look at the 'expire4months' regime for my incremental
> testing harness.  Performance was worse than remembering everything,
> but significantly better than mistake-based training (with the
> 'fpfnunsure' regime).
>
> I have not done any experiments with just nuking hapaxes; I didn't see
> any reason to do a partial job instead of a full one.

There may not be one.  The question arose specifically in the context of the
mixed unigram/bigram classifier, which grows the database at a much faster
rate.  I've got ~90% hapaxes after a couple days with that, and the database
is already 3x larger than after months of mistake/unsure training under the
pure-unigram classifier.   Expiring a full message doesn't seem to make
sense after two days, or even after a week; expiring unused hapaxes may;
that's for experiment to decide.

>>> I know you're not arguing that, but if there were bidirectional
>>> msg_id <-> feature_ID maps, it would be fairly easy to expire whole
>>> messages.
>>>
>>> That would obviate the need to track last time seen for every token.

>> Only if you don't want also to be able to expire tokens on their own.

> No... just find the most recent message that the token appeared in,
> which would be a quick search through a few message times.  A really
> quick search if you're only looking to expire hapaxes.

I don't want to expire a hapax if it's been used recently in *scoring*.
Message times can't distinguish used from unused features.  If you're doing
train-on-everything (with or without whole-msg expiration), a hapax used in
scoring becomes a non-hapax the first time it's used in scoring.  For
mistake/unsure training, a hapax used in scoring remains a hapax if the
message being scored ends up correctly classified.  Hapaxes that are never
seen again also remain hapaxes.  Distinguishing used from unused requires
recording use.

Followups set to spambayes-dev@python.org, as this speculative stuff really
doesn't belong on the general spambayes list.


From popiel at wolfskeep.com  Mon Dec 22 00:40:35 2003
From: popiel at wolfskeep.com (T. Alexander Popiel)
Date: Mon Dec 22 00:40:40 2003
Subject: [spambayes-dev] RE: How low can you go? 
In-Reply-To: Message from "Tim Peters" <tim.one@comcast.net> of "Mon,
	22 Dec 2003 00:24:25 EST."
	<LNBBLJKPBEHFEDALKOLCIEDOHPAB.tim.one@comcast.net> 
References: <LNBBLJKPBEHFEDALKOLCIEDOHPAB.tim.one@comcast.net> 
Message-ID: <20031222054035.EB7672DF61@cashew.wolfskeep.com>

In message:  <LNBBLJKPBEHFEDALKOLCIEDOHPAB.tim.one@comcast.net>
             "Tim Peters" <tim.one@comcast.net> writes:
>[T. Alexander Popiel]
>> Actually, there have been experiments done (by me) with expiry of
>> whole messages.
>
>Yes.  By "the project" having experience I mean controlled tests run by
>several across their own email mix, using exactly the same strategy, with
>reporting and analysis and all that good stuff.  We've done little of that
>(as a group) over the last year.

Ah.  Yes, that hasn't happened... I've been as lax as most folks
with regards to trying to replicate other people's testing, too. :-(

>> No... just find the most recent message that the token appeared in,
>> which would be a quick search through a few message times.  A really
>> quick search if you're only looking to expire hapaxes.
>
>I don't want to expire a hapax if it's been used recently in *scoring*.

*blink* *blink*  Oh, right, you don't train on everything like I do.
Sometimes I forget. ;-)

- Alex

From gbrown at alumni.caltech.edu  Mon Dec 22 11:49:23 2003
From: gbrown at alumni.caltech.edu (Glenn Brown)
Date: Mon Dec 22 11:49:30 2003
Subject: [spambayes-dev] siickkk and deprrravved stufff totallly grossssse
Message-ID: <01ae01c3c8ab$8e3003a0$6601a8c0@Glenn>

I fear my email box is seeing a reliable Spam attack on Bayesian filters,
starting in the past week: the tweaking of spam tokens by repeating
characters.

If spammers use 0-3 repetitions of each letter, a spam token like
"investment" can be spelled 4^10 (a million) different ways.  I don't want
to suffer a million spam messages to train my filter for this one word.

 
A simple solution would be to eliminate character repetitions in the spam
database.  This produces 163 ambiguities out of the 25143 words in the
Solaris /usr/dict/words list of words in the English language, but probably
none of these are spam tokens.  I've appended a list of the ambiguous tokens
below.  For example, "be" represents "be" and "bee".

 
I won't be implementing adding this feature myself, but would sure like to
see this feature in my favorite spam filter.

 
Cheers to all the SpamBayes developers,

--Glenn

 
Alan

Alison

Barnet

Bela

Burt

De

Diane

Douglas

Eliot

Eliot

Emanuel

Gary

Godwin

Greg

Haley

Herman

Kaufman

Kenan

Liget

Lilian

Marieta

Mathews

Matson

McConel

NW

Nichols

Paterson

Philip

SE

SW

Scot

Shafer

Shepard

Simons

Wals

Whitaker

ad

advise

apointe

as

bare

bat

be

bel

below

bel

below

bet

bib

bit

bled

boby

bogy

bon

both

bred

bus

but

canister

canon

canvas

carton

chery

chose

col

coma

con

con

cop

coral

cot

desert

desicate

devise

devote

discus

divorce

dragon

drol

drop

duly

el

el

escape

fed

fel

fiance

filet

fogy

fury

gable

gal

glom

god

gripe

grove

hel

his

hop

hot

i

i

in

inbred

invite

ken

knel

later

legate

lop

lose

lot

mana

marque

mate

met

milenia

mortgage

mot

ne

non

nose

of

pal

parole

pep

pepy

per

pol

pol

pop

pose

put

red

refuge

retire

rifle

robin

rod

rot

salon

sen

shot

slop

son

sped

step

stop

tapa

ten

the

til

to

todle

tol

tor

tot

very

vi

vi

we

wed

whop

willful

 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.python.org/pipermail/spambayes-dev/attachments/20031222/8aea057d/attachment-0001.html
From skip at pobox.com  Mon Dec 22 11:57:25 2003
From: skip at pobox.com (Skip Montanaro)
Date: Mon Dec 22 11:57:36 2003
Subject: [spambayes-dev] siickkk and deprrravved stufff totallly grossssse
In-Reply-To: <01ae01c3c8ab$8e3003a0$6601a8c0@Glenn>
References: <01ae01c3c8ab$8e3003a0$6601a8c0@Glenn>
Message-ID: <16359.8821.262470.812169@montanaro.dyndns.org>


    Glenn> I fear my email box is seeing a reliable Spam attack on Bayesian
    Glenn> filters, starting in the past week: the tweaking of spam tokens
    Glenn> by repeating characters.

Only if "deprrravved" is a hammy word for you.  If not, then it has no
effect and other clues are used to distinguish ham from spam.  Can you post
a full set of clues for such a message?

Skip

From skip at pobox.com  Mon Dec 22 12:10:13 2003
From: skip at pobox.com (Skip Montanaro)
Date: Mon Dec 22 12:10:29 2003
Subject: [spambayes-dev] default to mine_received_headers=True,
	"may be forged"
In-Reply-To: <LNBBLJKPBEHFEDALKOLCMEOKHOAB.tim.one@comcast.net>
References: <16355.22556.564839.561779@montanaro.dyndns.org>
	<LNBBLJKPBEHFEDALKOLCMEOKHOAB.tim.one@comcast.net>
Message-ID: <16359.9589.956992.208750@montanaro.dyndns.org>


    >> While I was messing with the received header regular expressions
    >> today I also noticed that Sendmail sometimes adds "may be forged" to
    >> a header....

    >> I'm inclined to trust sendmail on this one and just add it.  It seems
    >> like a very objective feature.

    Tim> I agree -- it's extremely unlikely to lose.  The ones to worry
    Tim> about are things spammers could inject to push things in the ham
    Tim> direction, but they're not gonna get far forging "may be forged"
    Tim> unless I have a *very* weird idea of ham <wink>.

I just checked in tokenizer.py with this change.  Note that it's guarded by
options["Tokenizer", "mine_received_headers"].

Skip

    Tim> I noticed this in the headers of a spam today:

    Tim> Received: from shawmail-cg-shawcable-net
    Tim>        (c-24-9-163-244.client.comcast.net[24.9.163.244](untrusted sender))
    Tim>        by rwcrmxc11.comcast.net (rwcrmxc11) with SMTP
    Tim>        id <20031220054919r1100n4pj1e>; Sat, 20 Dec 2003 05:49:20 +0000

    Tim> It's the "(untrusted sender)" part that's interesting.  I'd suggest
    Tim> *not* folding that in with "may be forged", though.  There probably
    Tim> aren't a lot of strings of this nature, so the database burden
    Tim> should be trivial, and I *bet* different strings will prove to have
    Tim> different spamprobs.

You're probably right.  In this case it may just be that an ident lookup
failed (many servers don't run identd), so the assertion that the message is
spam would be much weaker.

Poking around Google a bit suggests "(untrusted sender)" is something
specific to Comcast.  I'm happy to add it if you would like, but in the mail
I've saved it actually seems to turn up a bit more in ham (six messages)
than in spam (one message) and not at all in my current training database.
All such lines also match "client2?\.attbi\.com".

Skip

From gbrown at alumni.caltech.edu  Mon Dec 22 12:28:24 2003
From: gbrown at alumni.caltech.edu (Glenn Brown)
Date: Mon Dec 22 12:28:35 2003
Subject: [spambayes-dev] siickkk and deprrravved stufff totallly grossssse
In-Reply-To: <16359.8821.262470.812169@montanaro.dyndns.org>
Message-ID: <01bb01c3c8b1$01927d50$6601a8c0@Glenn>

> Only if "deprrravved" is a hammy word for you.  If not, then it has no
> effect and other clues are used to distinguish ham from spam.

Well, after 10K+ spam messages of training, the approach gets a 1% spam
score for the following otherwise obvious spam.

> Can you post a full set of clues for such a message?

Done.

--Glenn

Spam Score: 1% (0.00944801)


word                                spamprob         #ham  #spam
'*H*'                               0.981104            -      -
'*S*'                               1.71005e-007        -      -
"i've"                              0.0338633         513     64
'guys'                              0.0546273         191     39
'wet'                               0.0661982          93     23
'stan'                              0.097452            7      2
'think'                             0.099588          845    336
'header:Received:2'                 0.106663          534    229
'amazing.'                          0.118307           18      8
'got'                               0.123214          507    256
'x-mailer:qualcomm windows eudora version 6.0.0.22' 0.125758           38
19
'it.'                               0.154291          570    374
'from:addr:aloktorvaldis.com'       0.155172            1      0
'from:addr:ger'                     0.155172            1      0
'from:name:detractors m. tinnier'   0.155172            1      0
'heyyouguys,'                       0.155172            1      0
'huunngg'                           0.155172            1      0
'message-id:@aloktorvaldis.com'     0.155172            1      0
'reply-to:addr:aloktorvaldis.com'   0.155172            1      0
'subject:giirrllss'                 0.155172            1      0
'subject:soak'                      0.155172            1      0
'subject:squiiirrrtt'               0.155172            1      0
'url:aloktorvaldis'                 0.155172            1      0
'hit'                               0.16835           138    100
'little'                            0.170749          412    305
'when'                              0.174352         1030    783
'reply-to:no real name:2**0'        0.181731         2589   2071
'well'                              0.190478          390    330
'splash'                            0.203534            5      4
'what'                              0.206076         1117   1044
'skip:o 30'                         0.206761           23     21
'they'                              0.210826         1024    985
'seen'                              0.211543          372    359
'into'                              0.220027          586    595
'tiny'                              0.220335           32     32
'very'                              0.220758          705    719
'then'                              0.234163          596    656
'that'                              0.257805         2233   2794
'have'                              0.264403         2222   2877
'has'                               0.264422         1086   1406
'just'                              0.264724         1257   1630
'over'                              0.265763          837   1091
"won't"                             0.269641          153    203
'how'                               0.273075          771   1043
"don't"                             0.284679          824   1181
'reply-to:addr:stacy'               0.290906            1      1
'faces.'                            0.292799            3      4
"it's"                              0.299524          726   1118
'totally'                           0.308229           47     75
'also'                              0.327243          665   1165
'love'                              0.329743          174    308
'will'                              0.330548         1301   2314
'right'                             0.33105           399    711
'want'                              0.336363          636   1161
'ever'                              0.341633          273    510
'see'                               0.343879          847   1599
'truly'                             0.348488           36     69
'their'                             0.356169          528   1052
'most'                              0.362206          485    992
'with'                              0.367376         1955   4090
'net.'                              0.371879           10     21
'the'                               0.378413         3332   7308
'taken'                             0.380944          112    248
'kk0kks'                            0.389062            1      2
'masssive'                          0.389062            1      2
'sluttttsss'                        0.389062            1      2
'squirting'                         0.389062            1      2
'ads'                               0.3959             34     80
'promise'                           0.397485           14     33
'are'                               0.399717         1652   3963
'dripping'                          0.606785            1      6
'simply'                            0.621827           86    510
'believe'                           0.622983          140    834
'subject:and'                       0.634866          115    721
'anymore'                           0.647478            7     47
'address'                           0.652343          124    839
'subject:the'                       0.653141          147    998
'yours'                             0.67376            23    172
'jusdt'                             0.689717            1      9
'squirt'                            0.689717            1      9
'subject:place'                     0.689717            1      9
'skip:d 30'                         0.74621            26    277
'here'                              0.748105          579   6197
'url:html'                          0.760285          284   3247
'subject:that'                      0.777933            8    103
'url:face'                          0.781771            1     15
'subject:all'                       0.78701             9    122
'girls'                             0.804866           11    166
'gushing'                           0.809961            1     18
'url:index'                         0.828674          117   2042
'to:name:gbrown'                    0.884536            1     33
'recieve'                           0.918175            4    170
Message Stream:


Received: from ptd-204-210-94-45.maine.rr.com [204.210.94.45] by
	alumniweb.ir.caltech.edu
	(SMTPD32-8.05) id ACDA3B000C4; Sat, 20 Dec 2003 23:33:46 -0800
Received: from aloktorvaldis.com (mail.aloktorvaldis.com [64.156.186.89])
	by ptd-204-210-94-45.maine.rr.com (Postfix) with ESMTP id E76C1C464E
	for <gbrown@alumni.caltech.edu>; Sun, 21 Dec 2003 02:33:36 -0500
Message-ID: <6.0.0.22.1.20031221023336.c0bc8684@aloktorvaldis.com>
X-Sender: centurions@mail.aloktorvaldis.com
X-Mailer: QUALCOMM Windows Eudora Version 6.0.0.22
Reply-To: stacy@aloktorvaldis.com
Date: Sun, 21 Dec 2003 02:33:36 -0500
To: Gbrown <gbrown@alumni.caltech.edu>
From: "Detractors M. Tinnier" <ger@aloktorvaldis.com>
Subject: giirrllss that squiiirrrtt and soak all over the place
MIME-Version: 1.0
Content-Type: text/plain; format=flowed
Content-Transfer-Encoding: 7bit
X-IMAIL-SPAM-STATISTICS: 1.0000
X-RCPT-TO: <gbrown@alumni.caltech.edu>
Status: U
X-UIDL: 371502929

heyyouguys,

   Gushing and squirting teeeenn girls that splash and squirt all over their
boyfriend's faces.  Simply the wetest most totally dripping pusssiesss you
have got to see on the net.  These girls are so soaking wet that you won't
believe how they squirt all over the masssive kk0kks of these very well
huunngg guys  into their tiny little cooochiees.  It's totally amazing.  And
these are the pretiest little young sluttttsss  that I've think I've ever
seen when it comes to this and this site has a phrreeeee triall to go with
it.
http://www.aloktorvaldis.com/dollar/face/index.html
You'll love what you see I just promise you.
DAcLBBIXKwQVHggXAksaCgkNDgYRRQAdHg== Also if you don't want to recieve
anymore of these ads from me then you jusdt have to hit on the address right
here just hit it and you will be taken offitwww.splasterastem.com/wanton/
and you will be taken off!

yours truly

stan


Message Tokens:

130 unique tokens

'address'
'ads'
'all'
'also'
'amazing.'
'and'
'anymore'
'are'
'believe'
"boyfriend's"
'cc:none'
'comes'
'content-type:text/plain'
'cooochiees.'
"don't"
'dripping'
'ever'
'faces.'
'from'
'from:addr:aloktorvaldis.com'
'from:addr:ger'
'from:name:detractors m. tinnier'
'girls'
'got'
'gushing'
'guys'
'has'
'have'
'header:Date:1'
'header:From:1'
'header:MIME-Version:1'
'header:Message-ID:1'
'header:Received:2'
'header:Reply-To:1'
'header:Subject:1'
'header:To:1'
'here'
'heyyouguys,'
'hit'
'how'
'huunngg'
"i've"
'into'
"it's"
'it.'
'jusdt'
'just'
'kk0kks'
'little'
'love'
'masssive'
'message-id:@aloktorvaldis.com'
'most'
'net.'
'off!'
'over'
'phrreeeee'
'pretiest'
'promise'
'proto:http'
'pusssiesss'
'recieve'
'reply-to:addr:aloktorvaldis.com'
'reply-to:addr:stacy'
'reply-to:no real name:2**0'
'right'
'see'
'seen'
'sender:none'
'simply'
'site'
'skip:d 30'
'skip:o 30'
'sluttttsss'
'soaking'
'splash'
'squirt'
'squirting'
'stan'
'subject: '
'subject:all'
'subject:and'
'subject:giirrllss'
'subject:over'
'subject:place'
'subject:soak'
'subject:squiiirrrtt'
'subject:that'
'subject:the'
'taken'
'teeeenn'
'that'
'the'
'their'
'then'
'these'
'they'
'think'
'this'
'tiny'
'to:2**0'
'to:addr:alumni.caltech.edu'
'to:addr:gbrown'
'to:name:gbrown'
'totally'
'triall'
'truly'
'url:aloktorvaldis'
'url:com'
'url:dollar'
'url:face'
'url:html'
'url:index'
'url:www'
'very'
'want'
'well'
'wet'
'wetest'
'what'
'when'
'will'
'with'
"won't"
'x-mailer:qualcomm windows eudora version 6.0.0.22'
'you'
"you'll"
'you.'
'young'
'yours'


From popiel at wolfskeep.com  Mon Dec 22 12:39:15 2003
From: popiel at wolfskeep.com (T. Alexander Popiel)
Date: Mon Dec 22 12:39:19 2003
Subject: [spambayes-dev] siickkk and deprrravved stufff totallly grossssse
In-Reply-To: Message from "Glenn Brown" <gbrown@alumni.caltech.edu> of "Mon,
	22 Dec 2003 09:28:24 PST." <01bb01c3c8b1$01927d50$6601a8c0@Glenn> 
References: <01bb01c3c8b1$01927d50$6601a8c0@Glenn> 
Message-ID: <20031222173915.C01132DF61@cashew.wolfskeep.com>

In message:  <01bb01c3c8b1$01927d50$6601a8c0@Glenn>
             "Glenn Brown" <gbrown@alumni.caltech.edu> writes:

>> Only if "deprrravved" is a hammy word for you.  If not, then it has no
>> effect and other clues are used to distinguish ham from spam.

>> Can you post a full set of clues for such a message?
>
>Done.

>'from:addr:aloktorvaldis.com'       0.155172            1      0
>'from:addr:ger'                     0.155172            1      0
>'from:name:detractors m. tinnier'   0.155172            1      0
>'heyyouguys,'                       0.155172            1      0
>'huunngg'                           0.155172            1      0
>'message-id:@aloktorvaldis.com'     0.155172            1      0
>'reply-to:addr:aloktorvaldis.com'   0.155172            1      0
>'subject:giirrllss'                 0.155172            1      0
>'subject:soak'                      0.155172            1      0
>'subject:squiiirrrtt'               0.155172            1      0
>'url:aloktorvaldis'                 0.155172            1      0

Based on these clues, I'd say that you trained on one of these
messages as ham.  That'll certainly encourage a ham classification
for them.

What happens if you untrain and then retrain as spam?

- Alex

From tim.one at comcast.net  Mon Dec 22 12:50:25 2003
From: tim.one at comcast.net (Tim Peters)
Date: Mon Dec 22 12:50:35 2003
Subject: [spambayes-dev] siickkk and deprrravved stufff totallly grossssse
In-Reply-To: <20031222173915.C01132DF61@cashew.wolfskeep.com>
Message-ID: <LNBBLJKPBEHFEDALKOLCIEGHHPAB.tim.one@comcast.net>

[Glenn Brown]
>> 'from:addr:aloktorvaldis.com'       0.155172            1      0
>> 'from:addr:ger'                     0.155172            1      0
>> 'from:name:detractors m. tinnier'   0.155172            1      0
>> 'heyyouguys,'                       0.155172            1      0
>> 'huunngg'                           0.155172            1      0
>> 'message-id:@aloktorvaldis.com'     0.155172            1      0
>> 'reply-to:addr:aloktorvaldis.com'   0.155172            1      0
>> 'subject:giirrllss'                 0.155172            1      0
>> 'subject:soak'                      0.155172            1      0
>> 'subject:squiiirrrtt'               0.155172            1      0
>> 'url:aloktorvaldis'                 0.155172            1      0

[T. Alexander Popiel]
> Based on these clues, I'd say that you trained on one of these
> messages as ham.  That'll certainly encourage a ham classification
> for them.

Yup, looks certain -- or else Glenn makes some mighty fine distinctions
about which kinds of porn spam he *wants* to see <wink>.

This line:

'reply-to:addr:stacy'               0.290906            1      1

also tells us the database was trained on a lot more spam than ham (a token
appearing equally often in both ends up with a decidedly hammy spamprob).
Glenn, you should find that spambayes works better if you train on *less*
spam (or more ham -- the math works out best if you train on an
approximately equal number of each).  This database isn't wildly unbalanced,
but it's beyond the point where my classifier starts acting flaky.


From nobody at spamcop.net  Mon Dec 22 13:04:03 2003
From: nobody at spamcop.net (Seth Goodman)
Date: Mon Dec 22 13:04:06 2003
Subject: [spambayes-dev] RE: [Spambayes] How low can you go? 
In-Reply-To: <LNBBLJKPBEHFEDALKOLCIEDOHPAB.tim.one@comcast.net>
Message-ID: <MHEGIFHMACFNNIMMBACAMEHOGPAA.nobody@spamcop.net>

> >>> [Seth Goodman]
> >>> I know you're not arguing that, but if there were bidirectional
> >>> msg_id <-> feature_ID maps, it would be fairly easy to expire whole
> >>> messages.
> >>>
> >>> That would obviate the need to track last time seen for every token.
>
> >> [Tim Peters]
> >> Only if you don't want also to be able to expire tokens on their own.
>
> > [T. Alexander Popiel]
> > No... just find the most recent message that the token appeared in,
> > which would be a quick search through a few message times.  A really
> > quick search if you're only looking to expire hapaxes.
>
> [Tim Peters]
> I don't want to expire a hapax if it's been used recently in *scoring*.
> Message times can't distinguish used from unused features.  If
> you're doing
> train-on-everything (with or without whole-msg expiration), a
> hapax used in
> scoring becomes a non-hapax the first time it's used in scoring.  For

But for really unusual messages of the type you were concerned about, this
may only happen once a year, or so, which is too long for a hapax-expiration
scheme.

> mistake/unsure training, a hapax used in scoring remains a hapax if the
> message being scored ends up correctly classified.  Hapaxes that are never
> seen again also remain hapaxes.  Distinguishing used from unused requires
> recording use.


--------------------------------------

I'm reposting an earlier post that didn't receive any comments (poor
netiquette, I know) because I feel it's relevant to both comments made
subsequently in this thread and the question of expiring hapaxes not
recently used vs. whole messages.  I also asked for a little help getting
started to be able to test some of my own and/or other peoples' ideas and
would still like to do that, unless you folks would prefer otherwise.

I've noticed that hapaxes do seem to contribute to scoring when the training
set is small and I think I've seen others make similar comments.  This also
may be the case for really odd messages.  So please forgive me for the
repost, but here it is:

> [Tim Peters]
> There are messages I never want to expire.  That creates major new UI
> headaches to be doable.  I believe (but don't yet know) that expiring
> hapaxes can be done without need for user intervention, and without harm.

I hope the "without harm" part is true.  See my question two sections down.


> [Tim Peters]
> At some point, if you want to try your ideas, *try* your ideas <wink> --
> that's what Open Source is all about.  Everyone is born knowing how to
> program in Python, although most don't realize it until they try.

I admit I wasn't aware that I could program in Python since birth, but I'm
willing to take your word on that.  We all have hidden potential.  So that I
don't have to re-invent that round thing with the axle in the middle, could
someone please give me some hints as to which of the mapping features we've
discussed in this thread exist or will soon exist and where I can look for
them?  I saw on spambayes-dev that there is discussion of a new database, so
I don't want to go off on a useless fork with the present db if that comes
to pass.  Search for your inner newbie when you answer this.


> > [Seth Goodman]
> > I agree completely.  This was an important motivation for expiring a
> > whole message at a time.  Training mistakes would eventually drop out
> > of the database without user intervention.  Not that a tool to help
> > track down training mistakes wouldn't be great, but a "casual" user
> > could still make occasional mistakes and the system would recover by
> > itself.
>
> [Tim Peters]
> Without intervention, it will also expire the screaming bright-red HTML
> birthday message sent by my favorite 7-year-old niece, and when
> she's 8 the
> next one may get tagged as spam.  These are the kinds of messages I never
> want to expire.  ...

Here lies my concern.  I sincerely hope that correct classification of these
infrequent, unusual messages is not hapax-driven.  If it is, the result of
pruning infrequently-used hapaxes will be as bad as deleting the whole
message.  If that is the case, the _only_ solution will be to keep either
those hapaxes or the whole message trained forever.  Either way, I agree
this is a big UI problem without an obvious intuitive solution.

It does appear from looking at the scoring of some of my "typical" messages
that hapaxes don't contribute much, as you've said before.  Could you look
at the scoring of a couple of those special messages and tell if their
scoring would be seriously affected if the hapaxes were gone?

--
Seth Goodman

  Humans:   off-list replies to sethg [at] GoodmanAssociates [dot] com

  Spambots: disregard the above


From skip at pobox.com  Mon Dec 22 13:09:44 2003
From: skip at pobox.com (Skip Montanaro)
Date: Mon Dec 22 13:09:53 2003
Subject: [spambayes-dev] siickkk and deprrravved stufff totallly grossssse
In-Reply-To: <01bb01c3c8b1$01927d50$6601a8c0@Glenn>
References: <16359.8821.262470.812169@montanaro.dyndns.org>
	<01bb01c3c8b1$01927d50$6601a8c0@Glenn>
Message-ID: <16359.13160.783191.328164@montanaro.dyndns.org>


Looks like you have a mistake in your training:

    Glenn> word                                spamprob         #ham  #spam
    ...
    Glenn> 'from:addr:aloktorvaldis.com'       0.155172            1      0
    Glenn> 'from:addr:ger'                     0.155172            1      0
    Glenn> 'from:name:detractors m. tinnier'   0.155172            1      0
    Glenn> 'heyyouguys,'                       0.155172            1      0
    Glenn> 'huunngg'                           0.155172            1      0
    Glenn> 'message-id:@aloktorvaldis.com'     0.155172            1      0
    Glenn> 'reply-to:addr:aloktorvaldis.com'   0.155172            1      0
    Glenn> 'subject:giirrllss'                 0.155172            1      0
    Glenn> 'subject:soak'                      0.155172            1      0
    Glenn> 'subject:squiiirrrtt'               0.155172            1      0
    Glenn> 'url:aloktorvaldis'                 0.155172            1      0

You said that message was spam, yet the above suggests you trained on it as
ham one time.  My guess is that if you untrained it, the outcome would be
unsure or spam.

Skip


From skip at pobox.com  Mon Dec 22 13:15:36 2003
From: skip at pobox.com (Skip Montanaro)
Date: Mon Dec 22 13:15:43 2003
Subject: [spambayes-dev] RE: [Spambayes] How low can you go? 
In-Reply-To: <MHEGIFHMACFNNIMMBACAMEHOGPAA.nobody@spamcop.net>
References: <LNBBLJKPBEHFEDALKOLCIEDOHPAB.tim.one@comcast.net>
	<MHEGIFHMACFNNIMMBACAMEHOGPAA.nobody@spamcop.net>
Message-ID: <16359.13512.453204.264142@montanaro.dyndns.org>


    >> [Tim Peters]
    >> I don't want to expire a hapax if it's been used recently in
    >> *scoring*.  Message times can't distinguish used from unused
    >> features.  If you're doing train-on-everything (with or without
    >> whole-msg expiration), a hapax used in scoring becomes a non-hapax
    >> the first time it's used in scoring.  For

    Seth> But for really unusual messages of the type you were concerned
    Seth> about, this may only happen once a year, or so, which is too long
    Seth> for a hapax-expiration scheme.

Under the heading of "practicality beats purity"...

If you know a given type of message is ham but is seen infrequently, train
on it twice.  That makes sure none of its tokens are hapaxes, and are thus
never candidates for deletion.

Hmmm...  That violates my "never train on a message twice" dictum.

Skip

From tim.one at comcast.net  Mon Dec 22 13:35:30 2003
From: tim.one at comcast.net (Tim Peters)
Date: Mon Dec 22 13:35:34 2003
Subject: [spambayes-dev] RE: [Spambayes] How low can you go? 
In-Reply-To: <MHEGIFHMACFNNIMMBACAMEHOGPAA.nobody@spamcop.net>
Message-ID: <LNBBLJKPBEHFEDALKOLCOEGLHPAB.tim.one@comcast.net>

[Seth Goodman]
> But for really unusual messages of the type you were concerned about,
> this may only happen once a year, or so, which is too long for a
> hapax-expiration scheme.

Yes, and I'm aware of that.

> I'm reposting an earlier post that didn't receive any comments (poor
> netiquette, I know) because I feel it's relevant to both comments made
> subsequently in this thread and the question of expiring hapaxes not
> recently used vs. whole messages.  I also asked for a little help
> getting started to be able to test some of my own and/or other
> peoples' ideas and would still like to do that, unless you folks
> would prefer otherwise.

Sorry, I can't make time to reply now.  Your original message is still
sitting in my queue (actually, several of your msgs are -- you write a lot,
you know <wink>), and I'll get to it when I can.

Let's do the easy ones:

> could someone please give me some hints as to which of the mapping
> features we've discussed in this thread exist

None.  We map string features to pairs of little integers (ham count and
spam count) now, and that's all.

> or will soon exist

Also none.

> and where I can look for them?

For now, somewhere over the rainbow.

> I saw on spambayes-dev that there is discussion of a new database,

Also just speculation at this time.  We "have problems" with the most-common
Berkeley back end now (there are several other database back ends you
*could* configure spambayes to use already), and mostly those threads are
trying to find ways to sidestep those problems.  "Problems" == error
messages from Berkeley saying that the database is corrupted.  It's very
unusual to see these in the Outlook addin, but it has happened.  For some
people on Linux, they seem downright common.

> so I don't want to go off on a useless fork with the present db if
> that comes to pass.

Try say something more specific about what you want to investigate, and
you'll probably get a better answer.


From nobody at spamcop.net  Mon Dec 22 13:40:40 2003
From: nobody at spamcop.net (Seth Goodman)
Date: Mon Dec 22 13:41:56 2003
Subject: [spambayes-dev] RE: [Spambayes] How low can you go? 
In-Reply-To: <16359.13512.453204.264142@montanaro.dyndns.org>
Message-ID: <MHEGIFHMACFNNIMMBACAIEIAGPAA.nobody@spamcop.net>

>     >> [Tim Peters]
>     >> I don't want to expire a hapax if it's been used recently in
>     >> *scoring*.  Message times can't distinguish used from unused
>     >> features.  If you're doing train-on-everything (with or without
>     >> whole-msg expiration), a hapax used in scoring becomes a non-hapax
>     >> the first time it's used in scoring.  For
>
>     Seth> But for really unusual messages of the type you were concerned
>     Seth> about, this may only happen once a year, or so, which
>           is too long
>     Seth> for a hapax-expiration scheme.
>
> [Skip Montanaro]
> Under the heading of "practicality beats purity"...
>
> If you know a given type of message is ham but is seen infrequently, train
> on it twice.  That makes sure none of its tokens are hapaxes, and are thus
> never candidates for deletion.

Great point.  That solves the problem for hapax expiration and unusual
messages.

> [Skip Montanaro]
> Hmmm...  That violates my "never train on a message twice" dictum.

Since you're thinking pragmatically, don't worry about the dictum.
Presumably, you would only do this rarely, i.e. on messages the likes of
which you only expect a couple times a year.  For the Outlook version, you
would have to make a copy of the message and train on that, but it would
still solve the problem.  Just out of curiosity, does the proxy version of
SpamBayes have the same protection as the Outlook version against training
on the same msg_id twice?

--
Seth Goodman

  Humans:   off-list replies to sethg [at] GoodmanAssociates [dot] com

  Spambots: disregard the above


From sethg at GoodmanAssociates.com  Mon Dec 22 14:06:14 2003
From: sethg at GoodmanAssociates.com (Seth Goodman)
Date: Mon Dec 22 14:06:16 2003
Subject: [spambayes-dev] RE: [Spambayes] How low can you go? 
In-Reply-To: <LNBBLJKPBEHFEDALKOLCOEGLHPAB.tim.one@comcast.net>
Message-ID: <MHEGIFHMACFNNIMMBACAAEICGPAA.sethg@GoodmanAssociates.com>

> [Tim Peters]
> Sorry, I can't make time to reply now.  Your original message is still
> sitting in my queue (actually, several of your msgs are -- you
> write a lot,
> you know <wink>), and I'll get to it when I can.

Sorry about that.  I'll try to keep the noise level down.

> > [Seth Goodman]
> > so I don't want to go off on a useless fork with the present db if
> > that comes to pass.
>
> [Tim Peters]
> Try say something more specific about what you want to investigate, and
> you'll probably get a better answer.

I would like to investigate whole message expiration with different training
and expiration schemes.  From our previous discussion, it seems that the
most flexible way to approach this is by going to a system with the several
bidirectional maps implemented in the databases:  feature_id <-> token,
msg_id (+ training timestamp) <-> feature_id  and token database w/training
timestamp per entry.  Instead of training timestamp, expiration time might
be preferable.

If none of this exists, I guess I need to start there.  I was hoping that
some of this might exist since you are already experimenting with hapax
expiration.  I thought I read that there was experimental code that mapped
tokens to the message_id's they were trained from, but that may have been
wishful thinking.  In any case, these are all significant database changes,
and I was afraid to go off half-cocked if the underlying database was not
going to hang around.  Any advise as to how to proceed would be appreciated.
If this is too ambitious for a first project, please help me pare it down.

--
Seth Goodman

  Humans:   off-list replies to sethg [at] GoodmanAssociates [dot] com

  Spambots: disregard the above


From tim.one at comcast.net  Mon Dec 22 14:20:14 2003
From: tim.one at comcast.net (Tim Peters)
Date: Mon Dec 22 14:20:25 2003
Subject: [spambayes-dev] default to mine_received_headers=True,
	"may be forged"
In-Reply-To: <16359.9589.956992.208750@montanaro.dyndns.org>
Message-ID: <LNBBLJKPBEHFEDALKOLCIEGPHPAB.tim.one@comcast.net>

[Skip Montanaro]
> ...
> Poking around Google a bit suggests "(untrusted sender)" is something
> specific to Comcast.  I'm happy to add it if you would like, but in
> the mail I've saved it actually seems to turn up a bit more in ham
> (six messages) than in spam (one message) and not at all in my
> current training database. All such lines also match
> "client2?\.attbi\.com".

It really doesn't matter whether it looks hammy or spammy to you -- each
person's classifier learns "what works" for that person's email mix.  IOW,
I'm not looking for "spam clues" here, I'm looking for potentially
interesting raw data to throw at the classifier, be that hammy or spammy or
neutral.  It's the classifier's job to *learn* what's useful, but it can
only see what we explicitly show it.

A generalization of this gimmick finds several potentially interesting
Received comments in my current little training database:

'received:(built aug  5\n 2002)' spam: 0 ham: 1
'received:(built aug  5 2002)' spam: 0 ham: 1
'received:(built mar 18 2003)' spam: 0 ham: 2
'received:(built may\n 14 2003)' spam: 0 ham: 1
'received:(built may  7 2001)' spam: 0 ham: 1
'received:(built may 13 2002)' spam: 0 ham: 3
'received:(built may 14 2003)' spam: 0 ham: 6
'received:(built nov\n 25 2002)' spam: 0 ham: 2
'received:(built nov  6 2002)' spam: 0 ham: 2
'received:(built nov 25 2002)' spam: 0 ham: 3
'received:(built nov 6\n 2002)' spam: 0 ham: 2
'received:(built sep 23\n 2002)' spam: 0 ham: 1
'received:(built sep 23 2002)' spam: 0 ham: 2
'received:(helo bala)' spam: 0 ham: 1
'received:(helo cyb)' spam: 0 ham: 1
'received:(helo gamer)' spam: 0 ham: 1
'received:(helo hp751n)' spam: 0 ham: 1
'received:(helo mailscanner)' spam: 0 ham: 1
'received:(may\n\tbe forged)' spam: 0 ham: 1
'received:(no client certificate requested)' spam: 0 ham: 3
'received:(qmail 20043 invoked from network)' spam: 0 ham: 1
'received:(qmail 20649 invoked from network)' spam: 0 ham: 1
'received:(qmail 20705 invoked from network)' spam: 0 ham: 1
'received:(qmail 29420 invoked from network)' spam: 0 ham: 1
'received:(qmail 30856 invoked from network)' spam: 0 ham: 1
'received:(qmail 59242 invoked by uid 1002)' spam: 0 ham: 1
'received:(qmail 6276 invoked by uid 99)' spam: 0 ham: 1
'received:(qmail 6378 invoked from network)' spam: 0 ham: 1
'received:(qmail 6383 invoked from network)' spam: 0 ham: 1
'received:(qmail 76214 invoked by uid 0)' spam: 0 ham: 1
'received:(qmail 94959 invoked by uid 399)' spam: 0 ham: 1
'received:(built feb 13 2003)' spam: 1 ham: 1
'received:(helo 3sfm)' spam: 1 ham: 0
'received:(helo d1e)' spam: 1 ham: 0
'received:(helo lsi)' spam: 1 ham: 0
'received:(helo s9rr4v)' spam: 1 ham: 0
'received:(helo timslaptop)' spam: 1 ham: 0
'received:(helo xtr)' spam: 1 ham: 0
'received:(qmail 13979 invoked from network)' spam: 1 ham: 0
'received:(qmail 5950 invoked by uid 500)' spam: 1 ham: 0
'received:(sasktel mail service)' spam: 1 ham: 0
'received:(smtp server)' spam: 2 ham: 1
'received:(misconfigured sender)' spam: 12 ham: 5
'received:(may be forged)' spam: 3 ham: 1
'received:(untrusted sender)' spam: 9 ham: 3

Note that one of the "may be forged" comments there was split across lines
('(may\n\tbe forged)').

That was done via adding

received_complaints_re = re.compile(r'\(\w+(?:\s+\w+)+\)')

and replacing

               if header.lower().find('may be forged') != -1:
                   yield 'received:may be forged'

with
               for x in received_complaints_re.findall(header.lower()):
                   yield 'received:' + x

Since these feed into bigrams too, there are a lot more combinations.  Some
are purely spammy so far:

'bi:received:(untrusted sender) received:ca' spam: 3 ham: 0
'bi:received:63.240.213.250 received:(may be forged)' spam: 3 ham: 0

and some are purely hammy so far:

'bi:received:(built may 14 2003) received:172' spam: 0 ham: 5


From popiel at wolfskeep.com  Mon Dec 22 14:27:27 2003
From: popiel at wolfskeep.com (T. Alexander Popiel)
Date: Mon Dec 22 14:27:30 2003
Subject: [spambayes-dev] RE: [Spambayes] How low can you go? 
In-Reply-To: Message from "Seth Goodman" <nobody@spamcop.net> of "Mon,
	22 Dec 2003 12:04:03 CST."
	<MHEGIFHMACFNNIMMBACAMEHOGPAA.nobody@spamcop.net> 
References: <MHEGIFHMACFNNIMMBACAMEHOGPAA.nobody@spamcop.net> 
Message-ID: <20031222192727.874882DF61@cashew.wolfskeep.com>

In message:  <MHEGIFHMACFNNIMMBACAMEHOGPAA.nobody@spamcop.net>
             "Seth Goodman" <nobody@spamcop.net> writes:

>could someone please give me some hints as to which of the mapping features
>we've discussed in this thread exist or will soon exist and where I can look
>for them?

If we're trying to get reproducible results, then I strongly suggest
looking at the various testing frameworks we have built.  I haven't
done any stuff with maintaining the mappings, but my expire4months
regime does keep message lists for expiry.  Building a mapping on
top of that shouldn't be too difficult...

>I saw on spambayes-dev that there is discussion of a new database, so
>I don't want to go off on a useless fork with the present db if that
>comes to pass.

Again, if we're trying to get reproducible results, then I think that
the main DB and such is the wrong place to be starting.  We shouldn't
be treating just anecdotal evidence from running changed code with our
ongoing live mail feeds as the best we can do.

While the Outlook plugin has done wonders for our popularity, it seems
to have utterly destroyed our rigor.  People now typically don't have
the slightest clue how to go from their normal usage to a testing
deployment...  or at least don't know how to extract their mail from
Outlook's clutches so that they have data to work _on_.

As I don't use Outlook in any environment where I see spam, I don't
know how to write the newbie guide to fix this... if indeed it is
fixable.  I'm spoiled by doing most of my mail handling in an
environment which encourages treating mail as data to be arbitrarily
processed, instead of just viewed through a gui.

- Alex

From skip at pobox.com  Mon Dec 22 14:55:41 2003
From: skip at pobox.com (Skip Montanaro)
Date: Mon Dec 22 14:55:59 2003
Subject: [spambayes-dev] RE: [Spambayes] How low can you go? 
In-Reply-To: <MHEGIFHMACFNNIMMBACAAEICGPAA.sethg@GoodmanAssociates.com>
References: <LNBBLJKPBEHFEDALKOLCOEGLHPAB.tim.one@comcast.net>
	<MHEGIFHMACFNNIMMBACAAEICGPAA.sethg@GoodmanAssociates.com>
Message-ID: <16359.19517.682605.648156@montanaro.dyndns.org>


    Seth> I would like to investigate whole message expiration with
    Seth> different training and expiration schemes.  From our previous
    Seth> discussion, it seems that the most flexible way to approach this
    Seth> is by going to a system with the several bidirectional maps
    Seth> implemented in the databases: feature_id <-> token, msg_id (+
    Seth> training timestamp) <-> feature_id and token database w/training
    Seth> timestamp per entry.  Instead of training timestamp, expiration
    Seth> time might be preferable.

I'll just toss out a thought with nothing really to back it up besides my
seat-of-the-pants experience.  You might find it easier to experiment with
different table layouts using SQL.  There are both MySQL and PostgreSQL
classifiers available (browse spambayes/storage.py).  You could add new
tables or new columns to existing tables without much fuss.  Also, hapax
expiration would be pretty simple.  (Add a last_used column, arrange for it
to get incremented whenever a row is fetched - fairly trivial with
PostgreSQL's triggers I think, then use it to expire hapaxes periodically.)
Finally, problems of multi-thread or multi-process access to the database
should go away.

Skip

From tim.one at comcast.net  Mon Dec 22 15:10:38 2003
From: tim.one at comcast.net (Tim Peters)
Date: Mon Dec 22 15:10:46 2003
Subject: [spambayes-dev] RE: [Spambayes] How low can you go? 
In-Reply-To: <20031222192727.874882DF61@cashew.wolfskeep.com>
Message-ID: <LNBBLJKPBEHFEDALKOLCIEHFHPAB.tim.one@comcast.net>

[T. Alexander Popiel]
> ...
> Again, if we're trying to get reproducible results, then I think that
> the main DB and such is the wrong place to be starting.

Right!

> We shouldn't be treating just anecdotal evidence from running changed
> code with our ongoing live mail feeds as the best we can do.

We're really not, Alex.  It's just a source of ideas to try, and nothing has
changed as a result of it (some experimental, non-default options have been
added, but that's it).

> While the Outlook plugin has done wonders for our popularity, it seems
> to have utterly destroyed our rigor.

I'm still comfortable with what's been checked in.  While there's been
massive refactoring of the code, very little has changed in how messages get
tokenized and scored.  Nothing material has changed in classifier.py, except
for removing experimental_ham_spam_imbalance_adjustment support, and there
was plenty of evidence that that gimmick hurt more than it helped, and more
so the more unbalanced training got.  It was a proven loser (since I wrote
it to begin with, I'm biased in its favor <wink>).

I did check in a few material changes to tokenizer.py over the last year
without full-scale testing.  These were all in the nature of untangling HTML
obfuscations, so that the classifier got a better idea of what the human
email reader sees, instead of tokenizing mountains of raw numeric character
entities, nonsense tags, and other coding tricks unique to HTML.  That was
driven by staring at low-scoring unsures, and identifying tricks that had no
purpose beyond disguising the rendered content.  Tests (on my own email and
on my original large test data) showed that de-obfuscating that stuff was a
pure win, so I was willing to risk that much.

I'm hard pressed to think of other default behavior that's changed.

> People now typically don't have the slightest clue how to go from
> their normal usage to a testing deployment...  or at least don't know
> how to extract their mail from Outlook's clutches so that they have
> data to work _on_.

That's for sure, and is one reason nothing else material *has* been checked
in.  Mark knows how to extract email from Outlook for usable testing, and
wrote some code to help do that, but I haven't yet had time to figure out
how it's done myself.  I'm sure very few Outlook users have.  I agree that
needs to change.  I've been speculating about lots of stuff lately, but I
have no intention of checking in any of that as default behavior without
full-blown, multi-corpus rigorous testing.

> As I don't use Outlook in any environment where I see spam, I don't
> know how to write the newbie guide to fix this... if indeed it is
> fixable.  I'm spoiled by doing most of my mail handling in an
> environment which encourages treating mail as data to be arbitrarily
> processed, instead of just viewed through a gui.

OTOH, Outlook users are spoiled by that GUI, deeply integrated with
spambayes.  It's truly a joy to use, day-to-day.  Training spambayes
effectively via the Outlook UI remains more than a bit of a puzzle, though,
and that extends in part to everyone who isn't prepared to retrain from
scratch at the drop of a pin.  There's a growing disconnect that way between
what developers are happy to do, and what "real users" are able to tolerate.
That's worth some thought too.


From skip at pobox.com  Mon Dec 22 15:29:58 2003
From: skip at pobox.com (Skip Montanaro)
Date: Mon Dec 22 15:30:13 2003
Subject: [spambayes-dev] default to mine_received_headers=True,
	"may be forged"
In-Reply-To: <LNBBLJKPBEHFEDALKOLCIEGPHPAB.tim.one@comcast.net>
References: <16359.9589.956992.208750@montanaro.dyndns.org>
	<LNBBLJKPBEHFEDALKOLCIEGPHPAB.tim.one@comcast.net>
Message-ID: <16359.21574.614482.990083@montanaro.dyndns.org>


    Tim> A generalization of this gimmick finds several potentially
    Tim> interesting Received comments in my current little training
    Tim> database:

    ...

Interesting scheme.  When I tried that I got swamped by '(qmail NNN ...'
stuff, where it appears that NNN is a process id.  To retain this in its
current form I suspect we'd have to either specifically eliminate such
features or implement hapax expiration.

    Tim> Note that one of the "may be forged" comments there was split
    Tim> across lines ('(may\n\tbe forged)').

Perhaps we should add

    header = re.sub(r'\s+', ' ', header)

to the "for header ..." loop in any case?  It seems that many other headers
get split that way.  If we're looking for features which include whitespace
we should probably normalize it.

I'm willing to tuck the more general received sifting into the tokenizer
controlled by a new experimental option.  Let me know if you want me to take
that step.

Skip

From igidon at resystemsgroup.com  Mon Dec 22 15:44:44 2003
From: igidon at resystemsgroup.com (Ira L. Gidon)
Date: Mon Dec 22 15:44:57 2003
Subject: [spambayes-dev] Duplicate E-Mails
Message-ID: <00d501c3c8cc$73d55430$1e14a8c0@ILGToshiba>

I am running on a laptop using Windows XP.

I am using Outlook 2002 SP-2 for e-mail.

 
I installed Spambayes and I start getting duplicate e-mails (same e-mail
with same date/time stamp).

It appears that there is a delay in notifying my exchange server that the
e-mail has been downloaded.

I can sometimes get 3 or 4 copies of the same e-mail. This definitely
started at the same time as the installation of Spambeyes.

 
I was hoping someone can provide me with a solution.

 
Thanks!

 
Ira Gidon

 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.python.org/pipermail/spambayes-dev/attachments/20031222/1ddfee60/attachment.html
From tim.one at comcast.net  Mon Dec 22 15:50:54 2003
From: tim.one at comcast.net (Tim Peters)
Date: Mon Dec 22 15:50:57 2003
Subject: [spambayes-dev] Duplicate E-Mails
In-Reply-To: <00d501c3c8cc$73d55430$1e14a8c0@ILGToshiba>
Message-ID: <LNBBLJKPBEHFEDALKOLCAEIEHPAB.tim.one@comcast.net>

[Ira L. Gidon]
> I am running on a laptop using Windows XP.
> I am using Outlook 2002 SP-2 for e-mail.
>
> I installed Spambayes and I start getting duplicate e-mails (same
> e-mail with same date/time stamp). It appears that there is a delay
> in notifying my exchange server that the e-mail has been downloaded.
> I can sometimes get 3 or 4 copies of the same e-mail. This definitely
> started at the same time as the installation of Spambeyes.
>
> I was hoping someone can provide me with a solution.

Try SpamBayes -> SpamBayes Manager ... -> Advanced and check the "Enable
background filtering" box.  That cures a lot of strange Outlook symptoms,
and will be enabled by default the next time the Outlook addin is released.
I don't know whether it will cure your problem (I've never seen it happen
myself), but it's easy to try.


From popiel at wolfskeep.com  Mon Dec 22 16:09:18 2003
From: popiel at wolfskeep.com (T. Alexander Popiel)
Date: Mon Dec 22 16:09:21 2003
Subject: [spambayes-dev] RE: [Spambayes] How low can you go? 
In-Reply-To: Message from "Seth Goodman" <sethg@GoodmanAssociates.com> 
	of "Mon, 22 Dec 2003 13:06:14 CST."
	<MHEGIFHMACFNNIMMBACAAEICGPAA.sethg@GoodmanAssociates.com> 
References: <MHEGIFHMACFNNIMMBACAAEICGPAA.sethg@GoodmanAssociates.com> 
Message-ID: <20031222210918.1EB742DF61@cashew.wolfskeep.com>

In message:  <MHEGIFHMACFNNIMMBACAAEICGPAA.sethg@GoodmanAssociates.com>
             "Seth Goodman" <sethg@GoodmanAssociates.com> writes:
>
>I would like to investigate whole message expiration with different training
>and expiration schemes.

Ah, in that case, definitely look at the incremental framework that
I built.  I have various training regimes that do train-on-everything
vs. mistake-only, as well as one which expires stuff based on time.
Making more regimes to do various other things should be very easy.

>From our previous discussion, it seems that the most flexible way to
>approach this is by going to a system with the several bidirectional
>maps implemented in the databases:  feature_id <-> token, msg_id (+
>training timestamp) <-> feature_id  and token database w/training
>timestamp per entry.  Instead of training timestamp, expiration time
>might be preferable.

Definite overkill.  Most of this won't be needed for any given
regime, and will instead just bloat the transient data requirements
during testing.  Just make each regime keep track of the data it
needs to do whatever it wants to do.

- Alex

From tim.one at comcast.net  Mon Dec 22 16:41:17 2003
From: tim.one at comcast.net (Tim Peters)
Date: Mon Dec 22 16:41:25 2003
Subject: [spambayes-dev] default to mine_received_headers=True,
	"may be forged"
In-Reply-To: <16359.21574.614482.990083@montanaro.dyndns.org>
Message-ID: <LNBBLJKPBEHFEDALKOLCEEIKHPAB.tim.one@comcast.net>

[Skip Montanaro]
> Interesting scheme.  When I tried that I got swamped by '(qmail NNN
> ...' stuff, where it appears that NNN is a process id.  To retain
> this in its current form I suspect we'd have to either specifically
> eliminate such features or implement hapax expiration.

Changing the regexp to use [a-z] instead of \w would weed out all that
stuff.  I didn't see any containing numbers that looked promising.  The
all-text ones looked interesting, though.

> Perhaps we should add
>
>     header = re.sub(r'\s+', ' ', header)
>
> to the "for header ..." loop in any case?

There are many "for header" loops, and I'm not sure which one(s) you're
talking about here.  If you want to do this somewhere,

    header = ' '.join(header.split())

is faster.

> It seems that many other headers get split that way.  If we're
> looking for features which include whitespace we should probably
> normalize it.

I doubt this is often a concern.  It's dangerous to make basic changes "in
general", so don't do it except where there's a specific need.  It should be
fine in Received lines.  As a counter-example, Subject line parsing *wants*
to know whether tab characters appear, and runs of multiple spaces are also
significant there.  It's irrelevant to parsing of multi-line address headers
(like To and Cc) because email.Utils.getaddresses() is already used for
those, and already hides the line structure.

> I'm willing to tuck the more general received sifting into the
> tokenizer controlled by a new experimental option.  Let me know if
> you want me to take that step.

No, I don't want another experimental option just for this.  It seems clear
enough already that "may be forged" is potentially interesting, and also
that "may be forged" isn't the only potentially interesting string.  We
should suck up a bunch of them, or none of them.  The classifier will learn
which are and aren't useful, and it sure looks like that will vary depending
on user (that one of my ISPs is Comcast and one of yours isn't is not a good
reason to poo-poo the clues Comcast leaves behind <wink>).


From popiel at wolfskeep.com  Mon Dec 22 16:54:35 2003
From: popiel at wolfskeep.com (T. Alexander Popiel)
Date: Mon Dec 22 16:54:39 2003
Subject: [spambayes-dev] RE: [Spambayes] How low can you go? 
In-Reply-To: Message from "Tim Peters" <tim.one@comcast.net> of "Mon,
	22 Dec 2003 15:10:38 EST."
	<LNBBLJKPBEHFEDALKOLCIEHFHPAB.tim.one@comcast.net> 
References: <LNBBLJKPBEHFEDALKOLCIEHFHPAB.tim.one@comcast.net> 
Message-ID: <20031222215435.CE64A2DF61@cashew.wolfskeep.com>

In message:  <LNBBLJKPBEHFEDALKOLCIEHFHPAB.tim.one@comcast.net>
             "Tim Peters" <tim.one@comcast.net> writes:
>[T. Alexander Popiel]
>
>> We shouldn't be treating just anecdotal evidence from running changed
>> code with our ongoing live mail feeds as the best we can do.
>
>We're really not, Alex.  It's just a source of ideas to try, and nothing has
>changed as a result of it (some experimental, non-default options have been
>added, but that's it).

You're right, and I'm being overly emphatic.  The significant work
over the last year has almost entirely been with the Outlook integration;
the original core of the project has gone fairly dormant.  For UI stuff,
you don't need rigor (unless you're Don Norman), and I've been letting
some of that bleed over into my perception of all of the recent progress.

>I did check in a few material changes to tokenizer.py over the last year
>without full-scale testing.  These were all in the nature of untangling HTML
>obfuscations, so that the classifier got a better idea of what the human
>email reader sees, instead of tokenizing mountains of raw numeric character
>entities, nonsense tags, and other coding tricks unique to HTML.

These are definitely good changes.  The header whitespace
normalization that's been suggested in a separate thread may
also be, though I'm less certain of that one; since the vast
majority of people don't look at the headers, I suspect there's
a greater chance of something quirky but useful there that'd be
obscured by the normalization.  (I suppose it depends on whether
intermediate mailservers unwrap and rewrap the headers...)

>> I'm spoiled by doing most of my mail handling in an
>> environment which encourages treating mail as data to be arbitrarily
>> processed, instead of just viewed through a gui.
>
>OTOH, Outlook users are spoiled by that GUI, deeply integrated with
>spambayes.  It's truly a joy to use, day-to-day.  Training spambayes
>effectively via the Outlook UI remains more than a bit of a puzzle, though,
>and that extends in part to everyone who isn't prepared to retrain from
>scratch at the drop of a pin.  There's a growing disconnect that way between
>what developers are happy to do, and what "real users" are able to tolerate.
>That's worth some thought too.

I don't use a gui at all from my normal mail, so I really don't
know what it would be like to have spambayes 'tightly integrated'.
As it is, I've got a couple folders set up for spambayes use, and
some procmail stuff... but any retraining or corrections only take
effect in my nightly rebuild-the-database-from-scratch, unless I
go out of my way to kick off a rebuild early.

I think the biggest disconnect by far is whether or not people are
willing to keep every single piece of mail they get for months or
years at a time.  That's what I'm doing now... but I think I can
count the number of people who do that on one hand.

The next test that I'm actually interested in doing is a comparison
between training on everything and training on everything that isn't
1.00 or 0.00 (rounded).  I may post a regime for that shortly.

- Alex

From skip at pobox.com  Mon Dec 22 16:58:59 2003
From: skip at pobox.com (Skip Montanaro)
Date: Mon Dec 22 16:59:09 2003
Subject: [spambayes-dev] default to mine_received_headers=True,
	"may be forged"
In-Reply-To: <LNBBLJKPBEHFEDALKOLCEEIKHPAB.tim.one@comcast.net>
References: <16359.21574.614482.990083@montanaro.dyndns.org>
	<LNBBLJKPBEHFEDALKOLCEEIKHPAB.tim.one@comcast.net>
Message-ID: <16359.26915.135330.791705@montanaro.dyndns.org>


    Tim> Changing the regexp to use [a-z] instead of \w would weed out all
    Tim> that stuff. 

I'll give that a try.  Thanks.

    >> Perhaps we should add
    >> 
    >> header = re.sub(r'\s+', ' ', header)
    >> 
    >> to the "for header ..." loop in any case?

    Tim> There are many "for header" loops, and I'm not sure which one(s)
    Tim> you're talking about here.  If you want to do this somewhere,

    Tim>     header = ' '.join(header.split())

    Tim> is faster.

Okay.  I was just referring to the loop over the Received headers in the
section of code we've been messing with.

    >> I'm willing to tuck the more general received sifting into the
    >> tokenizer controlled by a new experimental option.  Let me know if
    >> you want me to take that step.

    Tim> No, I don't want another experimental option just for this.  It
    Tim> seems clear enough already that "may be forged" is potentially
    Tim> interesting, and also that "may be forged" isn't the only
    Tim> potentially interesting string.  We should suck up a bunch of them,
    Tim> or none of them.  The classifier will learn which are and aren't
    Tim> useful, and it sure looks like that will vary depending on user
    Tim> (that one of my ISPs is Comcast and one of yours isn't is not a
    Tim> good reason to poo-poo the clues Comcast leaves behind <wink>).

Okay, I'll leave "(may be forged)" in and add Comcast's "(untrusted
sender)".  I posted a note to comp.mail.misc asking for equivalents to "(may
be forged)" for other MTAs.  I'll see if anything interesting turns up which
warrants investigation.

Skip


From tim.one at comcast.net  Mon Dec 22 17:28:33 2003
From: tim.one at comcast.net (Tim Peters)
Date: Mon Dec 22 17:28:38 2003
Subject: [spambayes-dev] default to mine_received_headers=True,
	"may be forged"
In-Reply-To: <16359.26915.135330.791705@montanaro.dyndns.org>
Message-ID: <LNBBLJKPBEHFEDALKOLCIEJAHPAB.tim.one@comcast.net>

[Skip Montanaro]
> Okay.  I was just referring to the loop over the Received headers in
> the section of code we've been messing with.

Cool!  The line structure clearly does't do anything except get in the way
for us there.

> ...
> Okay, I'll leave "(may be forged)" in and add Comcast's "(untrusted
> sender)".  I posted a note to comp.mail.misc asking for equivalents
> to "(may be forged)" for other MTAs.  I'll see if anything
> interesting turns up which warrants investigation.

Don't you think this is a "stupid beats smart" kind of thing?  I do.
Besides those strings, "(no client certificate requested)" is 100%
correlated with ham for me now, and "(misconfigured sender)" is curiously
mixed.  I don't know who's generating them, but after weeding out the ones
containing digits there are so few remaining I don't give a rip.  MTAs will
change over time, MTAs in other countries may use different words, spammers
trying to forge Received lines are (if history is any guide) quite likely to
screw up small details ... the classifier will learn all this on its own,
provided it's not blinded to the raw data by a presumption that we know in
advance what will and won't be useful.

be-stupid-be-happy<wink>-ly y'rs  - tim


From skip at pobox.com  Mon Dec 22 18:13:44 2003
From: skip at pobox.com (Skip Montanaro)
Date: Mon Dec 22 18:13:54 2003
Subject: [spambayes-dev] default to mine_received_headers=True,
	"may be forged"
In-Reply-To: <LNBBLJKPBEHFEDALKOLCIEJAHPAB.tim.one@comcast.net>
References: <16359.26915.135330.791705@montanaro.dyndns.org>
	<LNBBLJKPBEHFEDALKOLCIEJAHPAB.tim.one@comcast.net>
Message-ID: <16359.31400.650350.732281@montanaro.dyndns.org>


    >> Okay, I'll leave "(may be forged)" in and add Comcast's "(untrusted
    >> sender)".

    Tim> Don't you think this is a "stupid beats smart" kind of thing?  

For the moment I'd like to at least make a passing stab at understanding
what those phrases mean (or at least what generates them).

If anyone else would like to generate some raw data, you could run something
like this:

    from spambayes.mboxutils import getmbox
    import re, pprint
    d = {}
    for msg in getmbox("<Directory Full of Mail>"):
      hdrs = msg.get_all("received", ())
      for hdr in hdrs:
        for hit in pat.findall(' '.join(hdr.split())):
          d[hit] = d.get(hit,0)+1
    l = [(d[k], k) for k in d if d[k] > 2]
    l.sort()
    pprint.pprint(l)

using a relatively recent cvs checkout (one that has the more general
definition of getmbox()).  The conditional in the lc is just to trim the
output to a reasonable size.  Using a couple training databases I get:

    [(3, '(HELO bean)'),
     (3, '(HELO ckalin)'),
     (3, '(HELO default)'),
     (3, '(HELO laptop)'),
     (3, '(No client certificate requested)'),
     (3, '(authenticated user wgmachado)'),
     (3, '(may be fabricated)'),
     (4, '(HELO jim)'),
     (4, '(HELO vaio)'),
     (4, '(Postfix MTA)'),
     (4, '(account dave HELO nefarious)'),
     (4, '(verified OK)'),
     (5, '(HELO there)'),
     (6, '(HELO lion)'),
     (7, '(HELO bogdanm)'),
     (8, '(HELO opus)'),
     (8, '(misconfigured sender)'),
     (15, '(NEW ZEALAND STANDARD TIME)'),
     (15, '(untrusted sender)'),
     (17, '(HELO localhost)'),
     (18, '(from localhost)'),
     (19, '(SMTP Server)'),
     (26, '(MET DST)'),
     (28, '(NEW ZEALAND DAYLIGHT TIME)'),
     (435, '(may be forged)')]

I am really starting to worry about those kiwis.  Are these header phrases
part of their master plan for world domination?  Tom Ridge just raised our
alert level in the US to "orange".  Is there a correlation.  Do you think I
should call 9-1-1?

    Tim> be-stupid-be-happy<wink>-ly y'rs  - tim

Every time I try that I'm happy until Ellen hits me with a 2-by-4.  Then my
head hurts like hell for about three days.  <wink>

Skip

From popiel at wolfskeep.com  Mon Dec 22 18:35:05 2003
From: popiel at wolfskeep.com (T. Alexander Popiel)
Date: Mon Dec 22 18:35:34 2003
Subject: [spambayes-dev] RE: [Spambayes] How low can you go? 
In-Reply-To: Message from "T. Alexander Popiel" <popiel@wolfskeep.com> 
	of "Mon, 22 Dec 2003 13:54:35 PST."
	<20031222215435.CE64A2DF61@cashew.wolfskeep.com> 
References: <LNBBLJKPBEHFEDALKOLCIEHFHPAB.tim.one@comcast.net>
	<20031222215435.CE64A2DF61@cashew.wolfskeep.com> 
Message-ID: <20031222233505.608672DF61@cashew.wolfskeep.com>

In message:  <20031222215435.CE64A2DF61@cashew.wolfskeep.com>
             "T. Alexander Popiel" <popiel@wolfskeep.com> writes:
>
>The next test that I'm actually interested in doing is a comparison
>between training on everything and training on everything that isn't
>1.00 or 0.00 (rounded).  I may post a regime for that shortly.

Regime 'nonedge' is now checked in for this.  I'll be running the
tests with it shortly.

- Alex

From richie at entrian.com  Mon Dec 22 18:57:27 2003
From: richie at entrian.com (Richie Hindle)
Date: Mon Dec 22 18:57:39 2003
Subject: [spambayes-dev] default to mine_received_headers=True,
	"may be forged"
In-Reply-To: <16359.31400.650350.732281@montanaro.dyndns.org>
References: <16359.26915.135330.791705@montanaro.dyndns.org>
	<LNBBLJKPBEHFEDALKOLCIEJAHPAB.tim.one@comcast.net>
	<16359.31400.650350.732281@montanaro.dyndns.org>
Message-ID: <ebveuv8ncupf0trqhgs47p2o1ges2lnc53@4ax.com>


> If anyone else would like to generate some raw data

Your script didn't define 'pat' - I've assumed you meant:

pat = re.compile(r'\(\w+(?:\s+\w+)+\)')

Here's what I get from my corpus of 20,000 verified spams:

[(3, '(HELO 0j3x2or)'),
 (3, '(HELO 2vqmm)'),
 (3, '(HELO 3bn0dn2)'),
 (3, '(HELO 3frty7)'),
 (3, '(HELO 6qzmi3)'),
 (3, '(HELO QRJATYDI)'),
 (3, '(HELO ben)'),
 (3, '(HELO d9vyix)'),
 (3, '(HELO ic6nlfq)'),
 (3, '(HELO laabud)'),
 (3, '(HELO ojeudcb)'),
 (3, '(HELO pebbyrl)'),
 (3, '(HELO pm9he0)'),
 (3, '(HELO r26)'),
 (3, '(HELO richie)'),
 (3, '(HELO vzjqt6x)'),
 (3, '(HELO xhz5j)'),
 (3, '(HELO yu5s)'),
 (3, '(untrusted sender)'),
 (4, '(built Aug 19 2002)'),
 (4, '(built May 7 2001)'),
 (6, '(HELO kos)'),
 (6, '(built Jul 28 2003)'),
 (6, '(built Oct 18 2002)'),
 (7, '(built Feb 21 2002)'),
 (8, '(HELO localhost)'),
 (9, '(built Sep 8 2003)'),
 (11, '(HELO pm69)'),
 (12, '(built Feb 13 2003)'),
 (15, '(HELO pm65)'),
 (18, '(built Mar 18 2003)'),
 (21, '(built May 14 2003)'),
 (27, '(SMTP Server)'),
 (149, '(may be forged)')]

And these from the 12,000 or so message in the spambayes and spambayes-dev
archives - not 100% spam-free, but very very nearly:

[(3, '(HELO GR43)'),
 (3, '(HELO WPWD0038)'),
 (3, '(HELO diffy2)'),
 (3, '(HELO gamer)'),
 (3, '(built Jul 12 2002)'),
 (4, '(HELO jimws)'),
 (4, '(HELO localhost)'),
 (6, '(HELO dj2klap)'),
 (6, '(built Feb 21 2002)'),
 (6, '(built Sep 8 2003)'),
 (7, '(userid 1)'),
 (8, '(EHLO localhost)'),
 (8, '(MET DST)'),
 (8, '(No client certificate requested)'),
 (8, '(SquirrelMail authenticated user gaza)'),
 (8, '(built Jul 28 2003)'),
 (9, '(0 bits)'),
 (11, '(HELO STRIPER)'),
 (11, '(built Jan 23 2003)'),
 (11, '(built Oct 18 2002)'),
 (11, '(sSMTP sendmail emulation)'),
 (13, '(HELO jim)'),
 (13, '(SMTP Server)'),
 (16, '(built Nov 6 2002)'),
 (21, '(built Nov 25 2002)'),
 (26, '(HELO striper)'),
 (27, '(built Jul 29 2002)'),
 (28, '(built Jan 7 2003)'),
 (33, '(misconfigured sender)'),
 (34, '(userid 4)'),
 (35, '(HELO lion)'),
 (51, '(may be forged)'),
 (59, '(built Feb 13 2003)'),
 (86, '(built May 14 2003)'),
 (99, '(built Sep 23 2002)'),
 (100, '(built Mar 18 2003)'),
 (101, '(built May 13 2002)'),
 (158, '(untrusted sender)'),
 (364, '(built Aug 5 2002)')]

So "(may be forged)" would be a weak spam clue for me, while "(untrusted
sender)" would be a strong ham clue - but 133 of those 158 are from Tim...
Even taking Tim out of the equation, it's 25-to-3 in favour of ham.  The
other 25 are from maybe a dozen other people.  Ah - all are either
attbi.com or comcast.net.  Here's an example of an attbi.com one:

Received: from hal2
	(h00e01840da57.ne.client2.attbi.com[24.91.108.212](untrusted sender))
	by attbi.com (rwcrmhc11) with SMTP
	id <2003061814314101300an5bve>; Wed, 18 Jun 2003 14:31:42 +0000
Message-ID: <ENELLFEIPIANCGOIGFOEOEMIDPAA.RCaro@CMC.us>

Make of all that what you will.

-- 
Richie Hindle
richie@entrian.com


From gbrown at alumni.caltech.edu  Mon Dec 22 20:31:43 2003
From: gbrown at alumni.caltech.edu (Glenn Brown)
Date: Mon Dec 22 20:32:17 2003
Subject: [spambayes-dev] siickkk and deprrravved stufff totallly grossssse
In-Reply-To: <LNBBLJKPBEHFEDALKOLCIEGHHPAB.tim.one@comcast.net>
Message-ID: <01ff01c3c8f4$8760f680$6601a8c0@Glenn>

> [T. Alexander Popiel]
> > Based on these clues, I'd say that you trained on one of these
> > messages as ham.  That'll certainly encourage a ham classification
> > for them.
> 
> Yup, looks certain -- or else Glenn makes some mighty fine distinctions
> about which kinds of porn spam he *wants* to see <wink>.

> [T. Alexander Popiel]
> > Based on these clues, I'd say that you trained on one of these
> > messages as ham.  That'll certainly encourage a ham classification
> > for them.
> 
> Yup, looks certain -- or else Glenn makes some mighty fine distinctions
> about which kinds of porn spam he *wants* to see <wink>.

I had "recovered from spam" that very message before scoring it and sending
the output.  My intention is was to remove the message from the "spam" db,
but I forgot it moved the message to "inbox" instead of "junk suspects".

I'm sure this SNAFU effectively killed this thread, but the meme is planted.
If character repetition attacks become a problem, time will tell, and the
solution is easy...

--Glenn


From skip at pobox.com  Mon Dec 22 21:17:07 2003
From: skip at pobox.com (Skip Montanaro)
Date: Mon Dec 22 21:17:18 2003
Subject: [spambayes-dev] default to mine_received_headers=True,
	"may be forged"
In-Reply-To: <ebveuv8ncupf0trqhgs47p2o1ges2lnc53@4ax.com>
References: <16359.26915.135330.791705@montanaro.dyndns.org>
	<LNBBLJKPBEHFEDALKOLCIEJAHPAB.tim.one@comcast.net>
	<16359.31400.650350.732281@montanaro.dyndns.org>
	<ebveuv8ncupf0trqhgs47p2o1ges2lnc53@4ax.com>
Message-ID: <16359.42403.517834.642502@montanaro.dyndns.org>


    Richie> Your script didn't define 'pat' - I've assumed you meant:

    Richie> pat = re.compile(r'\(\w+(?:\s+\w+)+\)')

Whoops.  I was cutting-n-pasting from an interpreter session.  'pat' was
actually

    pat = re.compile(r'\([a-z]+(?:\s+[a-z]+)+\)', re.I)

but yours is close enough.  Thanks for the input/output.

    Richie> Here's what I get from my corpus of 20,000 verified spams:

    ...
    Richie>  (3, '(untrusted sender)'),
    ...
    Richie>  (149, '(may be forged)')]

    Richie> And these from the 12,000 or so message in the spambayes and
    Richie> spambayes-dev archives - not 100% spam-free, but very very
    Richie> nearly:

    ...
    Richie>  (51, '(may be forged)'),
    ...
    Richie>  (158, '(untrusted sender)'),
    ...

    Richie> "(untrusted sender)".... Ah - all are either attbi.com or
    Richie> comcast.net.  Here's an example of an attbi.com one:

Yup, this tag is almost certainly added by Comcast's MTA (they bought AT&T's
cable internet business not that long ago).

It's interesting that you seem to have a lot of HELO's with the same value.
Frequent correspondents perhaps?  I don't see that many HELO's (some from
localhost).  Are they generated close to your machine (in a late Received:
header)?

Skip


From tim.one at comcast.net  Mon Dec 22 21:44:12 2003
From: tim.one at comcast.net (Tim Peters)
Date: Mon Dec 22 21:44:16 2003
Subject: [spambayes-dev] siickkk and deprrravved stufff totallly grossssse
In-Reply-To: <01ff01c3c8f4$8760f680$6601a8c0@Glenn>
Message-ID: <LNBBLJKPBEHFEDALKOLCKEJOHPAB.tim.one@comcast.net>

[Glenn Brown]
> I had "recovered from spam" that very message before scoring it and
> sending the output.  My intention is was to remove the message from
> the "spam" db, but I forgot it moved the message to "inbox" instead
> of "junk suspects".

Ah!  You're still well-advised to better balance your training data.  The
imbalance now is hurting you.

> I'm sure this SNAFU effectively killed this thread, but the meme is
> planted. If character repetition attacks become a problem, time will
> tell, and the solution is easy...

That can't be known without testing, a great many tokens aren't "words" at
all, and SpamBayes isn't limited to English even if they were.  IOW, it may
or may not prove an effective gimmick, but nobody can claim to know one way
or the other without testing.

There are mnay, many ohter wyas to obcsure werds t00, but they all have in
comon that they make the s p a m m e r luk like an 1d1ot, and so cut
response rate.


From mhammond at skippinet.com.au  Tue Dec 23 01:29:25 2003
From: mhammond at skippinet.com.au (Mark Hammond)
Date: Tue Dec 23 01:29:43 2003
Subject: [spambayes-dev] Experimental SpamBayes build available
Message-ID: <001801c3c91e$1cde2bf0$2c00a8c0@eden>

Hi all,

I have just uploaded an installer for a new experimental binary of
SpamBayes.  This binary includes *both* the Outlook addin and the sb_server
applications.  The installer attempts to detect the most appropriate one to
install.

Everything is built from CVS sources as of today.  Hopefully, this will mean
the Outlook addin has a number of bugs fixed over the 0.8 release.  However,
it is possible there are a number of bugs *not* in 0.8, and even the
possiblility it will not work at all for many people (as this is released
with different 'python->.exe' technology than previous versions)

The sb_server application suite all seem to work fine too, so non-outlook
users are also encouraged to try this version.  Note that it comes with
almost no documentation (as there is none!) and that this is the first
release of such a binary, so this too is bleeding edge.

Thus, only brave people willing to test out stuff with almost no release
notes should try it :)  To further dissuade you, I am leaving for a week or
so holiday, and will not be in a position to respond to any mail or bugs
relating to this build. That said, it works well for me and the testing I
have done on a number of machines.

If anyone is keen, please visit
http://starship.python.net/crew/mhammond/spambayes/

Happy holidays!

Mark.


From skip at pobox.com  Tue Dec 23 10:43:49 2003
From: skip at pobox.com (Skip Montanaro)
Date: Tue Dec 23 10:44:03 2003
Subject: [spambayes-dev] comment assertion error? revisit DBDictClassifier
	assumptions?
Message-ID: <16360.25269.160738.272779@montanaro.dyndns.org>


The comment for DBDictClassifier._wordinfoset says:

    # "Singleton" words (i.e. words that only have a single instance)
    # take up more than 1/2 of the database, but are rarely used
    # so we don't put them into the wordinfo cache, but write them
    # directly to the database
    # If the word occurs again, then it will be brought back in and
    # never be a singleton again.
    # This seems to reduce the memory footprint of the DBDictClassifier by
    # as much as 60%!!!  This also has the effect of reducing the time it
    # takes to store the database

With the recent testing of bigrams the clause "but are rarely used" would
seem to be at least partially false.  I'm not too concerned about memory
footprint of the classifier, since I have lots of memory and use
sb_filter.py, not one of the long-running servers or plugins.  I also wonder
about the contention that it reduces the database store time.  It's probably
true that the time spent at shutdown is shorter, but that time has been
amortized over the entire runtime of the program.

Perhaps we should reexamine the caching in DBDictClassifier.  I would like
it to be able to inherit a bit more functionality from its base class.  If
the assumptions it makes aren't entirely accurate, much of the extra work
maintaining caches might be avoided.

Skip

From nobody at spamcop.net  Tue Dec 23 11:40:32 2003
From: nobody at spamcop.net (Seth Goodman)
Date: Tue Dec 23 11:40:34 2003
Subject: [spambayes-dev] RE: [Spambayes] How low can you go? 
In-Reply-To: <16359.19517.682605.648156@montanaro.dyndns.org>
Message-ID: <MHEGIFHMACFNNIMMBACAGEKJGPAA.nobody@spamcop.net>

Thanks to all for replying.  However, I am still a bit confused by the
advise (or like we say in sunny Wisconsin,  Uff Dah!).  Skip suggests trying
out MySQL or PostgreSQL to implement the various bidirectional mappings (I
assume this means trash the existing database and create new ones).  Alex
suggests that bidirectional maps are overkill and not to bother.  Alex also
has some scripts that do much of what I am trying to do, but it sounds like
they will only work in a procmail environment and not with Outlook, which is
where I am stuck.  I run an Outlook client in IMO mode and fetch mail with
POP3.  Tim appeared to agree with Alex that I shouldn't mess with the main
database but I should nonetheless experiment and I know he likes the
bidirectional maps.  I understand that there are also a bunch of testing
frameworks/harnesses checked in and standard data sets to test against,
though it sounds like they don't work with Outlook, which is a real pity.

So I'm again asking for direction in the initial, most important decisions.
For testing message and hapax expiration with various training regimens
under the Outlook environment (if that is even possible or reasonable):

1) Do you recommend that I use the Outlook code base or ditch the Outlook
plug-in and install the sbproxy version from source?  I hate to lose the
integration and I don't even know if the proxy produces mbox-style mail
folders that the myriad scripts already written can work with.

2) Do you recommend I start with the existing database and modify it, or as
Skip suggested, change over to a database that doesn't have the multi-thread
corruption problem?

3) And finally, Skip previously suggested that I check out the CVS trunk.
Is that still your recommendation?


Thanks for all your help.  I just want to avoid taking initial mis-steps
that would make anything I put together useless to anybody else.  I also
don't want to duplicate efforts that others who are experienced have already
taken.

--
Seth Goodman

  Humans:   off-list replies to sethg [at] GoodmanAssociates [dot] com

  Spambots: disregard the above


From skip at pobox.com  Tue Dec 23 12:07:11 2003
From: skip at pobox.com (Skip Montanaro)
Date: Tue Dec 23 12:07:22 2003
Subject: [spambayes-dev] RE: [Spambayes] How low can you go? 
In-Reply-To: <MHEGIFHMACFNNIMMBACAGEKJGPAA.nobody@spamcop.net>
References: <16359.19517.682605.648156@montanaro.dyndns.org>
	<MHEGIFHMACFNNIMMBACAGEKJGPAA.nobody@spamcop.net>
Message-ID: <16360.30271.161758.305580@montanaro.dyndns.org>


    Seth> 2) Do you recommend I start with the existing database and modify
    Seth> it, or as Skip suggested, change over to a database that doesn't
    Seth> have the multi-thread corruption problem?

That's not why I suggested MySQL or PostgreSQL.  Sure, thread safety would
be a nice side-effect, but for testing I probably wouldn't care much about
that.  I suggested them because it would be easy to experiment with
different database structures.

    Seth> 3) And finally, Skip previously suggested that I check out the CVS
    Seth> trunk.  Is that still your recommendation?

For testing, yes.

I'd also recommend you ditch the Outlook plugin for testing.  If you've ever
done any Unix programming you'll probably find cobbling stuff together much
easier without the overhead of a GUI.

Skip

From nobody at spamcop.net  Tue Dec 23 13:18:29 2003
From: nobody at spamcop.net (Seth Goodman)
Date: Tue Dec 23 13:19:07 2003
Subject: [spambayes-dev] RE: [Spambayes] How low can you go? 
In-Reply-To: <16360.30271.161758.305580@montanaro.dyndns.org>
Message-ID: <MHEGIFHMACFNNIMMBACAAEKPGPAA.nobody@spamcop.net>

> [Skip Montanaro]
> I'd also recommend you ditch the Outlook plugin for testing.  If
> you've ever
> done any Unix programming you'll probably find cobbling stuff
> together much
> easier without the overhead of a GUI.

Just to be clear, I would then use the sbserver code, as I run Windows, not
Unix.  I have done Unix scripts in the past, and certainly appreciated the
flexibility and ease of messing around, but I no longer have a Unix setup.
Are the scripts that people are checking in (like Alex's nonedge script)
compatible with the mail folders produced (if any) by the sbserver code?

--
Seth Goodman

  Humans:   off-list replies to sethg [at] GoodmanAssociates [dot] com

  Spambots: disregard the above


From skip at pobox.com  Tue Dec 23 13:25:37 2003
From: skip at pobox.com (Skip Montanaro)
Date: Tue Dec 23 13:25:44 2003
Subject: [spambayes-dev] RE: [Spambayes] How low can you go? 
In-Reply-To: <MHEGIFHMACFNNIMMBACAAEKPGPAA.nobody@spamcop.net>
References: <16360.30271.161758.305580@montanaro.dyndns.org>
	<MHEGIFHMACFNNIMMBACAAEKPGPAA.nobody@spamcop.net>
Message-ID: <16360.34977.846047.659325@montanaro.dyndns.org>


    Seth> Just to be clear, I would then use the sbserver code, as I run
    Seth> Windows, not Unix.

Yeah, or sb_filter.py and/or sb_moxtrain.py.  Note that I'm assuming you're
going to test your changes on a collection of saved mail, not on your
incoming mail feed. 

    Seth> I have done Unix scripts in the past, and certainly appreciated
    Seth> the flexibility and ease of messing around, but I no longer have a
    Seth> Unix setup.  Are the scripts that people are checking in (like
    Seth> Alex's nonedge script) compatible with the mail folders produced
    Seth> (if any) by the sbserver code?

sb_server.py is a proxy.  It doesn't create long-term storage for messages.
It only annotates messages it fetches on your behalf from your POP3 server.

Skip

From tim at fourstonesExpressions.com  Tue Dec 23 15:10:06 2003
From: tim at fourstonesExpressions.com (Tim Stone)
Date: Tue Dec 23 15:10:13 2003
Subject: [spambayes-dev] comment assertion error? revisit DBDictClassifier
	assumptions?
In-Reply-To: <16360.25269.160738.272779@montanaro.dyndns.org>
References: <16360.25269.160738.272779@montanaro.dyndns.org>
Message-ID: <opr0ngy4d2it6vze@mail.fourstonesExpressions.com>

On Tue, 23 Dec 2003 09:43:49 -0600, Skip Montanaro <skip@pobox.com> wrote:

> Perhaps we should reexamine the caching in DBDictClassifier.  I would 
> like
> it to be able to inherit a bit more functionality from its base class.  
> If
> the assumptions it makes aren't entirely accurate, much of the extra work
> maintaining caches might be avoided.

I have no idea where that comment came from... The scheme seems bogus to 
me.  It's a word, it occurs once or many times, there's no reason to treat 
it differently.  If we have memory consumption problems, then that's the 
problem to fix..  We've had a bunch of discussion about using other db 
systems (zodb, mysql, etc.).  Perhaps this is yet another reason to 
"modernize" our database.

-- 

Vous exprimer; Expr?sese; Te stesso esprimere; Express yourself!
Tim Stone
See my photography at www.fourstonesExpressions.com
See my writing at www.xanga.com/obj3kshun

From skip at pobox.com  Tue Dec 23 15:26:06 2003
From: skip at pobox.com (Skip Montanaro)
Date: Tue Dec 23 15:26:15 2003
Subject: [spambayes-dev] comment assertion error? revisit DBDictClassifier
	assumptions?
In-Reply-To: <opr0ngy4d2it6vze@mail.fourstonesExpressions.com>
References: <16360.25269.160738.272779@montanaro.dyndns.org>
	<opr0ngy4d2it6vze@mail.fourstonesExpressions.com>
Message-ID: <16360.42206.839409.41850@montanaro.dyndns.org>


    Tim> On Tue, 23 Dec 2003 09:43:49 -0600, Skip Montanaro <skip@pobox.com> wrote:
    >> Perhaps we should reexamine the caching in DBDictClassifier.

    Tim> I have no idea where that comment came from... 

That much I can tell you.  Mark wrote the comment on May 30th.  Here's the
checkin comment:

    2 changes to the way the DB classifier manages words:

    * As per Tim P's mail, keep a list of "changed words" with a flag
    indicating "change" or "delete".  This prevents the database save
    from updating every single word ever loaded by the db.

    * From Sean, a change that prevents caching of hapaxes.  Such words are
    saved directly to the DB.  This reduces the memory footprint significantly
    (as these words are not kept in memory) and helps save times.

    This change makes "incremental" saving of the database happen in a
    reasonable time, and doesn't degrade after a complete retrain etc.

    I'm off for a weekend holiday - someone can just back this out if I
    screwed it up <wink>

Perhaps Mark can elaborate when he returns from holiday.

If we are going to cache lookups in the file-based classifiers, I'd prefer
to restructure things so we can reuse behavior defined in
classifier.Classifier wherever possible.  That means that self.wordinfo
should refer to the real file storage, not a cache.  _wordinfoget() and
friends can then rely on the versions in classifier.Classifier and fron that
functionality with caches or other apply other annotations.  This all breaks
down when you consider the SQL-based classifiers, but they've only ever been
experimental (I think - is anyone using them on a regular basis?), so I
think it's okay for the maintenance burden to be higher for them.

Skip

From nobody at spamcop.net  Tue Dec 23 15:33:24 2003
From: nobody at spamcop.net (Seth Goodman)
Date: Tue Dec 23 15:40:53 2003
Subject: [spambayes-dev] RE: [Spambayes] How low can you go? 
In-Reply-To: <16360.34977.846047.659325@montanaro.dyndns.org>
Message-ID: <MHEGIFHMACFNNIMMBACAMELEGPAA.nobody@spamcop.net>

>     Seth> Just to be clear, I would then use the sbserver code, as I run
>     Seth> Windows, not Unix.
>
> [Skip Montanaro]
> Yeah, or sb_filter.py and/or sb_moxtrain.py.  Note that I'm
> assuming you're
> going to test your changes on a collection of saved mail, not on your
> incoming mail feed.

In that case, is it possible to leave the Outlook binary installed for my
incoming mail stream while I use sb_mboxtrain.py and sb_filter.py for stored
mbox testing?  My system doesn't seem to have a PythonPath environment
variable, so I would guess this is possible, so long as I can keep all the
relevant paths different.  If I can have the Outlook binary and non-Outlook
source working at the same time, is there a way to convert my saved Outlook
mail folders to mbox format so that I _can_ see how the changes I make work
on my own mail stream as well?

--
Seth Goodman

  Humans:   off-list replies to sethg [at] GoodmanAssociates [dot] com

  Spambots: disregard the above


From kennypitt at hotmail.com  Tue Dec 23 16:04:10 2003
From: kennypitt at hotmail.com (Kenny Pitt)
Date: Tue Dec 23 16:04:50 2003
Subject: [spambayes-dev] comment assertion error? revisit
	DBDictClassifierassumptions?
In-Reply-To: <opr0ngy4d2it6vze@mail.fourstonesExpressions.com>
Message-ID: <Law11-OE39pHEbgkJqz0001135f@hotmail.com>

Tim Stone wrote:
> On Tue, 23 Dec 2003 09:43:49 -0600, Skip Montanaro <skip@pobox.com>
> wrote: 
> 
>> Perhaps we should reexamine the caching in DBDictClassifier.  I
>> would like it to be able to inherit a bit more functionality from
>> its base class. If the assumptions it makes aren't entirely
>> accurate, much of the extra work maintaining caches might be avoided.
> 
> I have no idea where that comment came from... The scheme seems bogus
> to me.  It's a word, it occurs once or many times, there's no reason
> to treat it differently.  If we have memory consumption problems,
> then that's the problem to fix..  We've had a bunch of discussion
> about using other db systems (zodb, mysql, etc.).  Perhaps this is
> yet another reason to "modernize" our database.

The comment appears in the _wordinfoset() function, which means it is
called when a message is trained.  I believe the original reasoning was
probably that there are a lot of tokens in a newly trained message that
have never been seen before, and quite likely will never be seen again.
It would be a waste of memory to cache lots of singleton tokens that
will never be used to classify another message, so the token is saved to
the database on disk but is discarded from the memory cache.  If the
token is ever needed when classifying a message in the future, then it
will be read in from the database and will then be kept in the memory
cache.

Because the uni/bigram scheme generates so many more tokens from the
same message, I would think this reasoning would apply even more so
there.

This same caching scheme could be applied to any of the random-access
database storage mechanisms, such as MySQL or Postgres.  It doesn't seem
like it would apply to pickles, however, because the complete list of
all known tokens is always kept in memory for a pickle.  Since
PickledClassifier also derives from Classifier, I would have to vote
against moving caching logic into the base Classifier class.  Maybe a
DBClassifierBase class derived from Classifier and containing the
caching logic for all database storage mechanisms would be in order.

Regarding the reduced store time, this "optimization" seems to be
oriented towards a train-on-everything strategy and a long running
application such as sb_server.  Keeping updates in memory means that the
counts for a token can be updated multiple times with only one database
write at the end, while writing out singletons immediately keeps the
size of the change list down so that the database update doesn't take
quite so long at shutdown.

With the caching and optimization in the database engines being what it
is today, it seems that we might be better off to always write changes
to the DB immediately and dispense with the whole self.changed_words
thing altogether.  When there are multiple processes that could be using
the database at the same time, any caching (read or write) that we do
ourselves outside the database engine has the potential to generate
inconsistencies in the data anyway.

Whew, that's a much longer response than I intended.  Guess that's what
happens when things get slow before the holidays.

-- 
Kenny Pitt


From richie at entrian.com  Tue Dec 23 17:19:33 2003
From: richie at entrian.com (Richie Hindle)
Date: Tue Dec 23 17:19:46 2003
Subject: [spambayes-dev] default to mine_received_headers=True,
	"may be forged"
In-Reply-To: <16359.42403.517834.642502@montanaro.dyndns.org>
References: <16359.26915.135330.791705@montanaro.dyndns.org>
	<LNBBLJKPBEHFEDALKOLCIEJAHPAB.tim.one@comcast.net>
	<16359.31400.650350.732281@montanaro.dyndns.org>
	<ebveuv8ncupf0trqhgs47p2o1ges2lnc53@4ax.com>
	<16359.42403.517834.642502@montanaro.dyndns.org>
Message-ID: <28fhuvs6574p1v00amrap1j62v7s46vvrk@4ax.com>


[Skip]
> It's interesting that you seem to have a lot of HELO's with the same value.
> Frequent correspondents perhaps?

Based on looking at a couple of examples, each unique HELO corresponds
with one person.  "HELO lion", for instance, is David LeBlanc.  From the
spams, "HELO kos" are all instances of the same spam.  "HELO pm69" are all
opt-in spams from Interseer, a service that watches your website uptime at
the cost of (very mildly) spamming you.

> I don't see that many HELO's (some from localhost).  Are they generated
> close to your machine (in a late Received: header)?

The three I looked at were all in the initial Received header - that is,
the one that was added by the first MTA, and therefore appears last in the
message.

-- 
Richie Hindle
richie@entrian.com


From richie at entrian.com  Tue Dec 23 19:14:10 2003
From: richie at entrian.com (Richie Hindle)
Date: Tue Dec 23 19:14:23 2003
Subject: [spambayes-dev] FW: SF.NET Project Donation System
In-Reply-To: <029e01c3c74c$d84a8470$2c00a8c0@eden>
References: <ire9uv830bs6j3i7eta1kjt5m0brv2ek3j@4ax.com>
	<029e01c3c74c$d84a8470$2c00a8c0@eden>
Message-ID: <66mhuv0k1kjgmkbv786r6ja6gpuei5us0v@4ax.com>


[Richie]
> Anyone who's spent real money on the project, like Rob with the
> spambayes.org domain, could be reimbursed.

[Mark]
> I agree, but not sure how this could work in practical terms with the tax
> and holding issues.

We could temporarily add them to the top of the donations page, and use
your "Pay this guy first" idea.  Then users could donate directly to them.

[Mark]
> Our "donations" page could list the developers,
> and include a link to their personal sourceforge page.  What they say about
> themselves there is their issue.

Sounds sensible.  We could leave the PSF donation button there, and also
include a donation button for SourceForge (although they already take a 5%
cut of all donations, subject to a $1 minimum).  That should cover all the
bases.

[Tony]
> It seems to me that this would end up being more of a donate-for-support
> page, which leaves out those people that support but don't develop.  My
> personal suspicion is that people are more likely to want to donate for
> support than development, anyway.

Good point.  Maybe we should have a scheme whereby a non-developer
contributor can be nominated for inclusion on the donations page.

-- 
Richie Hindle
richie@entrian.com


From popiel at wolfskeep.com  Tue Dec 23 20:24:49 2003
From: popiel at wolfskeep.com (T. Alexander Popiel)
Date: Tue Dec 23 20:24:54 2003
Subject: [spambayes-dev] RE: [Spambayes] How low can you go? 
In-Reply-To: Message from "Seth Goodman" <nobody@spamcop.net> of "Tue,
	23 Dec 2003 10:40:32 CST."
	<MHEGIFHMACFNNIMMBACAGEKJGPAA.nobody@spamcop.net> 
References: <MHEGIFHMACFNNIMMBACAGEKJGPAA.nobody@spamcop.net> 
Message-ID: <20031224012449.35EA12DF61@cashew.wolfskeep.com>

In message:  <MHEGIFHMACFNNIMMBACAGEKJGPAA.nobody@spamcop.net>
             "Seth Goodman" <nobody@spamcop.net> writes:

>Thanks to all for replying.

Eh, I'm just satisfying one of my own vices: babbling at people while
scrambling to do the groundwork to back up my babble.

>Alex suggests that bidirectional maps are overkill and not to bother.

Hrm.  I think I'd rephrase to say that the maps are overkill for most
all of the individual tests/regimes that you might be interested in.
Furthermore, while we're just trying things out, it seems to make more
sense to do the tests individually as we come up with them, instead of
trying to make some over-arching generalization that could be used to
implement any of them.

>Alex also has some scripts that do much of what I am trying to do, but
>it sounds like they will only work in a procmail environment and not
>with Outlook, which is where I am stuck.

My scripts don't really work in a mail environment at all; they work
in an environment where data content (which happens to be RFC 822
formatted mail messages) is stored in files in a specific directory
structure with a special naming convention.  This structure is:

  Data/
    Ham/
      reservoir/
      Set1/
      Set2/
      ...
      SetN/
    Spam/
      reservoir/
      Set1/
      Set2/
      ...
      SetN/

Inside each of the bottom-level directories is a set of files named
with a 4-digit number, a dash, and a 6-digit number, such as 0267-045075.
The 4-digit number is a day-of-arrival indicator (for grouping vs.
periodic processes like the fixed retraining in the 'corrected' regime),
and the 6-digit number is a unique sequence number (for ordering all the
messages for behaviour-over-time analysis).

Note that the above structure can be used for Tim's cv tests, too;
his framework uses the directory hierarchy but doesn't care about the
file names.

More information on how I generate and manipulate this structure is
in the incremental.HOWTO.txt in the testtools directory of the project.
Also, the README-DEVEL.txt in the root of the project explains a lot
more about this structure and the other tools for manipulating it.

>I run an Outlook client in IMO mode and fetch mail with POP3.

To get at your raw mail messages, I'd stick a POP3 proxy in there which
saved each message into a separate file... but I'm a protocol weenie,
and there might be easier ways to get at the data.

>I understand that there are also a bunch of testing frameworks/harnesses
>checked in

Yes.  The testtools directory is your friend.

>and standard data sets to test against

This we do not have (in any significant quantity), for multiple reasons:

1) If we have a standard data set, then we'll end up with a tool that's
   good at classifying that data set, not random people's mail.

2) While sharing spam is fairly innocuous, sharing ham opens up all sorts
   of privacy concerns... and if we filter out private info from the stuff
   we share, then we're systematically neglecting a portion of the data
   we're trying to represent.

3) We seem to enjoy nagging each other into running tests on private
   datasets.  There seems to be some thought that if we nag enough people,
   someone will actually read the code that's being tested and point out
   where we're being stupid. <.5 wink>

>though it sounds like they don't work with Outlook, which is a real pity.

They don't really work with any mail hander, as mentioned above; instead,
they owrk on organized data, so you can rerun tests time and time again
after various fidgets and fixes.

The reason why Outlook is a particular problem is that Outlook mutilates
mail, irretrievably destroying the RFC 822 structure that it may have
once been delivered in.  A similar structure can theoretically be
recreated, but like many recreations, some information (like the
separators used in MIME encapsulation, etc) is not the same.

>So I'm again asking for direction in the initial, most important decisions.
>For testing message and hapax expiration with various training regimens
>under the Outlook environment (if that is even possible or reasonable):
>
>1) Do you recommend that I use the Outlook code base or ditch the Outlook
>plug-in and install the sbproxy version from source?  I hate to lose the
>integration and I don't even know if the proxy produces mbox-style mail
>folders that the myriad scripts already written can work with.

I'm strongly in favor of ditching Outlook entirely.

>2) Do you recommend I start with the existing database and modify it, or as
>Skip suggested, change over to a database that doesn't have the multi-thread
>corruption problem?

I'm not even sure if the test harnesses use a database backend at all;
I think they may be keeping everything in memory.  Dunno.  I haven't
looked at that in ages.

What I would suggest is starting with the existing test harnesses and
building from there.

>3) And finally, Skip previously suggested that I check out the CVS trunk.
>Is that still your recommendation?

Definitely.  Last I heard, there's a bunch of stuff (including all the
test info) that's in CVS but not in the binary distributions.

>Thanks for all your help.  I just want to avoid taking initial mis-steps
>that would make anything I put together useless to anybody else.  I also
>don't want to duplicate efforts that others who are experienced have already
>taken.

Reproducing what's gone before is useful.  Duplicating it is not so
useful.  Where the line is drawn between the two is something I'll
leave to someone else. ;-)

- Alex

From popiel at wolfskeep.com  Tue Dec 23 20:32:33 2003
From: popiel at wolfskeep.com (T. Alexander Popiel)
Date: Tue Dec 23 20:32:38 2003
Subject: [spambayes-dev] RE: [Spambayes] How low can you go? 
In-Reply-To: Message from "Seth Goodman" <nobody@spamcop.net> of "Tue,
	23 Dec 2003 14:33:24 CST."
	<MHEGIFHMACFNNIMMBACAMELEGPAA.nobody@spamcop.net> 
References: <MHEGIFHMACFNNIMMBACAMELEGPAA.nobody@spamcop.net> 
Message-ID: <20031224013233.A3CCB2DF61@cashew.wolfskeep.com>

In message:  <MHEGIFHMACFNNIMMBACAMELEGPAA.nobody@spamcop.net>
             "Seth Goodman" <nobody@spamcop.net> writes:
>>
>> [Skip Montanaro]
>> Yeah, or sb_filter.py and/or sb_moxtrain.py.  Note that I'm
>> assuming you're
>> going to test your changes on a collection of saved mail, not on your
>> incoming mail feed.
>
>In that case, is it possible to leave the Outlook binary installed for my
>incoming mail stream while I use sb_mboxtrain.py and sb_filter.py for stored
>mbox testing?

Certainly.  You can run test tools completely independent of your mail
feed, without affecting it at all.

>My system doesn't seem to have a PythonPath environment variable, so I
>would guess this is possible, so long as I can keep all the relevant
>paths different.

Exactly.  Most of the test scripts have path-futzing stuff at the top
to find local copies of the spambayes code, too, so it's theoretically
possible even if you do have a PythonPath set.

>is there a way to convert my saved Outlook mail folders to mbox format

This is much more problematic; Mark may have code for this, but as I
mentioned in my last mail, Outlook mutilates mail.  It may be easier to
just start collecting fresh by inserting something which saves incoming
mail before Outlook gets its grubby little hands all over it.

>so that I _can_ see how the changes I make work on my own mail stream
>as well?

This is the most laudable goal of all... for it is how we judge if things
are good or bad. :-)

- Alex

From tameyer at ihug.co.nz  Tue Dec 23 20:47:32 2003
From: tameyer at ihug.co.nz (Tony Meyer)
Date: Tue Dec 23 20:47:39 2003
Subject: [spambayes-dev] RE: [Spambayes] How low can you go? 
In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F13048D7DAE@its-xchg4.massey.ac.nz>
Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F1304677796@its-xchg4.massey.ac.nz>

[Seth Goodman]
> Just out of curiosity, does the proxy version 
> of SpamBayes have the same protection as the Outlook version 
> against training on the same msg_id twice?

Kinda.  It won't train a message with the same id twice, but that id is
generated when mail travels through the proxy.  So if you download the same
message (through the proxy) twice, then you'll have two messages identical
apart from the ids.  If you used the web interface to find a message after
you had already trained it, training it again will have no effect (unless
the classification is different, then it'll be fixed).

FWIW, the imap filter does the same thing, except that since mail isn't
downloaded (it's a filter not a proxy) mail does get given a permanent
unique id.

If you wanted to train a message twice and not download it twice, you could
simply duplicate the one sitting in the "Unknown" cache and give it a
different name (fitting the scheme).

=Tony Meyer


From tameyer at ihug.co.nz  Tue Dec 23 20:50:00 2003
From: tameyer at ihug.co.nz (Tony Meyer)
Date: Tue Dec 23 20:50:08 2003
Subject: [spambayes-dev] RE: [Spambayes] How low can you go? 
In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F13048D7DFB@its-xchg4.massey.ac.nz>
Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F1304677797@its-xchg4.massey.ac.nz>

> People now 
> typically don't have the slightest clue how to go from their 
> normal usage to a testing deployment...  or at least don't 
> know how to extract their mail from Outlook's clutches so 
> that they have data to work _on_.

Right now, the idea is simply that people run the "export.py" script in the
Outlook2000 directory (running from source, obviously), which churns out the
'standard' testing setup containing all the messages in the folders Outlook
knows about.  From there you run tests like anyone else.

=Tony Meyer


From tameyer at ihug.co.nz  Tue Dec 23 20:56:19 2003
From: tameyer at ihug.co.nz (Tony Meyer)
Date: Tue Dec 23 20:56:24 2003
Subject: [spambayes-dev] RE: [Spambayes] How low can you go? 
In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F13048D7F62@its-xchg4.massey.ac.nz>
Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F1304677798@its-xchg4.massey.ac.nz>

[Seth Goodman]
> I understand that there are also a bunch of testing 
> frameworks/harnesses checked in and standard data sets to 
> test against, though it sounds like they don't work with 
> Outlook, which is a real pity.

You can still use all the testing tools with mail that you receive via
Outlook, though.  This is what I do, (AFAIK) what Mark does.  Look at the
export.py script in the Outlook2000 directory.

> 1) Do you recommend that I use the Outlook code base or ditch 
> the Outlook plug-in and install the sbproxy version from 
> source?

Stick with the plug-in.  sb_server's not going to give you anything helpful
in the way of testing (my experimental TestToolsUI excluded <wink>).

> 3) And finally, Skip previously suggested that I check out 
> the CVS trunk. Is that still your recommendation?

Yes.

[Later]
> In that case, is it possible to leave the Outlook binary
> installed for my incoming mail stream while I use 
> sb_mboxtrain.py and sb_filter.py for stored mbox testing?

Yes you can leave the binary installed.  You don't need to use sb_mboxtrain
or sb_filter if you're going to use the testing setup, though.  You're after
the scripts in the testtools directory, not the scripts one.  (If I
understand the recommendations that have been made so far).

> My system doesn't seem to have a PythonPath environment variable,
> so I would guess this is possible, so long as I can keep all
> the relevant paths different.

Just don't run "addin.py" in the Outlook2000 directory, and the plug-in
binary will keep on chugging.

> If I can have the Outlook binary and non-Outlook source
> working at the same time, is there a way to convert my
> saved Outlook mail folders to mbox format so that I _can_
> see how the changes I make work on my own mail stream as well?

export.py in the Outlook2000 directory.

Let me know if you have any troubles getting the testing setup going or
exporting the messages from Outlook.

=Tony Meyer


From tim.one at comcast.net  Tue Dec 23 21:11:48 2003
From: tim.one at comcast.net (Tim Peters)
Date: Tue Dec 23 21:11:53 2003
Subject: [spambayes-dev] comment assertion error?
	revisitDBDictClassifierassumptions?
In-Reply-To: <Law11-OE39pHEbgkJqz0001135f@hotmail.com>
Message-ID: <LNBBLJKPBEHFEDALKOLCIEADIAAB.tim.one@comcast.net>

[Kenny Pitt]

You're doing an excellent job of channeling Mark, and I have only a little
to add.  From a 5-mile view, we run a memory cache (which happens to be a
Python dict) on top of a disk-based database, in order that the system not
run too slow to bear.  The memory cache is effective at speeding normal
operation; that's why it's there.  It may err on the side of keeping too
much in memory.

> The comment appears in the _wordinfoset() function, which means it is
> called when a message is trained.  I believe the original reasoning
> was probably that there are a lot of tokens in a newly trained
> message that have never been seen before, and quite likely will never
> be seen again. It would be a waste of memory to cache lots of
> singleton tokens that will never be used to classify another message,
> so the token is saved to the database on disk but is discarded from
> the memory cache.  If the token is ever needed when classifying a
> message in the future, then it will be read in from the database and
> will then be kept in the memory cache.

All correct.

> Because the uni/bigram scheme generates so many more tokens from the
> same message, I would think this reasoning would apply even more so
> there.

Me too.

> This same caching scheme could be applied to any of the random-access
> database storage mechanisms, such as MySQL or Postgres.

That's right, and if looking up frequently reference tokens goes faster in a
dict than reading from disk (hint:  it does <wink>), it will help them too.

> It doesn't seem like it would apply to pickles, however, because
> the complete list of all known tokens is always kept in memory for a
> pickle.

Also right.  Skip, what you described before makes me wonder why you'd want
a disk-based database:

    I'm not too concerned about memory footprint of the classifier,
    since I have lots of memory
    ...
    I also wonder about the contention that it reduces the database
    store time.

If you want peak classification and/or training speed, have lots of memory,
and don't care about initialization or finalization time, running a plain
Python dict (stored as a giant binary pickle) is definitely the way to go.
It's much faster, and it was much faster still before we added layers of
indirection to *allow* dict operations to get satisfied by "real" databases
instead.

FWIW, the memory cache may not apply much to ZODB either, since ZODB keeps
accessed Python objects (which is what ZODB stores) in its own memory cache.

> Since PickledClassifier also derives from Classifier, I would have
> to vote against moving caching logic into the base Classifier class.
> Maybe a DBClassifierBase class derived from Classifier and containing
> the caching logic for all database storage mechanisms would be in
> order.

Of course different storage mechanisms may want different caching
strategies.

> Regarding the reduced store time, this "optimization" seems to be
> oriented towards a train-on-everything strategy and a long running
> application such as sb_server.  Keeping updates in memory means that
> the counts for a token can be updated multiple times with only one
> database write at the end, while writing out singletons immediately
> keeps the size of the change list down so that the database update
> doesn't take quite so long at shutdown.

It was really aimed at incremental training.  When you hit, e.g., the
"Delete as Spam" button in the Outlook addin with even just one msg
selected, the Berkeley db on disk is synch'ed after training.  This makes
for a *very* perceptible delay if the cache contains lots of info that
differs from what's on disk.  Startup and shutdown time are also important
in this context, and amortizing those costs has major "perceived usability"
benefits.

If, e.g., you run from a giant pickled Python dict instead, you can expect
to wait several seconds (at best) whenever loading it from, or storing it
to, disk.

> With the caching and optimization in the database engines being what
> it is today, it seems that we might be better off to always write
> changes to the DB immediately and dispense with the whole
> self.changed_words thing altogether.

This should be measured; it's not (or shouldn't be) a religious issue.  I
have no experience with general-purpose database engines that are actually
fast; only some that aren't as slow as others <0.5 wink>.

> When there are multiple processes that could be using the database
> at the same time, any caching (read or write) that we do ourselves
> outside the database engine has the potential to generate
> inconsistencies in the data anyway.

A conclusion there, one way or the other, depends on specific details.
Concurrent read-write access is never simple, and I'm not sure anyone uses
spambayes that way anyway.


From tameyer at ihug.co.nz  Tue Dec 23 21:16:19 2003
From: tameyer at ihug.co.nz (Tony Meyer)
Date: Tue Dec 23 21:16:23 2003
Subject: [spambayes-dev] RE: [Spambayes] How low can you go? 
In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F13048D8049@its-xchg4.massey.ac.nz>
Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F130467779A@its-xchg4.massey.ac.nz>

[Alex]
> The reason why Outlook is a particular problem is that 
> Outlook mutilates mail, irretrievably destroying the RFC 822 
> structure that it may have once been delivered in.  A similar 
> structure can theoretically be recreated, but like many 
> recreations, some information (like the separators used in 
> MIME encapsulation, etc) is not the same.
[...]
> I'm strongly in favor of ditching Outlook entirely.

The export.py script does a reasonable job of putting everything back
together again.  Actually, I believe it does the exact same job as when
getting a message to pass to tokenizer for general use.  So although popping
a proxy in between Outlook and the POP3 server to catch raw messages would
certainly be more pure and correct (sb_server can do this, BTW, just set the
cache expiry limit *really* high and don't bother classifiying any
messages), for practical purposes using the data that Outlook gives is just
as useful. (Since if anything got accepted into the core those using the
Outlook plug-in would be dealing with those effects).  This is a (another)
good reason for us to try each other's patches (and I will get to the
incremental ones soon, honest! <wink>) since some of us have Outlook-altered
messages to test, and others have nice pure message streams.

> I'm not even sure if the test harnesses use a database 
> backend at all; I think they may be keeping everything in 
> memory.  Dunno.  I haven't looked at that in ages.

They keep everything in memory unless you've enabled the 'save the
classifier' option (can't remember what it's called; too lazy to check), and
then it pickles them.

=Tony Meyer


From tim.one at comcast.net  Wed Dec 24 01:07:21 2003
From: tim.one at comcast.net (Tim Peters)
Date: Wed Dec 24 01:07:30 2003
Subject: [spambayes-dev] A new and altogether different bsddb breakage
In-Reply-To: <Law11-OE68jx2XDniY50000e32d@hotmail.com>
Message-ID: <LNBBLJKPBEHFEDALKOLCKEAKIAAB.tim.one@comcast.net>

[Kenny Pitt]
> ...
> Whenever you use direct BDB through the pybsddb/bsddb3/bsddb module
> in a multi-thread/multi-user scenario, you always have to start with
> a call to initialize the DB environment before you can do anything
> else.  You expressed some concern over the breakage on Win98 of the
> tests in test_dbshelve.py.  Unfortunately, the line that always fails
> is that very first and most basic initialization call, the same one
> that we would need to call for any use in SpamBayes.

I don't think there's a problem with that:

C:\Python23>python lib/bsddb/test/test_dbshelve.py -v
test01_basics (__main__.DBShelveTestCase) ... ok
test02_cursors (__main__.DBShelveTestCase) ... ok
test01_basics (__main__.BTreeShelveTestCase) ... ok
test02_cursors (__main__.BTreeShelveTestCase) ... ok
test01_basics (__main__.HashShelveTestCase) ... ok
test02_cursors (__main__.HashShelveTestCase) ... ok
test01_basics (__main__.ThreadBTreeShelveTestCase) ... ok
test02_cursors (__main__.ThreadBTreeShelveTestCase) ... ok
test01_basics (__main__.ThreadHashShelveTestCase) ... ok
test02_cursors (__main__.ThreadHashShelveTestCase) ... ok
test01_basics (__main__.EnvBTreeShelveTestCase) ... ERROR
test02_cursors (__main__.EnvBTreeShelveTestCase) ... ok
test01_basics (__main__.EnvHashShelveTestCase) ... ERROR
test02_cursors (__main__.EnvHashShelveTestCase) ... ok
test01_basics (__main__.EnvThreadBTreeShelveTestCase) ... ERROR
test02_cursors (__main__.EnvThreadBTreeShelveTestCase) ... ok
test01_basics (__main__.EnvThreadHashShelveTestCase) ... ERROR
test02_cursors (__main__.EnvThreadHashShelveTestCase) ... ok

Note that the 4 Env instances of test02_cursors pass.  They're doing the
full-blown open-with-env bit too.  It's the the 4 Env instances of
test01_basics that fail, and all of them die with the same traceback:

Traceback (most recent call last):
  File "lib/bsddb/test/test_dbshelve.py", line 75, in test01_basics
    self.do_open()
  File "lib/bsddb/test/test_dbshelve.py", line 238, in do_open
    self.env.open(homeDir, self.envflags | db.DB_INIT_MPOOL |
                                           db.DB_CREATE)
DBAgainError: (11, 'Resource temporarily unavailable -- unable to join
                    the environment')

This isn't the *first* time test01_basics opens with an env, though.  Line
75 is here:

    def test01_basics(self):
        if verbose:
            print '\n', '-=' * 30
            print "Running %s.test01_basics..." % self.__class__.__name__

        self.populateDB(self.d)
        self.d.sync()
        self.do_close()
        self.do_open() # ********************* LINE 75
        d = self.d

The test setUp() method already does self.do_open() once by the time
test01_basics begins.  So there's something screwed up about how the test
tries to close and reopen the dbshelve (self.d) on this box.  Figuring out
exactly what would require digging into the guts of the stinkin' dbshelve
module, to see how *its* stinkin' close method screws up <wink>.

If I comment out lines 74 and 75 (the back-to-back close()/open() pair), the
4 env instances of test01_basics all pass.  In fact, they all pass if I just
comment out line 74.  They also pass if I replace lines 74 and 75 with:

        self.tearDown()
        self.setUp()
        self.populateDB(self.d)

The only way they don't pass is to do exactly what the test does <wink>.

> ...
> Maybe the best thing is to throw some test code into SpamBayes
> and see if it will even start up on Win98.

Yes.

> I don't have access to a Win98 test system, but if I can code up
> enough support that we can try this out, would you be willing to give
> it a test?

Certainly.

> It will probably be after the holidays before I can get to it, but
> we'll see.

That's fine, there's no rush.  Especially since it will work <wink>.


From tim.one at comcast.net  Wed Dec 24 02:33:21 2003
From: tim.one at comcast.net (Tim Peters)
Date: Wed Dec 24 02:33:31 2003
Subject: [spambayes-dev] RE: [Spambayes] How low can you go? 
In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F130467779A@its-xchg4.massey.ac.nz>
Message-ID: <LNBBLJKPBEHFEDALKOLCOEAMIAAB.tim.one@comcast.net>

[Tony Meyer]
> The export.py script does a reasonable job of putting everything back
> together again [from Outlook].

Thanks, Tony!  I'm mortified to admit I had forgotten where this script
lived.

> Actually, I believe it does the exact same job as when getting a
> message to pass to tokenizer for general use.

In particular, exactly the same as when scoring a message, or training on
one.  The MIME armor (if any) is gone, (at least all) non text/* attachments
are gone, and if the original headers contained Content-Type or
Content-Transfer-Encoding specs, they're gone too.  If it was
multipart/alternative with text/plain and text/html sections, they're both
slammed into the body, without indication of where one ends and the other
begins.

But that's the way we score Outlook email, and it's darned hard to do
better.  Outlook's message store is a complicated beast, and predates
current email standards; they tacked MIME email on top of a sprawling store
that didn't know anything about MIME, spraying bits and pieces all over the
place.  Pretty cool <wink>.

> So although popping a proxy in between Outlook and the POP3 server to
> catch raw messages would certainly be more pure and correct
> (sb_server can do this, BTW, just set the cache expiry limit *really*
> high and don't bother classifiying any messages), for practical
> purposes using the data that Outlook gives is just as useful.

For anyone using spambayes via the Outlook addin, it's *better* to use
export.py than to capture the incoming email bytestream.  SpamBayes can't
reconstruct the original bytestream from Outlook (not out of laziness, it's
simply impossible), so how the classifier would do if it *could* see the
original bytestream is irrelevant to real-life Outlook use.

It's close enough that I doubt it matters much.


From kennypitt at hotmail.com  Wed Dec 24 08:44:48 2003
From: kennypitt at hotmail.com (Kenny Pitt)
Date: Wed Dec 24 08:45:26 2003
Subject: [spambayes-dev] comment assertion error?
	revisitDBDictClassifierassumptions?
In-Reply-To: <LNBBLJKPBEHFEDALKOLCIEADIAAB.tim.one@comcast.net>
Message-ID: <Law11-OE32sBXBLfhg300011e1a@hotmail.com>

Tim Peters wrote:
> [Kenny Pitt]
>> With the caching and optimization in the database engines being what
>> it is today, it seems that we might be better off to always write
>> changes to the DB immediately and dispense with the whole
>> self.changed_words thing altogether.
> 
> This should be measured; it's not (or shouldn't be) a religious
> issue.  I have no experience with general-purpose database engines
> that are actually fast; only some that aren't as slow as others
> <0.5 wink>. 

As always, never assume anything without thorough testing, right? <wink>

>> When there are multiple processes that could be using the database
>> at the same time, any caching (read or write) that we do ourselves
>> outside the database engine has the potential to generate
>> inconsistencies in the data anyway.
> 
> A conclusion there, one way or the other, depends on specific details.
> Concurrent read-write access is never simple, and I'm not sure anyone
> uses spambayes that way anyway.

As far as I can tell, this should only happen with
sb_filter/sb_mboxtrain.  All the other solutions that I know about
(Outlook, sb_server, sb_imapfilter, sb_xmlrpcserver) have a single
server process that handles all database access.  Out of any remaining
solutions, I also suspect they are rarely used since I hardly ever see
them mentioned on any of the mailing lists.

This leads to a question regarding the proposed direct BerkeleyDB
storage.  If we never access the database from more than one process at
the same time, do we really need a full-fledged multi-process
environment for Berkeley?  You can do private, multi-thread environments
that provide sufficient locking with less overhead for a single process.
Any guesses from anyone as to what cases would require cross-process
locking?

-- 
Kenny Pitt


From kennypitt at hotmail.com  Wed Dec 24 08:56:59 2003
From: kennypitt at hotmail.com (Kenny Pitt)
Date: Wed Dec 24 08:57:39 2003
Subject: [spambayes-dev] A new and altogether different bsddb breakage
In-Reply-To: <LNBBLJKPBEHFEDALKOLCKEAKIAAB.tim.one@comcast.net>
Message-ID: <Law11-OE64uv08Be4ZE00011ce4@hotmail.com>

Tim Peters wrote:
> [Kenny Pitt]
>> ...
>> Unfortunately, the line that always fails
>> is that very first and most basic initialization call, the same one
>> that we would need to call for any use in SpamBayes.
> 
> I don't think there's a problem with that:
> 
> ...
> 
> Note that the 4 Env instances of test02_cursors pass.  They're doing
> the full-blown open-with-env bit too.  It's the the 4 Env instances of
> test01_basics that fail, and all of them die with the same traceback:
> 
> ...
> 
> So there's something screwed up about how the
> test tries to close and reopen the dbshelve (self.d) on this box. 
> Figuring out exactly what would require digging into the guts of the
> stinkin' dbshelve module, to see how *its* stinkin' close method
> screws up <wink>. 

I suspect some timing issue with the Windows disk cache not immediately
flushing stuff to disk.  That's just idle speculation, of course, but I
have seen similar things in other development projects.

> If I comment out lines 74 and 75 (the back-to-back close()/open()
> pair), the 4 env instances of test01_basics all pass.
>
> ...
> 
> The only way they don't pass is to do exactly what the test does
> <wink>. 
> 
>> ...
>> Maybe the best thing is to throw some test code into SpamBayes
>> and see if it will even start up on Win98.
> 
> Yes.

Good to know, thanks.  I'll proceed along that line, then.  I can't
think of a good reason that we should need to close and then immediately
reopen the same database.

-- 
Kenny Pitt


From skip at pobox.com  Wed Dec 24 10:07:05 2003
From: skip at pobox.com (Skip Montanaro)
Date: Wed Dec 24 10:07:10 2003
Subject: [spambayes-dev] test_storage.py failing
Message-ID: <16361.43929.945893.137105@montanaro.dyndns.org>


I ran the spambayes/test/test_storage.py this morning for the first time (on
a fresh CVS checkout) and got several instances of the same error.  Here's
one example:

    ERROR: testHapax (__main__.DBStorageTestCase)
    ----------------------------------------------------------------------
    Traceback (most recent call last):
      File "test_storage.py", line 137, in setUp
        return _StorageTestBase.setUp(self)
      File "test_storage.py", line 20, in setUp
        self.classifier = self.__class__.StorageClass(self.db_name)
    TypeError: __init__() takes exactly 1 argument (2 given)

I inserted

    print self.__class__.StorageClass

right above the class call and in the error case it's always instantiating
DBDictClassifier which does take a db_name argument, so I'm a bit confused
about why this is generating an error.  My brain is not in a high enough
gear to see why.  I think I'll just go do a little last minute Christmas
shopping and let someone else figure it out.

Skip

From tim.one at comcast.net  Wed Dec 24 12:35:33 2003
From: tim.one at comcast.net (Tim Peters)
Date: Wed Dec 24 12:35:37 2003
Subject: [spambayes-dev] test_storage.py failing
In-Reply-To: <16361.43929.945893.137105@montanaro.dyndns.org>
Message-ID: <LNBBLJKPBEHFEDALKOLCCECEIAAB.tim.one@comcast.net>

[spambayes-dev-bounces@python.org]
[Skip]
> I ran the spambayes/test/test_storage.py this morning for the first
> time (on a fresh CVS checkout) and got several instances of the same
> error.

How many is several?  Note that there are only 5 tests here, so if several
means more than 2, it's possible that *all* the tests died this way for you.
That would be a clue.

>  Here's one example:
>
>     ERROR: testHapax (__main__.DBStorageTestCase)
>
> ----------------------------------------------------------------------
> Traceback (most recent call last):
> File "test_storage.py", line 137, in setUp
>      return _StorageTestBase.setUp(self)
> File "test_storage.py", line 20, in setUp
>      self.classifier = self.__class__.StorageClass(self.db_name)
> TypeError: __init__() takes exactly 1 argument (2 given)
>
> I inserted
>
>     print self.__class__.StorageClass
>
> right above the class call and in the error case it's always
> instantiating DBDictClassifier which does take a db_name argument, so
> I'm a bit confused about why this is generating an error.

I expect you need to run python with -v to see how the imports are getting
satisfied -- the only guess I have is that you're not getting the classes
the test expects to get.

Here are runs on my box (Pythons 2.3.3 and 2.2.3):

C:\Code\spambayes>echo %PYTHONPATH%
\code\spambayes

C:\Code\spambayes>\python23\python spambayes/test/test_storage.py -v
testHapax (__main__.PickleStorageTestCase) ... ok
test_bug777026 (__main__.PickleStorageTestCase) ... ok
testHapax (__main__.DBStorageTestCase) ... ok
testNoDBMAvailable (__main__.DBStorageTestCase) ... ok
test_bug777026 (__main__.DBStorageTestCase) ... ok

----------------------------------------------------------------------
Ran 5 tests in 0.050s

OK

C:\Code\spambayes>\python22\python spambayes/test/test_storage.py -v
testHapax (__main__.PickleStorageTestCase) ... ok
test_bug777026 (__main__.PickleStorageTestCase) ... ok
testHapax (__main__.DBStorageTestCase) ... ok
testNoDBMAvailable (__main__.DBStorageTestCase) ... ok
test_bug777026 (__main__.DBStorageTestCase) ... ok

----------------------------------------------------------------------
Ran 5 tests in 0.110s

Note that I checked in some code cleanup for test_storage.py right before
typing this msg, but I got the same results before too.


From nobody at spamcop.net  Wed Dec 24 13:46:14 2003
From: nobody at spamcop.net (Seth Goodman)
Date: Wed Dec 24 13:46:23 2003
Subject: [spambayes-dev] RE: [Spambayes] How low can you go? 
In-Reply-To: <LNBBLJKPBEHFEDALKOLCOEAMIAAB.tim.one@comcast.net>
Message-ID: <MHEGIFHMACFNNIMMBACAMENHGPAA.nobody@spamcop.net>

Thanks so much Alex, Skip, Tony and Tim!  This gives me all the rope I need,
as they say.  I'll dig in and ask for specific help when I run into problems
later.  I've been saving my whole mail stream for a while, so I do have
something to test on.  First task is get export.py working, next explore the
test programs that others have already checked in.

Thanks again.

--
Seth Goodman

  Humans:   off-list replies to sethg [at] GoodmanAssociates [dot] com

  Spambots: disregard the above


From richie at entrian.com  Wed Dec 24 16:31:31 2003
From: richie at entrian.com (Richie Hindle)
Date: Wed Dec 24 16:31:46 2003
Subject: [spambayes-dev] Experimental SpamBayes build available
In-Reply-To: <001801c3c91e$1cde2bf0$2c00a8c0@eden>
References: <001801c3c91e$1cde2bf0$2c00a8c0@eden>
Message-ID: <p91kuvkrdt81vqqivrklhq87f0qvqrq7u0@4ax.com>


[Mark]
> I have just uploaded an installer for a new experimental binary of
> SpamBayes.  This binary includes *both* the Outlook addin and the sb_server
> applications.

Nice one!  Barring a few minor glitches (which I'll enter into the SF
tracker when I get the chance) sb_tray worked like a charm for me.

-- 
Richie Hindle
richie@entrian.com


From popiel at wolfskeep.com  Thu Dec 25 15:24:04 2003
From: popiel at wolfskeep.com (T. Alexander Popiel)
Date: Thu Dec 25 15:24:08 2003
Subject: [spambayes-dev] Reduced training test results
Message-ID: <20031225202404.757652DF61@cashew.wolfskeep.com>

Training on just those messages whose score isn't 0.00 or 1.00
(rounded) seems to be a huge win over training on everything.
Not so much because the accuracy is better (though accuracy
does seem to be improved by neglecting those messages that it's
already certain about), but because of a hugely reduced training
set (and thus database).  Specifically, training on everything
yielded a database with 70,000 messages, while training only
on the non-extreme put only about 3,500 messages into the database.
Unfortunately, I don't have firm numbers on token counts.

Also of significant interest is that the classifier doesn't seem
to decay as badly over time.  With training on everything, the
unsure rate in particular (and fn to a much lesser extent) goes
up significantly after about 200 days worth of traffic, though
the fp rate stays low.  With just training on those things that
aren't already certain, the unsure rate climbs much more slowly
after 200 days (with the cumulative rate staying relatively flat),
while the fp and fn rates stay at very low values.

Details of my experiment parameters:

I've got about 77000 messages in my dataset, covering a span of
418 days.  Of these, about 21500 are ham, and nearly 56000 are spam.
I include virus/worm messages in my spam, and the "latest windows
update" worm makes its presence felt around day 360.

I divided my dataset into 10 subsets, and ran the incremental.py
harness over these 10 times, excluding 1 set each time, as per normal
cv-ish behaviour.  Thus, each of my measurements is replicated 10
times, with slightly different input data.

Finally, I did the above-mentioned 10 runs using both the 'perfect'
and 'nonedge' regimes.  The 'perfect' regime trains on every message
using the proper ham/spam classification, while the 'nonedge'
regime trains only on those messages that were not correctly
classified with 0.00 or 1.00 (rounded) scores.

I've plotted the both cumulative and 7-day average values for
error rates (fp, fn, and unsure) and training counts (ham and spam).

Pictures (and a copy of this writeup) are on my website at:
  http://www.wolfskeep.com/~popiel/spambayes/nonedge

- Alex

PS. Sorry this took so long, but running the perfect regime
    on such a large dataset took a couple days on my machine...
    I need more memory! ;-)

From tim.one at comcast.net  Thu Dec 25 17:10:23 2003
From: tim.one at comcast.net (Tim Peters)
Date: Thu Dec 25 17:10:29 2003
Subject: [spambayes-dev] A new and altogether different bsddb breakage
In-Reply-To: <Law11-OE64uv08Be4ZE00011ce4@hotmail.com>
Message-ID: <LNBBLJKPBEHFEDALKOLCOEEPIAAB.tim.one@comcast.net>

[Kenny Pitt]
> ...
> I suspect some timing issue with the Windows disk cache not
> immediately flushing stuff to disk.  That's just idle speculation, of
> course, but I have seen similar things in other development projects.

It's true that doing fileobject.flush() on Windows doesn't make any
guarantee about writing anything to disk.  Python 2.3 grew an os.fsync
implementation for Windows, and os.fsync(fileobject.fileno()) does write to
disk on Windows (and sometimes takes a veeeeery long time to do so!).  That
calls the MS C _commit() function under the covers, which in turn calls the
Win32 FlushFileBuffers().

> ...
> I can't think of a good reason that we should need to close and then
> immediately reopen the same database.

Me neither, but I bet we can find a way if we need to.  In particular, you
pointed to Sleepycat docs before containing cautions about how things need
to be set up under Windows, and I'm almost certain the test suite doesn't do
that.


From tim.one at comcast.net  Thu Dec 25 17:10:24 2003
From: tim.one at comcast.net (Tim Peters)
Date: Thu Dec 25 17:10:36 2003
Subject: [spambayes-dev] RE: [Spambayes] How low can you go? 
In-Reply-To: <MHEGIFHMACFNNIMMBACAMELEGPAA.nobody@spamcop.net>
Message-ID: <LNBBLJKPBEHFEDALKOLCAEFAIAAB.tim.one@comcast.net>

[Seth Goodman]
> ...
> If I can have the Outlook binary and non-Outlook source working at
> the same time,

Probably, but I don't really know.  I run the Outlook addin directly from a
CVS checkout of spambayes, and have never used the binary installer (I don't
object to it <wink>, it's just that using it would consume a little more
non-existent "spare time").

> is there a way to convert my saved Outlook mail folders to mbox
> format

export.py in the spambayes Outlook2000 directory works fine, and I just
checked in a pile of changes so it works even finer.

> so that I _can_ see how the changes I make work on my own mail
> stream as well?

That's potentially more difficult than what you've (or I've!) been doing:
to run "what if I changed this or that?" experiments, you need to save every
email you ever get, and ensure that each one is correctly classified.  Else
you're not reproducing your original email stream, so it's anyone's guess
then what you'd really be testing.

Two days ago I created a new .pst file, with two folders "All ham" and "All
spam".  Since then I've been copying each message I get into one of them.
When it comes time to use export.py, I'll have to temporarily fiddle my
spambayes config to say that "All ham" is my (only) ham folder and "All
spam" my (only) spam folder (export.py gets its idea of where your ham and
spam training data are from your Outlook spambayes config file).

Copying all incoming msgs is a bit of a PITA for me, and if you use Outlook
rules too (I don't) to sort ham into different folders, may be a royal PITA.
So it goes -- Outlook wasn't designed for running spam-filter experiments
(then again, no email client was, and that's why we have a "standard"
test-data directory structure of our own).

Ah, I've noted before that I throw away half my Unsures unclassified,
because I can't tell whether they're ham or spam (these are usually barely
intelligible msgs addressed to public "admin" or "help" kinds of addresses).
I'm making an arbitrary guess about each of those too, and saving a copy in
"All ham" or "All spam".  I *expect* a relatively high Unsure rate because
of this aspect of my email mix.  No part of the testing framework can be
talked into believing that Unsure is the *desired* outcome for a msg,
though, so I either have to make a guess about each, or damage the
experimental setup in unknown ways by not saving *all* my email.


From skip at pobox.com  Thu Dec 25 18:08:40 2003
From: skip at pobox.com (Skip Montanaro)
Date: Thu Dec 25 18:08:47 2003
Subject: [spambayes-dev] test_storage.py failing
In-Reply-To: <LNBBLJKPBEHFEDALKOLCCECEIAAB.tim.one@comcast.net>
References: <16361.43929.945893.137105@montanaro.dyndns.org>
	<LNBBLJKPBEHFEDALKOLCCECEIAAB.tim.one@comcast.net>
Message-ID: <16363.28152.314476.785433@montanaro.dyndns.org>


    Tim> [Skip]
    >> I ran the spambayes/test/test_storage.py this morning for the first
    >> time (on a fresh CVS checkout) and got several instances of the same
    >> error.

    Tim> How many is several?  Note that there are only 5 tests here, so if
    Tim> several means more than 2, it's possible that *all* the tests died
    Tim> this way for you.  That would be a clue.

"several" was 3 in this case - all the DBDictClassifier tests.

    Tim> I expect you need to run python with -v to see how the imports are
    Tim> getting satisfied -- the only guess I have is that you're not
    Tim> getting the classes the test expects to get.

They were coming from the right place.  I eventually figured out that
distutils didn't overwrite my installed copy when I tried installing from a
new CVS version.

Sorry for the false alarm.  I wonder if I should file a bug report against
distutils...

Skip

From popiel at wolfskeep.com  Thu Dec 25 18:59:14 2003
From: popiel at wolfskeep.com (T. Alexander Popiel)
Date: Thu Dec 25 18:59:18 2003
Subject: [spambayes-dev] RE: [Spambayes] How low can you go? 
In-Reply-To: Message from "Tim Peters" <tim.one@comcast.net> of "Thu,
	25 Dec 2003 17:10:24 EST."
	<LNBBLJKPBEHFEDALKOLCAEFAIAAB.tim.one@comcast.net> 
References: <LNBBLJKPBEHFEDALKOLCAEFAIAAB.tim.one@comcast.net> 
Message-ID: <20031225235914.8CE0E2DF61@cashew.wolfskeep.com>

In message:  <LNBBLJKPBEHFEDALKOLCAEFAIAAB.tim.one@comcast.net>
             "Tim Peters" <tim.one@comcast.net> writes:
>
>Ah, I've noted before that I throw away half my Unsures unclassified,
>because I can't tell whether they're ham or spam

>No part of the testing framework can be talked into believing that
>Unsure is the *desired* outcome for a msg,

Hrm.  Good point.  Perhaps we should fix this, adding a third branch
to the testing framework's data directory tree, and then convincing
the test code to use messages in that third branch in the classify
phase, but not in the train phase.  And then we'd have the six
error states of ham->spam, ham->unsure, unsure->ham, unsure->spam,
spam->ham, and spam->unsure.

Hrm.  Not for me to do today, though... I'm still running more
variations of the stuff I posted about earlier.  Redoing the
fpfnunsure test that I did last March (with my new dataset so it's
comparable), and then adding in 200 day message expiry to my
nonedge regime.

- Alex

From skip at pobox.com  Fri Dec 26 09:35:40 2003
From: skip at pobox.com (Skip Montanaro)
Date: Fri Dec 26 09:35:52 2003
Subject: [spambayes-dev] Reduced training test results
In-Reply-To: <20031225202404.757652DF61@cashew.wolfskeep.com>
References: <20031225202404.757652DF61@cashew.wolfskeep.com>
Message-ID: <16364.18236.225460.401395@montanaro.dyndns.org>


    Alex> Also of significant interest is that the classifier doesn't seem
    Alex> to decay as badly over time.  With training on everything, the
    Alex> unsure rate in particular (and fn to a much lesser extent) goes up
    Alex> significantly after about 200 days worth of traffic, though the fp
    Alex> rate stays low.  With just training on those things that aren't
    Alex> already certain, the unsure rate climbs much more slowly after 200
    Alex> days (with the cumulative rate staying relatively flat), while the
    Alex> fp and fn rates stay at very low values.

    Alex> Details of my experiment parameters:

    Alex> I've got about 77000 messages in my dataset, covering a span of
    Alex> 418 days.  Of these, about 21500 are ham, and nearly 56000 are spam.
    Alex> I include virus/worm messages in my spam, and the "latest windows
    Alex> update" worm makes its presence felt around day 360.

Is it possible that the ham/spam ratio isn't as bad when you don't train on
everything? 

Skip

From nobody at spamcop.net  Fri Dec 26 12:07:52 2003
From: nobody at spamcop.net (Seth Goodman)
Date: Fri Dec 26 12:07:57 2003
Subject: [spambayes-dev] RE: [Spambayes] How low can you go? 
In-Reply-To: <LNBBLJKPBEHFEDALKOLCAEFAIAAB.tim.one@comcast.net>
Message-ID: <MHEGIFHMACFNNIMMBACACEAPHAAA.nobody@spamcop.net>

> [Tim Peters]
> Two days ago I created a new .pst file, with two folders "All
> ham" and "All
> spam".  Since then I've been copying each message I get into one of them.

That's exactly what I've been doing for a while, so that's encouraging.  I
have local ham and spam corpus folders in outlook.pst that I move (or copy)
_all_ messages into when I am finished with them.  I toss a few unsures, but
most go into one bucket or the other.  Those two folders in Outlook.pst get
autoarchived into SpamCorpus1.pst when messages in them are more than three
days old.  That gives me time to manually track statistics (another PITA).


> [Tim Peters]
> When it comes time to use export.py, I'll have to temporarily fiddle my
> spambayes config to say that "All ham" is my (only) ham folder and "All
> spam" my (only) spam folder (export.py gets its idea of where your ham and
> spam training data are from your Outlook spambayes config file).

Which place in the SpamBayes manager is the one that changes the config that
export.py uses?  There are ham and spam folder specifications in more than
one place:  filtering, training and watched folders at least, there may be
more.


> [Tim Peters]
> Copying all incoming msgs is a bit of a PITA for me, and if you
> use Outlook
> rules too (I don't) to sort ham into different folders, may be a
> royal PITA.
> So it goes -- Outlook wasn't designed for running spam-filter experiments
> (then again, no email client was, and that's why we have a "standard"
> test-data directory structure of our own).

Yeah, I use a lot of rules and sub-folders, so I have developed a "recipe"
to make sure I don't screw up the semi-manual sorting (the thought of
learning VB and the insides of Outlook is painful; my hat's off to Mark).
One thing I do that may or may not be typical is that I let Outlook rules
take care of all the mailing list traffic.  That includes almost no spam and
so I don't train or classify it (the list admins do a good job).  Therefore,
I _don't_ include it in my ham corpus.  This gives me a roughly 1:5 ham/spam
corpus, instead of roughly even, but that's the mail stream that SpamBayes
sees.  I _do_ make sure the training sets have equal numbers of messages.
At present, my corpus is about 7,500 messages total.  This may not be enough
to "divide into ten sets", etc.  Or is it?


--
Seth Goodman

  Humans:   off-list replies to sethg [at] GoodmanAssociates [dot] com

  Spambots: disregard the above


From dave at boost-consulting.com  Fri Dec 26 13:16:55 2003
From: dave at boost-consulting.com (David Abrahams)
Date: Fri Dec 26 13:17:06 2003
Subject: [spambayes-dev] NEWTRICKS
Message-ID: <uekur8mw8.fsf@boost-consulting.com>


I keep getting quite a few spams which fit the descriptions below
(from NEWTRICKS.txt):

  - Punctuation sometimes gets inserted in otherwise spammy words or phrases,
    e.g.: "Ch-eck ou=t ou-r sel)ection _of grea)t R_X -emgffj".  It might be
    helpful to try stripping punctuation.  (Idea from Paul Sorenson)

  - Similarly, some letters get replaced by numbers, e.g.: "V1agra" instead of
    "Viagra".  Mapping numbers to suitable letters might help in some
    situations.

Since "this file is for ideas that have or have not yet been tried",
I'd love to know what constitutes "trying".  Is there some official
testing procedure or corpus we can test against?  I'd like to know
whether any change I make is worth proposing.  Of course I can try it
on my own databases of Ham and Spam first...

-- 
Dave Abrahams
Boost Consulting
www.boost-consulting.com


From skip at pobox.com  Fri Dec 26 13:36:09 2003
From: skip at pobox.com (Skip Montanaro)
Date: Fri Dec 26 13:36:33 2003
Subject: [spambayes-dev] NEWTRICKS
In-Reply-To: <uekur8mw8.fsf@boost-consulting.com>
References: <uekur8mw8.fsf@boost-consulting.com>
Message-ID: <16364.32665.369857.975422@montanaro.dyndns.org>


    Dave> I keep getting quite a few spams which fit the descriptions below
    Dave> (from NEWTRICKS.txt):

    Dave>   - Punctuation sometimes gets inserted in otherwise spammy words
    Dave>     or phrases, e.g.: "Ch-eck ou=t ou-r sel)ection _of grea)t R_X
    Dave>     -emgffj".  It might be helpful to try stripping punctuation.
    Dave>     (Idea from Paul Sorenson)

    Dave>   - Similarly, some letters get replaced by numbers, e.g.:
    Dave>     "V1agra" instead of "Viagra".  Mapping numbers to suitable
    Dave>     letters might help in some situations.

    Dave> Since "this file is for ideas that have or have not yet been
    Dave> tried", I'd love to know what constitutes "trying".  Is there some
    Dave> official testing procedure or corpus we can test against?  I'd
    Dave> like to know whether any change I make is worth proposing.  Of
    Dave> course I can try it on my own databases of Ham and Spam first...

I tried the first (eliding punctuation from words).  From a testing
standpoint it turns out to not be all that useful, I think for a couple
reasons:

    * There are plenty of other spammy clues in such messages which are
      sufficient to kick these messages into spam range.  Most of this stuff
      winds up scoring at 0.95 or above for me.  If they don't score as spam
      for you, train on a few and see how it does then.

    * Training databases full of old-ish mail won't contain many of these
      sorts of messages, so enabling punctuation removal won't change things
      very much.

Skip

From popiel at wolfskeep.com  Fri Dec 26 13:44:21 2003
From: popiel at wolfskeep.com (T. Alexander Popiel)
Date: Fri Dec 26 13:44:25 2003
Subject: [spambayes-dev] Reduced training test results 
In-Reply-To: Message from Skip Montanaro <skip@pobox.com> of "Fri,
	26 Dec 2003 08:35:40 CST."
	<16364.18236.225460.401395@montanaro.dyndns.org> 
References: <20031225202404.757652DF61@cashew.wolfskeep.com>
	<16364.18236.225460.401395@montanaro.dyndns.org> 
Message-ID: <20031226184421.D3E342DF61@cashew.wolfskeep.com>

In message:  <16364.18236.225460.401395@montanaro.dyndns.org>
             Skip Montanaro <skip@pobox.com> writes:
>
>    Alex> Also of significant interest is that the classifier doesn't seem
>    Alex> to decay as badly over time.  With training on everything, the
>    Alex> unsure rate in particular (and fn to a much lesser extent) goes up
>    Alex> significantly after about 200 days worth of traffic, though the fp
>    Alex> rate stays low.  With just training on those things that aren't
>    Alex> already certain, the unsure rate climbs much more slowly after 200
>    Alex> days (with the cumulative rate staying relatively flat), while the
>    Alex> fp and fn rates stay at very low values.
>
>    Alex> Details of my experiment parameters:
>
>    Alex> I've got about 77000 messages in my dataset, covering a span of
>    Alex> 418 days.  Of these, about 21500 are ham, and nearly 56000 are spam.
>    Alex> I include virus/worm messages in my spam, and the "latest windows
>    Alex> update" worm makes its presence felt around day 360.
>
>Is it possible that the ham/spam ratio isn't as bad when you don't train on
>everything? 

Eyeballing the graphs, it seems that the ratio is slightly _more_
unbalanced for the nonedge regime, rather than less.

Also, from looking closer at the 7-day span graphs, I see that the
inflection point is at about 120 days, not 200.

- Alex

From popiel at wolfskeep.com  Fri Dec 26 14:09:23 2003
From: popiel at wolfskeep.com (T. Alexander Popiel)
Date: Fri Dec 26 14:09:28 2003
Subject: [spambayes-dev] NEWTRICKS 
In-Reply-To: Message from David Abrahams <dave@boost-consulting.com> of "Fri,
	26 Dec 2003 13:16:55 EST." <uekur8mw8.fsf@boost-consulting.com> 
References: <uekur8mw8.fsf@boost-consulting.com> 
Message-ID: <20031226190923.5708E2DF61@cashew.wolfskeep.com>

In message:  <uekur8mw8.fsf@boost-consulting.com>
             David Abrahams <dave@boost-consulting.com> writes:
>
>Since "this file is for ideas that have or have not yet been tried",
>I'd love to know what constitutes "trying".  Is there some official
>testing procedure or corpus we can test against?  I'd like to know
>whether any change I make is worth proposing.  Of course I can try it
>on my own databases of Ham and Spam first...

Heh.  We just went through this question with Seth Goodman.  Basic
summation of the last week or so of advice is: Grab the latest CVS
image, then read README-DEVEL.txt and incremental.HOWTO.txt.  Lots
of good info in there.  Collect your own ham & spam corpora, put
them into the appropriate directory structure, then run the testing
tools over them with different options/classifiers/tokenizers/whatnot.
Post results and enough explanation so that people can try to
replicate your results using their own corpora.

- Alex

From popiel at wolfskeep.com  Fri Dec 26 14:21:08 2003
From: popiel at wolfskeep.com (T. Alexander Popiel)
Date: Fri Dec 26 14:21:12 2003
Subject: [spambayes-dev] RE: [Spambayes] How low can you go? 
In-Reply-To: Message from "Seth Goodman" <nobody@spamcop.net> of "Fri,
	26 Dec 2003 11:07:52 CST."
	<MHEGIFHMACFNNIMMBACACEAPHAAA.nobody@spamcop.net> 
References: <MHEGIFHMACFNNIMMBACACEAPHAAA.nobody@spamcop.net> 
Message-ID: <20031226192108.B38002DF61@cashew.wolfskeep.com>

In message:  <MHEGIFHMACFNNIMMBACACEAPHAAA.nobody@spamcop.net>
             "Seth Goodman" <nobody@spamcop.net> writes:
>
>One thing I do that may or may not be typical is that I let Outlook rules
>take care of all the mailing list traffic.  That includes almost no spam and
>so I don't train or classify it (the list admins do a good job).  Therefore,
>I _don't_ include it in my ham corpus.

Reasonable.

>This gives me a roughly 1:5 ham/spam corpus, instead of roughly even, but
>that's the mail stream that SpamBayes sees.

This is the stuff I'd tend to use for the testing, as opposed to your
equal-sized training sets.

>At present, my corpus is about 7,500 messages total.  This may not be enough
>to "divide into ten sets", etc.  Or is it?

I think we did our first classifier shootouts with a minimum of 2,000
messages, so you should be fine.  You may not have enough to see some
of the longer-term effects I'm now witnessing (with inflection points
at 120 and 200+ days), but you should be able to get started, at least.
And heck, those inflection points (or the timing thereof) may be
peculiarities of my own data.  It'd be good to see.

- Alex

From stephena at hiwaay.net  Fri Dec 26 16:32:45 2003
From: stephena at hiwaay.net (Stephen Anderson)
Date: Fri Dec 26 16:32:49 2003
Subject: [spambayes-dev] Two SB on One Computer
Message-ID: <Pine.OSF.4.44.0312261528370.1058453-100000@bee.hiwaay.net>

Hi,

I searched through the archives and I couldn't find anything conclusive on 
this.  Please don't hesistate to point me back to the archives if you know 
I've missed something.

I'm using the sb_server (pop3proxy) on an XP computer as a service.  I'd 
like to install it as two separate services and use two separate databases 
and separate web management ports so two different users can each have 
their own customized spam filter.

I tickled through the service script and eye-balled the sb_server but I'm 
not sure what all assumptions are made that would make two instances of SB 
overlap.   Can anybody give me some insight on what things they think I 
will have to watch out for?  Thank you!


Stephen Anderson
<stephena@HiWAAY.net>

===========================================================================

http://wecanstopspam.org


From tim.one at comcast.net  Fri Dec 26 21:13:04 2003
From: tim.one at comcast.net (Tim Peters)
Date: Fri Dec 26 21:13:15 2003
Subject: [spambayes-dev] RE: [Spambayes] How low can you go? 
In-Reply-To: <20031225235914.8CE0E2DF61@cashew.wolfskeep.com>
Message-ID: <LNBBLJKPBEHFEDALKOLCCEIJIAAB.tim.one@comcast.net>

[Tim]
>> ...
>> Ah, I've noted before that I throw away half my Unsures unclassified,
>> because I can't tell whether they're ham or spam
>> ...
>> No part of the testing framework can be talked into believing that
>> Unsure is the *desired* outcome for a msg, though ...

[T. Alexander Popiel]
> Hrm.  Good point.  Perhaps we should fix this, adding a third branch
> to the testing framework's data directory tree, and then convincing
> the test code to use messages in that third branch in the classify
> phase, but not in the train phase.  And then we'd have the six
> error states of ham->spam, ham->unsure, unsure->ham, unsure->spam,
> spam->ham, and spam->unsure.

The added complication is unattractive -- I'm OK with guessing "the right"
category, even while believing that doesn't make sense <wink>.


From tim.one at comcast.net  Fri Dec 26 21:13:05 2003
From: tim.one at comcast.net (Tim Peters)
Date: Fri Dec 26 21:13:19 2003
Subject: [spambayes-dev] RE: [Spambayes] How low can you go? 
In-Reply-To: <MHEGIFHMACFNNIMMBACACEAPHAAA.nobody@spamcop.net>
Message-ID: <LNBBLJKPBEHFEDALKOLCEEIJIAAB.tim.one@comcast.net>

[Seth Goodman]
> ...
> Which place in the SpamBayes manager is the one that changes the
> config that export.py uses?  There are ham and spam folder
> specifications in more than one place:  filtering, training and
> watched folders at least, there may be more.

Training.  This will become clear when you run export.py, since it displays
the names of the folders it's exporting.  Don't hesitate to run export.py.
It doesn't change your .pst files in any way -- it's harmless, and the files
it creates can be thrown away at will.

> ...
> One thing I do that may or may not be typical is that I let Outlook
> rules take care of all the mailing list traffic.  That includes
> almost no spam and so I don't train or classify it (the list admins
> do a good job).  Therefore, I _don't_ include it in my ham corpus.
> This gives me a roughly 1:5 ham/spam corpus, instead of roughly even,
> but that's the mail stream that SpamBayes sees.

Yet it remains possible that the best training strategy for your mix
requires artificially forcing a particular ratio.  Picture an extreme:  if
your actual incoming ratio is a million to one ...

> I _do_ make sure the training sets have equal numbers of messages.  At
> present, my corpus is about 7,500 messages total.  This may not be
> enough to "divide into ten sets", etc.  Or is it?

It's plenty.  The last multi-corpus "death match" experiments here required
that participants use exacty 10 sets of ham and 10 sets of spam, each set
having exactly 200 messages.  That's a grand total of 4,000 msgs.

However, it's not clear *what* to test anymore.  At the start, this project
was aimed at high-volume mailing lists, where the admins were thought most
likely to train on giant sets of ham and spam a few times per year.
Randomized cross-validation testing is a fine approach for that use.

There are apparently only a few people who use spambayes that way, though,
and among the rest of us no two seem to train in the same way.  Incremental
training, and preserving the order in which messages arrive, seem
overwhelmingly more interesting to most real users.

So what may be more important now, building on Alex's incremental testers,
isn't the sheer number of messages so much as the span of time they cover.
Indeed, for new users, it's important to know how this filter behaves after
training on just a few messages.  That's my particular interest with the
experimental mixed unigram/bigram scheme:  the hope is that it "learns
faster".  In earlier tests, I never found anything that beat the pure
unigram scheme *given enough training data*, but few users have 20,000
recent ham and spam to start off with.

OTOH, I don't have enough exhaustive personal email saved away to measure
anything other than how the system performs across a few days, and a scheme
that "learns fast" starting from nothing *may* also be slow to adapt to
changes over time (we all know a bright kid who never outgrew their
6th-grade worldview, right <wink>?).

Oh well.  There have always been more ideas to test than were possible to
cover.


From tim.one at comcast.net  Fri Dec 26 21:13:06 2003
From: tim.one at comcast.net (Tim Peters)
Date: Fri Dec 26 21:13:25 2003
Subject: [spambayes-dev] RE: [Spambayes] How low can you go? 
In-Reply-To: <20031226192108.B38002DF61@cashew.wolfskeep.com>
Message-ID: <LNBBLJKPBEHFEDALKOLCGEIJIAAB.tim.one@comcast.net>

[Seth Goodman]
>> This gives me a roughly 1:5 ham/spam corpus, instead of roughly
>> even, but that's the mail stream that SpamBayes sees.

[T. Alexander Popiel]
> This is the stuff I'd tend to use for the testing, as opposed to your
> equal-sized training sets.

If Seth is going to test incremental training regimes, then yes, his entire
email stream (well, the parts of it scored by spambayes -- he said he uses
Outlook rules to exempt a large part of it from getting scored at all)
should be included.

If he wants to do cross-validation testing, he should still bust it all up
into the same number of sets.  timcv's "ham-keep" and "spam-keep" options
can be used then to select random equal-sized (or non-equal-sized) subsets
dynamically.

In your (Alex's) recent "nonedge" incremental training experiment, it looks
like your training data grew to about a 5.5::1 spam::ham ratio after 400
days.  I know my personal classifiers start acting flaky whenever I've let
them get imbalanced by more than 2::1 in either direction.  So if I had your
data, I'd be curious to try variations that force better balance.  I have my
data, but it's less than a week old <wink>.  You have enough data that it
may well be more interesting to you to try variations including expiration
(the second derivative of your "Cumulative Trained Counts" ham training
curve appears slightly negative, but your spam training curve appears mostly
straight except for two points where it clearly gets steeper -- a hypothesis
is that your ham isn't changing much over time, but that your spam is, the
weight of the old spam training data is making it harder to adjust to the
spam changes, and that this gets worse over time; OTOH, with the spam::ham
training imbalance getting worse over time too, it may just be that the
classifier is getting flakier over time too for that reason alone).


From tim.one at comcast.net  Sat Dec 27 00:52:36 2003
From: tim.one at comcast.net (Tim Peters)
Date: Sat Dec 27 00:52:41 2003
Subject: [spambayes-dev] NEWTRICKS
In-Reply-To: <uekur8mw8.fsf@boost-consulting.com>
Message-ID: <LNBBLJKPBEHFEDALKOLCAEIPIAAB.tim.one@comcast.net>

[David Abrahams]
> I keep getting quite a few spams which fit the descriptions below
> (from NEWTRICS.txt):

I'm sure everyone gets them, the interesting question is whether they're
evading your spambayes filter.  They don't seem to give mine particular
trouble (of course I train on those that score Unsure; I'm not sure I've
ever seen one score as Ham).

> ... [descriptions of attempted obfuscation via insertion of
>      punctuation, and replacing letters by digits] ...

> Since "this file is for ideas that have or have not yet been tried",
> I'd love to know what constitutes "trying".  Is there some official
> testing procedure or corpus we can test against?  I'd like to know
> whether any change I make is worth proposing.  Of course I can try it
> on my own databases of Ham and Spam first...

There's no official corpus, else we'd be teaching the system to recognize
that corpus.  Alex gave the right pointers to docs for the testing
framework.


From tim.one at comcast.net  Sat Dec 27 00:52:37 2003
From: tim.one at comcast.net (Tim Peters)
Date: Sat Dec 27 00:52:45 2003
Subject: [spambayes-dev] Reduced training test results
In-Reply-To: <20031225202404.757652DF61@cashew.wolfskeep.com>
Message-ID: <LNBBLJKPBEHFEDALKOLCCEIPIAAB.tim.one@comcast.net>

[T. Alexander Popiel]
> Training on just those messages whose score isn't 0.00 or 1.00
> (rounded) seems to be a huge win over training on everything.
> Not so much because the accuracy is better (though accuracy
> does seem to be improved by neglecting those messages that it's
> already certain about),

I'm afraid TOE gives too much weight to systematically correlated tokens.
My experience with python.org mailing lists has pointed in that direction
since the start, but it's probably more general than that.  In a recap
nutshell, every piece of email coming from python.org has (with
mine_received_headers enabled) about a dozen tokens effectively saying "I
came from python.org".  I get several hundred ham like that every day, but
also a few spam per week.  Under TOE, the "python.org clues" get spamprobs
approaching 0, and a dozen very strong ham tokens is hard to overcome.  As a
result, it's *hard* for a spam leaking thru python.org to score as spam on
my end -- even under mistake-based training, where the spamprobs on
python.org-tokens are much higher than they'd be under TOE.

I expect most (maybe all) of the developers here have similar long-term
sources of ham, feeding you daily with correlated tokens effectively
identifying the source.

An irony is that I don't need those python.org tokens:  the *content* of
those msgs is solidly hammy even without them.  Maybe we should ignore our
strongest clues <0.5 wink>.

> but because of a hugely reduced training set (and thus database).
> Specifically, training on everything yielded a database with 70,000
> messages, while training only on the non-extreme put only about
> 3,500 messages into the database.  Unfortunately, I don't have firm
> numbers on token counts.

That's OK.  It was rigorously established before that the # of tokens either
does or doesn't go up with the square root, or some other function, of the
message count <wink>.

> Also of significant interest is that the classifier doesn't seem
> to decay as badly over time.  With training on everything, the
> unsure rate in particular (and fn to a much lesser extent) goes
> up significantly after about 200 days worth of traffic,

That's peculiar.  Did you try this with different starting dates, and find
that "about 200 days" was invariant across starting dates -- or did you try
a single starting date, and note that something funny happened about 200
days after that single starting date.  I think the latter, in which case
it's natural to speculate that something significant changed around then in
your ham and/or spam mix.

Thanks for the report, Alex!  Good work.


From popiel at wolfskeep.com  Sat Dec 27 01:01:24 2003
From: popiel at wolfskeep.com (T. Alexander Popiel)
Date: Sat Dec 27 01:01:30 2003
Subject: [spambayes-dev] RE: [Spambayes] How low can you go? 
In-Reply-To: Message from "Tim Peters" <tim.one@comcast.net> of "Fri,
	26 Dec 2003 21:13:04 EST."
	<LNBBLJKPBEHFEDALKOLCCEIJIAAB.tim.one@comcast.net> 
References: <LNBBLJKPBEHFEDALKOLCCEIJIAAB.tim.one@comcast.net> 
Message-ID: <20031227060125.07CC32DF80@cashew.wolfskeep.com>

In message:  <LNBBLJKPBEHFEDALKOLCCEIJIAAB.tim.one@comcast.net>
             "Tim Peters" <tim.one@comcast.net> writes:
>[Tim]
>>> ...
>>> Ah, I've noted before that I throw away half my Unsures unclassified,
>>> because I can't tell whether they're ham or spam
>>> ...
>>> No part of the testing framework can be talked into believing that
>>> Unsure is the *desired* outcome for a msg, though ...
>
>[T. Alexander Popiel]
>> Hrm.  Good point.  Perhaps we should fix this, adding a third branch
>> to the testing framework's data directory tree, ...
>> And then we'd have the six error states ...
>
>The added complication is unattractive -- I'm OK with guessing "the right"
>category, even while believing that doesn't make sense <wink>.

Oh, good.  That wasn't a bit of hackery I was looking forward to.
If you're OK with ignoring it, then I certainly am.

- Alex

From popiel at wolfskeep.com  Sat Dec 27 01:19:35 2003
From: popiel at wolfskeep.com (T. Alexander Popiel)
Date: Sat Dec 27 01:20:37 2003
Subject: [spambayes-dev] RE: [Spambayes] How low can you go? 
In-Reply-To: Message from "Tim Peters" <tim.one@comcast.net> of "Fri,
	26 Dec 2003 21:13:06 EST."
	<LNBBLJKPBEHFEDALKOLCGEIJIAAB.tim.one@comcast.net> 
References: <LNBBLJKPBEHFEDALKOLCGEIJIAAB.tim.one@comcast.net> 
Message-ID: <20031227061940.7C5492DF61@cashew.wolfskeep.com>

In message:  <LNBBLJKPBEHFEDALKOLCGEIJIAAB.tim.one@comcast.net>
             "Tim Peters" <tim.one@comcast.net> writes:
>
>In your (Alex's) recent "nonedge" incremental training experiment, it looks
>like your training data grew to about a 5.5::1 spam::ham ratio after 400
>days.

Yup.  I have a nice picture now of the ratio over time at the bottom
of the report at:
http://www.wolfskeep.com/~popiel/spambayes/nonedge

>I know my personal classifiers start acting flaky whenever I've let
>them get imbalanced by more than 2::1 in either direction.

Interestingly enough, though, the nonedge did better than TOE, despite
a worse imbalance.

>So if I had your data, I'd be curious to try variations that force better
>balance.

I'd love to... but I haven't been able to come up with anything which
maintains the balance better without extreme artificiality.  If you
think of any regimes that make sense, I'd be more than happy to run them.

>You have enough data that it
>may well be more interesting to you to try variations including expiration

*grin* That's part of what's been burning my CPU ever since I posted
the last report.  I'll have another report, including that, probably
within 3 days.  Still have more to test... and my runs are taking
between 6 and 20 hours each, depending on the memory used by the
classifiers.

- Alex

From popiel at wolfskeep.com  Sat Dec 27 01:39:02 2003
From: popiel at wolfskeep.com (T. Alexander Popiel)
Date: Sat Dec 27 01:39:06 2003
Subject: [spambayes-dev] RE: [Spambayes] How low can you go? 
In-Reply-To: Message from "Tim Peters" <tim.one@comcast.net> of "Fri,
	26 Dec 2003 21:13:05 EST."
	<LNBBLJKPBEHFEDALKOLCEEIJIAAB.tim.one@comcast.net> 
References: <LNBBLJKPBEHFEDALKOLCEEIJIAAB.tim.one@comcast.net> 
Message-ID: <20031227063902.9D9372DF61@cashew.wolfskeep.com>

In message:  <LNBBLJKPBEHFEDALKOLCEEIJIAAB.tim.one@comcast.net>
             "Tim Peters" <tim.one@comcast.net> writes:
>
>So what may be more important now, building on Alex's incremental testers,
>isn't the sheer number of messages so much as the span of time they cover.

I've been having this supposition, too, but was afraid of scaring
people off by voicing it.  After all, I don't know if anyone else
has been anal enough to have been maintaining growing corpora for
over a year...

>OTOH, I don't have enough exhaustive personal email saved away to measure
>anything other than how the system performs across a few days, and a scheme
>that "learns fast" starting from nothing *may* also be slow to adapt to
>changes over time (we all know a bright kid who never outgrew their
>6th-grade worldview, right <wink>?).

Heh.  I could be convinced to run the bigram scheme over my dataset
after I'm done with my current set of tests... though I may need
a gig of memory to do it. ;-)  My current 256 meg is dying under
the load of TOE with expiry.

- Alex

From popiel at wolfskeep.com  Sat Dec 27 01:43:13 2003
From: popiel at wolfskeep.com (T. Alexander Popiel)
Date: Sat Dec 27 01:43:17 2003
Subject: [spambayes-dev] Reduced training test results 
In-Reply-To: Message from "Tim Peters" <tim.one@comcast.net> of "Sat,
	27 Dec 2003 00:52:37 EST."
	<LNBBLJKPBEHFEDALKOLCCEIPIAAB.tim.one@comcast.net> 
References: <LNBBLJKPBEHFEDALKOLCCEIPIAAB.tim.one@comcast.net> 
Message-ID: <20031227064313.AB1702DF61@cashew.wolfskeep.com>

In message:  <LNBBLJKPBEHFEDALKOLCCEIPIAAB.tim.one@comcast.net>
             "Tim Peters" <tim.one@comcast.net> writes:
>[T. Alexander Popiel]
>
>> Also of significant interest is that the classifier doesn't seem
>> to decay as badly over time.  With training on everything, the
>> unsure rate in particular (and fn to a much lesser extent) goes
>> up significantly after about 200 days worth of traffic,
>
>That's peculiar.  Did you try this with different starting dates, and find
>that "about 200 days" was invariant across starting dates -- or did you try
>a single starting date, and note that something funny happened about 200
>days after that single starting date.  I think the latter, in which case
>it's natural to speculate that something significant changed around then in
>your ham and/or spam mix.

It was in fact the latter, and I'm just now prepping for spinning my
dataset by 80 and 160 days to revalidate.  Even odder things are
happening at specific times in the expiry stuff, and I want to see if
it's specific real times, or time after training commences...

- Alex

From tim.one at comcast.net  Sat Dec 27 04:58:31 2003
From: tim.one at comcast.net (Tim Peters)
Date: Sat Dec 27 04:58:43 2003
Subject: [spambayes-dev] New sort+group.py
Message-ID: <LNBBLJKPBEHFEDALKOLCOEJDIAAB.tim.one@comcast.net>

Attached is a major rewrite of testtools/sort+group.py.  Anyone who uses
that, please give it a try.  If nobody gripes, I'll check it in.  (If you're
on Linux, the attached probably has Windows line ends, and you may need to
change that.)

It's used exactly the same way as before, and creates filenames with the
same pattern as before, *except* that any pre-existing extension (like
".txt" on Windows) is preserved.  Extensions are necessary for sane life on
Windows, but the code currently checked in strips extensions as part of
renaming.

The major thrust of the changes is to order msgs by full-precision UTC
timestamp.  It was sorting just by date (not time), and wasn't accounting
for that different ISPs may be in different time zones.  It also failed to
parse many of the Received headers in my email, partly because Comcast's
Received headers don't make any attempt to keep the date-time part on a
single line.  Other failures were due to "unusual" spellings in the
date-time part.  Instead email.Utils.parsedate_tz() is used to parse this
stuff, and that didn't fail on any of the email I've tried so far.

Almost all Received headers I see have hour:minute:second info, and since I
do incremental training during the day, as email comes in, it's important to
me that the email be ordered at finer granularity than "a day".  A second
should be good enough <wink>.  My various ISPs are in different time zones
too, and normalizing to UTC should help model that, e.g., the first time I
see a new spam campaign it's much more likely to arrive from my MSN account
than from my Comcast account.
-------------- next part --------------
#! /usr/bin/env python

### Sort and group the messages in the Data hierarchy.
### Run this prior to mksets.py for setting stuff up for
### testing of chronological incremental training.

"""Usage: sort+group.py

This program has no options!  Muahahahaha!
"""

import sys
import os
import glob
import time

from email.Utils import parsedate_tz, mktime_tz

loud = True
SECONDS_PER_DAY = 24 * 60 * 60

# Scan the file with path fpath for its first Received header, and return
# a UTC timestamp for the date-time it specifies.  If anything goes wrong
# (can't find a Received header; can't parse the date), return None.
# This is the best guess about when we received the msg.
def get_time(fpath):
    fh = file(fpath, "rb")
    # Find first Received header.
    for line in fh:
        if line.lower().startswith("received:"):
            break
    else:
        print "\nNo Received header found."
        fh.close()
        return None
    # Paste on the continuation lines.
    received = line
    for line in fh:
        if line[0] in ' \t':
            received += line
        else:
            break
    fh.close()
    # RFC 2822 says the date-time field must follow a semicolon at the end.
    i = received.rfind(';')
    if i < 0:
        print "\n" + received
        print "No semicolon found in Received header."
        return None
    # We only want the part after the semicolon.
    datestring = received[i+1:]
    # It may still be split across lines (like "Wed, \r\n\t22 Oct ...").
    datestring = ' '.join(datestring.split())
    as_tuple = parsedate_tz(datestring)
    if as_tuple is None:
        print "\n" + received
        print "Couldn't parse the date: %r" % datestring
        return None
    return mktime_tz(as_tuple)

def main():
    """Main program; parse options and go."""

    data = []   # list of (time_received, path) pairs
    now = time.time()
    if loud:
        print "Scanning everything"
    for name in glob.glob('Data/*/*/*'):
        if loud:
            sys.stdout.write("%-78s\r" % name)
            sys.stdout.flush()
        when_received = get_time(name)
        data.append((when_received or now, name))

    if loud:
        print ""
        print "Sorting ..."
    data.sort()

    # First rename all the files to a form we can't produce in the end.
    # This is to protect against name clashes in case the files are
    # already named according to the scheme we use.
    if loud:
        print "Renaming first pass ..."
    for dummy, name in data:
        dirname = os.path.dirname(name)
        basename = os.path.basename(name)
        os.rename(name, os.path.join(dirname, "-"+basename))

    if loud:
        print "Renaming second pass ..."
    earliest = data[0][0]  # timestamp of earliest msg received
    for i, (when_received, name) in enumerate(data):
        dirname = os.path.dirname(name)
        basename = os.path.basename(name)
        extension = os.path.splitext(basename)[-1]
        group = int((when_received - earliest) / SECONDS_PER_DAY)
        newbasename = "%04d-%06d%s" % (group, i, extension)
        os.rename(os.path.join(dirname, "-"+basename),
                  os.path.join(dirname, newbasename))

if __name__ == "__main__":
    main()
From richie at entrian.com  Sat Dec 27 05:07:33 2003
From: richie at entrian.com (Richie Hindle)
Date: Sat Dec 27 05:07:37 2003
Subject: [spambayes-dev] Two SB on One Computer
In-Reply-To: <Pine.OSF.4.44.0312261528370.1058453-100000@bee.hiwaay.net>
References: <Pine.OSF.4.44.0312261528370.1058453-100000@bee.hiwaay.net>
Message-ID: <m5mquvoo1ck2oki2d9cp0unlv2pnpa749c@4ax.com>

Hi Stephen,

> I'm using the sb_server (pop3proxy) on an XP computer as a service.  I'd 
> like to install it as two separate services and use two separate databases 
> and separate web management ports so two different users can each have 
> their own customized spam filter.

You can't (I don't think) do this with the service, but you can certainly
do it when running sb_server from the command line, or via the Startup
group.  Just run each one in its own working directory with its own
bayescustomize.ini:

 o Create a directory for each instance of sb_server.

 o In each, create a bayescustomize.ini with minimal settings.  This:

   [html_ui]
   port=1234

   is probably enough.  Set up the rest through http://localhost:1234

 o Run sb_server from the command line in each directory.

-- 
Richie Hindle
richie@entrian.com


From dave at boost-consulting.com  Sat Dec 27 07:08:32 2003
From: dave at boost-consulting.com (David Abrahams)
Date: Sat Dec 27 07:08:40 2003
Subject: [spambayes-dev] NEWTRICKS
In-Reply-To: <LNBBLJKPBEHFEDALKOLCAEIPIAAB.tim.one@comcast.net> (Tim
	Peters's message of "Sat, 27 Dec 2003 00:52:36 -0500")
References: <LNBBLJKPBEHFEDALKOLCAEIPIAAB.tim.one@comcast.net>
Message-ID: <u65g25upr.fsf@boost-consulting.com>

"Tim Peters" <tim.one@comcast.net> writes:

> [David Abrahams]
>> I keep getting quite a few spams which fit the descriptions below
>> (from NEWTRICS.txt):
>
> I'm sure everyone gets them, the interesting question is whether they're
> evading your spambayes filter.  

They are showing up as Unsure; I wouldn't see them otherwise.

> They don't seem to give mine particular trouble (of course I train
> on those that score Unsure; 

Me too.

> I'm not sure I've ever seen one score as Ham).

Me neither.

>> ... [descriptions of attempted obfuscation via insertion of
>>      punctuation, and replacing letters by digits] ...
>
>> Since "this file is for ideas that have or have not yet been tried",
>> I'd love to know what constitutes "trying".  Is there some official
>> testing procedure or corpus we can test against?  I'd like to know
>> whether any change I make is worth proposing.  Of course I can try it
>> on my own databases of Ham and Spam first...
>
> There's no official corpus, else we'd be teaching the system to recognize
> that corpus.  Alex gave the right pointers to docs for the testing
> framework.

Thanks.  We'll see if my Christmas downtime lasts long enough for me
to be able to try that ;-)

-- 
Dave Abrahams
Boost Consulting
www.boost-consulting.com


From popiel at wolfskeep.com  Sat Dec 27 14:29:33 2003
From: popiel at wolfskeep.com (T. Alexander Popiel)
Date: Sat Dec 27 14:29:38 2003
Subject: [spambayes-dev] New sort+group.py 
In-Reply-To: Message from "Tim Peters" <tim.one@comcast.net> of "Sat,
	27 Dec 2003 04:58:31 EST."
	<LNBBLJKPBEHFEDALKOLCOEJDIAAB.tim.one@comcast.net> 
References: <LNBBLJKPBEHFEDALKOLCOEJDIAAB.tim.one@comcast.net> 
Message-ID: <20031227192933.AD0CC2DF61@cashew.wolfskeep.com>

In message:  <LNBBLJKPBEHFEDALKOLCOEJDIAAB.tim.one@comcast.net>
             "Tim Peters" <tim.one@comcast.net> writes:
>
>Attached is a major rewrite of testtools/sort+group.py.

Yay!  I'll be happy to admit I just sort of threw that together.

>Anyone who uses that, please give it a try.

Trying... but it seems to have major problems with python2.2.
It barfs on enumerate(), and it doesn't seem to be picking up
continuation lines, either, so I suspect the file reading style
you're using isn't grokked correctly, either.

>The major thrust of the changes is to order msgs by full-precision UTC
>timestamp.  It was sorting just by date (not time), and wasn't accounting
>for that different ISPs may be in different time zones.

Oops.  Doh.  Thanks for catching that.  All my mail gets received
and timestamped by my local machine, so the timezones weren't an
issue... but ignoring time of day entirely is rather embarassing.

- Alex, who is now trying to get it to work with python2.2...

From tim.one at comcast.net  Sat Dec 27 14:51:46 2003
From: tim.one at comcast.net (Tim Peters)
Date: Sat Dec 27 14:51:49 2003
Subject: [spambayes-dev] New sort+group.py 
In-Reply-To: <20031227192933.AD0CC2DF61@cashew.wolfskeep.com>
Message-ID: <LNBBLJKPBEHFEDALKOLCAEKHIAAB.tim.one@comcast.net>

[T. Alexander Popiel]
> Trying... but it seems to have major problems with python2.2.

Ah, fiddlesticks -- does someone here still give a rip about Python 2.2?  I
don't.  2.2 is dead -- it's no longer maintained, 2.3.3 is out and
universally regarded as stabler & faster than 2.2.3, and development has
moved on to 2.4.  I'd like to drop all our 2.2 compatibility cruft; it's a
growing mass of dead weight.

> It barfs on enumerate(), and it doesn't seem to be picking up
> continuation lines, either, so I suspect the file reading style
> you're using isn't grokked correctly, either.

Right, it wouldn't.  The easiest pithy explanation is that file objects in
2.2 *have* iterators, but file objects in 2.3 *are* iterators.  To use the
same style of code under both requires getting an explicit iterator,

    it = iter(fh)

and then doing

    for line in it:

everywhere instead of

    for line in fh:

As is, the

    for line in fh:

lines under 2.2 are really jumping across internal file buffers.  That was
crazy behavior, and that's why it got repaired for 2.3 (but the fix couldn't
be backported to 2.2 lest some crazy code relied on the broken behavior).

>> The major thrust of the changes is to order msgs by full-precision
>> UTC timestamp.  It was sorting just by date (not time), and wasn't
>> accounting for that different ISPs may be in different time zones.

> Oops.  Doh.  Thanks for catching that.  All my mail gets received
> and timestamped by my local machine, so the timezones weren't an
> issue... but ignoring time of day entirely is rather embarassing.

Not your fault:  it's not possible to find out time of day under 2.2 either
<wink>.


From popiel at wolfskeep.com  Sat Dec 27 15:03:15 2003
From: popiel at wolfskeep.com (T. Alexander Popiel)
Date: Sat Dec 27 15:03:19 2003
Subject: [spambayes-dev] New sort+group.py 
In-Reply-To: Message from "Tim Peters" <tim.one@comcast.net> of "Sat,
	27 Dec 2003 04:58:31 EST."
	<LNBBLJKPBEHFEDALKOLCOEJDIAAB.tim.one@comcast.net> 
References: <LNBBLJKPBEHFEDALKOLCOEJDIAAB.tim.one@comcast.net> 
Message-ID: <20031227200315.3E1242DF61@cashew.wolfskeep.com>

In message:  <LNBBLJKPBEHFEDALKOLCOEJDIAAB.tim.one@comcast.net>
             "Tim Peters" <tim.one@comcast.net> writes:
>
>Attached is a major rewrite of testtools/sort+group.py.

Here's a patch to make it work with python2.2.  It appears that the
'for line in fh:' syntax for filereading in 2.2 buffered a bunch of
lines which were then unavailable for use to subsequent similar loops
in the case of the first loop terminating early.  Also, enumerate()
didn't seem to exist, so I just maintained the count manually.

Enjoy.

- Alex

--- sort+group.py.noworky	Sat Dec 27 11:58:04 2003
+++ sort+group.py	Sat Dec 27 11:57:37 2003
@@ -26,20 +26,21 @@
 def get_time(fpath):
     fh = file(fpath, "rb")
     # Find first Received header.
-    for line in fh:
+    line = fh.readline()
+    while line != "\r\n" and line != "\n" and line != "":
         if line.lower().startswith("received:"):
             break
+        line = fh.readline()
     else:
         print "\nNo Received header found."
         fh.close()
         return None
     # Paste on the continuation lines.
     received = line
-    for line in fh:
-        if line[0] in ' \t':
-            received += line
-        else:
-            break
+    line = fh.readline()
+    while line[0] in ' \t':
+        received += line
+        line = fh.readline()
     fh.close()
     # RFC 2822 says the date-time field must follow a semicolon at the end.
     i = received.rfind(';')
@@ -90,7 +91,9 @@
     if loud:
         print "Renaming second pass ..."
     earliest = data[0][0]  # timestamp of earliest msg received
-    for i, (when_received, name) in enumerate(data):
+    i = 0
+    for when_received, name in data:
+        i = i + 1
         dirname = os.path.dirname(name)
         basename = os.path.basename(name)
         extension = os.path.splitext(basename)[-1]

From popiel at wolfskeep.com  Sat Dec 27 16:45:50 2003
From: popiel at wolfskeep.com (T. Alexander Popiel)
Date: Sat Dec 27 16:45:55 2003
Subject: [spambayes-dev] New sort+group.py 
In-Reply-To: Message from "Tim Peters" <tim.one@comcast.net> of "Sat,
	27 Dec 2003 14:51:46 EST."
	<LNBBLJKPBEHFEDALKOLCAEKHIAAB.tim.one@comcast.net> 
References: <LNBBLJKPBEHFEDALKOLCAEKHIAAB.tim.one@comcast.net> 
Message-ID: <20031227214550.D698C2DF61@cashew.wolfskeep.com>

In message:  <LNBBLJKPBEHFEDALKOLCAEKHIAAB.tim.one@comcast.net>
             "Tim Peters" <tim.one@comcast.net> writes:
>[T. Alexander Popiel]
>> Trying... but it seems to have major problems with python2.2.
>
>Ah, fiddlesticks -- does someone here still give a rip about Python 2.2?

Yeah, I do.  There are no more recent versions packaged for Debian stable.
Sorry.  This will likely change when sarge makes it to stable... but that's
likely not going to happen for at least 6 months.

>Not your fault:  it's not possible to find out time of day under 2.2 either
><wink>.

*pthbbbt*

- Alex

From tim.one at comcast.net  Sat Dec 27 21:41:53 2003
From: tim.one at comcast.net (Tim Peters)
Date: Sat Dec 27 21:41:55 2003
Subject: [spambayes-dev] New sort+group.py 
In-Reply-To: <20031227214550.D698C2DF61@cashew.wolfskeep.com>
Message-ID: <LNBBLJKPBEHFEDALKOLCCELFIAAB.tim.one@comcast.net>

[Tim]
>> Ah, fiddlesticks -- does someone here still give a rip about Python
>> 2.2?

[T. Alexander Popiel]
> Yeah, I do.  There are no more recent versions packaged for Debian
> stable. Sorry.  This will likely change when sarge makes it to
> stable... but that's likely not going to happen for at least 6 months.

Heh.  You're on a Linux system and can't upgrade a package?  Makes me glad
I'm running Windows, where others don't dictate what I can run on my own
machine <wink>.

I made sort+group.py 2.2.3-friendly, far as I can tell, but since 2.3 came
out I don't use 2.2 for anything anymore -- if I introduce more
incompatibilities, I won't know.


From tim.one at comcast.net  Sat Dec 27 22:07:56 2003
From: tim.one at comcast.net (Tim Peters)
Date: Sat Dec 27 22:07:59 2003
Subject: [spambayes-dev] Code changes
Message-ID: <LNBBLJKPBEHFEDALKOLCMELGIAAB.tim.one@comcast.net>

The meaning of Outlook2000/export.py's -n option has changed.  Here's the
checkin comment:

    INCOMPATIBLE CHANGE:  the -n option now gives the number of Set
    subdirectories desired, instead of a number of msgs per Set subdir
    "to shoot for".  If you want to run, e.g., 10-fold cross-validation,
    you have to have exactly 10 Set folders, and the # of msgs per folder
    is of much less importance.  Also added a note recommending to run
    rebal.py afterwards.  rebal is the expert in setting up randomized
    Set subdirectories, and the export.py script probably should have
    stuck to just extracting msgs from Outlook.

utilities/rebal.py has grown a -t option, which makes it (once again) easy
to use with a standard test setup.  It was originally easy to use that way,
but grew -r and -s options, presumably added by someone with a non-standard
test setup.  Unfortunately, those with a standard test setup had to use them
too, and they're both clumsy and error-prone to use with a standard test
setup.  -t can't be used in the same run with -r or -s.  Those with a
standard test setup no longer need to worry about -r or -s, just -t; vice
versa for those with a non-standard test setup.

The changes to testtools/sort+group.py discussed here have been checked in,
after fiddling to play nice with Python 2.2.3 too.


From popiel at wolfskeep.com  Sat Dec 27 22:32:16 2003
From: popiel at wolfskeep.com (T. Alexander Popiel)
Date: Sat Dec 27 22:32:20 2003
Subject: [spambayes-dev] New sort+group.py 
In-Reply-To: Message from "Tim Peters" <tim.one@comcast.net> of "Sat,
	27 Dec 2003 21:41:53 EST."
	<LNBBLJKPBEHFEDALKOLCCELFIAAB.tim.one@comcast.net> 
References: <LNBBLJKPBEHFEDALKOLCCELFIAAB.tim.one@comcast.net> 
Message-ID: <20031228033216.2897F2DF61@cashew.wolfskeep.com>

In message:  <LNBBLJKPBEHFEDALKOLCCELFIAAB.tim.one@comcast.net>
             "Tim Peters" <tim.one@comcast.net> writes:
>[Tim]
>>> Ah, fiddlesticks -- does someone here still give a rip about Python
>>> 2.2?
>
>[T. Alexander Popiel]
>> Yeah, I do.  There are no more recent versions packaged for Debian
>> stable. Sorry.  This will likely change when sarge makes it to
>> stable... but that's likely not going to happen for at least 6 months.
>
>Heh.  You're on a Linux system and can't upgrade a package?  Makes me glad
>I'm running Windows, where others don't dictate what I can run on my own
>machine <wink>.

Eh, it's not that I can't... it that if I do, I either have to go through
a lot of hassle to package it myself, or I have to go through a lot of
hassle to make all the packages that depend on python ignore the fact
that there's no python package listed as installed.  A pain either way.
(I _might_ be able to grab a package version out of sarge and recompile
to avoid a library-version-incompatibility cascade which would require
me to upgrade half my system to possibly broken versions... but still,
a nuisance).

>I made sort+group.py 2.2.3-friendly, far as I can tell, but since 2.3 came
>out I don't use 2.2 for anything anymore -- if I introduce more
>incompatibilities, I won't know.

*nod* I'll tell ya if you broke it for me. ;-)

- Alex

From sourceforge at metrak.com  Sun Dec 28 01:16:32 2003
From: sourceforge at metrak.com (Paul Sorenson)
Date: Sun Dec 28 01:16:36 2003
Subject: [spambayes-dev] error training on dbx file
Message-ID: <00e901c3cd0a$23566f20$c48b0fcb@home.classware.com.au>

With code I checked out from CVS in the last 24 hours or so, I got the error
below when trying to train a dbx file via the web interface.

I have python 2.3.3 installed and I the box involved is running Windows XP.
oe_mailbox.py doesn't appear to import time.

500 Server error

Traceback (most recent call last):

  File "C:\usr\spambayes\spambayes\Dibbler.py", line 457, in
found_terminator
    getattr(plugin, name)(**params)

  File "C:\usr\spambayes\spambayes\UserInterface.py", line 479, in onTrain
    content = self._convertToMbox(content)

  File "C:\usr\spambayes\spambayes\UserInterface.py", line 521, in
_convertToMbox
    content = oe_mailbox.convertToMbox(content)

  File "C:\usr\spambayes\spambayes\oe_mailbox.py", line 465, in
convertToMbox
    dbxBuffer += "From spambayes@spambayes.org %s\n%s" \

NameError: global name 'strftime' is not defined


From sourceforge at metrak.com  Sun Dec 28 01:26:24 2003
From: sourceforge at metrak.com (Paul Sorenson)
Date: Sun Dec 28 01:26:24 2003
Subject: [spambayes-dev] Re: error training on dbx file
Message-ID: <00fa01c3cd0b$847ffae0$c48b0fcb@home.classware.com.au>

Please ignore this error.  The user has been deleted :-)

From rob at hooft.net  Mon Dec 29 04:37:58 2003
From: rob at hooft.net (Rob Hooft)
Date: Mon Dec 29 04:39:16 2003
Subject: [spambayes-dev] Reduced training test results
In-Reply-To: <20031225202404.757652DF61@cashew.wolfskeep.com>
References: <20031225202404.757652DF61@cashew.wolfskeep.com>
Message-ID: <3FEFF5F6.1090004@hooft.net>

T. Alexander Popiel wrote:
> Training on just those messages whose score isn't 0.00 or 1.00
> (rounded) seems to be a huge win over training on everything.

Told you:
See the section "Train on Errors, Unsures, and non-obvious correct 
decisions" at http://www.entrian.com/sbwiki/TrainingIdeas

Happy that it comes out as I thought it would, though.

> Not so much because the accuracy is better (though accuracy
> does seem to be improved by neglecting those messages that it's
> already certain about), but because of a hugely reduced training
> set (and thus database). 

Both are effects I can feel in practice!

Rob

-- 
Rob W.W. Hooft  ||  rob@hooft.net  ||  http://www.hooft.net/people/rob/


From skip at pobox.com  Mon Dec 29 09:05:10 2003
From: skip at pobox.com (Skip Montanaro)
Date: Mon Dec 29 12:41:16 2003
Subject: [spambayes-dev] Reduced training test results
In-Reply-To: <3FEFF5F6.1090004@hooft.net>
References: <20031225202404.757652DF61@cashew.wolfskeep.com>
	<3FEFF5F6.1090004@hooft.net>
Message-ID: <16368.13462.404757.694070@montanaro.dyndns.org>


    Rob> T. Alexander Popiel wrote:
    >> Training on just those messages whose score isn't 0.00 or 1.00
    >> (rounded) seems to be a huge win over training on everything.

    Rob> Told you:
    Rob> See the section "Train on Errors, Unsures, and non-obvious correct 
    Rob> decisions" at http://www.entrian.com/sbwiki/TrainingIdeas

I think we need to split that page into multiple chunks.  I (directly and
indirectly) contributed a fair amount of content to that page, but my eyes
just glaze over now when reading it.  Anybody got some pretty graphs to
break up the text?

Skip

From popiel at wolfskeep.com  Mon Dec 29 12:51:22 2003
From: popiel at wolfskeep.com (T. Alexander Popiel)
Date: Mon Dec 29 12:51:29 2003
Subject: [spambayes-dev] Reduced training test results 
In-Reply-To: Message from Rob Hooft <rob@hooft.net> 
	of "Mon, 29 Dec 2003 10:37:58 +0100." <3FEFF5F6.1090004@hooft.net> 
References: <20031225202404.757652DF61@cashew.wolfskeep.com>
	<3FEFF5F6.1090004@hooft.net> 
Message-ID: <20031229175122.C3E6A2DE88@cashew.wolfskeep.com>

In message:  <3FEFF5F6.1090004@hooft.net>
             Rob Hooft <rob@hooft.net> writes:
>T. Alexander Popiel wrote:
>> Training on just those messages whose score isn't 0.00 or 1.00
>> (rounded) seems to be a huge win over training on everything.
>
>Told you:
>See the section "Train on Errors, Unsures, and non-obvious correct 
>decisions" at http://www.entrian.com/sbwiki/TrainingIdeas

Hrm.  I suppose that I ought to actually look at the wiki. ;-)

Is there any way for me to upload my plots to go along with any
discussion that I might add to the above page?  I could just
reference them on my machine, but it seems better to keep the
wiki content all in one place.

>> Not so much because the accuracy is better (though accuracy
>> does seem to be improved by neglecting those messages that it's
>> already certain about), but because of a hugely reduced training
>> set (and thus database). 
>
>Both are effects I can feel in practice!

FWIW, using this training style with my nightly retrains cut my
database size in half (from 21 meg to 10 meg).  This is with a
4-month horizon, too, so the difference would likely be even
greater with a longer span.

- Alex

From popiel at wolfskeep.com  Mon Dec 29 13:28:36 2003
From: popiel at wolfskeep.com (T. Alexander Popiel)
Date: Mon Dec 29 13:28:41 2003
Subject: [spambayes-dev] Reduced training test results 
In-Reply-To: Message from Skip Montanaro <skip@pobox.com> of "Mon,
	29 Dec 2003 08:05:10 CST."
	<16368.13462.404757.694070@montanaro.dyndns.org> 
References: <20031225202404.757652DF61@cashew.wolfskeep.com>
	<3FEFF5F6.1090004@hooft.net>
	<16368.13462.404757.694070@montanaro.dyndns.org> 
Message-ID: <20031229182836.EA1012DE88@cashew.wolfskeep.com>

In message:  <16368.13462.404757.694070@montanaro.dyndns.org>
             Skip Montanaro <skip@pobox.com> writes:
>
>    Rob> T. Alexander Popiel wrote:
>    >> Training on just those messages whose score isn't 0.00 or 1.00
>    >> (rounded) seems to be a huge win over training on everything.
>
>    Rob> Told you:
>    Rob> See the section "Train on Errors, Unsures, and non-obvious correct 
>    Rob> decisions" at http://www.entrian.com/sbwiki/TrainingIdeas
>
>I think we need to split that page into multiple chunks.

Agreed.  I think that using the subpage mechanism would be good.

>Anybody got some pretty graphs to break up the text?

Several, now... though I may need to see if I can rescale them to something
less than full-page.  Hrm.

Also, a few of these training ideas are already represented by regimes for
the incremental harness... and I've got a couple more to check in.  I also
recognize that my names for the regimes are, umm, less than optimal; if
people have better names for such, please speak up.  As an example, I'm
probably going to rename the 'perfect' regime to 'TrainOnEverything'.
Suggestions for capitalization style?  Should it be TrainOnEverything,
train_on_everything, or something else?

- Alex

From richie at entrian.com  Mon Dec 29 13:32:29 2003
From: richie at entrian.com (Richie Hindle)
Date: Mon Dec 29 13:32:35 2003
Subject: [spambayes-dev] Reduced training test results 
In-Reply-To: <20031229175122.C3E6A2DE88@cashew.wolfskeep.com>
References: <20031225202404.757652DF61@cashew.wolfskeep.com>
	<3FEFF5F6.1090004@hooft.net> <rob@hooft.net> of "Mon,
	29 Dec 2003 10:37:58 +0100." <3FEFF5F6.1090004@hooft.net>
	<20031229175122.C3E6A2DE88@cashew.wolfskeep.com>
Message-ID: <jms0vv075ujt7ioinbe3faguge1atvvmdd@4ax.com>


[Alex]
> Is there any way for me to upload my plots to go along with any
> discussion that I might add to the above page?  I could just
> reference them on my machine, but it seems better to keep the
> wiki content all in one place.

You can't upload images into the Wiki, no.  You can either reference
images on another server, as you say, or if you make them available to me
then I'll upload them onto the Wiki server and let you know their URLs.

-- 
Richie Hindle
richie@entrian.com


From tim.one at comcast.net  Mon Dec 29 13:57:54 2003
From: tim.one at comcast.net (Tim Peters)
Date: Mon Dec 29 13:58:07 2003
Subject: [spambayes-dev] Reduced training test results 
In-Reply-To: <20031229182836.EA1012DE88@cashew.wolfskeep.com>
Message-ID: <LNBBLJKPBEHFEDALKOLCIEALIBAB.tim.one@comcast.net>

[T. Alexander Popiel]
> ...
> Suggestions for capitalization style?  Should it be TrainOnEverything,
> train_on_everything, or something else?

I like the latter.  Barry's experience with the email package is that
especially non-native English readers have an easier time with underscores
than with CamelCasing.  Underscores are also more natural if these end up as
specifiable values in .ini files.


From skip at pobox.com  Mon Dec 29 15:14:42 2003
From: skip at pobox.com (Skip Montanaro)
Date: Mon Dec 29 15:14:54 2003
Subject: [spambayes-dev] Reduced training test results 
In-Reply-To: <20031229175122.C3E6A2DE88@cashew.wolfskeep.com>
References: <20031225202404.757652DF61@cashew.wolfskeep.com>
	<3FEFF5F6.1090004@hooft.net>
	<20031229175122.C3E6A2DE88@cashew.wolfskeep.com>
Message-ID: <16368.35634.271833.34086@montanaro.dyndns.org>


    Alex> Is there any way for me to upload my plots to go along with any
    Alex> discussion that I might add to the above page?  I could just
    Alex> reference them on my machine, but it seems better to keep the wiki
    Alex> content all in one place.

Dunno.  You'll have to poke around the Wiki help.

Skip

From tim.one at comcast.net  Mon Dec 29 15:37:17 2003
From: tim.one at comcast.net (Tim Peters)
Date: Mon Dec 29 15:37:42 2003
Subject: [spambayes-dev] RE: [Spambayes] How low can you go? 
In-Reply-To: <20031227061940.7C5492DF61@cashew.wolfskeep.com>
Message-ID: <LNBBLJKPBEHFEDALKOLCAEBHIBAB.tim.one@comcast.net>

[T. Alexander Popiel]
> ...
> Yup.  I have a nice picture now of the ratio over time at the bottom
> of the report at:
> http://www.wolfskeep.com/~popiel/spambayes/nonedge

Hmm.  That appears to be using a log scale for the Y (ratio) axis, so what
*appears* to be straight-line growth in the ratio after about day 150 is
really exponential growth.  That could get bad over time <wink>.

> ...
> Interestingly enough, though, the nonedge did better than TOE, despite
> a worse imbalance.

Yup, I saw that.

>> So if I had your data, I'd be curious to try variations that force
>> better balance.

> I'd love to... but I haven't been able to come up with anything which
> maintains the balance better without extreme artificiality.  If you
> think of any regimes that make sense, I'd be more than happy to run
> them.

Oh, there are billions of things that could be tried.  Who knows what might
pay?  Picking just enough edge ham  at random for training to force balance
is one idea.  The definition of "nonedge" is arbitrarily mutable too:
there's nothing a priori compelling about "0.00 or 1.00 after rounding to 2
decimal digits after the radix point".  For example, maybe it's better to
use 3 decimal digits, or 1, or maybe it's really best to use 2 digits after
the radix point when the score is expressed in base 7 <wink -- but "two
decimal digits" is just an artifact of how scores get displayed>.
Asymmetric bounds also have some attraction, since, e.g., in mistake-based
training "by hand" I always end up moving the ham cutoff closer to 0 than
the spam cutoff is to 1.  IOW, empirically, in my own email mix, and based
on one kind of lazy training, my region of certainty for ham is smaller than
my region of certainty for spam.  This makes some sense to me, since my ham
is more uniform than my spam.

Heh.  Except at Christmas, and probably through the first week of next year,
when I get piles of msgs from people I only hear from once a year.


From popiel at wolfskeep.com  Mon Dec 29 17:00:14 2003
From: popiel at wolfskeep.com (T. Alexander Popiel)
Date: Mon Dec 29 17:00:18 2003
Subject: [spambayes-dev] RE: [Spambayes] How low can you go? 
In-Reply-To: Message from "Tim Peters" <tim.one@comcast.net> of "Mon,
	29 Dec 2003 15:37:17 EST."
	<LNBBLJKPBEHFEDALKOLCAEBHIBAB.tim.one@comcast.net> 
References: <LNBBLJKPBEHFEDALKOLCAEBHIBAB.tim.one@comcast.net> 
Message-ID: <20031229220014.2B7C02DE88@cashew.wolfskeep.com>

In message:  <LNBBLJKPBEHFEDALKOLCAEBHIBAB.tim.one@comcast.net>
             "Tim Peters" <tim.one@comcast.net> writes:
>[T. Alexander Popiel]
>> ...
>> Yup.  I have a nice picture now of the ratio over time at the bottom
>> of the report at:
>> http://www.wolfskeep.com/~popiel/spambayes/nonedge
>
>Hmm.  That appears to be using a log scale for the Y (ratio) axis, so what
>*appears* to be straight-line growth in the ratio after about day 150 is
>really exponential growth.  That could get bad over time <wink>.

Yeah, I used log scale for the ratio... log makes more sense to me for
ratios.  I can trivially replot on linear scale if you want. ;-)

>Oh, there are billions of things that could be tried.  Who knows what might
>pay?

Aye, there are.  I don't have billions of CPU-days to burn, though,
so I'm trying to winnow down to stuff that's likely to pay off.
Theoretical beauty is one measure that sort of appeals.

><wink -- but "two decimal digits" is just an artifact of how scores get
>displayed>.

No argument there.  I have no particular love for that rule, either.

>Asymmetric bounds also have some attraction, since, e.g., in mistake-based
>training "by hand" I always end up moving the ham cutoff closer to 0 than
>the spam cutoff is to 1.

One thing that's occurred to me is to have the training cutoffs at
N sigma from mean (where N == .5?) for the two populations; how you'd
bootstrap that is an open question, of course.

- Alex

From mhammond at skippinet.com.au  Mon Dec 29 17:21:26 2003
From: mhammond at skippinet.com.au (Mark Hammond)
Date: Mon Dec 29 17:21:48 2003
Subject: [spambayes-dev] test_storage.py failing
In-Reply-To: <16363.28152.314476.785433@montanaro.dyndns.org>
Message-ID: <0e7401c3ce5a$19e1e900$2c00a8c0@eden>

> They were coming from the right place.  I eventually figured out that
> distutils didn't overwrite my installed copy when I tried
> installing from a
> new CVS version.
>
> Sorry for the false alarm.  I wonder if I should file a bug
> report against
> distutils...

I have seen similar things with distutils.  If the 'installed' file has a
later date than the file being installed, distutils decides not to install
it.

I struck this when I modified an installed version of a file, making a quick
hack of a change for debugging.  My idea was that by changing it in the
installed copy, I wouldn't need to undo the change, and would just rely on
distutils to overwrite with the correct copy.

I went so far as stepping through distutils in a debugger before I saw that
was the intent of the code.  I agree it sucks, but it doesn't appear to be a
bug.

Mark.


From skip at pobox.com  Mon Dec 29 17:55:23 2003
From: skip at pobox.com (Skip Montanaro)
Date: Mon Dec 29 17:55:31 2003
Subject: [spambayes-dev] test_storage.py failing
In-Reply-To: <0e7401c3ce5a$19e1e900$2c00a8c0@eden>
References: <16363.28152.314476.785433@montanaro.dyndns.org>
	<0e7401c3ce5a$19e1e900$2c00a8c0@eden>
Message-ID: <16368.45275.633710.948924@montanaro.dyndns.org>


    >> Sorry for the false alarm.  I wonder if I should file a bug report
    >> against distutils...

    Mark> I have seen similar things with distutils.  If the 'installed'
    Mark> file has a later date than the file being installed, distutils
    Mark> decides not to install it.

I think cvs exacerbates the problem by setting the mod time of checked out
files to the last checkin date.  Here's a listing of the Outlook2000 in my
~/tmp/spambayes directory:

    % ls -ltr Outlook2000/
    total 284
    -rw-rw-r--    1 skip     staff        4102 Oct  3 00:23 README.txt
    -rw-rw-r--    1 skip     staff        1779 Dec 14 05:23 default_bayes_customize.ini
    -rw-rw-r--    1 skip     staff        7199 Dec 15 23:06 train.py
    -rw-rw-r--    1 skip     staff        3539 Dec 15 23:06 oastats.py
    -rw-rw-r--    1 skip     staff        6205 Dec 15 23:06 config_wizard.py
    -rw-rw-r--    1 skip     staff       57355 Dec 19 00:25 msgstore.py
    -rw-rw-r--    1 skip     staff        7105 Dec 19 00:27 filter.py
    -rw-rw-r--    1 skip     staff       39497 Dec 20 05:21 manager.py
    -rw-rw-r--    1 skip     staff       18414 Dec 21 21:13 config.py
    -rw-rw-r--    1 skip     staff       73521 Dec 21 21:16 addin.py
    -rw-rw-r--    1 skip     staff        7407 Dec 21 21:17 about.html
    drwxrwxr-x   15 skip     staff         510 Dec 23 22:00 dialogs
    drwxrwxr-x    7 skip     staff         238 Dec 23 22:00 docs
    drwxrwxr-x   11 skip     staff         374 Dec 23 22:00 sandbox
    drwxrwxr-x   10 skip     staff         340 Dec 23 22:00 installer
    drwxrwxr-x    6 skip     staff         204 Dec 23 22:00 images
    -rw-rw-r--    1 skip     staff       29666 Dec 23 22:06 tester.py
    -rw-r--r--    1 skip     staff        8209 Dec 29 15:58 export.py
    drwxrwxr-x    5 skip     staff         170 Dec 29 15:58 CVS

Note that I created the entire tree just a few days before Christmas, yet
the README.txt file has a timestamp of October 3rd.

    Mark> I went so far as stepping through distutils in a debugger before I
    Mark> saw that was the intent of the code.  I agree it sucks, but it
    Mark> doesn't appear to be a bug.

Maybe there's a flag in cvs which will set the timestamp appropriately.
Alternatively, I suppose a 'find . -type f | xargs touch' would work for us
Unix geeks.  Still, it's surprising.

(Best thing would be to install if the source and destination files have
different checksums.)

Skip


From ltnieh at earthlink.net  Mon Dec 29 21:01:27 2003
From: ltnieh at earthlink.net (Luther Nieh)
Date: Mon Dec 29 21:01:26 2003
Subject: [spambayes-dev] Full re-initialization
Message-ID: <PAEIKFCFJBPAOLNKOEEMEEHMCIAA.ltnieh@earthlink.net>


Hello SpamBayes Tech support,

Thank you for developing this useful software.  I have used it for a few
weeks and I feel it has been doing it's job as described in your
documentation.

One day, as I was cleaning up the M/S Outlook folders, I accidentally
deleted the spam email folder.  Now, SpamBayes does not seem to work
anymore.  It said it couldn't send spam emails to that folder, even though I
had manually re-created the spam email folder.  I even attempted to remove
and re-install SpamBayes without any improvement.  Please let me know what
one does to have the program re-initialize itself as if it was a new
installation.

Thank you for your help.

Regards,


Luther Nieh
ltnieh@sunnydesign.com
12/29/03
---
Outgoing mail is certified Virus Free.
Checked by AVG anti-virus system (http://www.grisoft.com).
Version: 6.0.552 / Virus Database: 344 - Release Date: 12/15/2003


From popiel at wolfskeep.com  Mon Dec 29 23:29:17 2003
From: popiel at wolfskeep.com (T. Alexander Popiel)
Date: Mon Dec 29 23:29:22 2003
Subject: [spambayes-dev] Full re-initialization 
In-Reply-To: Message from "Luther Nieh" <ltnieh@earthlink.net> of "Mon,
	29 Dec 2003 18:01:27 PST."
	<PAEIKFCFJBPAOLNKOEEMEEHMCIAA.ltnieh@earthlink.net> 
References: <PAEIKFCFJBPAOLNKOEEMEEHMCIAA.ltnieh@earthlink.net> 
Message-ID: <20031230042918.115EB2DE88@cashew.wolfskeep.com>

In message:  <PAEIKFCFJBPAOLNKOEEMEEHMCIAA.ltnieh@earthlink.net>
             "Luther Nieh" <ltnieh@earthlink.net> writes:
>
>Hello SpamBayes Tech support,

Well, we're not really tech support here... it's just the people
who wrote the code and us hangers-on who heckle from the sides.

>One day, as I was cleaning up the M/S Outlook folders, I accidentally
>deleted the spam email folder.  Now, SpamBayes does not seem to work
>anymore.  It said it couldn't send spam emails to that folder, even
>though I had manually re-created the spam email folder.

In this case, Outlook is "helpful" and remembers that you "moved" the
spam mail folder to the trash.  What you should do is go back to the
configuration panel where you set the spam mail folder, and point it
back to your re-created spam mail folder.  Once that's done, all should
be happy again.

- Alex (who doesn't use Outlook and thus can't get super-specific)

From mhammond at skippinet.com.au  Tue Dec 30 01:09:22 2003
From: mhammond at skippinet.com.au (Mark Hammond)
Date: Tue Dec 30 01:09:31 2003
Subject: [spambayes-dev] Experimental SpamBayes build available
In-Reply-To: <p91kuvkrdt81vqqivrklhq87f0qvqrq7u0@4ax.com>
Message-ID: <0f2801c3ce9b$78a1aca0$2c00a8c0@eden>

[Richie]
> [Mark]
> > I have just uploaded an installer for a new experimental binary of
> > SpamBayes.  This binary includes *both* the Outlook addin
> and the sb_server
> > applications.
>
> Nice one!  Barring a few minor glitches (which I'll enter into the SF
> tracker when I get the chance) sb_tray worked like a charm for me.

Great!  I didn't see any new items in the tracker though.  If they are
trivial, just mail them to me.

Otherwise, did anyone else try this build?  Either for Outlook or sb_server?
I fear I may have "disclaimed" the build a little too much, as this is the
only reply I got, and I see no new bugs etc.

Thanks,

Mark.


From theller at python.net  Tue Dec 30 04:54:51 2003
From: theller at python.net (Thomas Heller)
Date: Tue Dec 30 03:55:35 2003
Subject: [spambayes-dev] Re: test_storage.py failing
References: <16363.28152.314476.785433@montanaro.dyndns.org>
	<0e7401c3ce5a$19e1e900$2c00a8c0@eden>
Message-ID: <ad5aljf8.fsf@python.net>

"Mark Hammond" <mhammond@skippinet.com.au> writes:

>> They were coming from the right place.  I eventually figured out that
>> distutils didn't overwrite my installed copy when I tried
>> installing from a
>> new CVS version.
>>
>> Sorry for the false alarm.  I wonder if I should file a bug
>> report against
>> distutils...
>
> I have seen similar things with distutils.  If the 'installed' file has a
> later date than the file being installed, distutils decides not to install
> it.
>
> I struck this when I modified an installed version of a file, making a quick
> hack of a change for debugging.  My idea was that by changing it in the
> installed copy, I wouldn't need to undo the change, and would just rely on
> distutils to overwrite with the correct copy.
>
> I went so far as stepping through distutils in a debugger before I saw that
> was the intent of the code.  I agree it sucks, but it doesn't appear to be a
> bug.

There's a '--force' command line option for distutils' install command
for that.

Thomas


From richie at entrian.com  Tue Dec 30 08:50:02 2003
From: richie at entrian.com (Richie Hindle)
Date: Tue Dec 30 08:50:10 2003
Subject: [spambayes-dev] Experimental SpamBayes build available
In-Reply-To: <0f2801c3ce9b$78a1aca0$2c00a8c0@eden>
References: <p91kuvkrdt81vqqivrklhq87f0qvqrq7u0@4ax.com>
	<0f2801c3ce9b$78a1aca0$2c00a8c0@eden>
Message-ID: <6e03vvcjj7rt568k13e89l3m9m17m40pgl@4ax.com>


[Richie]
> Nice one!  Barring a few minor glitches (which I'll enter into the SF
> tracker when I get the chance) sb_tray worked like a charm for me.

[Mark]
> Great!  I didn't see any new items in the tracker though.  If they are
> trivial, just mail them to me.

Sorry Mark, Christmas has been pretty hectic.  8-)

Here are the notes I made.  Lots of these aren't anything to do with the
binary packaging, but I'll send the lot anyway.  If you need any more
detail, just ask:

I have Outlook, and the installer says "Outlook appears to be installed".
But I don't use Outlook, so I clear that checkbox, check the Server box,
and hit Next.  The I think, "Hang on, I might as well have a look at the
Outlook plugin" so I hit Back.  It now says "Outlook does not appear to be
installed".  A bit misleading.

At the end of the install I checked both "View welcome.html" and "View
proxy_readme".  Only welcome.html appeared.

I can launch many instances of sb_tray without complaint.

The ini file for the proxy appeared in "C:\Documents and
Settings\rjh\Application Data\SpamBayes\Proxy" as you'd expect, but the
database and cache directories appeared in "C:\Program
Files\SpamBayes\bin".  Then after restarting, another set of database and
cache directories appeared in "C:\Documents and Settings\rjh".  I guess
sb_tray writes them into the working directory, and the installer's
working directory is "C:\Program Files\SpamBayes\bin" when it launches
sb_tray for the first time.  Then when you start it from the Start menu,
the working directory is "C:\Documents and Settings\rjh".

If I right-click the tray icon and go "Stop spambayes", the icon goes red
after a second or two and the proxy stops.  When I go right-click / Start,
the proxy doesn't start, and the icon still shows red.  If I move the
cursor over the icon - without clicking - it goes green, but the proxy
still hasn't started.  Right-clicking presents a "Stop" command, which
makes the icon go red again, but as soon as I move the cursor over the
icon it returns to green.  I have to exit and restart before the proxy
will restart.  I'd question whether we need the Stop/Start command - why
would I want the tray icon to stay there but the application to not run?
Stop vs. Exit is not a clear distinction.  Things like firewalls and virus
scanners need this because they can be intrusive, but sb_tray is not
intrusive - if your email client is configured to use it then it must be
running for your email to work, and if your email client isn't configured
to use it then it has no effect.

After training through the web interface, the home page still says
"Database has no training information ..." even though the stats say
"Total emails trained: Spam: 3 Ham: 18".  Only after changing and saving
the configuration does it update to say "Database only has 18 good and 3
spam ..."  Then after subsequent training it still says "18 good and 3
spam".

Defaulting the "Maximum results" field in the Find pane to 1 seems wrong.
It made sense when all you could do was search for a message ID (because
they're unique) but if I'm searching for text, I'll want to see all the
hits.

The Find pane only looks in the unknown cache, so it won't find anything
once you've trained.  It ought to look in the ham and spam caches as well.

I deliberately induced a false positive (by training on a thousand spams
with no hams trained) then corrected it via the Review page, and the
statistics now say "1 being false negatives" (plural: ack!) and "0 being
false positives".  That's the wrong way round.

-- 
Richie Hindle
richie@entrian.com


From mhammond at skippinet.com.au  Tue Dec 30 08:55:30 2003
From: mhammond at skippinet.com.au (Mark Hammond)
Date: Tue Dec 30 08:55:41 2003
Subject: [spambayes-dev] Experimental SpamBayes build available
In-Reply-To: <6e03vvcjj7rt568k13e89l3m9m17m40pgl@4ax.com>
Message-ID: <0fa601c3cedc$97e6d400$2c00a8c0@eden>

> Sorry Mark, Christmas has been pretty hectic.  8-)

Woo hoo - me too - and I've a blinder planned for tomorrow night ;)  Thanks
for that!  I'll reply in detail for each of these points - either a 'fixed'
or the bug number.

Happy new year!

Mark.


From skip at pobox.com  Tue Dec 30 09:37:26 2003
From: skip at pobox.com (Skip Montanaro)
Date: Tue Dec 30 09:37:35 2003
Subject: [spambayes-dev] Experimental SpamBayes build available
In-Reply-To: <6e03vvcjj7rt568k13e89l3m9m17m40pgl@4ax.com>
References: <p91kuvkrdt81vqqivrklhq87f0qvqrq7u0@4ax.com>
	<0f2801c3ce9b$78a1aca0$2c00a8c0@eden>
	<6e03vvcjj7rt568k13e89l3m9m17m40pgl@4ax.com>
Message-ID: <16369.36262.575261.379159@montanaro.dyndns.org>

Mark,

Thanks for the new installer.  I tried it out on my little-used Win2k
machine.  While it seemed to install fine, the tray icon does nothing but
briefly change the pointer to an hourglass.  I double-clicked the
sb_server.exe icon and it popped up a window then immediately went away.
Following the suggestion in the readme_proxy.html file I tried right-mousing
the tray menu item but saw nothing but the usual Windows fluff
("Save/Restore Desktop Icons", "Open", ..., "Properties").

Skip


From nobody at spamcop.net  Tue Dec 30 11:32:18 2003
From: nobody at spamcop.net (Seth Goodman)
Date: Tue Dec 30 11:36:48 2003
Subject: [spambayes-dev] RE: [Spambayes] How low can you go? 
In-Reply-To: <20031229220014.2B7C02DE88@cashew.wolfskeep.com>
Message-ID: <MHEGIFHMACFNNIMMBACAGELKHAAA.nobody@spamcop.net>

> [T. Alexander Popiel]
> One thing that's occurred to me is to have the training cutoffs at
> N sigma from mean (where N == .5?) for the two populations; how you'd
> bootstrap that is an open question, of course.

Great idea.  The first pass could just be set to two constant thresholds,
then start computing the mean, SD and new thresholds.  This should converge
fairly quickly.

Another idea is to use the two means, but decide how many SD's to go for
each one based on the incoming ham/spam ratio.  This requires you to make an
assumption about the distributions.

Along the same lines, one more possibility is to construct a cumulative
distribution function (CDF) of new mail received, then set the training
thresholds such that you would train an equal number of ham/spam.  This also
lets you set the total number of messages trained, or at least to limit it
to a maximum value.  Since this is a batch (nightly?) process rather than
continuous, the CDF calculation is a posteriori so both the ratio and number
of new trained messages will be achieved exactly.

--
Seth Goodman

  Humans:   off-list replies to sethg [at] GoodmanAssociates [dot] com

  Spambots: disregard the above


From nobody at spamcop.net  Tue Dec 30 11:39:46 2003
From: nobody at spamcop.net (Seth Goodman)
Date: Tue Dec 30 11:40:06 2003
Subject: [spambayes-dev] Experimental SpamBayes build available
In-Reply-To: <0f2801c3ce9b$78a1aca0$2c00a8c0@eden>
Message-ID: <MHEGIFHMACFNNIMMBACAIELLHAAA.nobody@spamcop.net>

I tried the straight Outlook add-in and so far, no bugs to report!  It
appeared to recognize my old databases just fine, but I retrained to take
advantage of any things you smoothed out in the tokenizer.  I like the new
spam clues page with the internal ham and spam scores plus the number of
significant tokens listed.  Nice job!

--
Seth Goodman

  Humans:   off-list replies to sethg [at] GoodmanAssociates [dot] com

  Spambots: disregard the above


From tim.one at comcast.net  Tue Dec 30 14:17:56 2003
From: tim.one at comcast.net (Tim Peters)
Date: Tue Dec 30 14:18:07 2003
Subject: [spambayes-dev] A URL experiment
Message-ID: <LNBBLJKPBEHFEDALKOLCGEGNIBAB.tim.one@comcast.net>

Over on the spambayes list yesterday, we were discussing a particularly good
identity-theft scam spam, purporting to be from PayPal.  It linked
extensively to PayPal's real site, and about the only fishy lexical thing
was a highly obfuscated href (full of % escapes).

We don't do anything special with % escapes in URLs now.  Maybe we should.
The attached patch does.

I don't have enough personal email saved to make for a good test, but who
cares <wink>.  I just took what I had, slammed into randomly into 10 even
sets, and did "the usual" cross-validation business on it.  All of this
email is less than a week old, is all the email I've gotten since then, is
atypical for me (Christmas time -> a lot less email than usual, but a spike
in personal email), and runs 3:1 in favor of ham.  None of that matters,
though -- *whatever* you have, and however you train, the interesting
question is just how it does with the patch, compared to without it.

I ran my 10-fold CV with "the default" settings for Outlook.  These match
the current (CVS) project defaults, with the addition of

[Tokenizer]
replace_nonascii_chars: True
record_header_absence: True

I'm *not* using mine_received_headers or x-use_bigrams in these tests.

befores -> afters
-> <stat> tested 151 hams & 52 spams against 1359 hams & 468 spams
[19 repetitions of that]

false positive percentages
    0.000  0.000  tied
    0.000  0.000  tied
    0.000  0.000  tied
    0.000  0.000  tied
    0.000  0.000  tied
    0.000  0.000  tied
    0.662  0.662  tied
    0.000  0.000  tied
    0.000  0.000  tied
    0.000  0.000  tied

won   0 times
tied 10 times
lost  0 times

total unique fp went from 1 to 1 tied
mean fp % went from 0.0662251655629 to 0.0662251655629 tied

false negative percentages
    1.923  1.923  tied
    0.000  0.000  tied
    0.000  0.000  tied
    0.000  0.000  tied
    0.000  0.000  tied
    0.000  0.000  tied
    1.923  1.923  tied
    1.923  1.923  tied
    0.000  0.000  tied
    0.000  0.000  tied

won   0 times
tied 10 times
lost  0 times

total unique fn went from 3 to 3 tied
mean fn % went from 0.576923076924 to 0.576923076924 tied

ham mean                     ham sdev
   0.44    0.44   +0.00%        4.52    4.52   +0.00%
   0.34    0.34   +0.00%        4.11    4.11   +0.00%
   0.27    0.27   +0.00%        3.16    3.16   +0.00%
   0.17    0.17   +0.00%        1.51    1.51   +0.00%
   1.06    1.06   +0.00%        9.11    9.12   +0.11%
   0.00    0.00 +(was 0)        0.01    0.01   +0.00%
   0.78    0.78   +0.00%        8.16    8.16   +0.00%
   0.42    0.43   +2.38%        5.19    5.21   +0.39%
   0.01    0.01   +0.00%        0.11    0.11   +0.00%
   0.07    0.07   +0.00%        0.90    0.90   +0.00%

ham mean and sdev for all runs
   0.36    0.36   +0.00%        4.77    4.78   +0.21%

spam mean                    spam sdev
  96.41   96.43   +0.02%       13.52   13.51   -0.07%
  98.51   98.56   +0.05%        6.99    6.99   +0.00%
  97.80   97.80   +0.00%        6.42    6.41   -0.16%
  98.21   98.22   +0.01%        7.31    7.30   -0.14%
  93.00   93.03   +0.03%       16.68   16.66   -0.12%
  97.40   97.41   +0.01%        8.29    8.27   -0.24%
  97.58   97.70   +0.12%       12.30   12.18   -0.98%
  97.01   97.02   +0.01%       14.38   14.37   -0.07%
  95.90   96.03   +0.14%       11.61   11.46   -1.29%
  98.86   98.86   +0.00%        6.12    6.11   -0.16%

spam mean and sdev for all runs
  97.07   97.11   +0.04%       11.09   11.05   -0.36%

ham/spam mean difference: 96.71 96.75 +0.04

Not much to talk about there!  Pretty much indistinguishable, although the
spam mean went up a tad consistently, and the spam sdev down a tad
consistently.

table.py's "best cost" output shows that I could have reduced the optimal
cost by 1 unsure if I changed my cutoffs:

filename:   before   after
ham:spam:  1510:520
                   1510:520
fp total:        1       1
fp %:         0.07    0.07
fn total:        3       3
fn %:         0.58    0.58
unsure t:       39      39
unsure %:     1.92    1.92
real cost:  $20.80  $20.80
best cost:  $17.60  $17.40
h mean:       0.36    0.36
h sdev:       4.77    4.78
s mean:      97.07   97.11
s sdev:      11.09   11.05
mean diff:   96.71   96.75
k:            6.10    6.11

So the change would have been the tiniest of wins for me.  For you?

BTW, the fp here was an "end of year sale" blaring HTML ad from Gateway.
That's ham to me, but there are no other msgs from Gateway in this email.
It contains enough Gateway-specific lexicalisms that training on one is
enough to score future ones as solid ham.  The PayPal scam that started this
remained a solid FN.
-------------- next part --------------
Index: tokenizer.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/spambayes/tokenizer.py,v
retrieving revision 1.27
diff -c -u -r1.27 tokenizer.py
--- tokenizer.py	30 Dec 2003 16:26:33 -0000	1.27
+++ tokenizer.py	30 Dec 2003 18:45:59 -0000
@@ -1011,9 +1011,25 @@
         Stripper.__init__(self, url_re.search, re.compile("").search)
 
     def tokenize(self, m):
+        import urllib
+
         proto, guts = m.groups()
         tokens = ["proto:" + proto]
         pushclue = tokens.append
+
+        # %nn escapes are usually intentional obfuscation.  Generate a lot
+        # of correlated tokens if the URL contains a lot of them.  The
+        # classifier will learn which specific ones are and aren't spammy.
+        escapes = re.findall(r'%..', guts)
+        tokens.extend(["url:" + escape for escape in escapes])
+
+        try:
+            # Tokenize the unobfuscated URL.
+            guts = urllib.unquote(guts)
+        except:
+            pushclue("url:invalid escapes")
+            # And guts is unchanged; however, I don't think urllib.unquote()
+            # ever raises an exception now.
 
         # Lose the trailing punctuation for casual embedding, like:
         #     The code is at http://mystuff.org/here?  Didn't resolve.
From skip at pobox.com  Tue Dec 30 16:45:55 2003
From: skip at pobox.com (Skip Montanaro)
Date: Tue Dec 30 16:46:10 2003
Subject: [spambayes-dev] A URL experiment
In-Reply-To: <LNBBLJKPBEHFEDALKOLCGEGNIBAB.tim.one@comcast.net>
References: <LNBBLJKPBEHFEDALKOLCGEGNIBAB.tim.one@comcast.net>
Message-ID: <16369.61971.794205.276663@montanaro.dyndns.org>


    Tim> Over on the spambayes list yesterday, we were discussing a
    Tim> particularly good identity-theft scam spam, purporting to be from
    Tim> PayPal.  It linked extensively to PayPal's real site, and about the
    Tim> only fishy lexical thing was a highly obfuscated href (full of %
    Tim> escapes).

    Tim> We don't do anything special with % escapes in URLs now.  Maybe we
    Tim> should.  The attached patch does.

I tried a somewhat different approach (patch is attached) and got similar
results (all ties at the more gross level, slight increase in spam mean and
slight decrease in spam sdev, no change to ham at all (*)):

stds.txt -> pickurlss.txt
-> <stat> tested 250 hams & 300 spams against 1000 hams & 1200 spams
-> <stat> tested 250 hams & 300 spams against 1000 hams & 1200 spams
-> <stat> tested 250 hams & 300 spams against 1000 hams & 1200 spams
-> <stat> tested 250 hams & 300 spams against 1000 hams & 1200 spams
-> <stat> tested 250 hams & 300 spams against 1000 hams & 1200 spams
-> <stat> tested 250 hams & 300 spams against 1000 hams & 1200 spams
-> <stat> tested 250 hams & 300 spams against 1000 hams & 1200 spams
-> <stat> tested 250 hams & 300 spams against 1000 hams & 1200 spams
-> <stat> tested 250 hams & 300 spams against 1000 hams & 1200 spams
-> <stat> tested 250 hams & 300 spams against 1000 hams & 1200 spams

false positive percentages
    0.000  0.000  tied          
    0.400  0.400  tied          
    0.000  0.000  tied          
    0.000  0.000  tied          
    0.000  0.000  tied          

won   0 times
tied  5 times
lost  0 times

total unique fp went from 1 to 1 tied          
mean fp % went from 0.08 to 0.08 tied          

false negative percentages
    3.333  3.333  tied          
    5.000  5.000  tied          
    7.333  7.333  tied          
    5.667  5.667  tied          
    4.000  4.000  tied          

won   0 times
tied  5 times
lost  0 times

total unique fn went from 76 to 76 tied          
mean fn % went from 5.06666666667 to 5.06666666667 tied          

ham mean                     ham sdev
   1.64    1.64   +0.00%        8.44    8.44   +0.00%
   0.99    0.99   +0.00%        8.29    8.29   +0.00%
   2.82    2.82   +0.00%       12.52   12.52   +0.00%
   1.58    1.58   +0.00%        8.29    8.29   +0.00%
   1.30    1.30   +0.00%        8.04    8.04   +0.00%

ham mean and sdev for all runs
   1.66    1.66   +0.00%        9.30    9.30   +0.00%

spam mean                    spam sdev
  93.80   93.82   +0.02%       19.39   19.35   -0.21%
  90.56   90.58   +0.02%       24.31   24.26   -0.21%
  89.24   89.27   +0.03%       27.03   27.04   +0.04%
  89.27   89.27   +0.00%       25.51   25.50   -0.04%
  92.72   92.74   +0.02%       21.67   21.67   +0.00%

spam mean and sdev for all runs
  91.12   91.14   +0.02%       23.81   23.80   -0.04%

ham/spam mean difference: 89.46 89.48 +0.02


(*) Operational question: Given that my training data is somewhat small at
the moment (roughly 1000-1500 each of ham and spam), would I be better off
testing with fewer larger sets (e.g, 5 sets w/ 250 msgs each) or with more
smaller sets (e.g, 10 sets w/ 125 msgs each)?

Skip

-------------- next part --------------
Index: spambayes/Options.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/spambayes/Options.py,v
retrieving revision 1.97
diff -c -r1.97 Options.py
*** spambayes/Options.py        30 Dec 2003 16:26:33 -0000      1.97
--- spambayes/Options.py        30 Dec 2003 21:42:48 -0000
***************
*** 145,150 ****
--- 145,155 ----
       """(DEPRECATED) Extract day of the week tokens from the Date: header.""",
       BOOLEAN, RESTORE),
  
+     ("x-pick_apart_urls", "Extract clues about url structure", False,
+      """(EXPERIMENTAL) Note whether url contains non-standard port or
+      user/password elements.""",
+      BOOLEAN, RESTORE),
+ 
      ("replace_nonascii_chars", "Replace non-ascii characters", False,
       """If true, replace high-bit characters (ord(c) >= 128) and control
       characters with question marks.  This allows non-ASCII character
Index: spambayes/tokenizer.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/spambayes/tokenizer.py,v
retrieving revision 1.27
diff -c -r1.27 tokenizer.py
*** spambayes/tokenizer.py      30 Dec 2003 16:26:33 -0000      1.27
--- spambayes/tokenizer.py      30 Dec 2003 21:42:48 -0000
***************
*** 13,18 ****
--- 13,20 ----
  import time
  import os
  import binascii
+ import urlparse
+ import urllib
  try:
      from sets import Set
  except ImportError:
***************
*** 1014,1019 ****
--- 1016,1038 ----
          proto, guts = m.groups()
          tokens = ["proto:" + proto]
          pushclue = tokens.append
+ 
+         if options["Tokenizer", "x-pick_apart_urls"]:
+             url = proto + "://" + guts
+             num_pcs = url.count("%")
+             if num_pcs:
+                 pushclue("url:%d %%s" % num_pcs)
+             url = urllib.unquote(url)
+             scheme, netloc, path, params, query, frag = urlparse.urlparse(url)
+             user_pwd, host_port = urllib.splituser(netloc)
+             if user_pwd is not None:
+                 pushclue("url:has user")
+             host, port = urllib.splitport(host_port)
+             if port is not None:
+                 if scheme == "http" and port != '80':
+                     pushclue("url:non-standard http port")
+                 elif scheme == "https" and port != '443':
+                     pushclue("url:non-standard https port")
  
          # Lose the trailing punctuation for casual embedding, like:
          #     The code is at http://mystuff.org/here?  Didn't resolve.
From tameyer at ihug.co.nz  Tue Dec 30 17:51:29 2003
From: tameyer at ihug.co.nz (Tony Meyer)
Date: Tue Dec 30 17:51:37 2003
Subject: [spambayes-dev] Experimental SpamBayes build available
In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F1304985F95@its-xchg4.massey.ac.nz>
Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F13026F2A1C@its-xchg4.massey.ac.nz>

[I'll leave the install stuff for Mark, but I can sort out the rest of
these].

> The ini file for the proxy appeared in "C:\Documents and 
> Settings\rjh\Application Data\SpamBayes\Proxy" as you'd 
> expect, but the database and cache directories appeared in 
> "C:\Program Files\SpamBayes\bin".

Did the ini file have the appropriate [Storage] lines in it?  It's meant to
add them in there, storing the directories in that directory, too.  You
didn't already have an ini file in there, did you?  (It only adds those
lines if it's a new file, so that it doesn't overwrite someone's settings).

> I'd question whether 
> we need the Stop/Start command - why would I want the tray 
> icon to stay there but the application to not run?

I was thinking this just yesterday. I'm not sure what the original reasoning
behind having it was (and it may have been me that put it there ;).

+1 to getting rid of it, unless someone does know the reasoning.  We can
dump the 'stopped' icon, then, too.  (I'd like to see a '!' icon, though,
which appeared when there were important status messages to review).

> After training through the web interface, the home page still 
> says "Database has no training information ..." even though 
> the stats say "Total emails trained: Spam: 3 Ham: 18".

Good spotting.  I've checked in a fix for this.

> Defaulting the "Maximum results" field in the Find pane to 1 
> seems wrong. It made sense when all you could do was search 
> for a message ID (because they're unique) but if I'm 
> searching for text, I'll want to see all the hits.

Fair enough.  Line 435 of ui.html; change it to whatever you like most :)

> The Find pane only looks in the unknown cache, so it won't 
> find anything once you've trained.  It ought to look in the 
> ham and spam caches as well.

Are you positive?  The code has it looking in all three, and a quick test
here had it finding messages in more than one.

> I deliberately induced a false positive (by training on a 
> thousand spams with no hams trained) then corrected it via 
> the Review page, and the statistics now say "1 being false 
> negatives" (plural: ack!) and "0 being false positives".  
> That's the wrong way round.

Opps, my bad.  I've checking in a fix for this.  I think I've fixed all the
plurals, too.  If you've still got that false positive statistic around,
could you give it a run from cvs?

=Tony Meyer


From tameyer at ihug.co.nz  Tue Dec 30 20:32:02 2003
From: tameyer at ihug.co.nz (Tony Meyer)
Date: Tue Dec 30 20:32:09 2003
Subject: [spambayes-dev] Experimental SpamBayes build available
In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F1304985F9F@its-xchg4.massey.ac.nz>
Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F13046777BD@its-xchg4.massey.ac.nz>

[Skip]
> Thanks for the new installer.  I tried it out on my 
> little-used Win2k machine.  While it seemed to install fine, 
> the tray icon does nothing but briefly change the pointer to 
> an hourglass.  I double-clicked the sb_server.exe icon and it 
> popped up a window then immediately went away.

In your temp directory (C:\Documents and Settings\[username]\Local
Settings\Temp in Win2k, I think) there should be some SpamBayesServerN.log
files (where N is a number).  Could you grab any that are there and mail
them to me/the list?

(I haven't seen this myself, so it would be good to figure out what it is).

=Tony Meyer


From tameyer at ihug.co.nz  Tue Dec 30 20:36:44 2003
From: tameyer at ihug.co.nz (Tony Meyer)
Date: Tue Dec 30 20:36:48 2003
Subject: [spambayes-dev] Experimental SpamBayes build available
In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F1304985F55@its-xchg4.massey.ac.nz>
Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F13026F2A1D@its-xchg4.massey.ac.nz>

[Mark]
> Otherwise, did anyone else try this build?  Either for 
> Outlook or sb_server? I fear I may have "disclaimed" the 
> build a little too much, as this is the only reply I got, and 
> I see no new bugs etc.

I briefly tried it, and all seemed ok to me (Outlook XP and the others on
WinXP).  I've also been trying various experimental builds of my own (both
on the XP box and on Win98 at home).  I can't see how they would be
different to your build since it's using the same process.  (Latest CVS each
time).

Sorry I didn't report back earlier - didn't your message say that you were
away for a couple of weeks and wouldn't be looking at anything until then?
I got a 'not urgent' impression from it :)

No bugs to report - the install has worked fine for me, and I've fixed
anything wrong I've found with the source itself.  I've made some
improvements to the documentation, too - is it sufficient for a 8.5 release,
do you think?

=Tony Meyer


From tim.one at comcast.net  Tue Dec 30 20:59:44 2003
From: tim.one at comcast.net (Tim Peters)
Date: Tue Dec 30 20:59:46 2003
Subject: [spambayes-dev] A URL experiment
In-Reply-To: <16369.61971.794205.276663@montanaro.dyndns.org>
Message-ID: <LNBBLJKPBEHFEDALKOLCOEIPIBAB.tim.one@comcast.net>

[Skip Montanaro]
> I tried a somewhat different approach (patch is attached) and got
> similar results (all ties at the more gross level, slight increase in
> spam mean and slight decrease in spam sdev, no change to ham at all
> (*)):

3-way compare on my data:

filename:   before   after    skip
ham:spam:  1510:520        1510:520
                   1510:520
fp total:        1       1       1
fp %:         0.07    0.07    0.07
fn total:        3       3       3
fn %:         0.58    0.58    0.58
unsure t:       39      39      39
unsure %:     1.92    1.92    1.92
real cost:  $20.80  $20.80  $20.80
best cost:  $17.60  $17.40  $17.80
h mean:       0.36    0.36    0.36
h sdev:       4.77    4.78    4.77
s mean:      97.07   97.11   97.08
s sdev:      11.09   11.05   11.03
mean diff:   96.71   96.75   96.72
k:            6.10    6.11    6.12

The "best cost" measure actually got marginally worse, but not significantly
so.

Note that this part of the patch can't be helping much:

+             num_pcs = url.count("%")
+             if num_pcs:
+                 pushclue("url:%d %%s" % num_pcs)

That is, raw counts are almost never useful -- if I have a URL in a spam
that embeds 40 escapes, that does nothing to indict a URL with 39 (or 41)
escapes.  Pumping out log2(a_count) usually does more good.  I *expect* the
approach in my patch would work better, though (generating lots of
correlated tokens -- there are good reasons to escape some punctuation
characters in URLs, but the only good reason to escape a letter or digit is
to obfuscate; let the classifier see these things, and it will learn that on
its own, as appropriate, for each escape code; then a URL escaping several
letters or digits will get penalized more the more heavily it employs this
kind of obfuscation).

> (*) Operational question: Given that my training data is somewhat
> small at the moment (roughly 1000-1500 each of ham and spam), would I
> be better off testing with fewer larger sets (e.g, 5 sets w/ 250 msgs
> each) or with more smaller sets (e.g, 10 sets w/ 125 msgs each)?

If you ask me <wink>, cross-validation should *always* be done with a
minimum of 10 sets, regardless of how much data you have.  There are many
reasons for this, from statistical reliability of the grand averages at the
end (they're subject to central-limit theorem constraints, and the more sets
the more reliable they are, growing with the square root of the # of sets);
to that it's extremely important to see run-by-run comparisons (how many
runs won, lost, tied), and just about any distribution of those numbers is
achievable by chance with few sets (IOW, "9 won, 1 tied, 0 lost" is very
much harder to account for by chance than "4 won, 1 tied, 0 lost"; likewise
"1 won, 8 tied, 1 lost" is much less likely to be produced by a significant
(good or bad) change than "1 won, 3 tied, 1 lost").

Note, though, that cross-validation is modeling the performance of a
train-on-everything strategy, and in random time order to boot.  If that's
not how you train, the results may be irrelevant to what you'll see in real
life.  It should be good enough to weed out really bad ideas-- and highlight
really good ones --regardless, though.


From tameyer at ihug.co.nz  Tue Dec 30 22:17:49 2003
From: tameyer at ihug.co.nz (Tony Meyer)
Date: Tue Dec 30 22:18:02 2003
Subject: [spambayes-dev] A URL experiment
In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F130498603B@its-xchg4.massey.ac.nz>
Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F13026F2A1F@its-xchg4.massey.ac.nz>

My results (this is with a chuck of my most recent mail, with timcv.py
-n10).

Tim's patch:

bases.txt -> nntims.txt
-> <stat> tested 357 hams & 395 spams against 3311 hams & 3704 spams
-> <stat> tested 397 hams & 384 spams against 3271 hams & 3715 spams
-> <stat> tested 385 hams & 433 spams against 3283 hams & 3666 spams
-> <stat> tested 407 hams & 397 spams against 3261 hams & 3702 spams
-> <stat> tested 350 hams & 412 spams against 3318 hams & 3687 spams
-> <stat> tested 338 hams & 405 spams against 3330 hams & 3694 spams
-> <stat> tested 359 hams & 416 spams against 3309 hams & 3683 spams
-> <stat> tested 358 hams & 405 spams against 3310 hams & 3694 spams
-> <stat> tested 348 hams & 411 spams against 3320 hams & 3688 spams
-> <stat> tested 369 hams & 441 spams against 3299 hams & 3658 spams
-> <stat> tested 357 hams & 395 spams against 3311 hams & 3704 spams
-> <stat> tested 397 hams & 384 spams against 3271 hams & 3715 spams
-> <stat> tested 385 hams & 433 spams against 3283 hams & 3666 spams
-> <stat> tested 407 hams & 397 spams against 3261 hams & 3702 spams
-> <stat> tested 350 hams & 412 spams against 3318 hams & 3687 spams
-> <stat> tested 338 hams & 405 spams against 3330 hams & 3694 spams
-> <stat> tested 359 hams & 416 spams against 3309 hams & 3683 spams
-> <stat> tested 358 hams & 405 spams against 3310 hams & 3694 spams
-> <stat> tested 348 hams & 411 spams against 3320 hams & 3688 spams
-> <stat> tested 369 hams & 441 spams against 3299 hams & 3658 spams

false positive percentages
    0.000  0.000  tied          
    0.000  0.000  tied          
    0.000  0.000  tied          
    0.246  0.246  tied          
    0.000  0.000  tied          
    0.000  0.000  tied          
    0.557  0.557  tied          
    0.559  0.559  tied          
    0.287  0.287  tied          
    0.000  0.000  tied          

won   0 times
tied 10 times
lost  0 times

total unique fp went from 6 to 6 tied          
mean fp % went from 0.164881884948 to 0.164881884948 tied          

false negative percentages
    0.253  0.253  tied          
    0.781  0.781  tied          
    0.462  0.462  tied          
    0.756  0.756  tied          
    0.243  0.243  tied          
    0.247  0.247  tied          
    0.240  0.240  tied          
    0.494  0.494  tied          
    0.973  0.973  tied          
    0.454  0.454  tied          

won   0 times
tied 10 times
lost  0 times

total unique fn went from 20 to 20 tied          
mean fn % went from 0.490257037938 to 0.490257037938 tied          

ham mean                     ham sdev
   1.18    1.17   -0.85%        7.76    7.67   -1.16%
   0.99    0.99   +0.00%        6.64    6.64   +0.00%
   0.84    0.85   +1.19%        6.14    6.14   +0.00%
   1.99    2.10   +5.53%        9.46    9.73   +2.85%
   0.49    0.49   +0.00%        3.59    3.57   -0.56%
   0.85    0.87   +2.35%        5.45    5.46   +0.18%
   1.16    1.16   +0.00%        9.30    9.29   -0.11%
   1.20    1.30   +8.33%        8.13    8.66   +6.52%
   1.55    1.55   +0.00%        8.05    8.05   +0.00%
   0.47    0.47   +0.00%        3.22    3.15   -2.17%

ham mean and sdev for all runs
   1.08    1.10   +1.85%        7.13    7.21   +1.12%

spam mean                    spam sdev
  98.75   98.75   +0.00%        8.72    8.72   +0.00%
  97.67   97.69   +0.02%       11.26   11.24   -0.18%
  98.08   98.14   +0.06%       10.12    9.97   -1.48%
  98.16   98.16   +0.00%       10.19   10.20   +0.10%
  98.35   98.41   +0.06%        8.77    8.69   -0.91%
  98.45   98.47   +0.02%        8.97    8.86   -1.23%
  98.35   98.41   +0.06%        9.73    9.69   -0.41%
  98.25   98.36   +0.11%        9.16    8.96   -2.18%
  97.93   97.97   +0.04%       11.99   11.98   -0.08%
  98.92   98.93   +0.01%        7.62    7.62   +0.00%

spam mean and sdev for all runs
  98.30   98.34   +0.04%        9.72    9.66   -0.62%

ham/spam mean difference: 97.22 97.24 +0.02

Skip's patch:

bases.txt -> pickskips.txt
-> <stat> tested 357 hams & 395 spams against 3311 hams & 3704 spams
-> <stat> tested 397 hams & 384 spams against 3271 hams & 3715 spams
-> <stat> tested 385 hams & 433 spams against 3283 hams & 3666 spams
-> <stat> tested 407 hams & 397 spams against 3261 hams & 3702 spams
-> <stat> tested 350 hams & 412 spams against 3318 hams & 3687 spams
-> <stat> tested 338 hams & 405 spams against 3330 hams & 3694 spams
-> <stat> tested 359 hams & 416 spams against 3309 hams & 3683 spams
-> <stat> tested 358 hams & 405 spams against 3310 hams & 3694 spams
-> <stat> tested 348 hams & 411 spams against 3320 hams & 3688 spams
-> <stat> tested 369 hams & 441 spams against 3299 hams & 3658 spams
-> <stat> tested 357 hams & 395 spams against 3311 hams & 3704 spams
-> <stat> tested 397 hams & 384 spams against 3271 hams & 3715 spams
-> <stat> tested 385 hams & 433 spams against 3283 hams & 3666 spams
-> <stat> tested 407 hams & 397 spams against 3261 hams & 3702 spams
-> <stat> tested 350 hams & 412 spams against 3318 hams & 3687 spams
-> <stat> tested 338 hams & 405 spams against 3330 hams & 3694 spams
-> <stat> tested 359 hams & 416 spams against 3309 hams & 3683 spams
-> <stat> tested 358 hams & 405 spams against 3310 hams & 3694 spams
-> <stat> tested 348 hams & 411 spams against 3320 hams & 3688 spams
-> <stat> tested 369 hams & 441 spams against 3299 hams & 3658 spams

false positive percentages
    0.000  0.000  tied          
    0.000  0.000  tied          
    0.000  0.000  tied          
    0.246  0.246  tied          
    0.000  0.000  tied          
    0.000  0.000  tied          
    0.557  0.557  tied          
    0.559  0.559  tied          
    0.287  0.287  tied          
    0.000  0.000  tied          

won   0 times
tied 10 times
lost  0 times

total unique fp went from 6 to 6 tied          
mean fp % went from 0.164881884948 to 0.164881884948 tied          

false negative percentages
    0.253  0.253  tied          
    0.781  0.781  tied          
    0.462  0.462  tied          
    0.756  0.756  tied          
    0.243  0.243  tied          
    0.247  0.247  tied          
    0.240  0.240  tied          
    0.494  0.494  tied          
    0.973  0.973  tied          
    0.454  0.454  tied          

won   0 times
tied 10 times
lost  0 times

total unique fn went from 20 to 20 tied          
mean fn % went from 0.490257037938 to 0.490257037938 tied          

ham mean                     ham sdev
   1.18    1.18   +0.00%        7.76    7.76   +0.00%
   0.99    0.99   +0.00%        6.64    6.64   +0.00%
   0.84    0.84   +0.00%        6.14    6.14   +0.00%
   1.99    1.99   +0.00%        9.46    9.46   +0.00%
   0.49    0.50   +2.04%        3.59    3.60   +0.28%
   0.85    0.87   +2.35%        5.45    5.55   +1.83%
   1.16    1.16   +0.00%        9.30    9.30   +0.00%
   1.20    1.21   +0.83%        8.13    8.14   +0.12%
   1.55    1.55   +0.00%        8.05    8.06   +0.12%
   0.47    0.47   +0.00%        3.22    3.22   +0.00%

ham mean and sdev for all runs
   1.08    1.08   +0.00%        7.13    7.14   +0.14%

spam mean                    spam sdev
  98.75   98.78   +0.03%        8.72    8.56   -1.83%
  97.67   97.70   +0.03%       11.26   11.25   -0.09%
  98.08   98.08   +0.00%       10.12   10.12   +0.00%
  98.16   98.17   +0.01%       10.19   10.15   -0.39%
  98.35   98.38   +0.03%        8.77    8.73   -0.46%
  98.45   98.46   +0.01%        8.97    8.97   +0.00%
  98.35   98.38   +0.03%        9.73    9.68   -0.51%
  98.25   98.29   +0.04%        9.16    9.05   -1.20%
  97.93   97.95   +0.02%       11.99   11.98   -0.08%
  98.92   98.93   +0.01%        7.62    7.62   +0.00%

spam mean and sdev for all runs
  98.30   98.32   +0.02%        9.72    9.68   -0.41%

ham/spam mean difference: 97.22 97.24 +0.02

3-way compare:

filename:    bases  nntims pickskips
ham:spam:  3668:4099       3668:4099
                   3668:4099      
fp total:        6       6       6
fp %:         0.16    0.16    0.16
fn total:       20      20      20
fn %:         0.49    0.49    0.49
unsure t:      178     173     175
unsure %:     2.29    2.23    2.25
real cost: $115.60 $114.60 $115.00
best cost:  $93.00  $91.20  $92.40
h mean:       1.08    1.10    1.08
h sdev:       7.13    7.21    7.14
s mean:      98.30   98.34   98.32
s sdev:       9.72    9.66    9.68
mean diff:   97.22   97.24   97.24
k:            5.77    5.76    5.78

Rather like Tim's results, really, at least to my ignorant eyes.

=Tony Meyer


From tameyer at ihug.co.nz  Tue Dec 30 22:32:56 2003
From: tameyer at ihug.co.nz (Tony Meyer)
Date: Tue Dec 30 22:33:01 2003
Subject: [spambayes-dev] pop3proxy_tray error
In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F130458F6E9@its-xchg4.massey.ac.nz>
Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F13046777C0@its-xchg4.massey.ac.nz>

[Kenny on the 5th of December]
> I tried to stop SpamBayes from the right-click menu, and then 
> start it again.  Here's the output I got when it tried to 
> restart SpamBayes.
[...]
>     serverStrings = ["%s:%s" % (s, p) for s, p in self.servers]
> TypeError: iteration over non-sequence

I finally got around to finding this and fixing it :)  I think it's also the
problem that Richie found.

Fixed in sb_server 1.16, I hope.

=Tony Meyer


From tim.one at comcast.net  Tue Dec 30 22:46:36 2003
From: tim.one at comcast.net (Tim Peters)
Date: Tue Dec 30 22:46:40 2003
Subject: [spambayes-dev] A URL experiment
In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F13026F2A1F@its-xchg4.massey.ac.nz>
Message-ID: <LNBBLJKPBEHFEDALKOLCIEJHIBAB.tim.one@comcast.net>

[Tony Meyer, tries the patches]

Thanks, Tony!

> ...
> Rather like Tim's results, really, at least to my ignorant eyes.

The results are both as weakly positive as things get, but at least neither
patch is doing any harm.  As before, I'd rather see Skip try to deal with %
escapes the way my patch did -- that's a common obfuscation trick, and I bet
it accounts for the small reduction in Unsures you saw.  My patch should do
a lot more to penalize that trick than Skip's.

Both patches tokenize the de-obfuscated URL, so they're a wash in that
respect.

Skip's patch also exposes higher-level concepts to the classifier, like
"non-standard port number".  I don't see that often, but when I do it's
usually in email from my work account (e.g., trying to get me to preview a
pre-release site change, accessed via a non-standard port so it doesn't
interfere with the production site).  That's OK, though:  *my* classifier
will learn that's a ham clue in my email mix -- so it goes.

Since everyone is getting some good out of Skip's changes (and I don't think
his treatment of % escapes is making a difference), and also getting some
good out of mine (which don't try to do anything except get some good of %
escapes), combining the two will do better than either, or cancel each other
out <0.5 wink>.


From mhammond at skippinet.com.au  Tue Dec 30 23:31:45 2003
From: mhammond at skippinet.com.au (Mark Hammond)
Date: Tue Dec 30 23:31:58 2003
Subject: [spambayes-dev] Experimental SpamBayes build available
In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F13026F2A1C@its-xchg4.massey.ac.nz>
Message-ID: <10fe01c3cf57$02248cc0$2c00a8c0@eden>

> > The ini file for the proxy appeared in "C:\Documents and
> > Settings\rjh\Application Data\SpamBayes\Proxy" as you'd
> > expect, but the database and cache directories appeared in
> > "C:\Program Files\SpamBayes\bin".
>
> Did the ini file have the appropriate [Storage] lines in it?
> It's meant to add them in there, storing the directories in
> that directory, too.
> You didn't already have an ini file in there, did you?  (It only
> adds those
> lines if it's a new file, so that it doesn't overwrite
> someone's settings).

I really don't like the code in Options.py that handles the default values
for these storage items.  I'm not sure it is to blame, but it did cause me
to see a new .db file created in the cwd, rather than the data directory -
as my INI file already existed, it didn't get the default FQN for the new
option.

IMO, the ini files should generally store relative path names, being
relative to the directory of the config file being used.  This means we
never allow the cwd to determine anything other than the location of the
main config file, as all paths resolve via the directory of this file.  A
single Options.resolve_path() should be able to do this for us.

Code speaks louder than words - I'm suggesting:

Options.py, line 1156, the code starting:
                        # If the file doesn't exist, then let's get the user
to
                        # store their databases and caches here as well, by
                        # default, and save the file.
                        db_name = os.path.join(windowsUserDirectory,
                                               "statistics_database.db")

And all similar setting of the options to FQNs die.  The default remains
"statistics_database.db" .  All code that uses this option
('persistent_storage_file') does so via a new function:

def get_pathname_option(section, value):
  filename = options.get(section, value)
  if not os.path.isabs(filename):
    return filename
  # maybe expanduser() to *nix?
  return os.path.join(os.path.dirname(optionsPathname), # existing global
                      filename)

Or-something-like-that ly,

Mark.


From mhammond at skippinet.com.au  Tue Dec 30 23:45:26 2003
From: mhammond at skippinet.com.au (Mark Hammond)
Date: Tue Dec 30 23:45:39 2003
Subject: [spambayes-dev] Experimental SpamBayes build available
In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F13026F2A1D@its-xchg4.massey.ac.nz>
Message-ID: <110101c3cf58$e95d3fa0$2c00a8c0@eden>

> I briefly tried it, and all seemed ok to me (Outlook XP and
> the others on
> WinXP).  I've also been trying various experimental builds of
> my own (both
> on the XP box and on Win98 at home).  I can't see how they would be
> different to your build since it's using the same process.
> (Latest CVS each
> time).

I've a few win32all changes yet to release in binary - but from memory most
are pretty trivial.  I'll do a win32all at the same time (next year :)

> Sorry I didn't report back earlier - didn't your message say
> that you were
> away for a couple of weeks and wouldn't be looking at
> anything until then?

It probably did, but I meant a few days :)  I always had to get back in time
for this huge bender of a party we have planned!  About to take off now.

> I got a 'not urgent' impression from it :)

It certainly wasn't urgent, and I'm glad it worked so well.  I was too
pesimistic to believe "no news is good news", but it seems to have been the
case!

> is it sufficient for
> a 8.5 release,
> do you think?

I think an 0.85 would be perfect, in both source and binary.  We then go to
0.9, and we could still end up with 1.0 by March!

Happy new year!

Mark.


From skip at pobox.com  Wed Dec 31 00:02:11 2003
From: skip at pobox.com (Skip Montanaro)
Date: Wed Dec 31 00:02:29 2003
Subject: [spambayes-dev] A URL experiment
In-Reply-To: <LNBBLJKPBEHFEDALKOLCOEIPIBAB.tim.one@comcast.net>
References: <16369.61971.794205.276663@montanaro.dyndns.org>
	<LNBBLJKPBEHFEDALKOLCOEIPIBAB.tim.one@comcast.net>
Message-ID: <16370.22611.419923.477159@montanaro.dyndns.org>


    Tim> Note that this part of the patch can't be helping much:

    Tim> +             num_pcs = url.count("%")
    Tim> +             if num_pcs:
    Tim> +                 pushclue("url:%d %%s" % num_pcs)

    Tim> That is, raw counts are almost never useful -- if I have a URL in a
    Tim> spam that embeds 40 escapes, that does nothing to indict a URL with
    Tim> 39 (or 41) escapes.  Pumping out log2(a_count) usually does more
    Tim> good.  

I realized that before trying, but not having any raw data upon which to
base things, I left it as-is.  If I enable it I'll look at some results to
see what tokens are actually generated and how they seem to correlate with
ham and spam.  One other possibility would be a sort of "Watership Down"
approach: "1, 2, 3, many" (or something similar - rabbits can't count very
high).  The problem with log2(count) in this situation is there seems to be
a practical limit to how many % signs a URL might have (maybe 50?), so
something that creates buckets using division (counts // 5 ???) might do a
decent job of lumping things together.

I'm off work the next couple of days and have some house guests in from out
of town, so I probably won't look at this much.  I will try to at least
build a database from my current training set using this feature and see how
things shake out.  (Maybe tomorrow morning before everyone's up and about.)

    Tim> I *expect* the approach in my patch would work better, though
    Tim> (generating lots of correlated tokens -- there are good reasons to
    Tim> escape some punctuation characters in URLs, but the only good
    Tim> reason to escape a letter or digit is to obfuscate; let the
    Tim> classifier see these things, and it will learn that on its own, as
    Tim> appropriate, for each escape code; then a URL escaping several
    Tim> letters or digits will get penalized more the more heavily it
    Tim> employs this kind of obfuscation).

My problem with that approach is the stuff the spammers escape can be
essentially random, as in the bogus URL you received.  I think you might get
scads of hapaxes (or at least low-count escapes).  Stuff with high-counts
will be legitimate (%20 and so forth).  Conclusions obviously await some
eyeballing of databases.

    >> (*) Operational question: Given that my training data is somewhat
    >> small at the moment (roughly 1000-1500 each of ham and spam), would I
    >> be better off testing with fewer larger sets (e.g, 5 sets w/ 250 msgs
    >> each) or with more smaller sets (e.g, 10 sets w/ 125 msgs each)?

    Tim> If you ask me <wink>, cross-validation should *always* be done with
    Tim> a minimum of 10 sets, regardless of how much data you have.  There
    Tim> are many reasons for this, from statistical reliability of the
    Tim> grand averages at the end (they're subject to central-limit theorem
    Tim> constraints, and the more sets the more reliable they are, growing
    Tim> with the square root of the # of sets); 

Thanks, I will rebalance my training database to 10 sets and see how that
goes. 

    Tim> Note, though, that cross-validation is modeling the performance of
    Tim> a train-on-everything strategy, and in random time order to boot.

The random time order isn't so important to me at the moment, because all
the messages I'm using are recent (received within the past month or so).
The "train on everything" aspect is more interesting.  I find the
cross-validation tests never perform as well as in real life. ;-)

    Tim> If that's not how you train, the results may be irrelevant to what
    Tim> you'll see in real life.  It should be good enough to weed out
    Tim> really bad ideas-- and highlight really good ones --regardless,
    Tim> though.

There's the rub.  What might be really good ideas at this point will
probably only result in very small changes in performance because the
baseline system is currently so good.

Skip


From skip at pobox.com  Wed Dec 31 00:04:33 2003
From: skip at pobox.com (Skip Montanaro)
Date: Wed Dec 31 00:04:48 2003
Subject: [spambayes-dev] A URL experiment
In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F13026F2A1F@its-xchg4.massey.ac.nz>
References: <1ED4ECF91CDED24C8D012BCF2B034F130498603B@its-xchg4.massey.ac.nz>
	<1ED4ECF91CDED24C8D012BCF2B034F13026F2A1F@its-xchg4.massey.ac.nz>
Message-ID: <16370.22753.639278.165195@montanaro.dyndns.org>


    Tony> 3-way compare:

    Tony> filename:    bases  nntims pickskips
    Tony> ham:spam:  3668:4099       3668:4099
    Tony>                    3668:4099      
    Tony> fp total:        6       6       6
    Tony> fp %:         0.16    0.16    0.16
    Tony> fn total:       20      20      20
    ...

What do you use to generate these three-way comparisons?

Skip

From richie at entrian.com  Wed Dec 31 08:54:25 2003
From: richie at entrian.com (Richie Hindle)
Date: Wed Dec 31 08:54:34 2003
Subject: [spambayes-dev] Re: [Spambayes-checkins] spambayes/scripts
	sb_server.py, 1.15, 1.16
In-Reply-To: <E1AbX68-0007pU-00@sc8-pr-cvs1.sourceforge.net>
References: <E1AbX68-0007pU-00@sc8-pr-cvs1.sourceforge.net>
Message-ID: <cqj5vv0n82fgpgj8ivgls80e9htb83591k@4ax.com>


[Tony]
> Modified Files:
> 	sb_server.py 
> Log Message:
> When we stopped sb_server and then restarted, we didn't init the state, so it
> wouldn't work.  Fix that.
> 
> [...]
> 
>   def prepare():
> +     state.init()
>       state.prepare()

This edit keeps appearing and disappearing.  Mark removed that line in
order to fix the fact that none of sb_server's command line arguments
worked.  Tony has now put it back in order to fix a restart problem, which
has once again broken all the command line arguments.

The docstring for State.__init__() describes how the code was originally
intended to work:

"""Initialises the State object that holds the state of the app.
The default settings are read from Options.py and bayescustomize.ini
and are then overridden by the command-line processing code in the
__main__ code below."""

The __main__ code is now in a function called run(), and the code to read
the options is now in State.init().  Calling State.init() a second time,
as prepare() now does, overwrites the command line options set up by
run().

It does seem weird that there's a State.prepare() and a global prepare(),
but the global prepare() calls both State.init() and State.prepare().  I
don't know much about this code (if it was checked in under my name, it
must have been my evil twin that wrote it 8-)

Does anyone have a clear idea of what each of prepare(), State.__init__(),
State.init(), and State.prepare() are all intended to do?  I think the
command line option code needs to be inserted somewhere into one of them,
but I'm not 100% sure what each of them is for.

PS. (next eight hours):  Mark: You are too drunk to reply.
PS. (after eight hours): Mark: How's the head?  8-)

-- 
Richie Hindle
richie@entrian.com


From richie at entrian.com  Wed Dec 31 09:38:40 2003
From: richie at entrian.com (Richie Hindle)
Date: Wed Dec 31 09:38:49 2003
Subject: [spambayes-dev] Experimental SpamBayes build available
In-Reply-To: <10fe01c3cf57$02248cc0$2c00a8c0@eden>
References: <1ED4ECF91CDED24C8D012BCF2B034F13026F2A1C@its-xchg4.massey.ac.nz>
	<10fe01c3cf57$02248cc0$2c00a8c0@eden>
Message-ID: <non5vv0qn19fuh4ameoufv2dtjei15v08p@4ax.com>


[Mark]
> IMO, the ini files should generally store relative path names, being
> relative to the directory of the config file being used.

+1, definitely.

-- 
Richie Hindle
richie@entrian.com


From richie at entrian.com  Wed Dec 31 09:59:15 2003
From: richie at entrian.com (Richie Hindle)
Date: Wed Dec 31 09:59:22 2003
Subject: [spambayes-dev] Experimental SpamBayes build available
In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F13026F2A1C@its-xchg4.massey.ac.nz>
References: <1ED4ECF91CDED24C8D012BCF2B034F1304985F95@its-xchg4.massey.ac.nz>
	<1ED4ECF91CDED24C8D012BCF2B034F13026F2A1C@its-xchg4.massey.ac.nz>
Message-ID: <fsn5vvci3724s88kce5e703hnnq2o5fpvu@4ax.com>


[Tony]
> Did the ini file have the appropriate [Storage] lines in it?  It's meant to
> add them in there, storing the directories in that directory, too.  You
> didn't already have an ini file in there, did you?  (It only adds those
> lines if it's a new file, so that it doesn't overwrite someone's settings).

The environment's at work, so I don't know.  I can find out on Friday.

> Fair enough.  Line 435 of ui.html; change it to whatever you like most :)

Done.  20.

> The code has it looking in all three, and a quick test
> here had it finding messages in more than one.

You're quite right.  I've no idea what happened last time - I'll
double-ckeck on Friday.

> If you've still got that false positive statistic around,
> could you give it a run from cvs?

Yes, that's now working.  Thanks for that, and the other fixes.

-- 
Richie Hindle
richie@entrian.com


From skip at pobox.com  Wed Dec 31 10:53:27 2003
From: skip at pobox.com (Skip Montanaro)
Date: Wed Dec 31 10:53:32 2003
Subject: [spambayes-dev] A URL experiment
In-Reply-To: <LNBBLJKPBEHFEDALKOLCOEIPIBAB.tim.one@comcast.net>
References: <16369.61971.794205.276663@montanaro.dyndns.org>
	<LNBBLJKPBEHFEDALKOLCOEIPIBAB.tim.one@comcast.net>
Message-ID: <16370.61687.75647.724533@montanaro.dyndns.org>


    Tim> Note that this part of the patch can't be helping much:

    Tim> +             num_pcs = url.count("%")
    Tim> +             if num_pcs:
    Tim> +                 pushclue("url:%d %%s" % num_pcs)

    Tim> That is, raw counts are almost never useful -- if I have a URL in a
    Tim> spam that embeds 40 escapes, that does nothing to indict a URL with
    Tim> 39 (or 41) escapes.  Pumping out log2(a_count) usually does more
    Tim> good.  

<aside type="slight">

"url:has user" seems to be fairly spammy for me:

    % spamcounts -r -d ~/tmp/hammie.db '^url:has user'
    db: /Users/skip/tmp/hammie.db
    token,nspam,nham,spam prob
    url:has user,42,4,0.91016660508

</aside>

Okay, here are the raw number of URL percents as present in my current
ham/spam database:

    npcs    nspam   nham
    1       21      46  
    2       4       1   
    3       2       2   
    4       1       2   
    5       0       1   
    6       2       2   
    7       1       1   
    8       0       2   
    14      2       0   
    15      0       1   
    16      1       0   
    18      1       0   
    23      1       0   
    24      1       0   
    28      1       0   
    30      1       0   
    38      2       0   
    40      1       0   
    42      1       0   
    74      1       0   
    75      1       0   
    84      1       0   
    97      1       0   
    103     1       0   
    109     1       0   
    191     1       0   

I redid my patch to generate tokens like so:

    pushclue("url:%%%d" % int(log2(num_pcs)))

Converting the first column to int(log(n,2)) then rebuilding the database
gives: 

    log(npcs)   nspam   nham
    0           21      46
    1           6       3
    2           4       2
    3           2       2
    4           5       0
    5           3       0
    6           2       0
    7           1       0

The new cv test results are essentially the same (I still have just five
sets):

stds.txt -> pickurlss.txt
-> <stat> tested 250 hams & 300 spams against 1000 hams & 1200 spams
-> <stat> tested 250 hams & 300 spams against 1000 hams & 1200 spams
-> <stat> tested 250 hams & 300 spams against 1000 hams & 1200 spams
-> <stat> tested 250 hams & 300 spams against 1000 hams & 1200 spams
-> <stat> tested 250 hams & 300 spams against 1000 hams & 1200 spams
-> <stat> tested 250 hams & 300 spams against 1000 hams & 1200 spams
-> <stat> tested 250 hams & 300 spams against 1000 hams & 1200 spams
-> <stat> tested 250 hams & 300 spams against 1000 hams & 1200 spams
-> <stat> tested 250 hams & 300 spams against 1000 hams & 1200 spams
-> <stat> tested 250 hams & 300 spams against 1000 hams & 1200 spams

false positive percentages
    0.000  0.000  tied          
    0.400  0.400  tied          
    0.000  0.000  tied          
    0.000  0.000  tied          
    0.000  0.000  tied          

won   0 times
tied  5 times
lost  0 times

total unique fp went from 1 to 1 tied          
mean fp % went from 0.08 to 0.08 tied          

false negative percentages
    3.333  3.333  tied          
    5.000  5.000  tied          
    7.333  7.333  tied          
    5.667  5.667  tied          
    4.000  4.000  tied          

won   0 times
tied  5 times
lost  0 times

total unique fn went from 76 to 76 tied          
mean fn % went from 5.06666666667 to 5.06666666667 tied          

ham mean                     ham sdev
   1.64    1.64   +0.00%        8.44    8.45   +0.12%
   0.99    0.99   +0.00%        8.29    8.29   +0.00%
   2.82    2.82   +0.00%       12.52   12.52   +0.00%
   1.58    1.58   +0.00%        8.29    8.29   +0.00%
   1.30    1.30   +0.00%        8.04    8.04   +0.00%

ham mean and sdev for all runs
   1.66    1.66   +0.00%        9.30    9.30   +0.00%

spam mean                    spam sdev
  93.80   93.83   +0.03%       19.39   19.31   -0.41%
  90.56   90.59   +0.03%       24.31   24.26   -0.21%
  89.24   89.28   +0.04%       27.03   27.04   +0.04%
  89.27   89.27   +0.00%       25.51   25.50   -0.04%
  92.72   92.74   +0.02%       21.67   21.67   +0.00%

spam mean and sdev for all runs
  91.12   91.14   +0.02%       23.81   23.79   -0.08%

ham/spam mean difference: 89.46 89.48 +0.02

Skip

From popiel at wolfskeep.com  Wed Dec 31 14:06:34 2003
From: popiel at wolfskeep.com (T. Alexander Popiel)
Date: Wed Dec 31 14:06:39 2003
Subject: [spambayes-dev] Semi-results for TOE, TOAE, and expiry
Message-ID: <20031231190634.B0F0D2DF7C@cashew.wolfskeep.com>

Yes, a few days ago I promised a further report on various things,
including the effects of alternate start points in my dataset and
expiry on the train_on_everything and train_on_almost_everything
regimes.  Unfortunately, all I have at the moment is some
preliminary results and a big heap of frustration pointed at
my computer.

1. There doesn't appear to be anything particularly magical
   about 120 days after start.  Rotating my data forward or
   backward 80 days shows that (a) there was a particular
   event/change in my data at about 120 days after I started
   collecting that affects the accuracy of further 
   classifications, and (b) the general curve of getting
   better for a few months then decaying for the rest of
   time still holds even when the data is rotated... but
   the curve is not as distinct when not reinforced by (a).

2. Expiry (as I implemented it) appears to be a very bad thing
   for long-term TOAE.  I implemented it to expire trained
   messages after 120 days, without completely rebuilding the
   classifier.  This resulted in significantly degraded accuracy
   after about 250 days, though that may just be due to an ever
   increasing spam/ham imbalance.

   There was a sharp drop in the amount of spam training for
   about 30 days after the initial expiry date, and then a net
   spam training rate about equivalent to non-expiring TOAE
   until the "latest windows update" worm, after which spam
   training about doubled the non-expiry version.  This
   seems to show that spam mutation has a stong effect on
   4-month expiry for TOAE.

   On the other hand, net ham training was fairly consistently
   slightly negative after expiry commenced, showing that
   once it got a good idea of what ham was and threw out the
   oddballs that got trained on initially, it didn't need much
   categorize ham.

   By the end of the mess (at 418 days) the spam:ham ratio was
   over 15:1, and the unsure rate was around 3% (compared to
   non-expiring with 4.5:1 and 1%).

3. Expiry for TOE seems neutral (compared to non-expiring TOE),
   to the best of my ability to eyeball the three runs that
   actually completed.

The graphs I have are at:
http://www.wolfskeep.com/~popiel/spambayes/plots/expire.html

My primary machine (cashew.wolfskeep.com) unfortunately doesn't
seem have the capability to maintain reliable service while
running these tests anymore.  They're just too big, and it
doesn't have the memory/CPU to do everything all at once
(including running my web server, a mysql engine, my mail feed,
etc.).  Plus, it appears that Linux 2.4.18 doesn't take too
kindly to multiple processes trying to access/manipulate a
single directory with over 100,000 files in it; anything that
touches that directory after things have started going wonky
just hangs in disk-wait.  I'm suspecting a deadlock in the
filesystem layer on extended directory operations... probably
due to not enough file cache (see my memory problems) to hold
the entire structure at once.  I haven't poked deep enough into
the ext2 drivers to be sure, though.

Anyway, I'm not going to be able to do all that much until
I get this straightened out.  I'll add graphs and stuff to the
wiki as I have time, but that's likely to be all for a bit...

- Alex

From popiel at wolfskeep.com  Wed Dec 31 14:15:45 2003
From: popiel at wolfskeep.com (T. Alexander Popiel)
Date: Wed Dec 31 14:15:50 2003
Subject: [spambayes-dev] A URL experiment 
In-Reply-To: Message from Skip Montanaro <skip@pobox.com> of "Tue,
	30 Dec 2003 23:04:33 CST."
	<16370.22753.639278.165195@montanaro.dyndns.org> 
References: <1ED4ECF91CDED24C8D012BCF2B034F130498603B@its-xchg4.massey.ac.nz>
	<1ED4ECF91CDED24C8D012BCF2B034F13026F2A1F@its-xchg4.massey.ac.nz>
	<16370.22753.639278.165195@montanaro.dyndns.org> 
Message-ID: <20031231191545.E64892DF7C@cashew.wolfskeep.com>

In message:  <16370.22753.639278.165195@montanaro.dyndns.org>
             Skip Montanaro <skip@pobox.com> writes:
>
>    Tony> 3-way compare:
>
>    Tony> filename:    bases  nntims pickskips
>    Tony> ham:spam:  3668:4099       3668:4099
>    Tony>                    3668:4099      
>    Tony> fp total:        6       6       6
>    Tony> fp %:         0.16    0.16    0.16
>    Tony> fn total:       20      20      20
>    ...
>
>What do you use to generate these three-way comparisons?

That's table.py output.

- Alex

From popiel at wolfskeep.com  Wed Dec 31 14:18:58 2003
From: popiel at wolfskeep.com (T. Alexander Popiel)
Date: Wed Dec 31 14:19:23 2003
Subject: [spambayes-dev] A URL experiment 
In-Reply-To: Message from Skip Montanaro <skip@pobox.com> of "Tue,
	30 Dec 2003 23:02:11 CST."
	<16370.22611.419923.477159@montanaro.dyndns.org> 
References: <16369.61971.794205.276663@montanaro.dyndns.org>
	<LNBBLJKPBEHFEDALKOLCOEIPIBAB.tim.one@comcast.net>
	<16370.22611.419923.477159@montanaro.dyndns.org> 
Message-ID: <20031231191858.85C5B2DF7C@cashew.wolfskeep.com>

In message:  <16370.22611.419923.477159@montanaro.dyndns.org>
             Skip Montanaro <skip@pobox.com> writes:
>
>There's the rub.  What might be really good ideas at this point will
>probably only result in very small changes in performance because the
>baseline system is currently so good.

Aye, I think that's what killed this sort of tokenizer/classifier
testing a year ago...

- Alex

From tameyer at ihug.co.nz  Wed Dec 31 16:45:55 2003
From: tameyer at ihug.co.nz (Tony Meyer)
Date: Wed Dec 31 16:46:02 2003
Subject: [spambayes-dev] Re: [Spambayes-checkins]
	spambayes/scriptssb_server.py, 1.15, 1.16
In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F130499C191@its-xchg4.massey.ac.nz>
Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F13026F2A20@its-xchg4.massey.ac.nz>

> This edit keeps appearing and disappearing.  Mark removed 
> that line in order to fix the fact that none of sb_server's 
> command line arguments worked.  Tony has now put it back in 
> order to fix a restart problem, which has once again broken 
> all the command line arguments.

Opps.  Sorry, I must try and pay more attention...

> Does anyone have a clear idea of what each of prepare(), 
> State.__init__(), State.init(), and State.prepare() are all 
> intended to do?  I think the command line option code needs 
> to be inserted somewhere into one of them, but I'm not 100% 
> sure what each of them is for.

I'm not sure about intent, but this is what they currently do:

__init__(): calls init()
init():     opens the log file
            sets up the list of servers/ports
            loads options from configuration file
            resets statistics
prepare():  opens mutex
(and in calling createWorkers())
            opens db
            opens the corpora
            creates the trainers

Should init() be only done once for every time that sb_server is run, and
prepare() each time it is started/stopped?  In that case it should be:

__init__(): calls init()
init():     opens the log file
            loads options from configuration file
            resets statistics
prepare():  opens mutex
            sets up the list of servers/ports
(and in calling createWorkers())
            opens db
            opens the corpora
            creates the trainers

This makes prepare() a kind of anti-close().  This is probably the fix I
should have applied - moving setting up the list of servers/ports to
prepare() instead of calling init() again.  It needs to be done *sometime*
after close(), though, and before the call to createWorkers().  Whether the
log file and statistics should be reset on start/stop, I don't know, but I
suspect not.  This would mean that the command-line code could be anywhere
between __init__() and prepare().

+1 to docstrings for init() and prepare(), though :)

=Tony Meyer


From tameyer at ihug.co.nz  Wed Dec 31 17:15:41 2003
From: tameyer at ihug.co.nz (Tony Meyer)
Date: Wed Dec 31 17:15:47 2003
Subject: [spambayes-dev] Experimental SpamBayes build available
In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F130499C0C8@its-xchg4.massey.ac.nz>
Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F13026F2A21@its-xchg4.massey.ac.nz>

> I really don't like the code in Options.py that handles the 
> default values for these storage items.  I'm not sure it is 
> to blame, but it did cause me to see a new .db file created 
> in the cwd, rather than the data directory - as my INI file 
> already existed, it didn't get the default FQN for the new option.

The idea was that if you already had an ini file, then you had already set
things up (you were an existing user), and so we wouldn't want to fiddle
with your setup, whatever it was, because that might mean that we lose track
of your databases.  (This could even be with the default "hammie.db" in the
cwd for persistent_storage_file, if the user always runs the script from the
same directory).

Part of the problem is that when consolidating the storage name options, I
picked "hammie.db" over "~/.hammiedb", which was by far the better option.
I didn't realise that it would expand quite nicely on Windows (well, 2k and
XP; I presume earlier as well), and I presume on Macs as well.

> IMO, the ini files should generally store relative path 
> names, being relative to the directory of the config file 
> being used.

+1.  I can go through and check this in if you like (once others have ripped
into the idea ;).

> Code speaks louder than words - I'm suggesting:
[...]
> The default remains "statistics_database.db".

The proper default, of course, is still "hammie.db".  When I put that code
in to put things into a better place for Windows users I figured that since
they wouldn't have an existing db, a more easily understandable name would
be good, too, but I didn't think that I could change it for everyone.  Do we
continue setting this option to "statistics_database.db" in that place?
(Without the FQN).  The same code also gives new default names for the cache
directories and messageinfo db.

> All code that uses this option
> ('persistent_storage_file') does so via a new function:

Presumably also these:
  [URLRetriever] x-cache_directory
  [Storage] messageinfo_storage_file
  [Storage] spam_cache
  [Storage] ham_cache
  [Storage] unknown_cache

What about these?
  [TestDriver] spam_directories
  [TestDriver] ham_directories

> def get_pathname_option(section, value):
>   filename = options.get(section, value)
>   if not os.path.isabs(filename):
>     return filename

Shouldn't that be "if os.path.isabs(filename):"?

>   # maybe expanduser() to *nix?

I think this is necessary, yes, but before this (because
os.path.isabs("~/hammie.db") is False).

What about this?

def get_pathname_option(section, value):
    filename = os.path.expanduser(options.get(section, value))
    if os.path.isabs(filename):
        return filename
    return os.path.join(os.path.dirname(optionsPathname), # existing global
                        filename)

How do people feel about having this happen implictly when one of these
options is used, rather than explicitly?  (I worry that we'll miss an
occurrence of one of them, or that someone (maybe me!) will add new code and
forget to use the get_pathname_option function).  Something like this:

[Current code in OptionsClass.py]
    def get(self):
        '''Get option value.'''
        return self.value

[Proposed]
    def expand_path(value):
        filename = os.path.expanduser(value)
        if os.path.isabs(filename):
            return filename
        return os.path.join(os.path.dirname(optionsPathname), # existing
global
                            filename)

    def get(self):
        '''Get option value.

        If the option is a path, then get relative to the
        configuration file.'''
        if self.allowed_values in [PATH, FILE_WITH_PATH,]: # maybe also
VARIABLE_PATH?
            return self.expand_path(self.value)
        return self.value

=Tony Meyer


From richie at entrian.com  Wed Dec 31 20:54:51 2003
From: richie at entrian.com (Richie Hindle)
Date: Wed Dec 31 20:55:01 2003
Subject: [spambayes-dev] Strange performance dip and DBRunRecoveryError
	retreat
Message-ID: <tpt6vvo49go5u86u4f3rd1894hn2ub9s8t@4ax.com>


As part of trying to reproduce the DBRunRecoveryError problems (a task
that I'm giving up on for now - see below) I've written a script to hammer
the core SpamBayes code, repeatedly training and classifying using
faked-up messages.  It manages about 40 train-and-classify loops per
second on my 2.4GHz P4, *except* between about 100 and 400 messages, when
the performance drops to about a tenth of that and then recovers.

I've done enough investigation to know that the time is being spent in the
core SpamBayes code and not my script, that it's only the occasional
message that takes a long time (around a second in a few cases) and that
it can be either training or classifying that slows down.

I've committed the script as testtools/hammer.py, and I offer this as a
curiosity to anyone interested.  I'm not going to pursue this myself
because I've never seen a similar complaint about real-world SpamBayes
use.

The script includes code to build fake emails that look similar to
real-world ones, but which are all unique and include random elements.
Maybe this will be useful to someone in the future.  It works by taking a
small collection of real emails and chopping pieces out of them at random,
then stitching them back togther.

I don't think the script is going to be a lot of use in tracking down
DBRunRecoveryErrors - it *will* reproduce them as it is, but only by
mimicking a bug that was fixed in 1.0a6, and people have still been
complaining about DBRunRecoveryErrors in 1.0a6 and 1.0a7.

Having read up on full-mode bsddb, and bsddb-backed ZODB (including the
phrases "The underlying Berkeley database technology requires maintenance,
careful system resource planning, and tuning for performance." and
"BerkeleyDB never deletes "old" log files. Eventually, if you do not
maintain your Berkeley database by deleting "old" log files, you will run
out of disk space") I've given up - for the moment at least - on trying to
use full-mode bsddb (with or without ZODB).  sb_server users should use a
pickle and be done with it.  Maybe we should change the default.  Maybe
it's five to two and I should be in bed.

-- 
Richie Hindle
richie@entrian.com


From sourceforge at metrak.com  Wed Dec 31 22:55:58 2003
From: sourceforge at metrak.com (Paul Sorenson)
Date: Wed Dec 31 22:56:05 2003
Subject: [spambayes-dev] converToMbox broken
Message-ID: <00b601c3d01b$2a13f030$c48b0fcb@home.classware.com.au>

Further to my insane ramblings the other day about trying to train on dbx
files with the web interface, the only way I could get
oe_mailbox.convertToMbox() to work was by adding:

from time import *

at the top of the file.  Otherwise you get "global symbol strftime not
found"

I am using python 2.3.2 on linux this time round, the other day I tried it
on a WinXP box with 2.3.3.

Cheers