From Tim@mail.powweb.com  Fri Nov  1 00:07:10 2002
From: Tim@mail.powweb.com (Tim@mail.powweb.com)
Date: Thu, 31 Oct 2002 18:07:10 -0600
Subject: [Spambayes] Email client integration -- what's needed? 
In-Reply-To: <20021031235145.C0376F59F@cashew.wolfskeep.com>
Message-ID: <NH1PLL61693URE06FBEDXWOK4XB0.3dc1c5ae@riven>

There is considerable discussion in the original papers as to the usefulness of Spambayes not only for filtering spam, but also for altering the behavior of 
spammers.  This second consideration is actually much more powerful than the first.  If spam can be successfully dealt with, in a way that allows evolution as 
spammers evolve, then eventually spammers will be so restricted that their activities will not be profitable.  THIS is the real goal.  So, to that end, we must 
make Spambayes useful to a huge audience.  That means the windoze platform (cough hack gag).

Direct delivery users are almost invariably the creme-de-le-creme of users.  They will know how to wire Spambayes into their world, almost instinctively.  But 
the people that we need to have using this product (because there are SO many of 'em) are the type who might actually have trouble configuring a 
pop3proxy... a simply braindead installation is required.

So... unless we want this to simply be interesting research, we gotta take it to the masses....


10/31/2002 5:51:45 PM, "T. Alexander Popiel" <popiel@wolfskeep.com> wrote:

>In message:  <SRMJLJVPA8WQQ43874XEC51RQ1USNZX.3dc1bcb9@riven>
>             Tim Stone Four Stones Forum <tim@fourstonesforum.com> writes:
>
>>But can it be useful to the masses?  The pop3proxy is the right way to go
>>in my opinion.
>
>You folks make me feel like such a fuddy-duddy, still using MH
>from a shell account with the mailboxes fetched through the
>filesystem, instead of through some network mailbox protocol...
>Heck, I don't even have software to access a POP mailbox installed...
>
>I guess that raises the question: what is our target audience,
>and how strictly do we want to cater to them?  Do we want to
>offer support for processing in direct-delivery situations,
>even though it's only old-school fuddy-duddies like myself
>who use them, anymore?
>
>- Alex
>
>
>
>
- Tim
www.fourstonesExpressions.com 


From neale@woozle.org  Fri Nov  1 00:14:41 2002
From: neale@woozle.org (Neale Pickett)
Date: 31 Oct 2002 16:14:41 -0800
Subject: [Spambayes] Database reduction
In-Reply-To: <LNBBLJKPBEHFEDALKOLCIEDPCDAB.tim.one@comcast.net>
References: <LNBBLJKPBEHFEDALKOLCIEDPCDAB.tim.one@comcast.net>
Message-ID: <w53k7jyf8ni.fsf@woozle.org>

So then, Tim Peters <tim.one@comcast.net> is all like:

> [cool database trick]

The bigger problem, at least for hammie, is that pickling wordinfo
instances makes huge strings, the majority of which is redundant
information.  When pickling a Bayes object, the pickler is smart enough
not to repeatedly say "this is a wordinfo object" but rather, I assume,
"this is of type 2", only having to name type 2 once.  However, hammie
pickles each wordinfo individually, keyed by a string.  This makes for
fast lookups, but giant databases.

Tim just mentioned a performance tweak; is this an indicator that now
would be a good time to resume trying to reduce hammie's database size?

Neale

From rmunn@pobox.com  Fri Nov  1 00:37:12 2002
From: rmunn@pobox.com (Robin Munn)
Date: Thu, 31 Oct 2002 18:37:12 -0600
Subject: [Spambayes] Email client integration -- what's needed?
In-Reply-To: <20021031235145.C0376F59F@cashew.wolfskeep.com>
References: <SRMJLJVPA8WQQ43874XEC51RQ1USNZX.3dc1bcb9@riven>
	<20021031235145.C0376F59F@cashew.wolfskeep.com>
Message-ID: <20021101003712.GA28132@rmunnlfs>


---------------------- multipart/signed attachment
On Thu, Oct 31, 2002 at 03:51:45PM -0800, T. Alexander Popiel wrote:
> In message:  <SRMJLJVPA8WQQ43874XEC51RQ1USNZX.3dc1bcb9@riven>
>              Tim Stone Four Stones Forum <tim@fourstonesforum.com> writes:
>=20
> >But can it be useful to the masses?  The pop3proxy is the right way to go
> >in my opinion.
>=20
> You folks make me feel like such a fuddy-duddy, still using MH
> from a shell account with the mailboxes fetched through the
> filesystem, instead of through some network mailbox protocol...
> Heck, I don't even have software to access a POP mailbox installed...
>=20
> I guess that raises the question: what is our target audience,
> and how strictly do we want to cater to them?  Do we want to
> offer support for processing in direct-delivery situations,
> even though it's only old-school fuddy-duddies like myself
> who use them, anymore?

The "itch" that I'm scratching is that I'm tired of seeing all my
non-techie friends using inferior technology because the quality
open-source solutions are too complicated for them and/or have
user-unfriendly interfaces. So I'm inclined to focus on solutions that
will cater to the general public's needs first; techies capable of
scratching their own itches are going to be a distant second on my
priority list, personally. Certainly we should offer support for as many
configurations as possible, including direct-delivery situations, but I
want to first focus on a solution for the general non-techie public.

I agree that pop3proxy is the optimal way to go, but it does require the
ISP's cooperation to install. I also want a solution that the end user
can install without needing the ISP's cooperation; something that could
integrate into, say, Outlook Express and add a "Block this junk mail"
button (which adds the message to the spam corpus) to the E-mail reading
interface. Taking this kind of approach will lead to more work for us,
but would make the project useful sooner for all kinds of users.

What would be needed for a user-install-only interface?

1. It must integrate into the user's email client as seamlessly as
possible. This means researching the plugin API of Outlook, Eudora,
Pegasus Mail, Mozilla, et al.

2. The algorithm and filtering component must also run in the background
without any user intervention required after the initial install. This
means being able to install as a Windows NT service or into the StartUp
folder of Windows 9x.

3. There *MUST* be good documentation. We all know the user is going to
run the installer program before reading the documentation, but we must
include a "How to train your filter to recognize junk mail" document
that the installer displays after finishing installation. This means
actually writing said documentation. :-)

Those are the three things that I think are essential to a version of
spambayes that can be installed and used profitably by non-techie
end-users. Meanwhile, I'll try to help out with pop3proxy.

--=20
Robin Munn <rmunn@pobox.com>
http://www.rmunn.com/
PGP key ID: 0x6AFB6838    50FF 2478 CFFB 081A 8338  54F7 845D ACFD 6AFB 6838

---------------------- multipart/signed attachment
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
Url : http://mail.python.org/pipermail/spambayes/attachments/20021031/2f1215fa/attachment.bin

---------------------- multipart/signed attachment--

From rmunn@pobox.com  Fri Nov  1 00:46:00 2002
From: rmunn@pobox.com (Robin Munn)
Date: Thu, 31 Oct 2002 18:46:00 -0600
Subject: [Spambayes] Terminology in user documentation: "spam" vs. "junk mail"
Message-ID: <20021101004600.GB28132@rmunnlfs>


---------------------- multipart/signed attachment
When we start writing user documentation, I propose using the term "junk
mail" instead of "spam" and "non-junk mail" (or some other term) instead
of "ham". I believe this will reduce confusion among non-techies who are
encountering spam terminology for the first time. They'll have enough
new ideas to learn trying to install and run a filter, let's not add
jargon to what they have to learn.

Other possibilities:

"junk email" instead of "spam"
"valid email" instead of "ham"

"unwanted email" instead of "spam"
"wanted email" instead of "ham"

[Insert your clever idea here]

Comments, anyone?

--=20
Robin Munn <rmunn@pobox.com>
http://www.rmunn.com/
PGP key ID: 0x6AFB6838    50FF 2478 CFFB 081A 8338  54F7 845D ACFD 6AFB 6838

---------------------- multipart/signed attachment
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
Url : http://mail.python.org/pipermail/spambayes/attachments/20021031/25b0bfc8/attachment.bin

---------------------- multipart/signed attachment--

From guido@python.org  Fri Nov  1 00:56:13 2002
From: guido@python.org (Guido van Rossum)
Date: Thu, 31 Oct 2002 19:56:13 -0500
Subject: [Spambayes] Terminology in user documentation: "spam" vs. "junk
	mail"
In-Reply-To: Your message of "Thu, 31 Oct 2002 18:46:00 CST."
             <20021101004600.GB28132@rmunnlfs> 
References: <20021101004600.GB28132@rmunnlfs> 
Message-ID: <200211010056.gA10uDH02868@pcp02138704pcs.reston01.va.comcast.net>

> When we start writing user documentation, I propose using the term
> "junk mail" instead of "spam" and "non-junk mail" (or some other
> term) instead of "ham". I believe this will reduce confusion among
> non-techies who are encountering spam terminology for the first
> time. They'll have enough new ideas to learn trying to install and
> run a filter, let's not add jargon to what they have to learn.

If they don't know the word "spam", they don't need a spam filter yet.

I agree that we need something better than "ham".  Non-spam works for
me; "good mail" too.

--Guido van Rossum (home page: http://www.python.org/~guido/)

From Tim@mail.powweb.com  Fri Nov  1 00:56:15 2002
From: Tim@mail.powweb.com (Tim@mail.powweb.com)
Date: Thu, 31 Oct 2002 18:56:15 -0600
Subject: [Spambayes] Email client integration -- what's needed?
In-Reply-To: <20021101003712.GA28132@rmunnlfs>
Message-ID: <IDHGIKIYTNK3XRLSPQO75HZYIC83JH.3dc1d12f@riven>

Well, a pop3proxy is certainly capable of running on a client machine.  See http://www.software.bisswanger.de/en/index.php?seite=smtp for an example of 
a similar proxy that inserts SMTPAuth into a non-SMTPAuth enabled mailer, such as Opera.  That said, it would certainly be simpler to plug into the individual 
mailers in a much more seamless manner.  I'm not quite sure that this is even possible with various mailers.  If it is, great.  If not, then running a proxy process 
locally is a reasonable solution, and easier to implement in the near term.

I found this SMTP proxy by checking the Opera site when my host converted to SMTPAuth.  The Opera folks felt like this was easy enough to install(which it 
was) to put in their faq as the answer to how to use SMTPAuth with their mailer.  I think most folks could pull it off...

The problem with plugging into mail clients is that the plugin architecture can change over time, which produces an ongoing maintenance effort.  There will 
also be multiple codebases to maintain, as each plugin architecture (if one exists) will be different.  There are dozens of different mail clients... consider AOL 
for example.  Can we plug into their mailer?  It's used by millions of people...

- Tim S.

On Thu, Oct 31, 2002 at 03:51:45PM -0800, T. Alexander Popiel wrote:
> In message:  <SRMJLJVPA8WQQ43874XEC51RQ1USNZX.3dc1bcb9@riven>
>              Tim Stone Four Stones Forum <tim@fourstonesforum.com> writes:
>
> >But can it be useful to the masses?  The pop3proxy is the right way to go
> >in my opinion.
>
> You folks make me feel like such a fuddy-duddy, still using MH
> from a shell account with the mailboxes fetched through the
> filesystem, instead of through some network mailbox protocol...
> Heck, I don't even have software to access a POP mailbox installed...
>
> I guess that raises the question: what is our target audience,
> and how strictly do we want to cater to them?  Do we want to
> offer support for processing in direct-delivery situations,
> even though it's only old-school fuddy-duddies like myself
> who use them, anymore?

The "itch" that I'm scratching is that I'm tired of seeing all my
non-techie friends using inferior technology because the quality
open-source solutions are too complicated for them and/or have
user-unfriendly interfaces. So I'm inclined to focus on solutions that
will cater to the general public's needs first; techies capable of
scratching their own itches are going to be a distant second on my
priority list, personally. Certainly we should offer support for as many
configurations as possible, including direct-delivery situations, but I
want to first focus on a solution for the general non-techie public.

I agree that pop3proxy is the optimal way to go, but it does require the
ISP's cooperation to install. I also want a solution that the end user
can install without needing the ISP's cooperation; something that could
integrate into, say, Outlook Express and add a "Block this junk mail"
button (which adds the message to the spam corpus) to the E-mail reading
interface. Taking this kind of approach will lead to more work for us,
but would make the project useful sooner for all kinds of users.

What would be needed for a user-install-only interface?

1. It must integrate into the user's email client as seamlessly as
possible. This means researching the plugin API of Outlook, Eudora,
Pegasus Mail, Mozilla, et al.

2. The algorithm and filtering component must also run in the background
without any user intervention required after the initial install. This
means being able to install as a Windows NT service or into the StartUp
folder of Windows 9x.

3. There *MUST* be good documentation. We all know the user is going to
run the installer program before reading the documentation, but we must
include a "How to train your filter to recognize junk mail" document
that the installer displays after finishing installation. This means
actually writing said documentation. :-)

Those are the three things that I think are essential to a version of
spambayes that can be installed and used profitably by non-techie
end-users. Meanwhile, I'll try to help out with pop3proxy.

--
Robin Munn <rmunn@pobox.com>
http://www.rmunn.com/
PGP key ID: 0x6AFB6838    50FF 2478 CFFB 081A 8338  54F7 845D ACFD 6AFB 6838

10/31/2002 6:37:12 PM, Robin Munn <rmunn@pobox.com> wrote:

- Tim
www.fourstonesExpressions.com 


From Tim@mail.powweb.com  Fri Nov  1 01:08:41 2002
From: Tim@mail.powweb.com (Tim@mail.powweb.com)
Date: Thu, 31 Oct 2002 19:08:41 -0600
Subject: [Spambayes] Terminology in user documentation: "spam" vs. "junk
	mail"
Message-ID: <IEBANHLNLNJ4UQVSGF76MKC94WECF0.3dc1d419@riven>

Unwanted email is the best idea in your list.  This technology is useful for filtering mail from any number of sources, not just spammers.  How about mail from 
an ex-* who won't leave you alone...  While perhaps other kinds of mail are not quite as predictable as general spam, they could be filtered nonetheless... 
The reality is that the average person that will use this technology doesn't make a distinction between what we call spam and mail from their ex-boyfriend.   
It's all unwanted crap and they want to filter it.

- Tim

Robin said:

When we start writing user documentation, I propose using the term "junk
mail" instead of "spam" and "non-junk mail" (or some other term) instead
of "ham". I believe this will reduce confusion among non-techies who are
encountering spam terminology for the first time. They'll have enough
new ideas to learn trying to install and run a filter, let's not add
jargon to what they have to learn.

Other possibilities:

"junk email" instead of "spam"
"valid email" instead of "ham"

"unwanted email" instead of "spam"
"wanted email" instead of "ham"

[Insert your clever idea here]

Comments, anyone?

--
Robin Munn <rmunn@pobox.com>
http://www.rmunn.com/
PGP key ID: 0x6AFB6838    50FF 2478 CFFB 081A 8338  54F7 845D ACFD 6AFB 6838

10/31/2002 6:46:00 PM, Robin Munn <rmunn@pobox.com> wrote:

- Tim
www.fourstonesExpressions.com 


From vanhorn@whidbey.com  Fri Nov  1 01:20:58 2002
From: vanhorn@whidbey.com (G. Armour Van Horn)
Date: Thu, 31 Oct 2002 17:20:58 -0800
Subject: [Spambayes] Terminology in user documentation: "spam" vs. "junk 
 mail"
References: <20021101004600.GB28132@rmunnlfs>
Message-ID: <3DC1D6FA.343FD9F0@whidbey.com>

I think Spam and Ham will be perfectly clear to the users I support. They may be
realtors, but they aren't braindead, and a little humor helps in teaching.

Van

Robin Munn wrote:

> When we start writing user documentation, I propose using the term "junk
> mail" instead of "spam" and "non-junk mail" (or some other term) instead
> of "ham". I believe this will reduce confusion among non-techies who are
> encountering spam terminology for the first time. They'll have enough
> new ideas to learn trying to install and run a filter, let's not add
> jargon to what they have to learn.
>
> Other possibilities:
>
> "junk email" instead of "spam"
> "valid email" instead of "ham"
>
> "unwanted email" instead of "spam"
> "wanted email" instead of "ham"
>
> [Insert your clever idea here]
>
> Comments, anyone?
>
> --
> Robin Munn <rmunn@pobox.com>
> http://www.rmunn.com/
> PGP key ID: 0x6AFB6838    50FF 2478 CFFB 081A 8338  54F7 845D ACFD 6AFB 6838
>
>   ------------------------------------------------------------------------
>    Part 1.2Type: application/pgp-signature

--
----------------------------------------------------------
Sign up now for Quotes of the Day, a handful of quotations
on a theme delivered every morning.
Enlightenment! Daily, for free!
mailto:twisted@whidbey.com?subject=Subscribe_QOTD

For web hosting and maintenance,
visit Van's home page: http://www.domainvanhorn.com/van/
----------------------------------------------------------


From mhammond@skippinet.com.au  Fri Nov  1 01:25:14 2002
From: mhammond@skippinet.com.au (Mark Hammond)
Date: Fri, 1 Nov 2002 12:25:14 +1100
Subject: [Spambayes] Terminology in user documentation: "spam" vs.
	"junkmail"
In-Reply-To: <IEBANHLNLNJ4UQVSGF76MKC94WECF0.3dc1d419@riven>
Message-ID: <LCEPIIGDJPKCOIHOBJEPKEFBHHAA.mhammond@skippinet.com.au>

> The reality is that the average person that will use this
> technology doesn't make a distinction between what we call spam
> and mail from their ex-boyfriend.
> It's all unwanted crap and they want to filter it.

Agreed.  Most people in my social circle who have heard of spam often think
it is a general term for "any mail with a large To list".  I've received a
number of legit mails starting with eg. "Sorry for the spam, but I thought
this too funny not to send to all of you" (it wasn't <wink>).

At the end of the day though, this is an issue for the front-ends rather
than the engine.  The outlook addin tends to use "spam or unwanted email"
and "good email".  "junk email" seems good too.  It wouldn't surprise me to
find the word "spam" excised eventually.

Front-end issues will probably drive small engine details though - nothing
beats real experience with a tool <wink>

Mark.


From guido@python.org  Fri Nov  1 01:31:07 2002
From: guido@python.org (Guido van Rossum)
Date: Thu, 31 Oct 2002 20:31:07 -0500
Subject: [Spambayes] Re: [Spambayes-checkins] spambayes
	INTEGRATION.txt,NONE,1.1
In-Reply-To: Your message of "Thu, 31 Oct 2002 17:23:30 PST."
             <E187QX4-0006yo-00@usw-pr-cvs1.sourceforge.net> 
References: <E187QX4-0006yo-00@usw-pr-cvs1.sourceforge.net> 
Message-ID: <200211010131.gA11V8d03085@pcp02138704pcs.reston01.va.comcast.net>

[Skip checked in:]
> first scribbled notes about integrating Spambayes with different email
> packages.

Hm, maybe the spambayes website could be brought a bit more up to date
too?

--Guido van Rossum (home page: http://www.python.org/~guido/)

From Tim@mail.powweb.com  Fri Nov  1 01:33:01 2002
From: Tim@mail.powweb.com (Tim@mail.powweb.com)
Date: Thu, 31 Oct 2002 19:33:01 -0600
Subject: [Spambayes] Terminology in user documentation: "spam" vs.
	"junkmail"
In-Reply-To: <LCEPIIGDJPKCOIHOBJEPKEFBHHAA.mhammond@skippinet.com.au>
Message-ID: <VU1TSP95B9KZYSPNL31RM41ZYPNQLTS.3dc1d9cd@riven>

You got it, Mark.  Front ends are where this is really gonna happen, and your point is well taken.  I filter 'FWD:' stuff, even if it's from someone I know, 
because it is invariably unwanted.  In my Spambayes database, FWD: will have a spam weight of as close to 1 as possible...  Can I manually configure that 
weight?  lol

So we've got two examples of unwanted email that has not really been represented in the current training corpus: unwanted individual mails (e.g. "I love you, 
please take me back") and forwards of urban legends, thoughts of the day, funny stories I've heard, and the like.   It'd be interesting to see how the stats fall 
out if we were to (somehow) incorporate this kind of email into the current corpus.

Another thought is that my definition of unwanted may certainly differ from your definition, even as it pertains to spam.  Perhaps you really want to see 
something from quickinspirations.com  (as unbelievable as that might seem to me... ;) Thus, the only really good terminology here is something like 'unwanted 
email'.

- Tim


10/31/2002 7:25:14 PM, "Mark Hammond" <mhammond@skippinet.com.au> wrote:

>> The reality is that the average person that will use this
>> technology doesn't make a distinction between what we call spam
>> and mail from their ex-boyfriend.
>> It's all unwanted crap and they want to filter it.
>
>Agreed.  Most people in my social circle who have heard of spam often think
>it is a general term for "any mail with a large To list".  I've received a
>number of legit mails starting with eg. "Sorry for the spam, but I thought
>this too funny not to send to all of you" (it wasn't <wink>).
>
>At the end of the day though, this is an issue for the front-ends rather
>than the engine.  The outlook addin tends to use "spam or unwanted email"
>and "good email".  "junk email" seems good too.  It wouldn't surprise me to
>find the word "spam" excised eventually.
>
>Front-end issues will probably drive small engine details though - nothing
>beats real experience with a tool <wink>
>
>Mark.
>
>
>_______________________________________________
>Spambayes mailing list
>Spambayes@python.org
>http://mail.python.org/mailman/listinfo/spambayes
>
>
>
>
- Tim
www.fourstonesExpressions.com 


From Tim@mail.powweb.com  Fri Nov  1 01:33:01 2002
From: Tim@mail.powweb.com (Tim@mail.powweb.com)
Date: Thu, 31 Oct 2002 19:33:01 -0600
Subject: [Spambayes] Terminology in user documentation: "spam" vs.
	"junkmail"
In-Reply-To: <LCEPIIGDJPKCOIHOBJEPKEFBHHAA.mhammond@skippinet.com.au>
Message-ID: <VU1TSP95B9KZYSPNL31RM41ZYPNQLTS.3dc1d9cd@riven>

You got it, Mark.  Front ends are where this is really gonna happen, and your point is well taken.  I filter 'FWD:' stuff, even if it's from someone I know, 
because it is invariably unwanted.  In my Spambayes database, FWD: will have a spam weight of as close to 1 as possible...  Can I manually configure that 
weight?  lol

So we've got two examples of unwanted email that has not really been represented in the current training corpus: unwanted individual mails (e.g. "I love you, 
please take me back") and forwards of urban legends, thoughts of the day, funny stories I've heard, and the like.   It'd be interesting to see how the stats fall 
out if we were to (somehow) incorporate this kind of email into the current corpus.

Another thought is that my definition of unwanted may certainly differ from your definition, even as it pertains to spam.  Perhaps you really want to see 
something from quickinspirations.com  (as unbelievable as that might seem to me... ;) Thus, the only really good terminology here is something like 'unwanted 
email'.

- Tim


10/31/2002 7:25:14 PM, "Mark Hammond" <mhammond@skippinet.com.au> wrote:

>> The reality is that the average person that will use this
>> technology doesn't make a distinction between what we call spam
>> and mail from their ex-boyfriend.
>> It's all unwanted crap and they want to filter it.
>
>Agreed.  Most people in my social circle who have heard of spam often think
>it is a general term for "any mail with a large To list".  I've received a
>number of legit mails starting with eg. "Sorry for the spam, but I thought
>this too funny not to send to all of you" (it wasn't <wink>).
>
>At the end of the day though, this is an issue for the front-ends rather
>than the engine.  The outlook addin tends to use "spam or unwanted email"
>and "good email".  "junk email" seems good too.  It wouldn't surprise me to
>find the word "spam" excised eventually.
>
>Front-end issues will probably drive small engine details though - nothing
>beats real experience with a tool <wink>
>
>Mark.
>
>
>_______________________________________________
>Spambayes mailing list
>Spambayes@python.org
>http://mail.python.org/mailman/listinfo/spambayes
>
>
>
>
- Tim
www.fourstonesExpressions.com 


From Tim@mail.powweb.com  Fri Nov  1 01:33:01 2002
From: Tim@mail.powweb.com (Tim@mail.powweb.com)
Date: Thu, 31 Oct 2002 19:33:01 -0600
Subject: [Spambayes] Terminology in user documentation: "spam" vs.
	"junkmail"
In-Reply-To: <LCEPIIGDJPKCOIHOBJEPKEFBHHAA.mhammond@skippinet.com.au>
Message-ID: <VU1TSP95B9KZYSPNL31RM41ZYPNQLTS.3dc1d9cd@riven>

You got it, Mark.  Front ends are where this is really gonna happen, and your point is well taken.  I filter 'FWD:' stuff, even if it's from someone I know, 
because it is invariably unwanted.  In my Spambayes database, FWD: will have a spam weight of as close to 1 as possible...  Can I manually configure that 
weight?  lol

So we've got two examples of unwanted email that has not really been represented in the current training corpus: unwanted individual mails (e.g. "I love you, 
please take me back") and forwards of urban legends, thoughts of the day, funny stories I've heard, and the like.   It'd be interesting to see how the stats fall 
out if we were to (somehow) incorporate this kind of email into the current corpus.

Another thought is that my definition of unwanted may certainly differ from your definition, even as it pertains to spam.  Perhaps you really want to see 
something from quickinspirations.com  (as unbelievable as that might seem to me... ;) Thus, the only really good terminology here is something like 'unwanted 
email'.

- Tim


10/31/2002 7:25:14 PM, "Mark Hammond" <mhammond@skippinet.com.au> wrote:

>> The reality is that the average person that will use this
>> technology doesn't make a distinction between what we call spam
>> and mail from their ex-boyfriend.
>> It's all unwanted crap and they want to filter it.
>
>Agreed.  Most people in my social circle who have heard of spam often think
>it is a general term for "any mail with a large To list".  I've received a
>number of legit mails starting with eg. "Sorry for the spam, but I thought
>this too funny not to send to all of you" (it wasn't <wink>).
>
>At the end of the day though, this is an issue for the front-ends rather
>than the engine.  The outlook addin tends to use "spam or unwanted email"
>and "good email".  "junk email" seems good too.  It wouldn't surprise me to
>find the word "spam" excised eventually.
>
>Front-end issues will probably drive small engine details though - nothing
>beats real experience with a tool <wink>
>
>Mark.
>
>
>_______________________________________________
>Spambayes mailing list
>Spambayes@python.org
>http://mail.python.org/mailman/listinfo/spambayes
>
>
>
>
- Tim
www.fourstonesExpressions.com 


From Tim@mail.powweb.com  Fri Nov  1 01:34:05 2002
From: Tim@mail.powweb.com (Tim@mail.powweb.com)
Date: Thu, 31 Oct 2002 19:34:05 -0600
Subject: [Spambayes] Terminology in user documentation: "spam" vs.
	"junkmail"
Message-ID: <OIPN54PLHPKJEKKGMG08WDAEANKHE.3dc1da0d@riven>

You got it, Mark.  Front ends are where this is really gonna happen, and your point is well taken.  I filter 'FWD:' stuff, even if it's from someone I know, 
because it is invariably unwanted.  In my Spambayes database, FWD: will have a spam weight of as close to 1 as possible...  Can I manually configure that 
weight?  lol

So we've got two examples of unwanted email that has not really been represented in the current training corpus: unwanted individual mails (e.g. "I love you, 
please take me back") and forwards of urban legends, thoughts of the day, funny stories I've heard, and the like.   It'd be interesting to see how the stats fall 
out if we were to (somehow) incorporate this kind of email into the current corpus.

Another thought is that my definition of unwanted may certainly differ from your definition, even as it pertains to spam.  Perhaps you really want to see 
something from quickinspirations.com  (as unbelievable as that might seem to me... ;) Thus, the only really good terminology here is something like 'unwanted 
email'.

- Tim


10/31/2002 7:25:14 PM, "Mark Hammond" <mhammond@skippinet.com.au> wrote:

>> The reality is that the average person that will use this
>> technology doesn't make a distinction between what we call spam
>> and mail from their ex-boyfriend.
>> It's all unwanted crap and they want to filter it.
>
>Agreed.  Most people in my social circle who have heard of spam often think
>it is a general term for "any mail with a large To list".  I've received a
>number of legit mails starting with eg. "Sorry for the spam, but I thought
>this too funny not to send to all of you" (it wasn't <wink>).
>
>At the end of the day though, this is an issue for the front-ends rather
>than the engine.  The outlook addin tends to use "spam or unwanted email"
>and "good email".  "junk email" seems good too.  It wouldn't surprise me to
>find the word "spam" excised eventually.
>
>Front-end issues will probably drive small engine details though - nothing
>beats real experience with a tool <wink>
>
>Mark.
>
>
>_______________________________________________
>Spambayes mailing list
>Spambayes@python.org
>http://mail.python.org/mailman/listinfo/spambayes
>
>
>
>
- Tim
www.fourstonesExpressions.com 


From skip@pobox.com  Fri Nov  1 01:25:32 2002
From: skip@pobox.com (Skip Montanaro)
Date: Thu, 31 Oct 2002 19:25:32 -0600
Subject: [Spambayes] Email client integration -- what's needed?
In-Reply-To: <20021031211819.GC27454@rmunnlfs>
References: <20021031211819.GC27454@rmunnlfs>
Message-ID: <15809.55308.1945.988931@montanaro.dyndns.org>


    Robin> I just joined the spambayes mailing list a couple of days ago and
    Robin> have been trying to skim through the archives. It looks like a
    Robin> lot of time is being spent on algorithm refining and not as much
    Robin> time on email client integration or end-user documentation. 

That's true, largely because that's what the focus of the initial phase of
the project was supposed to be.  Even if it gets no farther than it is
today, the process has been highly educational for me, because we have an
expert in algorithm design (that'd be Tim) exposing his thought processes
and mechanics for the rest of us.

That said, I think the classification stuff has gone about as far as it's
going to go.  Future changes to the tokenizer are also likely to be
incremental, so the major changes over the next while will be in email
integration.  Mark Hammond has done a terrific service for all the
pointyhaired folks out there by adding some modules to the system which
allow this stuff to work rather elegantly from Outlook (from what I can
divine reading the list - I don't use Outlook).  A number of other people
have integrated it in various ways with other Unix-based mail
systems. Jeremy and I both use VM from X/Emacs.  We've approached the
problem of integration somewhat differently.  There's also the pop3proxy
script which Richie Hindle wrote as another way of integrating spambayes
into a MTA/MUA setup.  Neil Schemenauer also has a pair of scripts he uses
(look for neil*.py in the spambayes source tree).

This has all sort of sporadically been "documented" in the mailing list.  I
just checked in INTEGRATION.txt to the CVS repository.  Consider it a few
scribbled notes about integration, based upon my own experience.  I'm sure
others have much to share as well.

Skip

From skip@pobox.com  Fri Nov  1 01:37:56 2002
From: skip@pobox.com (Skip Montanaro)
Date: Thu, 31 Oct 2002 19:37:56 -0600
Subject: [Spambayes] Email client integration -- what's needed?
In-Reply-To: <20021101003712.GA28132@rmunnlfs>
References: <SRMJLJVPA8WQQ43874XEC51RQ1USNZX.3dc1bcb9@riven>
        <20021031235145.C0376F59F@cashew.wolfskeep.com>
        <20021101003712.GA28132@rmunnlfs>
Message-ID: <15809.56053.158.812689@montanaro.dyndns.org>


    Robin> The "itch" that I'm scratching is that I'm tired of seeing all my
    Robin> non-techie friends using inferior technology because the quality
    Robin> open-source solutions are too complicated for them and/or have
    Robin> user-unfriendly interfaces. 

I think Outlook users can eventually be handled by an installer which
installs Mark's Outlook modules and Python if necessary.  Should be point
and shoot.  They need not ever know that Python is there.

Skip

From skip@pobox.com  Fri Nov  1 01:41:47 2002
From: skip@pobox.com (Skip Montanaro)
Date: Thu, 31 Oct 2002 19:41:47 -0600
Subject: [Spambayes] Terminology in user documentation: "spam" vs. "junk
	mail"
In-Reply-To: <200211010056.gA10uDH02868@pcp02138704pcs.reston01.va.comcast.net>
References: <20021101004600.GB28132@rmunnlfs>
        <200211010056.gA10uDH02868@pcp02138704pcs.reston01.va.comcast.net>
Message-ID: <15809.56283.323513.587530@montanaro.dyndns.org>


    Guido> I agree that we need something better than "ham".  Non-spam works
    Guido> for me; "good mail" too.

I don't think that's necessarily the case.  "Ham" has a certain panache.  It
rolls of the tongue better than anything else I've seen.  It distinguishes
Spambayes from the herd a bit, and may be a clever little bit of marketing.
(I noted before that some people in the SpamAssassin community have picked
up the term.)  I wouldn't change it until it's demonstrated to be a
liability.

Skip

From skip@pobox.com  Fri Nov  1 01:34:31 2002
From: skip@pobox.com (Skip Montanaro)
Date: Thu, 31 Oct 2002 19:34:31 -0600
Subject: [Spambayes] Database reduction
In-Reply-To: <w53k7jyf8ni.fsf@woozle.org>
References: <LNBBLJKPBEHFEDALKOLCIEDPCDAB.tim.one@comcast.net>
        <w53k7jyf8ni.fsf@woozle.org>
Message-ID: <15809.55847.349091.23441@montanaro.dyndns.org>


    Neale> When pickling a Bayes object, the pickler is smart enough not to
    Neale> repeatedly say "this is a wordinfo object" but rather, I assume,
    Neale> "this is of type 2", only having to name type 2 once.  However,
    Neale> hammie pickles each wordinfo individually, keyed by a string.
    Neale> This makes for fast lookups, but giant databases.

You can always define your own __getstate__ and __setstate__ methods for the
Wordinfo class which processes a more compact form of the object's state.
Or am I misunderstanding what you said?

    Neale> Tim just mentioned a performance tweak; is this an indicator that
    Neale> now would be a good time to resume trying to reduce hammie's
    Neale> database size?

I reduced the size of my database significantly after my training run by
deleting wordinfo where the hamcount was 1 and the spamcount was 0 or vice
versa.

Skip

From skip@pobox.com  Fri Nov  1 01:29:14 2002
From: skip@pobox.com (Skip Montanaro)
Date: Thu, 31 Oct 2002 19:29:14 -0600
Subject: [Spambayes] Email client integration -- what's needed? 
In-Reply-To: <20021031235145.C0376F59F@cashew.wolfskeep.com>
References: <SRMJLJVPA8WQQ43874XEC51RQ1USNZX.3dc1bcb9@riven>
        <20021031235145.C0376F59F@cashew.wolfskeep.com>
Message-ID: <15809.55530.598489.521312@montanaro.dyndns.org>


    Alex> I guess that raises the question: what is our target audience, and
    Alex> how strictly do we want to cater to them?

On a sheer numbers basis, your target audience is definitely Outlook and
Outlook Express users.  The rest of it is just noise.  Mark Hammond seems to
have taken good care of the Outlook users.

Skip

From tim.one@comcast.net  Fri Nov  1 02:32:04 2002
From: tim.one@comcast.net (Tim Peters)
Date: Thu, 31 Oct 2002 21:32:04 -0500
Subject: [Spambayes] Terminology in user documentation: "spam" vs.
 "junk	mail"
In-Reply-To: <15809.56283.323513.587530@montanaro.dyndns.org>
Message-ID: <LNBBLJKPBEHFEDALKOLCMEMCCDAB.tim.one@comcast.net>

[Guido]
> I agree that we need something better than "ham".
> Non-spam works for me; "good mail" too.

[Skip Montanaro]
> I don't think that's necessarily the case.  "Ham" has a certain
> panache.  It rolls of the tongue better than anything else I've
> seen.  It distinguishes Spambayes from the herd a bit, and may
> be a clever little bit of marketing.
> (I noted before that some people in the SpamAssassin community have
> picked up the term.)  I wouldn't change it until it's demonstrated
> to be a liability.

Another data point:  I gave a live demo of the Outlook 2000 client last
week, to a group of people who were taking a Python+Zope training class at
Zope Corp.  They laughed out loud at the "spam vs ham" distinction, which
surprised me because I've come to think of them as purely technical terms
identifying a region in chi-squared probability space.  That may intensify
suspicions that they were laughing at me instead of with me <wink>, but I do
think they genuinely enjoyed the word play.  The only thing that got a
bigger laugh was that a "Want a BIG penis NOW?" spam happened to arrive
during the demo.  If the choice is between "spam" and "ham", or "spam" and
"big penis", I'm weakly in favor of the former.

buncha-stuffed-shirts-ly y'rs  - tim


From tim.one@comcast.net  Fri Nov  1 02:38:26 2002
From: tim.one@comcast.net (Tim Peters)
Date: Thu, 31 Oct 2002 21:38:26 -0500
Subject: [Spambayes] Email client integration -- what's needed?
In-Reply-To: <15809.55530.598489.521312@montanaro.dyndns.org>
Message-ID: <LNBBLJKPBEHFEDALKOLCMEMDCDAB.tim.one@comcast.net>

[Skip Montanaro]
> On a sheer numbers basis, your target audience is definitely Outlook and
> Outlook Express users.  The rest of it is just noise.

This is so sadly true.  If Netscape Communicator still survives in some
form, that would be a good one too.  I have a sister who uses that, and
better that someone else try to make her happy.

> Mark Hammond seems to have taken good care of the Outlook users.

Indeed he is, and it is indeed lots of hard work, and Outlook has to be the
most difficult email program in the universe to hook up with.

Except for Outlook Express, which doesn't appear to offer any programming
hooks at *all* (Outlook may supply thousands, to judge by the number of toes
that have been broken stumbling into them so far).


From Tim@mail.powweb.com  Fri Nov  1 02:40:27 2002
From: Tim@mail.powweb.com (Tim@mail.powweb.com)
Date: Thu, 31 Oct 2002 20:40:27 -0600
Subject: [Spambayes] Email client integration -- what's needed? 
In-Reply-To: <15809.55530.598489.521312@montanaro.dyndns.org>
Message-ID: <KIRNBTSRM053VVPTSIETQB0XR21ZT98.3dc1e99b@riven>

The latest figures that I can find from Microsoft are that Outlook has 57% market share, Notes has 29%, Browser based email is 9%, and the rest is split 
between cc:Mail, GroupWise, Outlook Express, the Exchange client, and 2% of "Other."  So Outlook is certainly the low hanging fruit here, but Notes is a big 
client as well.  The Notes market will be a bit more difficult to reach, because as a product it is even more closed than Outlook... (imagine that...)  But I think 
that Notes users comprise a good sized segment, and I don't believe that there are any effective filters for that product.  That said, a POP3 proxy won't work 
for Notes, because Notes servers are not POP3... it's a replication scheme.  So there may not be any hope for the Notes thing.  So, I think we should do a 
nice integration with Outlook (which it seems as though Mark has already handled), and Outlook Express if possible, do a Pop3proxy that can be used in 
most other circumstances, and leave the special cases to those who are interested enough to do whatever integration they wish...

That documentation effort should be enough to keep Robin busy for a while... ;)

- Tim

10/31/2002 7:29:14 PM, Skip Montanaro <skip@pobox.com> wrote:

>
>    Alex> I guess that raises the question: what is our target audience, and
>    Alex> how strictly do we want to cater to them?
>
>On a sheer numbers basis, your target audience is definitely Outlook and
>Outlook Express users.  The rest of it is just noise.  Mark Hammond seems to
>have taken good care of the Outlook users.
>
>Skip
>
>
>
>
- Tim
www.fourstonesExpressions.com 


From mhammond@skippinet.com.au  Fri Nov  1 02:52:30 2002
From: mhammond@skippinet.com.au (Mark Hammond)
Date: Fri, 1 Nov 2002 13:52:30 +1100
Subject: [Spambayes] Email client integration -- what's needed? 
In-Reply-To: <KIRNBTSRM053VVPTSIETQB0XR21ZT98.3dc1e99b@riven>
Message-ID: <LCEPIIGDJPKCOIHOBJEPCEFMHHAA.mhammond@skippinet.com.au>

> The latest figures that I can find from Microsoft are that
> Outlook has 57% market share, Notes has 29%, Browser based email
> is 9%, and the rest is split
> between cc:Mail, GroupWise, Outlook Express, the Exchange client,
> and 2% of "Other."  So Outlook is certainly the low hanging fruit
> here, but Notes is a big
> client as well.

This is really for "internet mail", whereas I bet that the figures above are
"corporate" users.  Most corporate users I have spoken to simply dont have a
large spam problem - their work address is rarely publically posted to the
Web, and the corporate internet mail gateway tends to have some rudimentary
spam filtering anyway.

Basically, I would be surprised to find many Lotus Notes users with a spam
problem.

> The Notes market will be a bit more difficult to
> reach, because as a product it is even more closed than
> Outlook... (imagine that...)

I find that a strange comment given the integration already achieved with
Outlook.  From an extensibility point-of-view, Outlook is almost as open as
I can imagine.

Mark.


From neale@woozle.org  Fri Nov  1 02:56:20 2002
From: neale@woozle.org (Neale Pickett)
Date: 31 Oct 2002 18:56:20 -0800
Subject: [Spambayes] non-ascii mail in hammiecli
In-Reply-To: <200210311817.57268.tdickenson@geminidataloggers.com>
References: <200210311817.57268.tdickenson@geminidataloggers.com>
Message-ID: <w53fzumf163.fsf@woozle.org>

So then, Toby Dickenson <tdickenson@devmail.geminidataloggers.co.uk> is all like:

> hammiecli is giving its input an xmlrpc binary wrapper, to avoid marshalling 
> problems with non-ascii input.  However hammiesrv wasnt doing the same for 
> its output.
> 
> diff attached

Rippin', Toby!  I've checked in your patch.  Thanks!

Neale

From jeremy@alum.mit.edu  Fri Nov  1 02:55:32 2002
From: jeremy@alum.mit.edu (Jeremy Hylton)
Date: Thu, 31 Oct 2002 21:55:32 -0500
Subject: [Spambayes] Fwd: Re: [Design] Contacts (Michael R. Bernstein)
Message-ID: <15809.60708.8788.803101@slothrop.zope.com>


---------------------- multipart/mixed attachment
Interesting effect.  I signed up for a couple of new mailing lists
(concerning the Kapor PIM project).  The discussions on them seem to
be very different than the stuff I usually get, and the conclusions it
that at least some of it is definitely spam.

It's unfortunate that they get marked as spam instead of unsure.  It
means that getting mail in a new subject area means that the
classifier will make some wildly wrong guesses until you get enough
new training data.

Of the 14 new messages, I see 1 ham, 3 spam, and 10 unsure.  I've
forwarded one of the high scoring spams.

Jeremy

Score: 0.998870303948

Clues
-----
*H* 0.00155925012726
*S* 0.999299858023
wrote: 0.00338600451467
michael 0.0155709342561
book, 0.0167286245353
except 0.0196506550218
joe 0.0266272189349
(in 0.0302013422819
from:"michael 0.0348837209302
interface 0.0348837209302
thu, 0.0412844036697
false 0.0652173913043
(so 0.0918367346939
foundation 0.0918367346939
machine, 0.0918367346939
machine. 0.0918367346939
origin 0.0918367346939
shared 0.0918367346939
subject:Design 0.0918367346939
widely 0.0918367346939
play 0.140636581012
(ie. 0.155172413793
(there 0.155172413793
apps 0.155172413793
belong 0.155172413793
chandler 0.155172413793
claiming 0.155172413793
count, 0.155172413793
horses 0.155172413793
managing, 0.155172413793
scripting 0.155172413793
solved 0.155172413793
subset 0.155172413793
tool. 0.155172413793
trojan 0.155172413793
header:In-Reply-To:1 0.212519247953
header:Errors-To:1 0.237464007184
can't 0.321673801617
proto:http 0.662469900861
information 0.662953865426
one 0.665248119685
header:Return-Path:1 0.665758340634
allow 0.670346027038
good 0.671104201444
get 0.671415294108
last 0.673179750848
place. 0.674836986007
are 0.679601700319
has 0.681104376636
used 0.688719773647
way 0.695441491093
distribution 0.695807170463
wrong 0.695807170463
users 0.69606431081
having 0.696133185508
needs 0.699732035795
skip:c 10 0.700608177332
other 0.702538098269
want 0.708850616487
e-mail 0.715394926258
running 0.716462999299
should 0.718351527065
someone 0.722025540013
part 0.723159240531
access 0.726693881975
also 0.727229519921
more 0.740683224491
user 0.741338432978
share 0.743372688795
may 0.745213686825
because 0.746505250332
data 0.747224736221
without 0.749466559635
all 0.751250287138
verify 0.752508123366
bit 0.75613165431
code 0.756797827917
these 0.757105434828
sign 0.760030537028
read 0.761534404086
address 0.762088324731
header:Message-Id:1 0.764545352748
open 0.76530284221
list 0.76574131017
>from 0.770761722645
avoid 0.770761722645
entry 0.770761722645
only 0.772148590346
provided 0.773230674523
first 0.776410372426
different 0.778505146553
copy 0.779604640492
further 0.783390240127
allows 0.785722995011
here. 0.785722995011
large 0.785722995011
building 0.788593119438
they're 0.788593119438
then 0.789698777271
whether 0.791623214186
those 0.795849646208
annoying 0.797771282426
make 0.800014773042
try 0.804967405411
saying 0.807062436029
mailing 0.814445440614
again. 0.815142308819
real 0.815142308819
must 0.817697541998
applications 0.81854602306
installed 0.81854602306
skip:i 10 0.822492441981
even 0.823228458985
person 0.823924036541
call 0.825417398415
secure 0.826130452181
skip:p 10 0.826254664449
isn't 0.826344529636
full 0.829689461024
once 0.831240651842
people 0.832090040931
skip:a 10 0.834052950494
joel 0.838351988458
requiring 0.838351988458
cannot 0.843680980631
case 0.843791821294
results 0.844279808423
...and 0.844827586207
hacked 0.844827586207
techniques, 0.844827586207
sharing 0.84571021875
capabilities 0.845855006747
outlook 0.845855006747
same 0.848300323403
computer 0.851704776271
every 0.855780693872
response 0.861670606819
easy 0.878526274377
prevent 0.8924082233
key 0.893108945959
nothing 0.898726242139
happen 0.904700604011
2.. 0.908163265306
trust 0.908217126704
private 0.919486477907
computer, 0.923563305955
easily 0.929335171741
application 0.935918946825
security 0.939824371865
contacts 0.948037345934
list, 0.970027495049
data. 0.973372781065


---------------------- multipart/mixed attachment
An embedded message was scrubbed...
From: "Michael R. Bernstein" <webmaven@lvcm.com>
Subject: Re: [Design] Contacts
Date: 31 Oct 2002 13:46:57 -0800
Size: 5181
Url: http://mail.python.org/pipermail/spambayes/attachments/20021031/022226d6/attachment.txt

---------------------- multipart/mixed attachment--


From tim.one@comcast.net  Fri Nov  1 03:01:49 2002
From: tim.one@comcast.net (Tim Peters)
Date: Thu, 31 Oct 2002 22:01:49 -0500
Subject: [Spambayes] Email client integration -- what's needed?
In-Reply-To: <LCEPIIGDJPKCOIHOBJEPCEFMHHAA.mhammond@skippinet.com.au>
Message-ID: <LNBBLJKPBEHFEDALKOLCGEMHCDAB.tim.one@comcast.net>

[Tim@mail.powweb.com]
> The Notes market will be a bit more difficult to
> reach, because as a product it is even more closed than
> Outlook... (imagine that...)

[Mark Hammond]
> I find that a strange comment given the integration already achieved
> with Outlook.  From an extensibility point-of-view, Outlook is almost
> as open as I can imagine.

Indeed, that's part of the *problem* with Outlook, isn't it?  There are so
many different ways to hook into it (the Outlook object model, the MAPI
substrate, the Collaboration Data Objects layer, ...) I can't even hold them
all in my head, and it's never clear which of seemingly dozens of ways to
get a thing done may actually work.  It's the very definition of
poke-and-hope programming.

I did some Notes programming in a previous life, and hope never to do so
again.  That was more a matter of wading through seemingly dozens of ways
*not* to get a thing done, hoping to paste enough failure modes together so
that the end result almost appeared to work some of the time.  BTW, Notes
was the purest example of a whole being greater than the sum of its parts
I've ever seen:  each piece (from email to database) sucked, but the whole
was nevertheless very useful for workgroup collaboration.


From Tim@mail.powweb.com  Fri Nov  1 03:03:37 2002
From: Tim@mail.powweb.com (Tim@mail.powweb.com)
Date: Thu, 31 Oct 2002 21:03:37 -0600
Subject: [Spambayes] Email client integration -- what's needed? 
In-Reply-To: <LCEPIIGDJPKCOIHOBJEPCEFMHHAA.mhammond@skippinet.com.au>
Message-ID: <YT2YSPDBXS61EAIHSO3YIJGSM61MH82.3dc1ef09@riven>

Well you're talkin to a Notes user with a HUGE spam problem... hundreds per day.  And to make it worse, corporate users often cannot simply change their 
address to shut off the flow.  You may be right about the corporate versus personal user.  This research really doesn't specify which market they're talking 
about.  I suspect that the lion's share of personal use is through hotmail, yahoo mail, and other web based mailers, or AOL, earthlink, or other specialized 
clients, which are beasts of a different nature...  can we enable these?

So what are we saying here?  I think i'm getting (giving) mixed messages about what/who we should be targeting.  Outlook is expensive, thus mostly 
corporations use it.  There is a lot of research that suggests that corporations have a huge spam problem.  Non-web-based personal use may not be the most 
productive area to enable, but it certainly is a visible segment, and when people at home can deal with spam effectively, they'll take that story to work....

So I propose we enable Outlook, Mozilla (for our OS brethren, which will likely get Netscape, too), have a pop3proxy that can be run locally and easily 
configured to be used by a number of "noise" mailers, and go from there...

- Tim

10/31/2002 8:52:30 PM, "Mark Hammond" <mhammond@skippinet.com.au> wrote:

>> The latest figures that I can find from Microsoft are that
>> Outlook has 57% market share, Notes has 29%, Browser based email
>> is 9%, and the rest is split
>> between cc:Mail, GroupWise, Outlook Express, the Exchange client,
>> and 2% of "Other."  So Outlook is certainly the low hanging fruit
>> here, but Notes is a big
>> client as well.
>
>This is really for "internet mail", whereas I bet that the figures above are
>"corporate" users.  Most corporate users I have spoken to simply dont have a
>large spam problem - their work address is rarely publically posted to the
>Web, and the corporate internet mail gateway tends to have some rudimentary
>spam filtering anyway.
>
>Basically, I would be surprised to find many Lotus Notes users with a spam
>problem.
>
>> The Notes market will be a bit more difficult to
>> reach, because as a product it is even more closed than
>> Outlook... (imagine that...)
>
>I find that a strange comment given the integration already achieved with
>Outlook.  From an extensibility point-of-view, Outlook is almost as open as
>I can imagine.
>
>Mark.
>
>
>_______________________________________________
>Spambayes mailing list
>Spambayes@python.org
>http://mail.python.org/mailman/listinfo/spambayes
>
>
>
>
- Tim
www.fourstonesExpressions.com 


From Tim@mail.powweb.com  Fri Nov  1 03:03:37 2002
From: Tim@mail.powweb.com (Tim@mail.powweb.com)
Date: Thu, 31 Oct 2002 21:03:37 -0600
Subject: [Spambayes] Email client integration -- what's needed? 
In-Reply-To: <LCEPIIGDJPKCOIHOBJEPCEFMHHAA.mhammond@skippinet.com.au>
Message-ID: <YT2YSPDBXS61EAIHSO3YIJGSM61MH82.3dc1ef09@riven>

Well you're talkin to a Notes user with a HUGE spam problem... hundreds per day.  And to make it worse, corporate users often cannot simply change their 
address to shut off the flow.  You may be right about the corporate versus personal user.  This research really doesn't specify which market they're talking 
about.  I suspect that the lion's share of personal use is through hotmail, yahoo mail, and other web based mailers, or AOL, earthlink, or other specialized 
clients, which are beasts of a different nature...  can we enable these?

So what are we saying here?  I think i'm getting (giving) mixed messages about what/who we should be targeting.  Outlook is expensive, thus mostly 
corporations use it.  There is a lot of research that suggests that corporations have a huge spam problem.  Non-web-based personal use may not be the most 
productive area to enable, but it certainly is a visible segment, and when people at home can deal with spam effectively, they'll take that story to work....

So I propose we enable Outlook, Mozilla (for our OS brethren, which will likely get Netscape, too), have a pop3proxy that can be run locally and easily 
configured to be used by a number of "noise" mailers, and go from there...

- Tim

10/31/2002 8:52:30 PM, "Mark Hammond" <mhammond@skippinet.com.au> wrote:

>> The latest figures that I can find from Microsoft are that
>> Outlook has 57% market share, Notes has 29%, Browser based email
>> is 9%, and the rest is split
>> between cc:Mail, GroupWise, Outlook Express, the Exchange client,
>> and 2% of "Other."  So Outlook is certainly the low hanging fruit
>> here, but Notes is a big
>> client as well.
>
>This is really for "internet mail", whereas I bet that the figures above are
>"corporate" users.  Most corporate users I have spoken to simply dont have a
>large spam problem - their work address is rarely publically posted to the
>Web, and the corporate internet mail gateway tends to have some rudimentary
>spam filtering anyway.
>
>Basically, I would be surprised to find many Lotus Notes users with a spam
>problem.
>
>> The Notes market will be a bit more difficult to
>> reach, because as a product it is even more closed than
>> Outlook... (imagine that...)
>
>I find that a strange comment given the integration already achieved with
>Outlook.  From an extensibility point-of-view, Outlook is almost as open as
>I can imagine.
>
>Mark.
>
>
>_______________________________________________
>Spambayes mailing list
>Spambayes@python.org
>http://mail.python.org/mailman/listinfo/spambayes
>
>
>
>
- Tim
www.fourstonesExpressions.com 


From Tim@mail.powweb.com  Fri Nov  1 03:04:16 2002
From: Tim@mail.powweb.com (Tim@mail.powweb.com)
Date: Thu, 31 Oct 2002 21:04:16 -0600
Subject: [Spambayes] Email client integration -- what's needed? 
Message-ID: <CE9PEAROTROEBRPGCPM54JJEC0RQ.3dc1ef30@riven>

Well you're talkin to a Notes user with a HUGE spam problem... hundreds per day.  And to make it worse, corporate users often cannot simply change their 
address to shut off the flow.  You may be right about the corporate versus personal user.  This research really doesn't specify which market they're talking 
about.  I suspect that the lion's share of personal use is through hotmail, yahoo mail, and other web based mailers, or AOL, earthlink, or other specialized 
clients, which are beasts of a different nature...  can we enable these?

So what are we saying here?  I think i'm getting (giving) mixed messages about what/who we should be targeting.  Outlook is expensive, thus mostly 
corporations use it.  There is a lot of research that suggests that corporations have a huge spam problem.  Non-web-based personal use may not be the most 
productive area to enable, but it certainly is a visible segment, and when people at home can deal with spam effectively, they'll take that story to work....

So I propose we enable Outlook, Mozilla (for our OS brethren, which will likely get Netscape, too), have a pop3proxy that can be run locally and easily 
configured to be used by a number of "noise" mailers, and go from there...

- Tim

10/31/2002 8:52:30 PM, "Mark Hammond" <mhammond@skippinet.com.au> wrote:

>> The latest figures that I can find from Microsoft are that
>> Outlook has 57% market share, Notes has 29%, Browser based email
>> is 9%, and the rest is split
>> between cc:Mail, GroupWise, Outlook Express, the Exchange client,
>> and 2% of "Other."  So Outlook is certainly the low hanging fruit
>> here, but Notes is a big
>> client as well.
>
>This is really for "internet mail", whereas I bet that the figures above are
>"corporate" users.  Most corporate users I have spoken to simply dont have a
>large spam problem - their work address is rarely publically posted to the
>Web, and the corporate internet mail gateway tends to have some rudimentary
>spam filtering anyway.
>
>Basically, I would be surprised to find many Lotus Notes users with a spam
>problem.
>
>> The Notes market will be a bit more difficult to
>> reach, because as a product it is even more closed than
>> Outlook... (imagine that...)
>
>I find that a strange comment given the integration already achieved with
>Outlook.  From an extensibility point-of-view, Outlook is almost as open as
>I can imagine.
>
>Mark.
>
>
>_______________________________________________
>Spambayes mailing list
>Spambayes@python.org
>http://mail.python.org/mailman/listinfo/spambayes
>
>
>
>
- Tim
www.fourstonesExpressions.com 


From tim.one@comcast.net  Fri Nov  1 03:20:55 2002
From: tim.one@comcast.net (Tim Peters)
Date: Thu, 31 Oct 2002 22:20:55 -0500
Subject: [Spambayes] RE: Re: [Design] Contacts (Michael R. Bernstein)
In-Reply-To: <15809.60708.8788.803101@slothrop.zope.com>
Message-ID: <LNBBLJKPBEHFEDALKOLCKEMLCDAB.tim.one@comcast.net>

[Jeremy Hylton]
> Interesting effect.  I signed up for a couple of new mailing lists
> (concerning the Kapor PIM project).  The discussions on them seem to
> be very different than the stuff I usually get, and the conclusions it
> that at least some of it is definitely spam.
>
> It's unfortunate that they get marked as spam instead of unsure.  It
> means that getting mail in a new subject area means that the
> classifier will make some wildly wrong guesses until you get enough
> new training data.

If it gets a high spam score, that simply reflects how you've trained; e.g.,
for a tech guy to have the word "computer" as high-spamprob word is
suspicious all by itself:

> computer 0.851704776271

Other oddities:

> applications 0.81854602306
> installed 0.81854602306
> full 0.829689461024
> once 0.831240651842
> people 0.832090040931


> results 0.844279808423
> ...and 0.844827586207
> hacked 0.844827586207
> techniques, 0.844827586207

Those four look like hapaxes, to judge from the scores.

> computer, 0.923563305955
> application 0.935918946825
> security 0.939824371865
> contacts 0.948037345934
> list, 0.970027495049
> data. 0.973372781065

I can only assume you've only trained it on Shakespeare ham <wink>.

Even stranger are the words that *don't* show up with high spamprobs for
you.  I haven't signed up for this list, but my personal classifier scored
the attachment very differently:

Spam Score: 2.85542e-008

'*H*'                          1
'*S*'                          1.40346e-010
'wrote:'                       0.001868
'subject:: ['                  0.0142256
'url:mailman'                  0.0151898
'url:listinfo'                 0.0182172
'url:lists'                    0.0193688

That you didn't have those as low-spamprob words suggests you've trained on
almost no mailing-list ham.

'otoh,'                        0.0215311
'interface'                    0.0310106
'(so'                          0.0348837
'scripting'                    0.0348837
'thu,'                         0.0412844
'url:org'                      0.0474607
'solved'                       0.050232
'it).'                         0.0505618
'x-mailer:ximian evolution 1.0.8' 0.0505618
'header:In-Reply-To:1'         0.0539033
'false'                        0.0564499
'header:Errors-To:1'           0.0634285
'>from:'                       0.0652174
'[snip]'                       0.0652174
'origin'                       0.0652174
'pointless'                    0.0652174
'protocol'                     0.0652174
'share'                        0.076162
'subject:] '                   0.0842127
'tool.'                        0.0918367
'challenge'                    0.0918367
'machine,'                     0.0918367
'api,'                         0.0918367
'techniques,'                  0.0918367
'subset'                       0.0918367
'key,'                         0.0918367
'apps'                         0.0918367
'except'                       0.0982036
'(in'                          0.114502
'copy'                         0.12034
'problem'                      0.12813
'encrypted'                    0.132432
'returned'                     0.138575
'quite'                        0.140119
'foundation'                   0.150981
'compromise'                   0.155172
'automating'                   0.155172
'widely.'                      0.155172
'url:osafoundation'            0.155172
'key.'                         0.155172
'horses'                       0.155172
'header:Received:5'            0.156821
'running'                      0.158385
'obviously'                    0.160346
'user'                         0.168307
'machine.'                     0.181282
'interesting'                  0.186509
'code'                         0.191394
'feature'                      0.196331
'(there'                       0.197397
'book,'                        0.197397
'list'                         0.199755
'also'                         0.202693
'probably'                     0.205907
'environment.'                 0.208559
'mine'                         0.208559
'michael'                      0.213552
'those'                        0.215792
'entry'                        0.218192
'installed'                    0.232266
'requiring'                    0.245609
'belong'                       0.245609
'distribution'                 0.248786
'shared'                       0.253444
'but'                          0.254219
'data'                         0.257012
'>from'                        0.262199
'insecure'                     0.262199
'open'                         0.268785
'think'                        0.276856
'wrong'                        0.283272
'which'                        0.284943
'saying'                       0.287973
'source'                       0.291711
'his'                          0.293522
'application'                  0.306776
'should'                       0.307599
'address'                      0.308855
'machine'                      0.309309
'users'                        0.321312
'e-mail'                       0.32136
'public'                       0.322013
'keys'                         0.334772
"can't"                        0.334823
'were'                         0.338587
'using'                        0.345285
'needs'                        0.34717
'used'                         0.348371
'mailing'                      0.349775
'having'                       0.351608
'header:Message-Id:1'          0.353295
'part'                         0.356067
'bit'                          0.359997
'skip:s 10'                    0.361335
"it's"                         0.362434
'anyone'                       0.365533
'once'                         0.369773
"won't"                        0.370267
'avoid'                        0.370705
'provided'                     0.376177
"isn't"                        0.379705
'joel'                         0.382591
'widely'                       0.382591
'with'                         0.384731
'there'                        0.390536
'that'                         0.394845
'even'                         0.600998
'sharing'                      0.605368
'key'                          0.60538
'header:Return-Path:1'         0.616771
'every'                        0.617388
'easy'                         0.62261
'full'                         0.625898
'large'                        0.628002
'trust'                        0.673698
'capabilities'                 0.674899
'computer,'                    0.682833
'secure'                       0.689415
'results'                      0.693054
'happen'                       0.711906
'contacts'                     0.716121
'here.'                        0.717214
'place.'                       0.720453
'easily'                       0.750357
'further'                      0.766401
'information'                  0.778198
'again.'                       0.779556
'securely,'                    0.844828
'trusting'                     0.844828
'trojan'                       0.844828
'hacked'                       0.844828
'from:"michael'                0.844828
'claiming'                     0.844828
'response'                     0.878172
'list,'                        0.881299
'2..'                          0.886933
'header:Mime-Version:1'        0.889037
'data.'                        0.889756
'url:design'                   0.908163
'...and'                       0.969799
'wealth'                       0.97619

And that you didn't get 'wealth' as a high-spamprob word suggests something
even weirder.

> Of the 14 new messages, I see 1 ham, 3 spam, and 10 unsure.  I've
> forwarded one of the high scoring spams.

Train on them; it will learn what you teach it, and nothing else.


From skip@pobox.com  Fri Nov  1 03:30:16 2002
From: skip@pobox.com (Skip Montanaro)
Date: Thu, 31 Oct 2002 21:30:16 -0600
Subject: [Spambayes] Email client integration -- what's needed? 
In-Reply-To: <CE9PEAROTROEBRPGCPM54JJEC0RQ.3dc1ef30@riven>
References: <CE9PEAROTROEBRPGCPM54JJEC0RQ.3dc1ef30@riven>
Message-ID: <15809.62792.793436.43805@montanaro.dyndns.org>


    Tim> Well you're talkin to a Notes user with a HUGE spam problem...

I take it Notes is responsible for screwing up your return address... ;-)

    Tim> So I propose we enable Outlook, Mozilla (for our OS brethren, which
    Tim> will likely get Netscape, too), have a pop3proxy that can be run
    Tim> locally and easily configured to be used by a number of "noise"
    Tim> mailers, and go from there...

Like all things open source, what gets implemented depends on who has an
itch that needs scratching.  Outlook was an obvious early choice, not
primarily because of its market share, but because Mark Hammond is an expert
in the area of Python/Windows integration and Tim happens to use Outlook as
his mail reader.  Mark couldn't have asked for a better beta tester.

Mozilla will happen when someone who uses Mozilla wants/needs it.

One thing to consider about mail programs is that outside the realm of
people whose software toolchest is dictated to them (generally corporate
types), most folks probably find something that works and then stick with it
until there is an overwhelming reason to change.  I have a MacOSX laptop now
but still use XEmacs+VM to read mail with a peculiar method of getting mail
off my server.  It took me a fair amount of time to decide to switch from
rmail to VM many years ago.  There are lots of reasons for this sort of
inertia, but I think it's generally dominated by familiarity with the user
interface, (perceived) difficulty converting mailboxes, and fear that new
tools won't be as stable.

So, go right ahead, solve the Notes integration problem.  We're here to lend
moral support.  ;-)

Skip

From Tim@mail.powweb.com  Fri Nov  1 03:41:05 2002
From: Tim@mail.powweb.com (Tim@mail.powweb.com)
Date: Thu, 31 Oct 2002 21:41:05 -0600
Subject: [Spambayes] Email client integration -- what's needed? 
In-Reply-To: <15809.62792.793436.43805@montanaro.dyndns.org>
Message-ID: <UT76B03YLJJE232EBWHBB8WR61DXR.3dc1f7d1@riven>

Hehe... point taken.   There I go, thinking like a marketer again.  Present the requirement to the developers and it'll magically get done... :)

As for Notes, that's S.E.P.  If my company doesn't care that I take 30 minutes out of my day to sort through my inbox, so be it.  My personal mail is my itch... 
so the pop3proxy is the scratch for me...  looking forward to the relief!  "Regular email user dude trys out Spambayes.  Stay tuned, news at 11..."

10/31/2002 9:30:16 PM, Skip Montanaro <skip@pobox.com> wrote:

>
>    Tim> Well you're talkin to a Notes user with a HUGE spam problem...
>
>I take it Notes is responsible for screwing up your return address... ;-)
>
>    Tim> So I propose we enable Outlook, Mozilla (for our OS brethren, which
>    Tim> will likely get Netscape, too), have a pop3proxy that can be run
>    Tim> locally and easily configured to be used by a number of "noise"
>    Tim> mailers, and go from there...
>
>Like all things open source, what gets implemented depends on who has an
>itch that needs scratching.  Outlook was an obvious early choice, not
>primarily because of its market share, but because Mark Hammond is an expert
>in the area of Python/Windows integration and Tim happens to use Outlook as
>his mail reader.  Mark couldn't have asked for a better beta tester.
>
>Mozilla will happen when someone who uses Mozilla wants/needs it.
>
>One thing to consider about mail programs is that outside the realm of
>people whose software toolchest is dictated to them (generally corporate
>types), most folks probably find something that works and then stick with it
>until there is an overwhelming reason to change.  I have a MacOSX laptop now
>but still use XEmacs+VM to read mail with a peculiar method of getting mail
>off my server.  It took me a fair amount of time to decide to switch from
>rmail to VM many years ago.  There are lots of reasons for this sort of
>inertia, but I think it's generally dominated by familiarity with the user
>interface, (perceived) difficulty converting mailboxes, and fear that new
>tools won't be as stable.
>
>So, go right ahead, solve the Notes integration problem.  We're here to lend
>moral support.  ;-)
>
>Skip
>
>_______________________________________________
>Spambayes mailing list
>Spambayes@python.org
>http://mail.python.org/mailman/listinfo/spambayes
>
>
>
>
- Tim
www.fourstonesExpressions.com 


From anthony@interlink.com.au  Fri Nov  1 04:17:07 2002
From: anthony@interlink.com.au (Anthony Baxter)
Date: Fri, 01 Nov 2002 15:17:07 +1100
Subject: [Spambayes] Fwd: [Spambayes-checkins] spambayes timtest.py,1.29,1.30
Message-ID: <200211010417.gA14H7009458@localhost.localdomain>

---------------------- multipart/mixed attachment
Note for anyone with their own test harnesses that aren't
checked into the CVS. I've updated timtest and timcv (the
only users of the spam/ham keep options that I could find)
but if you use your own, you'll need to make a change.

Anthony


---------------------- multipart/mixed attachment
An embedded message was scrubbed...
From: "Anthony Baxter" <anthonybaxter@users.sourceforge.net>
Subject: [Spambayes-checkins] spambayes timtest.py,1.29,1.30
Date: Thu, 31 Oct 2002 20:13:13 -0800
Size: 4422
Url: http://mail.python.org/pipermail/spambayes/attachments/20021101/76527f7a/attachment.txt

---------------------- multipart/mixed attachment--

From anthony@interlink.com.au  Fri Nov  1 04:31:11 2002
From: anthony@interlink.com.au (Anthony Baxter)
Date: Fri, 01 Nov 2002 15:31:11 +1100
Subject: [Spambayes] Terminology in user documentation: "spam" vs. "junk
	mail" 
In-Reply-To: <15809.56283.323513.587530@montanaro.dyndns.org> 
Message-ID: <200211010431.gA14VBT09600@localhost.localdomain>


A non-techie data point - my wife got what I meant by 'ham' ("everything
that's not spam is ham") and thought it was a good term to use. I'd prefer
it to something clumsy like 'wanted email' or suchlike.

Anthony
-- 
Anthony Baxter     <anthony@interlink.com.au>   
It's never too late to have a happy childhood.


From anthony@interlink.com.au  Fri Nov  1 04:50:36 2002
From: anthony@interlink.com.au (Anthony Baxter)
Date: Fri, 01 Nov 2002 15:50:36 +1100
Subject: [Spambayes] Re: [Spambayes-checkins] spambayes
	INTEGRATION.txt,NONE,1.1 
In-Reply-To: 
	<200211010131.gA11V8d03085@pcp02138704pcs.reston01.va.comcast.net> 
Message-ID: <200211010450.gA14oai09774@localhost.localdomain>


>>> Guido van Rossum wrote
> [Skip checked in:]
> > first scribbled notes about integrating Spambayes with different email
> > packages.
> 
> Hm, maybe the spambayes website could be brought a bit more up to date
> too?

I've just chucked an 'Applications' page up there now. People should feel
free to add more.

Anthony
-- 
Anthony Baxter     <anthony@interlink.com.au>   
It's never too late to have a happy childhood.


From anthony@interlink.com.au  Fri Nov  1 05:35:35 2002
From: anthony@interlink.com.au (Anthony Baxter)
Date: Fri, 01 Nov 2002 16:35:35 +1100
Subject: [Spambayes] package-ifying spambayes.
Message-ID: <200211010535.gA15Zat10734@localhost.localdomain>

I'd like to think about turning spambayes into something a little
more suitable for installing (e.g. with setup.py). At the moment
we install directly into site-packages, and with module names like
'tokenizer', 'Histogram', 'msgs', 'TestDriver' and 'Options', this
is a bit naughty :) 

This would be my current suggestions for how to organise it, but I'd 
like other suggestions, too:

prefix/lib/python/site-packages/
	     spambayes/    -- main body of code
                       chi2.py
                       classifier.py
                       hammie.py
                       heapq.py -- if not py > 2.2
                       Histogram.py
                       mboxutils.py
                       msgs.py
                       Options.py
                       sets.py -- if not py > 2.2
                       TestDriver.py
                       Tester.py
                       tokenizer.py

scripts: the current setup.py installs all of the following as scripts:
cmp.py        HistToGNU.py   neiltrain.py  splitndirs.py  unheader.py
hammiecli.py  loosecksum.py  rates.py      table.py
hammie.py     mboxcount.py   rebal.py      timcv.py
hammiesrv.py  mboxtest.py    runtest.sh    timtest.py

I'd suggest that the only things that should be installed by default 
as scripts are:
  hammiecli.py
  hammiesrv.py
  hammie.py
  neiltrain.py
  neilfilter.py -- but maybe with a better name?
  unheader.py   -- but maybe with a better name?
  pop3proxy.py  -- but maybe with a better name?

Anyone else see anything I missed?

Anthony

From B-Morgan@concentric.net  Fri Nov  1 06:30:35 2002
From: B-Morgan@concentric.net (Brad Morgan)
Date: Thu, 31 Oct 2002 23:30:35 -0700
Subject: [Spambayes] Outlook Plugin Questions
In-Reply-To: <LCEPIIGDJPKCOIHOBJEPMELJGMAA.mhammond@skippinet.com.au>
Message-ID: <NABBJOOEOFODEALNMJAJOEPNHAAA.B-Morgan@concentric.net>

Where should future questions about the Outlook 2000 plugin be directed?

I've got a fairly large set of Rules Wizard rules that separate my incoming
mail into folders.  Where does the Spambayes plugin fit into this?  I think
I'd like it to be "first" so that the rules wizard rules just operate on
what should be ham only.

I've used SpamWeasel and SpamAssassin Pro to help me build my initial spam
corpus.  Both of these products have added header fields with their
"conclusions".  Will the presence of these fields have any adverse effect on
this code?

After following the instructions in the About... text, I've added the Spam
field to my inbox display but it contains #ERROR for all messages.  Is there
something else I need to do?

Is there a "remove" or "uninstall" procedure?  If not, is one needed?

Thanks for your continued efforts.  I'll be happy to help with whatever I
can.

Regards,

Brad Morgan


From tim.one@comcast.net  Fri Nov  1 06:56:11 2002
From: tim.one@comcast.net (Tim Peters)
Date: Fri, 01 Nov 2002 01:56:11 -0500
Subject: [Spambayes] RE: Outlook Plugin Questions
In-Reply-To: <NABBJOOEOFODEALNMJAJOEPNHAAA.B-Morgan@concentric.net>
Message-ID: <LNBBLJKPBEHFEDALKOLCAEODCDAB.tim.one@comcast.net>

[Brad Morgan]
> Where should future questions about the Outlook 2000 plugin be directed?

I think right here for now.  Some people will probably want that to go to a
different mailing list, but while people are still in the early stages of
integrating this code, *I* think it's valuable to get the chance to read
about everyone's tribulations and triumphs.  There are lots of UI issues
that are going to be puzzles for everyone.

> I've got a fairly large set of Rules Wizard rules that separate
> my incoming mail into folders.  Where does the Spambayes plugin fit into
> this?

At the wrong end at the moment, and possibly forever.

> I think I'd like it to be "first" so that the rules wizard rules just
> operate on what should be ham only.

It may not be technically feasible to go first.  Mark is hooking "item
appears in folder" events, and those appear to trigger after the Rules
Wizard is done.  Outlook doesn't have a full object model, and the Rules
Wizard appears to be unhookable.

You can get much the same effect in the end, though:  In the Outlook addin's
Define Filters dialog, select *every one* of your destination folders in the
"Filter the following folders as msgs arrive" folder selector.  It's a
multi-selection tree view, so this is easy.  The Rules Wizard runs first
regardless, but each folder you selected alerts the addin whenever a msg
ends up there.  The addin can then move or copy (your choice) the msg to an
Unsure or Spam folder (as appropriate).

> I've used SpamWeasel and SpamAssassin Pro to help me build my initial spam
> corpus.  Both of these products have added header fields with their
> "conclusions".  Will the presence of these fields have any adverse effect
> on this code?

By default, we ignore almost all header fields now.  So, no.

> After following the instructions in the About... text, I've added the Spam
> field to my inbox display but it contains #ERROR for all  messages.  Is
> there something else I need to do?

I don't know; it's a new one on me; Mark may have a better clue, but I
wouldn't count on it at once -- the Outlook API for setting custom fields is
a mess, and doesn't appear to work as documented.  We've both spent hours
today trying to make better sense of it, and this subsystem is likely to
change.  In the meantime, just get rid of the Spam column if the #ERROR
things bother you.  The score is probably <wink> still there.  If you're
puzzled by any particular msg, select and click Anti-Spam -> Show spam clues
for current msg.

> Is there a "remove" or "uninstall" procedure?  If not, is one needed?

For what?  If just the Outlook addin, cd to the Outlook2000 directory and
run

    python addin.py --unregister

for now.  I suppose something fancier may get added later.


> Thanks for your continued efforts.  I'll be happy to help with whatever I
> can.

You're helping already!  Trying to use a thing is the best test we can get
now, and thanks.


From vanhorn@whidbey.com  Fri Nov  1 07:42:05 2002
From: vanhorn@whidbey.com (G. Armour Van Horn)
Date: Thu, 31 Oct 2002 23:42:05 -0800
Subject: [Spambayes] Email client integration -- what's needed?
References: <LNBBLJKPBEHFEDALKOLCMEMDCDAB.tim.one@comcast.net>
Message-ID: <3DC2304C.D634ADDD@whidbey.com>

Tim Peters wrote:

> [Skip Montanaro]
> > On a sheer numbers basis, your target audience is definitely Outlook and
> > Outlook Express users.  The rest of it is just noise.
>
> This is so sadly true.  If Netscape Communicator still survives in some
> form, that would be a good one too.  I have a sister who uses that, and
> better that someone else try to make her happy.

Outlook Express clearly has the big numbers among the numb, and Outlook among
the pointy haired, but looking over my incoming mail at various accounts I don't
think they are the majority yet. Bigger than anything else, but there is a lot
of "anything else" out there.

Thank God.

Van


--
----------------------------------------------------------
Sign up now for Quotes of the Day, a handful of quotations
on a theme delivered every morning.
Enlightenment! Daily, for free!
mailto:twisted@whidbey.com?subject=Subscribe_QOTD

For web hosting and maintenance,
visit Van's home page: http://www.domainvanhorn.com/van/
----------------------------------------------------------


From anthony@interlink.com.au  Fri Nov  1 07:54:47 2002
From: anthony@interlink.com.au (Anthony Baxter)
Date: Fri, 01 Nov 2002 18:54:47 +1100
Subject: [Spambayes] 'sender' and 'reply-to' tokenising.
Message-ID: <200211010754.gA17slK11570@localhost.localdomain>

comments in tokenizer.py:

        # Dang -- I can't use Sender:.  If I do,
        #     'sender:email name:python-list-admin'
        # becomes the most powerful indicator in the whole database.
        #
        # From:         # this helps both rates
        # Reply-To:     # my error rates are too low now to tell about this
        #               # one (smalls wins & losses across runs, overall
        #               # not significant), so leaving it out

So now we have things like h/s mean/sdev, we get more useful data. I tried
enabling tokenization of both 'sender' and 'reply-to' (and both) along with
the 'from' line. The left-hand column is the default.

filename:     from from+sender     from+sender+replyto
                           from+replyto   
ham:spam:  11192:1826      11192:1826     
                   11192:1826      11192:1826
fp total:        7       6       7       6
fp %:         0.06    0.05    0.06    0.05
fn total:        5       4       5       4
fn %:         0.27    0.22    0.27    0.22
unsure t:       80      82      80      81
unsure %:     0.61    0.63    0.61    0.62
real cost:  $91.00  $80.40  $91.00  $80.20
best cost:  $28.00  $27.20  $28.20  $25.80
h mean:       0.62    1.32    0.63    1.11
h sdev:       4.27    4.42    4.19    4.19
s mean:      98.69   98.66   98.68   98.65
s sdev:       7.69    7.86    7.74    7.92
mean diff:   98.07   97.34   98.05   97.54
k:            8.20    7.93    8.22    8.05


Summary: 'sender' was an across-the-board lose for me. It knocked out
a fp and a fn, but did considerable damage to both ham mean and sdev,
and spam mean and sdev.
'reply-to' tightened up ham scores, and loosened spam scores (but not as
much). I'd suggest re-enabling reply-to with the following patch:

--- tokenizer.py        31 Oct 2002 15:43:55 -0000      1.59
+++ tokenizer.py        1 Nov 2002 07:51:34 -0000
@@ -1082,10 +1082,9 @@
         # becomes the most powerful indicator in the whole database.
         #
         # From:         # this helps both rates
-        # Reply-To:     # my error rates are too low now to tell about this
-        #               # one (smalls wins & losses across runs, overall
-        #               # not significant), so leaving it out
-        for field in ('from',):
+        # Reply-To:     # this tightens up ham for me (anthony) and makes spam
+        #               # slightly worse (but the ham improvement is more) 
+        for field in ('from', 'reply-to'):
             prefix = field + ':'
             x = msg.get(field, 'none').lower()
             for w in x.split():

Someone else want to repeat this test?

Anthony

From richie@entrian.com  Fri Nov  1 09:17:12 2002
From: richie@entrian.com (Richie Hindle)
Date: Fri, 01 Nov 2002 09:17:12 +0000
Subject: [Spambayes] Re: pop3proxy bug?
In-Reply-To: <e5qtru8kergge8t212u8jambe3vtkfta8r@4ax.com>
References: <E1856rA-0004AP-00@usw-sf-list1.sourceforge.net>
	<e5qtru8kergge8t212u8jambe3vtkfta8r@4ax.com>
Message-ID: <71h4suo5nknl0sifno0q2vql97jaf0hs9b@4ax.com>


> Once I've reproduced the problem on Linux, I'll apply, test and
> commit that fix - thanks.

Done.  And all without a Linux box.  Three very slooooow cheers for Bochs
and mxCGIPython...

I'm still not sure the pop3proxy self-test works properly on Linux, but I
think that's a threading issue in the test code itself - the main program
works fine.

-- 
Richie Hindle
richie@entrian.com


From skip@pobox.com  Fri Nov  1 14:33:10 2002
From: skip@pobox.com (Skip Montanaro)
Date: Fri, 1 Nov 2002 08:33:10 -0600
Subject: [Spambayes] package-ifying spambayes.
In-Reply-To: <200211010535.gA15Zat10734@localhost.localdomain>
References: <200211010535.gA15Zat10734@localhost.localdomain>
Message-ID: <15810.37030.424053.826099@montanaro.dyndns.org>


    Anthony> I'd like to think about turning spambayes into something a
    Anthony> little more suitable for installing (e.g. with setup.py). At
    Anthony> the moment we install directly into site-packages, and with
    Anthony> module names like 'tokenizer', 'Histogram', 'msgs',
    Anthony> 'TestDriver' and 'Options', this is a bit naughty :)

Go for it.  My knowledge of distutils is minimal, at best.  I seem to recall
asking about this awhile ago.  Your structure seems fine to me.


    Anthony> scripts: the current setup.py installs all of the following as
    Anthony> scripts: 
    Anthony> cmp.py        HistToGNU.py   neiltrain.py  splitndirs.py  unheader.py
    Anthony> hammiecli.py  loosecksum.py  rates.py      table.py
    Anthony> hammie.py     mboxcount.py   rebal.py      timcv.py
    Anthony> hammiesrv.py  mboxtest.py    runtest.sh    timtest.py

    Anthony> I'd suggest that the only things that should be installed by
    Anthony> default as scripts are:
    Anthony>   hammiecli.py
    Anthony>   hammiesrv.py
    Anthony>   hammie.py
    Anthony>   neiltrain.py
    Anthony>   neilfilter.py -- but maybe with a better name?
    Anthony>   unheader.py   -- but maybe with a better name?
    Anthony>   pop3proxy.py  -- but maybe with a better name?

I take it you're segregating the script population into stuff which is
generally useful and stuff which is useful only for testing?  Seems
reasonable.  Again, my lack of distutils experience rears its ugly head.  I
think an install_test target mmight be useful (as in "python setup.py
install_test") but don't know how to implement it and was too lazy to figure
it out at the time.

Skip


From tdickenson@devmail.geminidataloggers.co.uk  Fri Nov  1 14:34:33 2002
From: tdickenson@devmail.geminidataloggers.co.uk (Toby Dickenson)
Date: Fri, 1 Nov 2002 14:34:33 +0000
Subject: [Spambayes] RE: Re: [Design] Contacts (Michael R. Bernstein)
In-Reply-To: <LNBBLJKPBEHFEDALKOLCKEMLCDAB.tim.one@comcast.net>
References: <LNBBLJKPBEHFEDALKOLCKEMLCDAB.tim.one@comcast.net>
Message-ID: <200211011433.02149.tdickenson@geminidataloggers.com>

On Friday 01 November 2002 3:20 am, Tim Peters wrote:

> e.g., for a tech guy to have the word "computer" as high-spamprob word =
is
> suspicious all by itself:
> > computer 0.851704776271

Its a 0.88 for me too, due to "If you want to make money with your comput=
er"=20
spam.


From skip@pobox.com  Fri Nov  1 14:39:50 2002
From: skip@pobox.com (Skip Montanaro)
Date: Fri, 1 Nov 2002 08:39:50 -0600
Subject: [Spambayes] 'sender' and 'reply-to' tokenising.
In-Reply-To: <200211010754.gA17slK11570@localhost.localdomain>
References: <200211010754.gA17slK11570@localhost.localdomain>
Message-ID: <15810.37430.792347.935488@montanaro.dyndns.org>


    Anthony> So now we have things like h/s mean/sdev, we get more useful
    Anthony> data. I tried enabling tokenization of both 'sender' and
    Anthony> 'reply-to' (and both) along with the 'from' line. 

I have a patch locally to generate to and cc tokens on a per-domain basis
(e.g. "To: skip@mojam.com, james@bond.net" would generate "to:@mojam.com"
and "to:@bond.net").  While this might not be useful for the current crop of
users, I think it will be useful for people who have abandoned email
addresses in the past for whatever reason.  *@calendar.com gets nothing but
spam these days, for example, though I don't publicize it any longer (it
still works).  I'm headed out right now but will generate some new results
while I wait for my car to be serviced and get something out later today.

Skip

From agmsmith@rogers.com  Fri Nov  1 15:13:46 2002
From: agmsmith@rogers.com (Alexander G. M. Smith)
Date: Fri, 01 Nov 2002 10:13:46 EST (-0500)
Subject: [Spambayes] Terminology in user documentation: "spam" vs. "junk
	mail"
In-Reply-To: <20021101004600.GB28132@rmunnlfs>
Message-ID: <733561455-BeMail@CR593174-A>

Robin Munn  wrote:
> Other possibilities:
> "unwanted email" instead of "spam"
> "wanted email" instead of "ham"
> 
> [Insert your clever idea here]

I use "Genuine" mail for the good stuff in my documentation:
http://members.rogers.com/agmsmith/beos/AGMSBayesianSpam.Documentation/index.html

Also remember to include a reference to the Monty Python skit,
otherwise your documentation won't be complete!

- Alex


From skip@pobox.com  Fri Nov  1 16:26:21 2002
From: skip@pobox.com (Skip Montanaro)
Date: Fri, 1 Nov 2002 10:26:21 -0600
Subject: [Spambayes] tokenizing to: and cc:
Message-ID: <15810.43821.983763.629427@montanaro.dyndns.org>


I made a change to tokenizer.py (just locally for now) to tokenize the
domains mentioned in to: and cc: headers:

Index: tokenizer.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/tokenizer.py,v
retrieving revision 1.59
diff -c -r1.59 tokenizer.py
*** tokenizer.py        31 Oct 2002 15:43:55 -0000      1.59
--- tokenizer.py        1 Nov 2002 16:17:23 -0000
***************
*** 6,11 ****
--- 6,12 ----
  import email
  import email.Message
  import email.Errors
+ import email.Utils
  import re
  import math
  import time
***************
*** 1098,1104 ****
          for field in ('to', 'cc'):
              count = 0
              for addrs in msg.get_all(field, []):
!                 count += len(addrs.split(','))
              if count > 0:
                  yield '%s:2**%d' % (field, round(log2(count)))
  
--- 1099,1112 ----
          for field in ('to', 'cc'):
              count = 0
              for addrs in msg.get_all(field, []):
!                 addrs = map(email.Utils.parseaddr, addrs.split(','))
!                 count += len(addrs)
!                 if options.generate_to_domains:
!                     # also generate tokens containing the destination domain
!                     for name,addr in addrs:
!                         yield '%s:@%s' % (field,
!                                           (addr.split("@")[1:] or
!                                            ["local"])[0])
              if count > 0:
                  yield '%s:2**%d' % (field, round(log2(count)))

Index: Options.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/Options.py,v
retrieving revision 1.63
diff -c -r1.63 Options.py
*** Options.py  28 Oct 2002 20:19:46 -0000      1.63
--- Options.py  1 Nov 2002 16:19:38 -0000
***************
*** 111,116 ****
--- 111,120 ----
  # spam indicator.
  replace_nonascii_chars: False
  
+ # If true, generate tokens from the to and cc fields containing the destination
+ # domains, e.g. 'To: skip@pobox.com' would generate to:@pobox.com
+ generate_to_domains: False
+ 
  [TestDriver]
  # These control various displays in class TestDriver.Driver, and Tester.Test.
  
***************
*** 323,328 ****
--- 327,333 ----
                    'basic_header_tokenize': boolean_cracker,
                    'basic_header_tokenize_only': boolean_cracker,
                    'basic_header_skip': ('get', lambda s: Set(s.split())),
+                   'generate_to_domains': boolean_cracker,
                    'replace_nonascii_chars': boolean_cracker,
                   },
      'TestDriver': {'nbuckets': int_cracker,

I think this should be turned into a separate pass over the to: and cc:
headers to simplify the logic and move the option test out of the inner
loop.

Here's a summary of the results:

    % python table.py base.txt to.txt
    -> <stat> tested 200 hams & 200 spams against 1800 hams & 1800 spams
    ...
    filename:     base      to
    ham:spam:  2000:2000      
                       2000:2000
    fp total:        8       7
    fp %:         0.40    0.35
    fn total:       21      22
    fn %:         1.05    1.10
    unsure t:       95      87
    unsure %:     2.38    2.17
    real cost: $120.00 $109.40
    best cost:  $79.80  $79.60
    h mean:       0.79    0.79
    h sdev:       7.43    7.43
    s mean:      97.41   97.46
    s sdev:      12.53   12.53
    mean diff:   96.62   96.67
    k:            4.84    4.84

base.ini is

    [TestDriver]
    show_unsure: True

to.ini is

    [Tokenizer]
    generate_to_domains: True

    [TestDriver]
    show_unsure: True

All things considered, I think it did pretty well for me.  It dropped the
unsure percentage a bit and spread the ham and spam means a bit further
apart.  As I mentioned earlier, I think this option may be useful for people
with inactive, but still operational, email addresses.  Over time, those
addresses will tend to get nothing but spam.  (It would thus be important to
not train on messages sent to those addresses before or shortly after during
abandonment.)

Should I rework the patch and check it in?

Skip

From tdickenson@devmail.geminidataloggers.co.uk  Fri Nov  1 16:42:49 2002
From: tdickenson@devmail.geminidataloggers.co.uk (Toby Dickenson)
Date: Fri, 1 Nov 2002 16:42:49 +0000
Subject: [Spambayes] hammie appending headers
Message-ID: <200211011642.49043.tdickenson@devmail.geminidataloggers.co.uk>

Hammie currently appends the X-Hammie-Disposition header. Any existing=20
X-Hammie-Disposition headers are left intact. I think we should be removi=
ng=20
them, to prevent spammers (or testers ;-) adding headers that confuse=20
downstream filters.


Index: hammie.py
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
RCS file: /cvsroot/spambayes/spambayes/hammie.py,v
retrieving revision 1.33
diff -c -4 -r1.33 hammie.py
*** hammie.py   27 Oct 2002 22:56:15 -0000      1.33
--- hammie.py   1 Nov 2002 16:39:25 -0000
***************
*** 266,273 ****
--- 266,274 ----
          else:
              disp =3D "Unsure"
          disp +=3D "; %.2f" % prob
          disp +=3D "; " + self.formatclues(clues)
+         del msg[header]
          msg.add_header(header, disp)
          return msg.as_string(unixfrom=3D(msg.get_unixfrom() is not None=
))

      def train(self, msg, is_spam):


From francois.granger@free.fr  Fri Nov  1 17:47:10 2002
From: francois.granger@free.fr (=?iso-8859-1?Q?Fran=E7ois?= Granger)
Date: Fri, 1 Nov 2002 18:47:10 +0100
Subject: [Spambayes] Recently discovered this work.
Message-ID: <a05100300b9e86e6b6a26@[192.168.1.11]>

Config: MacOS 9.1 MacPython 2.2.2

I started developping on the idea of bayesian filtering by end of 
august after reading the article. It took me time (spare time) to 
arrive at the point where I had a set of script with most 
functionalities needed to do it. I discovered few days ago the work 
you have done. I guess I can stop my development because It can't 
compare to yours.

Along the time, I discovered two issues.

The email package is fragil at decoding Eudora messages with 
enclosure wether I get them by OSA (similar to COM on windows) or by 
direct access to the mbox files. I went back to using the rfc822 
package instead because it was more robust if less sophisticated. I 
don't know if this come from Eudora not being conforming to the 
standards.

I downloaded your software and tried to use the tokenizer on my 
stored mail messages to understand how it was working. I can't make 
it works even modifying it a little. If anyone is interested, I did a 
small script to show the issue. If anyone is interested, I can send 
both the script and a mail message on wich it hangs.

As a side note I have two more questions.

The current software, as downloaded from SF on Oct 29 seems to be 
difficult to use on MacOS 9. I would be interrested in having the 
Pop3 proxy version working. The other way of using such a filter 
would be to have "plug In" to interract with the various mail 
clients. I implemented it in my development and have three plugs in 
for mails stored as file, for Eudora and for Entourage. They are not 
really nice but the idea is there.

What about multilingual situation. On average, I think I get spam 
splitted like this: 80% is english, 12% is french 5% is spanish  and 
3% is german. Not counting asian ones wich I easily filter on 
encoding and strange chars. How this technique would do on such a 
situation ? I started to develop a language discriminator in order to 
automatically sort by main language and then use frequency databases 
for each language. I don't know if this is needed ?


-- 
Le courrier �lectronique est un moyen de communication. Les gens devraient
se poser des questions sur les implications politiques des choix (ou non
choix) de leurs outils et technologies.
Pour des courriers propres : http://minilien.com/?IXZneLoID0 - 
http://marc.herbert.free.fr/mail/ http://expita.com/nomime.html

From tim.one@comcast.net  Fri Nov  1 20:24:42 2002
From: tim.one@comcast.net (Tim Peters)
Date: Fri, 01 Nov 2002 15:24:42 -0500
Subject: [Spambayes] Recently discovered this work.
In-Reply-To: <a05100300b9e86e6b6a26@[192.168.1.11]>
Message-ID: <LNBBLJKPBEHFEDALKOLCIEBFCFAB.tim.one@comcast.net>

[Fran=E7ois Granger]
> Config: MacOS 9.1 MacPython 2.2.2

I don't have experience with either, but others here do.  Keep postin=
g here
until they admit it <wink>.

> I started developping on the idea of bayesian filtering by end of
> august after reading the article. It took me time (spare time) to
> arrive at the point where I had a set of script with most
> functionalities needed to do it. I discovered few days ago the work
> you have done. I guess I can stop my development because It can't
> compare to yours.

Unclear:  at least yours worked for you!

> Along the time, I discovered two issues.
>
> The email package is fragil at decoding Eudora messages with
> enclosure wether I get them by OSA (similar to COM on windows) or b=
y
> direct access to the mbox files. I went back to using the rfc822
> package instead because it was more robust if less sophisticated. I
> don't know if this come from Eudora not being conforming to the
> standards.

I don't know either; we would need specific examples.

> I downloaded your software and tried to use the tokenizer on my
> stored mail messages to understand how it was working. I can't make
> it works even modifying it a little. If anyone is interested, I did=
 a
> small script to show the issue. If anyone is interested, I can send
> both the script and a mail message on wich it hangs.

Hangs?  That's hard to imagine -- there aren't any unbounded loops in=
 the
tokenizer.  It could be that a regexp search is taking a very long ti=
me,
although I tried to cut the legs off that possibility too.

> As a side note I have two more questions.
>
> The current software, as downloaded from SF on Oct 29 seems to be
> difficult to use on MacOS 9. I would be interrested in having the
> Pop3 proxy version working. The other way of using such a filter
> would be to have "plug In" to interract with the various mail
> clients. I implemented it in my development and have three plugs in
> for mails stored as file, for Eudora and for Entourage. They are no=
t
> really nice but the idea is there.

Sorry, I didn't find a question in there.

> What about multilingual situation. On average, I think I get spam
> splitted like this: 80% is english, 12% is french 5% is spanish  an=
d
> 3% is german.  Not counting asian ones wich I easily filter on
> encoding and strange chars.

There appears no need to special-case Asian spam with this code.  It
generates a bunch of tokens that are virtually unique to Asian spam, =
and
they quickly get very high spamprobs upon training.  The non-default =
option

[Tokenizer]
replace_nonascii_chars: True

accelerates learning for Asian spam, but at the cost of replacing *al=
l*
high-bit chars.

> How this technique would do on such a situation ?

Can't say:  you didn't say how much of your ham (non-spam) is English=
,
French, Spanish and German.  I expect it will work fine, as all those
languages (as opposed to some Asian languages) use whitespace too, an=
d the
tokenizer merely splits on whitespace.  This code is *certainly* bett=
er than
I am at distinguishing ham from spam in non-English languages, but th=
at's
not saying much.  Try it!

> I started to develop a language discriminator in order to
> automatically sort by main language and then use frequency database=
s
> for each language. I don't know if this is needed ?

I doubt it's necessary, and somewhat doubt it would even be helpful. =
 The
tokenizer has no concept of semantics, it's just crunching strings, a=
nd
doesn't know beans about English as opposed to anything else.  You ma=
y need
more training to get comparable results, or maybe not.  Nobody has te=
sted
this yet.

> --
> Le courrier =E9lectronique est un moyen de communication. Les gens =
devraient
> se poser des questions sur les implications politiques des choix (o=
u non
> choix) de leurs outils et technologies.

My personal email classifier was sure your msg was ham:

Spam Score: 2.82082e-007

'*H*'                          0.999999
'*S*'                          6.70774e-009

but some of the French words in your sig had high spamprobs:

'sur'                          0.908163
'les'                          0.969799
'est'                          0.973373

This reflects that I personally get a lot more French spam than Frenc=
h ham.
Your classifier is very likely to score these differently, and that's=
 a
great strength of the system for personal use; do note that it has no=
 idea
these words *are* French.  It doesn't even know they're words, for th=
at
matter <wink>.


From tim.one@comcast.net  Fri Nov  1 20:39:01 2002
From: tim.one@comcast.net (Tim Peters)
Date: Fri, 01 Nov 2002 15:39:01 -0500
Subject: [Spambayes] tokenizing to: and cc:
In-Reply-To: <15810.43821.983763.629427@montanaro.dyndns.org>
Message-ID: <LNBBLJKPBEHFEDALKOLCOEBGCFAB.tim.one@comcast.net>

[Skip Montanaro]
> I made a change to tokenizer.py (just locally for now) to tokenize the
> domains mentioned in to: and cc: headers:
>
> ...
>           for field in ('to', 'cc'):
>               count = 0
>               for addrs in msg.get_all(field, []):
> !                 addrs = map(email.Utils.parseaddr, addrs.split(','))
> !                 count += len(addrs)
> !                 if options.generate_to_domains:
> !                     for name,addr in addrs:
> !                         yield '%s:@%s' % (field,
> !                                           (addr.split("@")[1:] or
> !                                            ["local"])[0])
>               if count > 0:
>                   yield '%s:2**%d' % (field, round(log2(count)))

> ...
> I think this should be turned into a separate pass over the to: and cc:
> headers to simplify the logic and move the option test out of the inner
> loop.

The time cost is trivial.

> Here's a summary of the results:
>
>     % python table.py base.txt to.txt
>     -> <stat> tested 200 hams & 200 spams against 1800 hams & 1800 spams
>     ...
>     filename:     base      to
>     ham:spam:  2000:2000
>                        2000:2000
>     fp total:        8       7
>     fp %:         0.40    0.35
>     fn total:       21      22
>     fn %:         1.05    1.10
>     unsure t:       95      87
>     unsure %:     2.38    2.17
>     real cost: $120.00 $109.40
>     best cost:  $79.80  $79.60

This says you could have got more benefit simply by changing your ham_cutoff
and spam_cutoff values.  If you had picked the best possible in both cases,
the total difference would have been 1 unsure msg (79.60 - 79.60 = 0.20, the
default "cost" of one unsure).  See your "all runs" histograms for more info
about that.

>     h mean:       0.79    0.79
>     h sdev:       7.43    7.43
>     s mean:      97.41   97.46
>     s sdev:      12.53   12.53
>     mean diff:   96.62   96.67
>     k:            4.84    4.84

> ...
> All things considered, I think it did pretty well for me.  It dropped the
> unsure percentage a bit

Changing cutoffs can do the same.

> and spread the ham and spam means a bit further apart.

A change of 0.05 relative to 96.62 is insignificant.

> As I mentioned earlier, I think this option may be useful
> for people with inactive, but still operational, email addresses.
> Over time, those addresses will tend to get nothing but spam.  (It
> would thus be important to not train on messages sent to those
? addresses before or shortly after during abandonment.)
>
> Should I rework the patch and check it in?

I'm -0, but would become +1 if it really helped someone, or nailed cases
that can't be nailed via a more general gimmick.

For example, what if you were to introduce an option to fully tokenize To:
and Cc: addresses instead?  We don't even catch "Undisclosed Recipients"
now.  We ignore addressees by default because it becomes a killer-strong
clue for bogus reasons when training with mixed-source corpora (e.g.,

    To: bruceg@whatever

is in thousands & thousands of BruceG's spams).


From tim.one@comcast.net  Fri Nov  1 20:49:46 2002
From: tim.one@comcast.net (Tim Peters)
Date: Fri, 01 Nov 2002 15:49:46 -0500
Subject: [Spambayes] RE: Re: [Design] Contacts (Michael R. Bernstein)
In-Reply-To: <200211011433.02149.tdickenson@geminidataloggers.com>
Message-ID: <LNBBLJKPBEHFEDALKOLCOEBICFAB.tim.one@comcast.net>

[Tim Peters, on Jeremy's poorly-scoring example]
> e.g., for a tech guy to have the word "computer" as
> high-spamprob word is suspicious all by itself:
>
>> computer 0.851704776271

[Toby Dickenson]
> Its a 0.88 for me too, due to "If you want to make money with
> your computer" spam.

I believe it.  In context, Jeremy had many computer*ish* words scoring with
high spamprobs, and many mailing-list lexicalisms not scoring with low
spamprobs, and some obvious spam words not scoring with high spamprobs.
Jeremy has said in the past that he's inclined to train only on mistakes,
and I've raised as many cautions about that as I can.  The system was
intended from the start to be trained on a random sampling of all your ham
and spam.  Every time someone has sent me a "surprising msg", my personal
classifier has absolutely nailed it in the correct category; I don't think
that's because I know a secret way to start Python <wink>, but suspect it's
because I've made sustained attempts to train my personal classifier on a
"random slice of real life" every day (including a representative sampling
of duplicates when I get a single ham or spam from multiple sources).  This
gives it a reality-driven view of the probabilities instead of a
mistake-driven view, and also adapts its view of both as time goes on.


From francois.granger@free.fr  Fri Nov  1 20:51:33 2002
From: francois.granger@free.fr (=?iso-8859-1?Q?Fran=E7ois?= Granger)
Date: Fri, 1 Nov 2002 21:51:33 +0100
Subject: [Spambayes] Email client integration -- what's needed?
Message-ID: <a05100305b9e899ce4de1@[192.168.1.11]>

At 18:07 -0600 on 31/10/02, in message Re: [Spambayes] Email client 
integration -- what's need, Tim@mail.powweb.com, 
Stone@mail.powweb.com,
	Four Stones Expre wrote:
>but also for altering the behavior of
>spammers.  This second consideration is actually much more powerful 
>than the first.
>
>So... unless we want this to simply be interesting research, we 
>gotta take it to the masses....

I think that this is the real aim. Making it so hard to the spammers 
that they stop.

For this, I think htat we need a server version not too strict for 
the mail server. It may catch 80 to 90% of the spam. And a client 
version maybe as a pop3 proxy to remove the remaining spam at the 
user level.
-- 
Le courrier �lectronique est un moyen de communication. Les gens devraient
se poser des questions sur les implications politiques des choix (ou non
choix) de leurs outils et technologies.
Pour des courriers propres : http://minilien.com/?IXZneLoID0 - 
http://marc.herbert.free.fr/mail/ http://expita.com/nomime.html

From tim.one@comcast.net  Fri Nov  1 21:48:48 2002
From: tim.one@comcast.net (Tim Peters)
Date: Fri, 01 Nov 2002 16:48:48 -0500
Subject: [Spambayes] 'sender' and 'reply-to' tokenising.
In-Reply-To: <200211010754.gA17slK11570@localhost.localdomain>
Message-ID: <LNBBLJKPBEHFEDALKOLCEEBOCFAB.tim.one@comcast.net>

[Anthony Baxter]
> comments in tokenizer.py:
>
>      # Dang -- I can't use Sender:.  If I do,
>      #     'sender:email name:python-list-admin'
>      # becomes the most powerful indicator in the whole database.
>      #
>      # From:         # this helps both rates
>      # Reply-To:     # my error rates are too low now to tell about this
>      #               # one (smalls wins & losses across runs, overall
>      #               # not significant), so leaving it out
>
> So now we have things like h/s mean/sdev, we get more useful data.

I'm tempted to drop them!  mean/sdev were useful under schemes with real
systematic overlap between the population scores, but chi-combining is so
extreme that overlaps simply aren't due to random effects.

> I tried enabling tokenization of both 'sender' and 'reply-to' (and both)
> along with the 'from' line. The left-hand column is the default.
>
> filename:     from from+sender     from+sender+replyto
>                            from+replyto
> ham:spam:  11192:1826      11192:1826
>                    11192:1826      11192:1826
> fp total:        7       6       7       6
> fp %:         0.06    0.05    0.06    0.05
> fn total:        5       4       5       4
> fn %:         0.27    0.22    0.27    0.22
> unsure t:       80      82      80      81
> unsure %:     0.61    0.63    0.61    0.62
> real cost:  $91.00  $80.40  $91.00  $80.20
> best cost:  $28.00  $27.20  $28.20  $25.80
> h mean:       0.62    1.32    0.63    1.11
> h sdev:       4.27    4.42    4.19    4.19
> s mean:      98.69   98.66   98.68   98.65
> s sdev:       7.69    7.86    7.74    7.92
> mean diff:   98.07   97.34   98.05   97.54
> k:            8.20    7.93    8.22    8.05
>
> Summary: 'sender' was an across-the-board lose for me.

I disagree:  it looks like it had no significant effect either way; indeed,
I don't see a solid difference across any of these runs.

> It knocked out a fp and a fn, but did considerable damage to both
> ham mean and sdev, and spam mean and sdev.

This really doesn't matter for chi-combining.  The arithmetic mean is
supremely sensitive to scores "at the wrong end", and so is the sdev.  The
percentiles shown at the top of the all-runs histogram displays are much
better measures for extreme schemes (the median is barely affected at all by
a single exceptionally large or small value; the mean can be affected a
lot).  It's likely that some of your extreme FN and FP simply got even more
extreme across these runs, and just a handful of "bad case" msgs can have a
large effect on mean and sdev.  Under gary-combining, where the population
means were sometimes less than 50 points apart, and scores of both kinds
near 50 were relatively common, the k value seemed to have good predictive
power (high k <-> gary-combining worked well on the corpus).  But under
chi-combining, it appears to have none (indeed, in the table above, the two
runs with the lowest error rates and lowest "best cost" were the two with
the *lowest* k values, not the highest).

IOW, for chi-combining, believe the error rates, not the mean/sdev
statistics.  Across all these runs, a score "in the middle" is about 8 sdevs
away from both means, and that's astronomically large when viewed against
the tightness of the chi-combining score distributions (extremely clustered
near 0.0 for ham and near 1.0 for spam -- look at the percentiles, or look
at your histograms, to *see* this).

> 'reply-to' tightened up ham scores, and loosened spam scores (but not as
> much). I'd suggest re-enabling reply-to with the following patch:
>
> --- tokenizer.py        31 Oct 2002 15:43:55 -0000      1.59
> +++ tokenizer.py        1 Nov 2002 07:51:34 -0000
> @@ -1082,10 +1082,9 @@
>          # becomes the most powerful indicator in the whole database.
>          #
>          # From:         # this helps both rates
> -        # Reply-To:     # my error rates are too low now to tell
> about this
> -        #               # one (smalls wins & losses across runs, overall
> -        #               # not significant), so leaving it out
> -        for field in ('from',):
> +        # Reply-To:     # this tightens up ham for me (anthony)
> and makes spam
> +        #               # slightly worse (but the ham
> improvement is more)

The comment isn't justified:  the increase in k value from 8.20 to 8.22 was
too tiny to be significant, and your "best cost" measure actually got worse
(but also by an insignificant amount).

> +        for field in ('from', 'reply-to'):
>              prefix = field + ':'
>              x = msg.get(field, 'none').lower()
>              for w in x.split():
>
> Someone else want to repeat this test?

I tried it before on my c.l.py test, and your test runs seem to have
confirmed the comment the patch removed -- it costs, but neither helps nor
hurts the bottom line.

Now I just tried it again on a more-general python.org test, and it did
manage to nudge 3 marginal false positives (of 5 total) below the line.  All
three were redeemed for a single reason, and I challenge you to think about
this and do something better about it than just tokenizing reply-to <wink>:

In all three cases, the new token that saved the ham's bacon was

    'reply-to:none'

IOW, there *wasn't* a Reply-To header at all in these three, and that is
indeed a mild ham clue.  The way things currently work, Reply-To *is* in the
default safe_headers list, which feeds into your old "count the mere # of
header lines of each given kind" scheme.  But if the count is 0, no token is
generated for that header line.  So, in the end, by default the *presence*
of a Reply-To header ends up being a mild spam clue, but the *absence* of a
Reply-To header doesn't get noted.  The only useful effect of tokenizing
Reply-To in my python.org test appeared to be making the absence of a
Reply-To header visible to the classifier.

So that's a test for you to think about:  despite that your results didn't
show any real improvement, if you were to just record the absence of a
Reply-To header as a positive clue, and judged your test results with the
same infectious optimism <wink>, would it have done just as good?  If so,
let's try to generalize that into a cheap and more-general "produce clues
about header absence too" gimmick (BTW, Jeremy suggested that long ago, but
nobody has followed up on it yet).


From tim.one@comcast.net  Fri Nov  1 22:05:48 2002
From: tim.one@comcast.net (Tim Peters)
Date: Fri, 01 Nov 2002 17:05:48 -0500
Subject: [Spambayes] Email client integration -- what's needed?
In-Reply-To: <15809.55308.1945.988931@montanaro.dyndns.org>
Message-ID: <LNBBLJKPBEHFEDALKOLCGEBPCFAB.tim.one@comcast.net>

[Skip Montanaro]
> That's true, largely because that's what the focus of the initial phase of
> the project was supposed to be.  Even if it gets no farther than it is
> today, the process has been highly educational for me, because we have an
> expert in algorithm design (that'd be Tim) exposing his thought processes
> and mechanics for the rest of us.

I'm glad you've found it amusing <wink>.  I'm afraid "think for a second,
code for a minute, test for a day; repeat 6 times before you get a small
win" is par for the course when trying to push any decent scheme beyond the
80/20 rule (each additional 20% improvement requires 80% of all the effort
that went before).

> That said, I think the classification stuff has gone about as far as it's
> going to go.

Me too.  The classifier is hack-free now, as clean and uncompromising a
realization of the underlying math as anything can be.  The assumption of
word independence is a limitation of the approach, though.

> Future changes to the tokenizer are also likely to be incremental, so
> the major changes over the next while will be in email integration.

Yup!  Thanks to Sean and especially Mark lately, the non-Windows platforms
are a month behind on that too.  It's a curious thing about Windows:
because it is closed-source, the Windows market is homogenous enough that
one major effort there can make millions of happy campers.  I still hope
that the pop3proxy can do that for non-Windows systems too, and that's the
only advice I can offer:  find a way to use the proxy instead of pursuing
"deep integration" with unbounded dozens of quirky twenty-user email
clients.


From Tim@mail.powweb.com  Fri Nov  1 22:11:35 2002
From: Tim@mail.powweb.com (Tim@mail.powweb.com)
Date: Fri, 01 Nov 2002 16:11:35 -0600
Subject: [Spambayes] Email client integration -- what's needed?
In-Reply-To: <LNBBLJKPBEHFEDALKOLCGEBPCFAB.tim.one@comcast.net>
Message-ID: <ZWGFFDPMM85196FXRFCBRMYUWQ3Z.3dc2fc17@riven>

I think you're right about the pop3proxy.  Outlook is done, let that be, and let the proxy handle the rest.  That's what I'm going to try to do shortly here.  I've 
got the Opera mailer on Windoze platform... There's no doubt I can make the proxy work just fine, but I'm not at all sure I can train the classifier.  It seems like 
the training regimen requires spam in the file system, and at least with the Opera mailer, it stuffs mail into a single file with a proprietary format.  There is no 
export function... We'll see how that pans out.

11/1/2002 4:05:48 PM, Tim Peters <tim.one@comcast.net> wrote:

>[Skip Montanaro]
>> That's true, largely because that's what the focus of the initial phase of
>> the project was supposed to be.  Even if it gets no farther than it is
>> today, the process has been highly educational for me, because we have an
>> expert in algorithm design (that'd be Tim) exposing his thought processes
>> and mechanics for the rest of us.
>
>I'm glad you've found it amusing <wink>.  I'm afraid "think for a second,
>code for a minute, test for a day; repeat 6 times before you get a small
>win" is par for the course when trying to push any decent scheme beyond the
>80/20 rule (each additional 20% improvement requires 80% of all the effort
>that went before).
>
>> That said, I think the classification stuff has gone about as far as it's
>> going to go.
>
>Me too.  The classifier is hack-free now, as clean and uncompromising a
>realization of the underlying math as anything can be.  The assumption of
>word independence is a limitation of the approach, though.
>
>> Future changes to the tokenizer are also likely to be incremental, so
>> the major changes over the next while will be in email integration.
>
>Yup!  Thanks to Sean and especially Mark lately, the non-Windows platforms
>are a month behind on that too.  It's a curious thing about Windows:
>because it is closed-source, the Windows market is homogenous enough that
>one major effort there can make millions of happy campers.  I still hope
>that the pop3proxy can do that for non-Windows systems too, and that's the
>only advice I can offer:  find a way to use the proxy instead of pursuing
>"deep integration" with unbounded dozens of quirky twenty-user email
>clients.
>
>
>_______________________________________________
>Spambayes mailing list
>Spambayes@python.org
>http://mail.python.org/mailman/listinfo/spambayes
>
>
>
>
- Tim
www.fourstonesExpressions.com 


From jeremy@alum.mit.edu  Fri Nov  1 22:18:09 2002
From: jeremy@alum.mit.edu (Jeremy Hylton)
Date: Fri, 1 Nov 2002 17:18:09 -0500
Subject: [Spambayes] Email client integration -- what's needed?
In-Reply-To: <LNBBLJKPBEHFEDALKOLCGEBPCFAB.tim.one@comcast.net>
References: <15809.55308.1945.988931@montanaro.dyndns.org>
	<LNBBLJKPBEHFEDALKOLCGEBPCFAB.tim.one@comcast.net>
Message-ID: <15810.64929.812472.459643@slothrop.zope.com>

>>>>> "TP" == Tim Peters <tim.one@comcast.net> writes:

  TP> I still hope that the pop3proxy can do that for non-Windows
  TP> systems too, and that's the only advice I can offer: find a way
  TP> to use the proxy instead of pursuing "deep integration" with
  TP> unbounded dozens of quirky twenty-user email clients.

The pop proxy is great for people who use pop, but lots of people
don't.  Even for people who use pop, the proxy doesn't help with
training at all.  So I'm afraid it's just a mess for non-Windows
users.

Jeremy


From pje@telecommunity.com  Fri Nov  1 22:37:08 2002
From: pje@telecommunity.com (Phillip J. Eby)
Date: Fri, 01 Nov 2002 17:37:08 -0500
Subject: [Spambayes] Email client integration -- what's needed?
In-Reply-To: <LNBBLJKPBEHFEDALKOLCGEBPCFAB.tim.one@comcast.net>
References: <15809.55308.1945.988931@montanaro.dyndns.org>
Message-ID: <5.1.1.6.0.20021101172008.020d8ec0@telecommunity.com>

At 05:05 PM 11/1/02 -0500, Tim Peters wrote:

>Yup!  Thanks to Sean and especially Mark lately, the non-Windows platforms
>are a month behind on that too.  It's a curious thing about Windows:
>because it is closed-source, the Windows market is homogenous enough that
>one major effort there can make millions of happy campers.  I still hope
>that the pop3proxy can do that for non-Windows systems too, and that's the
>only advice I can offer:  find a way to use the proxy instead of pursuing
>"deep integration" with unbounded dozens of quirky twenty-user email
>clients.

And perhaps the proxy could include a web GUI to handle its other UI 
requirements.

The proxy could keep a history of "recently received" messages, along with 
their ham/spam/unsure status.  It would only permit downloading of ham 
messages, keeping the rest to itself.  Periodically, it would drop in a 
"unsure and spam summary" message that included the list of unsures 
followed by the spams, with a link to the web training UI.

The UI would let you inspect messages and mark them as ham or spam, and 
also allow you to go back and mark false negatives as spam, doing the 
necessary unlearning or relearning as needed.  By default, it would train 
itself on all "sure" messages, ham or spam.

This approach would ensure that the "right" training procedure gets 
followed, while keeping spam from ever entering the mail client.  If the 
installation procedure set up a desktop icon (or local platform equivalent 
thereof) to launch the Web UI, and set up the POP-proxy/webserver to run 
continuously or start-on-demand, the result could be "easy enough" for most 
people.

I think the POPFile (http://popfile.sf.net/) people are taking a similar 
approach to their Bayesian filtering proxy, complete with step-by-step 
screenshots of how to configure Outlook, Eudora, and Outlook Express to use 
their proxy.

One nice thing about the proxy approach is that a company could easily 
offer this as a commercial service, that would let you set up your home and 
office mail clients to pick up from the proxy, so you wouldn't have to 
train your filter in more than one place.  Of course, for such a service to 
work, it'd probably have to support some kind of SMTP proxying as well, 
since so many SMTP servers require POP-before-SMTP.  Hm.


From tim.one@comcast.net  Fri Nov  1 22:48:12 2002
From: tim.one@comcast.net (Tim Peters)
Date: Fri, 01 Nov 2002 17:48:12 -0500
Subject: [Spambayes] Email client integration -- what's needed?
In-Reply-To: <15810.64929.812472.459643@slothrop.zope.com>
Message-ID: <LNBBLJKPBEHFEDALKOLCMECHCFAB.tim.one@comcast.net>

[Jeremy Hylton]
> The pop proxy is great for people who use pop, but lots of people
> don't.

Name 362.  Ha!

> Even for people who use pop, the proxy doesn't help with training at all.
> So I'm afraid it's just a mess for non-Windows users.

I don't know that means it can't be less of a mess, though.  For example, I
expect we could use a common Training class, which manages a database of
opaque message objects and takes care of things like calling appropriate
classifier methods at appropriate times, and remembering which messages have
been trained on as what.  Mark invented a bunch of code like that for the
Outlook client, but there's really nothing Outlook-specific about it apart
from the all Outlook-specific bits <wink>.  Those could be factored out,
though.

A budding system architect could have a lot of fun sorting this out.  John
Draper seemed to be threatening to at one point, but didn't get much
mindshare at the time.  It's time now!


From seant@iname.com  Fri Nov  1 22:55:00 2002
From: seant@iname.com (Sean True)
Date: Fri, 1 Nov 2002 17:55:00 -0500
Subject: [Spambayes] Email client integration -- what's needed?
In-Reply-To: <LNBBLJKPBEHFEDALKOLCGEBPCFAB.tim.one@comcast.net>
Message-ID: <MJEHLHJKGINLONDMMKNEAEEOHFAA.seant@iname.com>

> Yup!  Thanks to Sean and especially Mark lately, the non-Windows platforms
> are a month behind on that too.  It's a curious thing about Windows:
> because it is closed-source, the Windows market is homogenous enough that
> one major effort there can make millions of happy campers.  I still hope
> that the pop3proxy can do that for non-Windows systems too, and that's the
> only advice I can offer:  find a way to use the proxy instead of pursuing
> "deep integration" with unbounded dozens of quirky twenty-user email
> clients.
>
Mark, mostly. I just complain.

I'd like to second the pop3proxy architecture as the way forward. If it
weren't for the fact that virus scanners weren't already sometimes adding
both pop3 and smtp proxies to the mix on Windows,
I would have pushed (er, whined) in favor of that architecture even for
Outlook. (There is also the problem of how to configure proxies for the case
of multiple pop3 accounts). You can mangle the host, user, and password
together in a horrible looking user name, but it is really a pain.

Nonetheless, it has a relatively clean API, and the same architecture could
be used to proxy SMTP output traffic, catching messages to ham@wherever and
spam@wherever, in order to talk to the training
system. The same code then might run pretty happily on both the client and
the server, depending on the needs of the installation.

-- Sean


From Tim@mail.powweb.com  Fri Nov  1 23:02:29 2002
From: Tim@mail.powweb.com (Tim@mail.powweb.com)
Date: Fri, 01 Nov 2002 17:02:29 -0600
Subject: [Spambayes] Email client integration -- what's needed?
Message-ID: <5272PJ65XVL62EDJF62XSFB2WBA091X.3dc30805@riven>

This proposal has a lot of attractions.  Forwarding to ham@ and spam@ would be a bit of a pain at first, but it would work for existing bodies of mail.  Training 
would be MUCH simpler with this method, and would not require some fancy-schmancy installation or configuration glorp.

Multiple pop account management is a requirement for sure.  I'd say most people that use pop use more than one.

Non-pop users?  There might be < 362 on non-windoze platform, but count web-mail in, and the vast majority of them are non-pop users...  We're not going to 
be able to help them directly, we'll need to do some server-side enablement somehow.  Perhaps a spambayes web-mail system is called for... who knows... 
not my itch.

- Tim

11/1/2002 4:55:00 PM, "Sean True" <seant@iname.com> wrote:

>> Yup!  Thanks to Sean and especially Mark lately, the non-Windows platforms
>> are a month behind on that too.  It's a curious thing about Windows:
>> because it is closed-source, the Windows market is homogenous enough that
>> one major effort there can make millions of happy campers.  I still hope
>> that the pop3proxy can do that for non-Windows systems too, and that's the
>> only advice I can offer:  find a way to use the proxy instead of pursuing
>> "deep integration" with unbounded dozens of quirky twenty-user email
>> clients.
>>
>Mark, mostly. I just complain.
>
>I'd like to second the pop3proxy architecture as the way forward. If it
>weren't for the fact that virus scanners weren't already sometimes adding
>both pop3 and smtp proxies to the mix on Windows,
>I would have pushed (er, whined) in favor of that architecture even for
>Outlook. (There is also the problem of how to configure proxies for the case
>of multiple pop3 accounts). You can mangle the host, user, and password
>together in a horrible looking user name, but it is really a pain.
>
>Nonetheless, it has a relatively clean API, and the same architecture could
>be used to proxy SMTP output traffic, catching messages to ham@wherever and
>spam@wherever, in order to talk to the training
>system. The same code then might run pretty happily on both the client and
>the server, depending on the needs of the installation.
>
>-- Sean
>
>
>_______________________________________________
>Spambayes mailing list
>Spambayes@python.org
>http://mail.python.org/mailman/listinfo/spambayes
>
>
>
>
- Tim
www.fourstonesExpressions.com 


From piersh@friskit.com  Fri Nov  1 23:20:58 2002
From: piersh@friskit.com (Piers Haken)
Date: Fri, 1 Nov 2002 15:20:58 -0800
Subject: [Spambayes] Outlook plugin errors with Exchange
Message-ID: <9891913C5BFE87429D71E37F08210CB92974FC@zeus.sfhq.friskit.com>

I'd like to report some problems I'm having with the Outlook plugin. I
hope this is the right place.

My setup is as follows:
- Windows XP SP1
- Outlook XP
- Exchange 2000
- python 2.2.2
- win32all-150
- spambayes CVS (currrent)

I have 3 message stores:
- my main inbox on the exchange server.
- a 'Personal Folders' (.pst) file on my local drive for Auto-Archived
mail.
- my 'Hotmail' inbox.

I realize that this config may well be untested/unsupported, especially
the fact that my inbox message store is on an Exchange server, but
hopefully this info can be of some use to someone...

Also, this is my first time using python, so I'm sorry if I'm missing
something really simple here.

1) It looks like the plugin is having problems hooking the folder events
for the exchange message store. When I use the 'Filter Rules' dialog to
select my exchange inbox, I get the following exception in PythonWin:

Traceback (most recent call last):
  File
"C:\Python22\spam\spambayes\Outlook2000\dialogs\ManagerDialog.py", line
156, in OnButDoSomething
    doer(self)
  File "C:\Python22\spam\spambayes\Outlook2000\manager.py", line 305, in
define_filter
    dlg.mgr.addin.FiltersChanged()
  File "C:\Python22\spam\spambayes\Outlook2000\addin.py", line 281, in
FiltersChanged
    self.UpdateFolderHooks()
  File "C:\Python22\spam\spambayes\Outlook2000\addin.py", line 289, in
UpdateFolderHooks
    FolderItemsEvent)
  File "C:\Python22\spam\spambayes\Outlook2000\addin.py", line 309, in
_HookFolderEvents
    folder =3D msgstore_folder.GetOutlookItem()
  File "C:\Python22\spam\spambayes\Outlook2000\msgstore.py", line 250,
in GetOutlookItem
    return self.msgstore.outlook.Session.GetFolderFromID(hex_item_id,
hex_store_id)
  File
"C:\Python22\lib\site-packages\win32com\gen_py\00062FFF-0000-0000-C000-0
00000000046x0x9x1\_NameSpace.py", line 48, in GetFolderFromID
    ret =3D self._oleobj_.InvokeTypes(8456, LCID, 1, (9, 0), ((8, 1), =
(12,
17)),EntryIDFolder, EntryIDStore)
pywintypes.com_error: (-2147352567, 'Exception occurred.', (4096,
'Microsoft Outlook', 'The operation failed.', None, 0, -2147221241),
None)
win32ui: Error in Command Message handler for command ID 1029, Code 0

The error '-2147221241' is defined as:
	CDONTS.h:       CdoE_INVALID_ENTRYID    =3D 0x80040107,

I am able to select the inboxes in my hotmail and .pst messages stores
without getting an exception thrown.

2) okay, so I have events hooked on my hotmail and .pst stores. If I
move an unread message into _either_ of these folders, I get the
following exception:

pythoncom error: Python error invoking COM method.
Traceback (most recent call last):
  File "C:\Python22\lib\site-packages\win32com\server\policy.py", line
275, in _Invoke_
    return self._invoke_(dispid, lcid, wFlags, args)
  File "C:\Python22\lib\site-packages\win32com\server\policy.py", line
280, in _invoke_
    return S_OK, -1, self._invokeex_(dispid, lcid, wFlags, args, None,
None)
  File "C:\Python22\lib\site-packages\win32com\server\policy.py", line
562, in _invokeex_
    return DesignatedWrapPolicy._invokeex_( self, dispid, lcid, wFlags,
args, kwArgs, serviceProvider)
  File "C:\Python22\lib\site-packages\win32com\server\policy.py", line
510, in _invokeex_
    return apply(func, args)
  File "C:\Python22\spam\spambayes\Outlook2000\addin.py", line 123, in
OnItemAdd
    msgstore_message =3D
self.manager.message_store.GetMessage(item.EntryID)
  File "C:\Python22\spam\spambayes\Outlook2000\msgstore.py", line 211,
in GetMessage
    mapi_object =3D self._OpenEntry(message_id)
  File "C:\Python22\spam\spambayes\Outlook2000\msgstore.py", line 152,
in _OpenEntry
    return store.OpenEntry(item_id, iid, flags)
pywintypes.com_error: (-2147221241, 'OLE error 0x80040107', None, None)

I don't think this problem is exchange-related since it happens even if
I completely remove my exchange account from my outlook settings.

3) messages sent from one exchange account to another (ie, never going
over SMTP) have no headers. This may be a problem since the parser can
never infer the sender or any other metadata about the message. It might
be useful to have a special tag that says that the message has no
headers, since such email is very probably ham. Alternatively, some SMTP
headers could be faked up from the various MAPI properties.

4) for some reason, my outlook is prefixing the headers of SMTP mail
with the string "Microsoft Mail Internet Headers Version 2.0\r\n", and
this is causing every SMTP message to throw an exception during parsing
(for example, when doing a 'show clues'):
Traceback (most recent call last):
  File "C:\Python22\lib\site-packages\win32com\server\policy.py", line
275, in _Invoke_
    return self._invoke_(dispid, lcid, wFlags, args)
  File "C:\Python22\lib\site-packages\win32com\server\policy.py", line
280, in _invoke_
    return S_OK, -1, self._invokeex_(dispid, lcid, wFlags, args, None,
None)
  File "C:\Python22\lib\site-packages\win32com\server\policy.py", line
510, in _invokeex_
    return apply(func, args)
  File "C:\Python22\spam\spambayes\Outlook2000\addin.py", line 101, in
OnClick
    self.handler(*self.args)
  File "C:\Python22\spam\spambayes\Outlook2000\addin.py", line 192, in
ShowClues
    score, clues =3D mgr.score(msgstore_message, evidence=3DTrue,
scale=3DFalse)
  File "C:\Python22\spam\spambayes\Outlook2000\manager.py", line 258, in
score
    email =3D msg.GetEmailPackageObject()
  File "C:\Python22\spam\spambayes\Outlook2000\msgstore.py", line 362,
in GetEmailPackageObject
    msg =3D email.message_from_string(text)
  File "C:\Python22\spam\spambayes\email\__init__.py", line 39, in
message_from_string
    return Parser(_class, strict=3Dstrict).parsestr(s)
  File "C:\Python22\spam\spambayes\email\Parser.py", line 52, in
parsestr
    return self.parse(StringIO(text), headersonly=3Dheadersonly)
  File "C:\Python22\spam\spambayes\email\Parser.py", line 46, in parse
    self._parseheaders(root, fp)
  File "C:\Python22\spam\spambayes\email\Parser.py", line 107, in
_parseheaders
    raise Errors.HeaderParseError(
email.Errors.HeaderParseError: Not a header, not a continuation:
``Microsoft Mail Internet Headers Version 2.0''

This string is also shown in the 'options' dialog for the message (on
both OutlookXP and Outlook2K) so I think it's something that exchange
server adds to the message, ugh. Here's a patch that fixes this for me
and at least allows me to train on a full set of messages:

Index: email/Parser.py
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
RCS file: /cvsroot/spambayes/spambayes/email/Parser.py,v
retrieving revision 1.1.1.1
diff -u -r1.1.1.1 Parser.py
--- email/Parser.py	23 Sep 2002 13:18:55 -0000	1.1.1.1
+++ email/Parser.py	1 Nov 2002 22:17:34 -0000
@@ -101,6 +101,8 @@
                 elif lineno =3D=3D 1 and line.startswith('--'):
                     # allow through duplicate boundary tags.
                     continue
+                elif lineno =3D=3D 1 and line.startswith('Microsoft =
Mail
Internet Headers Version '):
+                    continue
                 else:
                     raise Errors.HeaderParseError(
                         "Not a header, not a continuation:
``%s''"%line)


I'd like to get to the bottom of the event hooking problems so I can
actually have this stuff working live on my incoming spam. If anyone has
an hints on how to proceed, I'd be more than glad to hear them.

Thanks.
Piers.
From jeremy@alum.mit.edu  Fri Nov  1 23:18:51 2002
From: jeremy@alum.mit.edu (Jeremy Hylton)
Date: Fri, 1 Nov 2002 18:18:51 -0500
Subject: [Spambayes] Email client integration -- what's needed?
In-Reply-To: <LNBBLJKPBEHFEDALKOLCMECHCFAB.tim.one@comcast.net>
References: <15810.64929.812472.459643@slothrop.zope.com>
	<LNBBLJKPBEHFEDALKOLCMECHCFAB.tim.one@comcast.net>
Message-ID: <15811.3035.967754.435766@slothrop.zope.com>

>>>>> "TP" == Tim Peters <tim.one@comcast.net> writes:

  TP> [Jeremy Hylton]
  >> The pop proxy is great for people who use pop, but lots of people
  >> don't.

  TP> Name 362.  Ha!

Guido and at least 361 other people <wink>.

  >> Even for people who use pop, the proxy doesn't help with training
  >> at all.  So I'm afraid it's just a mess for non-Windows users.

  TP> I don't know that means it can't be less of a mess, though.  For
  TP> example, I expect we could use a common Training class, which
  TP> manages a database of opaque message objects and takes care of
  TP> things like calling appropriate classifier methods at
  TP> appropriate times, and remembering which messages have been
  TP> trained on as what.  Mark invented a bunch of code like that for
  TP> the Outlook client, but there's really nothing Outlook-specific
  TP> about it apart from the all Outlook-specific bits <wink>.  Those
  TP> could be factored out, though.

I should look at integrating Mark's code and my own training system
based on VM folders.  See what common code falls out.

Jeremy


From vanhorn@whidbey.com  Fri Nov  1 23:35:23 2002
From: vanhorn@whidbey.com (G. Armour Van Horn)
Date: Fri, 01 Nov 2002 15:35:23 -0800
Subject: [Spambayes] Email client integration -- what's needed?
References: <5272PJ65XVL62EDJF62XSFB2WBA091X.3dc30805@riven>
Message-ID: <3DC30FBB.7D203CF5@whidbey.com>

Tim@mail.powweb.com, Stone@mail.powweb.com, Four Stones Expressions wrote:

> This proposal has a lot of attractions.  Forwarding to ham@ and spam@ would be a bit of a pain at first, but it would work for existing bodies of mail.  Training
> would be MUCH simpler with this method, and would not require some fancy-schmancy installation or configuration glorp.

In my desired configuration, as a MailScanner plugin like SpamAssassin, I had a thought that I think would work. My larger clients not only use my system for mail,
but they also have internal discussion lists. Since users are about a hundred times more likely to report spam than they are to send letters from their girlfriends
to ham@, joining ham@ to the discussion lists would be a good source of training material with industry-specific language.

It's a particular issue because there is so much spam related to mortgage financing, and most of my users are realtors or loan officers, so dictionary filters are
risky. And they tend to use a lot of exclamation marks.

Van

--
----------------------------------------------------------
Sign up now for Quotes of the Day, a handful of quotations
on a theme delivered every morning.
Enlightenment! Daily, for free!
mailto:twisted@whidbey.com?subject=Subscribe_QOTD

For web hosting and maintenance,
visit Van's home page: http://www.domainvanhorn.com/van/
----------------------------------------------------------


From mhammond@skippinet.com.au  Sat Nov  2 00:27:12 2002
From: mhammond@skippinet.com.au (Mark Hammond)
Date: Sat, 2 Nov 2002 11:27:12 +1100
Subject: [Spambayes] Email client integration -- what's needed?
In-Reply-To: <15811.3035.967754.435766@slothrop.zope.com>
Message-ID: <LCEPIIGDJPKCOIHOBJEPEELFHHAA.mhammond@skippinet.com.au>

>   TP> trained on as what.  Mark invented a bunch of code like that for
>   TP> the Outlook client, but there's really nothing Outlook-specific
>   TP> about it apart from the all Outlook-specific bits <wink>.  Those
>   TP> could be factored out, though.
>
> I should look at integrating Mark's code and my own training system
> based on VM folders.  See what common code falls out.

I tend to filter the Python zen thusly:

% python -c "import this" | grep purity
Although practicality beats purity.

However, I have tried to think a little about what a generic system would
look like.  For example, I tried to create a generic "message" object
family:

class MsgStore:
    def Close(self):
    def GetFolderGenerator(self, folder_ids, include_sub):
    def GetFolder(self, folder_id):
    def GetMessage(self, message_id):

class MsgStoreFolder:
    def GetMessageGenerator(self, folder):

class MsgStoreMsg:
    def GetEmailPackageObject(self, strip_mime_headers=True):
        # Return a "read-only" Python email package object
        # "read-only" in that changes will never be reflected to the real
store.
        raise NotImplementedError
    def SetField(self, name, value):
        # Abstractly set a user field name/id to a field value.
        # User field is for the user to see - status/internal fields
        # should get their own methods
        raise NotImplementedError
    def GetField(self, name):
        # Abstractly get a user field name/id to a field value.
        raise NotImplementedError
    def Save(self):
        # Save changes after field changes.
        raise NotImplementedError
    def MoveTo(self, folder_id):
        # Move the message to a folder.
        raise NotImplementedError
    def CopyTo(self, folder_id):
        # Copy the message to a folder.
        raise NotImplementedError

The essence of our training code is then:

def train_folder( f, isspam, mgr, progress):
    # fancy progress reporting code omitted
    for message in f.GetMessageGenerator():
        train_message(message, isspam, mgr)

def train_message(msg, is_spam, mgr):
    # Train an individual message.
    # Returns True if newly added (message will be correctly
    # untrained if it was in the wrong category), False if already
    # in the correct category.  Catch your own damn exceptions.
    from tokenizer import tokenize
    stream = msg.GetEmailPackageObject()
    tokens = tokenize(stream)
    # Handle we may have already been trained.
    was_spam = mgr.message_db.get(msg.searchkey)
    if was_spam is None:
        # never previously trained.
        pass
    elif was_spam == is_spam:
        # Already in DB - do nothing (full retrain will wipe msg db)
        # leave now.
        return False
    else:
        mgr.bayes.unlearn(tokens, was_spam, False)
    # OK - setup the new data.
    mgr.bayes.learn(tokens, is_spam, False)
    mgr.message_db[msg.searchkey] = is_spam
    mgr.bayes_dirty = True
    return True

As Tim says, not much Outlook specific here (some - eg, "msg.searchkey" -
but nothing too painful)

Mark.


From jeremy@alum.mit.edu  Sat Nov  2 00:29:43 2002
From: jeremy@alum.mit.edu (Jeremy Hylton)
Date: Fri, 1 Nov 2002 19:29:43 -0500
Subject: [Spambayes] Email client integration -- what's needed?
In-Reply-To: <LCEPIIGDJPKCOIHOBJEPEELFHHAA.mhammond@skippinet.com.au>
References: <15811.3035.967754.435766@slothrop.zope.com>
	<LCEPIIGDJPKCOIHOBJEPEELFHHAA.mhammond@skippinet.com.au>
Message-ID: <15811.7287.50962.651569@slothrop.zope.com>

>>>>> "MH" == Mark Hammond <mhammond@skippinet.com.au> writes:

  MH> I tend to filter the Python zen thusly:

  MH> % python -c "import this" | grep purity
  MH> Although practicality beats purity.

:-).

  MH> However, I have tried to think a little about what a generic
  MH> system would look like.  For example, I tried to create a
  MH> generic "message" object family:

  MH> class MsgStoreMsg:
  MH>     def GetEmailPackageObject(self, strip_mime_headers=True):
  MH>         # Return a "read-only" Python email package object
  MH>         # "read-only" in that changes will never be reflected to
  MH>         # the real
  MH> store.
  MH>         raise NotImplementedError
  MH>     def SetField(self, name, value):
  MH>         # Abstractly set a user field name/id to a field value.
  MH>         # User field is for the user to see - status/internal
  MH>         # fields should get their own methods
  MH>         raise NotImplementedError
  MH>     def GetField(self, name):
  MH>         # Abstractly get a user field name/id to a field value.
  MH>         raise NotImplementedError
  MH>     def Save(self):
  MH>         # Save changes after field changes.
  MH>         raise NotImplementedError
  MH>     def MoveTo(self, folder_id):
  MH>         # Move the message to a folder.
  MH>         raise NotImplementedError
  MH>     def CopyTo(self, folder_id):
  MH>         # Copy the message to a folder.
  MH>         raise NotImplementedError

This part of the code doesn't work that well for my mail folders.  The
code to move messages from folder to folder needs to be written in
elisp.  I'm not sure how important that is.

The training code looks simple enough.  My version is:

    def update(self):
        """Update classifier from current folder contents."""
        changed1 = self._update(self.hams, False)
        changed2 = self._update(self.spams, True)
        if changed1 or changed2:
            self.classifier.update_probabilities()
        get_transaction().commit()

    def _update(self, folders, is_spam):
        changed = False
        for f in folders:
            added, removed = f.read()
            get_transaction().commit()
            if not (added or removed):
                continue
            changed = True

            # It's important not to commit a transaction until
            # after update_probabilities is called in update().
            # Otherwise some new entries will cause scoring to fail.
            for msg in added.keys():
                self.classifier.learn(tokenize(msg), is_spam, False)
            del added
            get_transaction().commit(1)
            for msg in removed.keys():
                self.classifier.unlearn(tokenize(msg), is_spam, False)
            del removed
            get_transaction().commit(1)
        return changed

The read method scans a folder and returns two sets of messages --
those added and removed since the last time it was read.

Jeremy


From mhammond@skippinet.com.au  Sat Nov  2 00:42:38 2002
From: mhammond@skippinet.com.au (Mark Hammond)
Date: Sat, 2 Nov 2002 11:42:38 +1100
Subject: [Spambayes] Outlook plugin errors with Exchange
In-Reply-To: <9891913C5BFE87429D71E37F08210CB92974FC@zeus.sfhq.friskit.com>
Message-ID: <LCEPIIGDJPKCOIHOBJEPIELGHHAA.mhammond@skippinet.com.au>

> I'd like to report some problems I'm having with the Outlook plugin. I
> hope this is the right place.

Me too <wink>.  I think it will be until we annoy everyone else
sufficiently!

> - spambayes CVS (currrent)

I'm sure it was at the time <wink>

> 1) It looks like the plugin is having problems hooking the folder events
> for the exchange message store. When I use the 'Filter Rules' dialog to
> select my exchange inbox, I get the following exception in PythonWin:

Hmmm.  This is a little strange.  I would not be surprised to find Outlook
can't hook Exchange folder events, but this is failing before that - it is
failing just getting a regular Outlook "MAPIFolder" object for that folder.

We may be able to get a little further offline.

> 2) okay, so I have events hooked on my hotmail and .pst stores. If I
> move an unread message into _either_ of these folders, I get the
> following exception:

This one is now fixed in CVS.

> 3) messages sent from one exchange account to another (ie, never going
> over SMTP) have no headers. This may be a problem since the parser can
> never infer the sender or any other metadata about the message. It might
> be useful to have a special tag that says that the message has no
> headers, since such email is very probably ham. Alternatively, some SMTP
> headers could be faked up from the various MAPI properties.

This is true, and known.  We could synthesize some headers in this case.
However, no one else involved on the project has this problem, so
contributions welcome.  You did read the "about" text, right? <wink>

> 4) for some reason, my outlook is prefixing the headers of SMTP mail
> with the string "Microsoft Mail Internet Headers Version 2.0\r\n", and

Probably because this mail is coming in via the exchange mail gateway,
rather than directly fetched by the Outlook client's internet mail
capabilities.

> This string is also shown in the 'options' dialog for the message (on
> both OutlookXP and Outlook2K) so I think it's something that exchange
> server adds to the message, ugh. Here's a patch that fixes this for me
> and at least allows me to train on a full set of messages:
>
> Index: email/Parser.py
> ===================================================================
> RCS file: /cvsroot/spambayes/spambayes/email/Parser.py,v
> retrieving revision 1.1.1.1
> diff -u -r1.1.1.1 Parser.py
> --- email/Parser.py	23 Sep 2002 13:18:55 -0000	1.1.1.1
> +++ email/Parser.py	1 Nov 2002 22:17:34 -0000
> @@ -101,6 +101,8 @@
>                  elif lineno == 1 and line.startswith('--'):
>                      # allow through duplicate boundary tags.
>                      continue
> +                elif lineno == 1 and line.startswith('Microsoft Mail
> Internet Headers Version '):
> +                    continue
>                  else:
>                      raise Errors.HeaderParseError(
>                          "Not a header, not a continuation:
> ``%s''"%line)

I wonder if there is another property on the message that holds the prefix?
I might send you some scripts to try <wink>

Mark.


From Tim@mail.powweb.com  Sat Nov  2 00:46:33 2002
From: Tim@mail.powweb.com (Tim@mail.powweb.com)
Date: Fri, 01 Nov 2002 18:46:33 -0600
Subject: [Spambayes] Email client integration -- what's needed?
In-Reply-To: <3DC30FBB.7D203CF5@whidbey.com>
Message-ID: <DCKJ82PB9C0B0EBA0MILBAE9YS5YT.3dc32069@riven>

Ok, so what's rattling around in my head is a set of two proxies: a pop3 proxy and a smtp proxy.  the pop3 proxy, running either locally or on the mail server 
machine, is responsible for classification of email, and delivery as appropriate (tbd).  The smtp proxy, again running either locally or on the mail server, is 
responsible for training.  Mail sent to spam@ or ham@ is used by the proxy as training, and isn't actually sent onward.  The proxies would simply have to be 
configurable for what port to listen on, *and* what port to send on.  This configurability also handles the case where there are multiple proxies running in the 
same system.  For instance, I already have an SMTP proxy running, that I probably can't live without.  The Spambayes proxy would have to listen on a new 
port and send to the port that my current proxy is listening on...

11/1/2002 5:35:23 PM, "G. Armour Van Horn" <vanhorn@whidbey.com> wrote:

>Tim@mail.powweb.com, Stone@mail.powweb.com, Four Stones Expressions wrote:
>
>> This proposal has a lot of attractions.  Forwarding to ham@ and spam@ would be a bit of a pain at first, but it would work for existing bodies of mail.  
Training
>> would be MUCH simpler with this method, and would not require some fancy-schmancy installation or configuration glorp.
>
>In my desired configuration, as a MailScanner plugin like SpamAssassin, I had a thought that I think would work. My larger clients not only use my system for 
mail,
>but they also have internal discussion lists. Since users are about a hundred times more likely to report spam than they are to send letters from their 
girlfriends
>to ham@, joining ham@ to the discussion lists would be a good source of training material with industry-specific language.
>
>It's a particular issue because there is so much spam related to mortgage financing, and most of my users are realtors or loan officers, so dictionary filters are
>risky. And they tend to use a lot of exclamation marks.
>
>Van
>
>--
>----------------------------------------------------------
>Sign up now for Quotes of the Day, a handful of quotations
>on a theme delivered every morning.
>Enlightenment! Daily, for free!
>mailto:twisted@whidbey.com?subject=Subscribe_QOTD
>
>For web hosting and maintenance,
>visit Van's home page: http://www.domainvanhorn.com/van/
>----------------------------------------------------------
>
>
>
>_______________________________________________
>Spambayes mailing list
>Spambayes@python.org
>http://mail.python.org/mailman/listinfo/spambayes
>
>
>
>
- Tim
www.fourstonesExpressions.com 


From anthony@interlink.com.au  Sat Nov  2 04:25:07 2002
From: anthony@interlink.com.au (Anthony Baxter)
Date: Sat, 02 Nov 2002 15:25:07 +1100
Subject: [Spambayes] 'sender' and 'reply-to' tokenising. 
In-Reply-To: <LNBBLJKPBEHFEDALKOLCEEBOCFAB.tim.one@comcast.net> 
Message-ID: <200211020425.gA24P7R19149@localhost.localdomain>


>>> Tim, smacking down my naive attempts at analysing test data:
> I'm tempted to drop them!  mean/sdev were useful under schemes with real
> systematic overlap between the population scores, but chi-combining is so
> extreme that overlaps simply aren't due to random effects.

So we're back with the problem we had with the Graham method, that
it's really really hard to analyse tokenizer changes because of the
lack of meaningful test data? Is it worth trying the tests with 
gary-combining to see if the tokenizer changes actually make things
better or worse? 

I don't think we're going to see any "easy big wins" from the 
tokenizer - but trying to figure out whether incremental changes
are positive or negative seems like it's going to be hard if
we can only use fp/fn numbers.

Anthony, confused.
-- 
Anthony Baxter     <anthony@interlink.com.au>   
It's never too late to have a happy childhood.


From tim.one@comcast.net  Sat Nov  2 04:36:23 2002
From: tim.one@comcast.net (Tim Peters)
Date: Fri, 01 Nov 2002 23:36:23 -0500
Subject: [Spambayes] Outlook plugin errors with Exchange
In-Reply-To: <9891913C5BFE87429D71E37F08210CB92974FC@zeus.sfhq.friskit.com>
Message-ID: <LNBBLJKPBEHFEDALKOLCIEEHCFAB.tim.one@comcast.net>

[Piers Haken]

Thank you for the excellent report!

> ...
> I realize that this config may well be untested/unsupported, especially
> the fact that my inbox message store is on an Exchange server, but
> hopefully this info can be of some use to someone...

There's no intention *not* to support Exchange server, but I've never been
near one and I'm not sure anyone else here is near one either.  Someone with
access to that will have to deal with it.  You're elected.

> Also, this is my first time using python, so I'm sorry if I'm missing
> something really simple here.

No, you did a great job of faking it <wink>.

> ...
> 3) messages sent from one exchange account to another (ie, never going
> over SMTP) have no headers. This may be a problem since the parser can
> never infer the sender or any other metadata about the message. It might
> be useful to have a special tag that says that the message has no
> headers, since such email is very probably ham. Alternatively, some SMTP
> headers could be faked up from the various MAPI properties.

By default, the tokenizer code ignores most header fields.  It would be good
to simulate a few, especially Subject and From.  Sticking something like
NOHEADERS in the synthesized Subject header would suffice to teach the
classifier that NOHEADERS-in-a-Subject-header is a strong ham clue, and
there's really no need to get fancier than that.

> 4) for some reason, my outlook is prefixing the headers of SMTP mail
> with the string "Microsoft Mail Internet Headers Version 2.0\r\n", and
> this is causing every SMTP message to throw an exception during parsing
> (for example, when doing a 'show clues'):
...
>   File "C:\Python22\spam\spambayes\email\Parser.py", line 107, in
> _parseheaders
>     raise Errors.HeaderParseError(
> email.Errors.HeaderParseError: Not a header, not a continuation:
> ``Microsoft Mail Internet Headers Version 2.0''

That would be an error!  The format of header lines is specified by a public
standard, and as the error msg said, that specific line is neither a valid
header line nor a valid continuation of a preceding header line.

> This string is also shown in the 'options' dialog for the message (on
> both OutlookXP and Outlook2K) so I think it's something that exchange
> server adds to the message, ugh.

Sounds very likely; I haven't seen this.

> Here's a patch that fixes this for me and at least allows me to
> train on a full set of messages:>
> Index: email/Parser.py

The email pkg is a part of standard Python, and we (speaking as a Python
developer here) won't warp it to accept non-standard headers.  If it's
necessary to worm around this in the Outlook client, it should be easy to do
so by fiddling Outlook2000\msgstore.py's _GetMessageText().  For example,
this is untested but almost certainly close to working:

    if headers.startswith("Microsoft Mail"):
        headers = "X-MS-Mail-Gibberish: " + headers

It's enough just to check for the "Microsoft Mail" prefix, as the embedded
space alone makes it an invalid header line.  Stuffing a legitimate header
tag at the front should be enough to make the email pkg's parser happy
again.


From popiel@wolfskeep.com  Sat Nov  2 04:41:00 2002
From: popiel@wolfskeep.com (T. Alexander Popiel)
Date: Fri, 01 Nov 2002 20:41:00 -0800
Subject: [Spambayes] An alternate use
Message-ID: <20021102044100.7CC18F5AC@cashew.wolfskeep.com>

A couple things have been kicking around in my head, and they've
managed to come together in an interesting configuration and stick,
so I'm going to make a quiet little proposal and see how much
thunder it generates.


First off, the observations:

1. Based on recent reports, spambayes works better when given full
   data about everything that comes through, not just the mistakes.
   This is predicted by the theory, too.

2. spambayes is extremely sensitive to changes in the nature of
   ham, and is moderately likely to classify any new topics/venues
   as spam.

3. spambayes is still a techie toy (though perhaps not for much
   longer).  People with a little knowhow are going to have a
   much easier time training it than the average joe.

4. We want a large penetration into the mail-reading populace,
   to better force the spammers to change tactics.

5. Many people read mailing lists.  In fact, for high volume
   mail users, mailing lists probably make the majority of
   their incoming mail (or at least their incoming ham).

6. A noticable amount of spam gets relayed through mailing lists,
   and most personal filters are notoriously bad about passing
   it through because it comes from a whitelisted intermediary.

6. Most mailing lists keep archives of everything sent over the
   list.

7. Most mailing lists are single-topic, and anything off-topic
   is unwanted.


So, what I propose is that we specifically target mailing list
managers (mailman and ecartis being the two obvious first
targets) for spambayes integration.  I see two main modes for
this: just adding headers for the less intrusive, and actually
rejecting or forcing moderation for the heavily policed.

Training is easily accomplished by taking the list archives
as a ham corpus and one of the spam collections floating
around as a spam corpus.  Run the classifier over the training
data to kick out all the false positives and false negatives
for possible resorting, then retrain.  Only the list owner
has to be techie to do this, and list owners are more likely
to be techie than not (they set up a mailing list, after all).
Periodic retraining can be handled in the same way.

In the case of adding headers, we'll want to avoid collisions
with personal use of spambayes, too.  I suggest tagging the
X-Spambayes-Disposition header (or whatever we call it) with
some identifier for which classifier generated the rating,
so that multiple X-Spambayes-Disposition lines are distinguishable.
Something like:

  X-Spambayes-Disposition: Spam by spambayes@python.org
  X-Spambayes-Disposition: Unsure by pennmush@pennmush.org

Personal classifiers could leave off the 'by' section.

Heck, make it so that X-Spambayes-Disposition lines are turned
into words similar to the mailer lines, and then personal
classifiers can use the judgements of list classifiers as clues.


Doing this sort of integration into mailing list managers takes
advantage of some 'weaknesses' of spambayes, and could be of
great benefit to many people beyond just those with the
wherewithal to train and run the filter.

- Alex

From tim.one@comcast.net  Sat Nov  2 04:57:08 2002
From: tim.one@comcast.net (Tim Peters)
Date: Fri, 01 Nov 2002 23:57:08 -0500
Subject: [Spambayes] 'sender' and 'reply-to' tokenising.
In-Reply-To: <200211020425.gA24P7R19149@localhost.localdomain>
Message-ID: <LNBBLJKPBEHFEDALKOLCIEEJCFAB.tim.one@comcast.net>

[Tim, praising Anthonly's enthusiastic attempts at analysing test data]
> I'm tempted to drop them!  mean/sdev were useful under schemes with real
> systematic overlap between the population scores, but  chi-combining is
> so extreme that overlaps simply aren't due to random effects.

[Anthony Baxter]
> So we're back with the problem we had with the Graham method, that
> it's really really hard to analyse tokenizer changes because of the
> lack of meaningful test data?

The problem I had with Graham-combining is that the more and better the
training data you had, the more embarrassing its errors became:  the middle
ground kept getting smaller, and eventually everything scored as 0.0 or 1.0,
and whether right or wrong.  chi-combining reliably scores highly ambiguous
msgs near 0.5, and its middle ground is (a) very accurate about when it's
confused, and (b) doesn't degenerate as training data increases.

> Is it worth trying the tests with gary-combining to see if the tokenizer
> changes actually make things better or worse?
>
> I don't think we're going to see any "easy big wins" from the
> tokenizer - but trying to figure out whether incremental changes
> are positive or negative seems like it's going to be hard if
> we can only use fp/fn numbers.

The FP/FN/unsure rates are the only numbers that matter in the end, and
under chi-combining it's *much* easier to stare at mistakes and find
commonalities.  Given a reasonable amount of training data, errors almost
never score at 0.0 or 1.0 under chi, which makes it plausible that tokenizer
chnages can redeem them.  This requires more work but is more rewarding.
For example, it was easy to identify exactly what about tokenizing Reply-To
saved 3 FP in my python.org test, and that suggested a focused area for
further work.  Precisely because there are very likely no big wins
remaining, progress now has to come from thinking about mistakes, finding
cheap ways to avoid them, and then running tests to ensure that new gimmicks
don't hurt anything else.  As with the only good effect I found from
Reply-To in my python.org test, I expect most such gimmicks will boil down
to letting the classifier see more of the msg -- but not so much that highly
correlated words lead to extreme mistakes.

There's still a lot of header info we ignore by default, and we still ignore
almost everything in almost all HTML tags, and almost everything in almost
all non-text/* sections, so there's still plenty of room for small
improvements.  Looking for something that increases the mean spread by 0.1%
when the means are already 16 sdev apart is a waste of time now, though.
Looking for something that cuts an FP without hurting FN or unsure is
golden.

progress-is-harder-now-but-that's-a-sign-of-success-ly y'rs  - tim


From tim.one@comcast.net  Sat Nov  2 05:32:58 2002
From: tim.one@comcast.net (Tim Peters)
Date: Sat, 02 Nov 2002 00:32:58 -0500
Subject: [Spambayes] An alternate use
In-Reply-To: <20021102044100.7CC18F5AC@cashew.wolfskeep.com>
Message-ID: <LNBBLJKPBEHFEDALKOLCIEEKCFAB.tim.one@comcast.net>

[T. Alexander Popiel]
> A couple things have been kicking around in my head, and they've
> managed to come together in an interesting configuration and stick,
> so I'm going to make a quiet little proposal and see how much
> thunder it generates.
>
>
> First off, the observations:
>
> 1. Based on recent reports, spambayes works better when given full
>    data about everything that comes through, not just the mistakes.
>    This is predicted by the theory, too.

I'd say "representative data" more than "full data".  A random slice of real
life, consistently applied, should be enough.

> 2. spambayes is extremely sensitive to changes in the nature of
>    ham, and is moderately likely to classify any new topics/venues
>    as spam.

Almost certainly true for a classifier trained mostly by mistakes, ignoring
the correctly classified msgs.  The latter are needed to transform spamprobs
from serendipitous hapaxes into robust indicators.

In my own classifier, I trained on *no* msgs from the Spambayes list at
first.  I left them out on purpose.  Recall that I reported on what happened
after I had a pretty decent classifier and scored more than 1,000 backed-up
spambayes msgs:  they were almost all scored as ham, despite not training on
the topic at all.  I expect this is more rule than exception for a properly
trained classifier.

What it *is* extremely sensitive to is advertising you sign up for.  I've
been at this thru a full billing cycle now, and marketing msgs from vendors
I want to do business with still score as Unsure before training on several
msgs from a specific vendor.  Spam that uses the same words can keep
knocking them back into Unsure territory too.

> 3. spambayes is still a techie toy (though perhaps not for much
>    longer).  People with a little knowhow are going to have a
>    much easier time training it than the average joe.

Absolutely.

> 4. We want a large penetration into the mail-reading populace,
>    to better force the spammers to change tactics.

Heh.  It's still an irony of this project that I've never particularly
minded getting 100 spam per day <wink>.

> 5. Many people read mailing lists.  In fact, for high volume
>    mail users, mailing lists probably make the majority of
>    their incoming mail (or at least their incoming ham).

True here.

> 6. A noticable amount of spam gets relayed through mailing lists,
>    and most personal filters are notoriously bad about passing
>    it through because it comes from a whitelisted intermediary.

Indeed, that's why I still ignore most of the header lines.  python.org and
Mailman put so many "I touched this!" clues in the headers, and do such a
good job of stopping spam already, that if I pay attention to those clues
then almost none of the spam they let pass gets caught.

> 6. Most mailing lists keep archives of everything sent over the
>    list.

Yup.

> 7. Most mailing lists are single-topic, and anything off-topic
>    is unwanted.

Eh -- probably.  I started with the mailing-list version of
comp.lang.python, and there's a huge amount of traffic there that never
mentions Python.  The variety of ham on that group is quite amazing.  But it
contains almost no advertising beyond conference announcements, and I still
expect that accounts for the breathtaking results I get on my c.l.py tests
(2 mistakes out of 34,000 msgs, where one "mistake" is saying that a quote
of a full Nigerian-scam spam is itself spam).

> So, what I propose is that we specifically target mailing list
> managers (mailman and ecartis being the two obvious first
> targets) for spambayes integration.  I see two main modes for
> this: just adding headers for the less intrusive, and actually
> rejecting or forcing moderation for the heavily policed.

That's actually what started this project:  Barry Warsaw is GNU Mailman's
author, and he asked me to look into adapting Graham's scheme for
incorporation into Mailman.  Barry has been pretty much missing in action
here since then, but I expect him to take it up again.

> Training is easily accomplished by taking the list archives
> as a ham corpus and one of the spam collections floating
> around as a spam corpus.

That's exactly what I did, and it was anything but easy.  Mixed-source
corpora create a world of problems, and Mailmain archives in particular save
*all* the Mailman distortions introduced into the headers.  Even on the more
general "python.org email" test I've been doing behind the scenes lately,
the headers are polluted by judgments from SpamAssassin, and goofy little
things like python.org's MTA inventing Message-Id lines out of thin air when
one doesn't come across on the wire.  There are lots and lots of traps here.

> Run the classifier over the training data to kick out all the false
> positives and false negatives for possible resorting, then retrain.
> Only the list owner has to be techie to do this, and list owners are
> more likely to be techie than not (they set up a mailing list, after
> all).  Periodic retraining can be handled in the same way.
>
> In the case of adding headers, we'll want to avoid collisions
> with personal use of spambayes, too.  I suggest tagging the
> X-Spambayes-Disposition header (or whatever we call it) with
> some identifier for which classifier generated the rating,
> so that multiple X-Spambayes-Disposition lines are distinguishable.
> Something like:
>
>   X-Spambayes-Disposition: Spam by spambayes@python.org
>   X-Spambayes-Disposition: Unsure by pennmush@pennmush.org
>
> Personal classifiers could leave off the 'by' section.
>
> Heck, make it so that X-Spambayes-Disposition lines are turned
> into words similar to the mailer lines, and then personal
> classifiers can use the judgements of list classifiers as clues.

Easy to spoof, and I'm sure spammers would pick up on that quickly.

> Doing this sort of integration into mailing list managers takes
> advantage of some 'weaknesses' of spambayes, and could be of
> great benefit to many people beyond just those with the
> wherewithal to train and run the filter.

That was Barry's idea, yes <wink>.  I'll leave it to him to resume this
battle.  One idea we kicked around was to add a

    If this looks like spam, click here:  http://yadda.yadda.yorg/abc?=etc

line at the bottom of each mailing-list msg.  An automated system on the
server would collect and organize votes.  There's no intention that users
get to vote on what *is* spam, the real point is more devious:  a msg that
*nobody* claims is spam almost certainly isn't spam, so it's really most
valuable as a way to identify ham.  That is, if nobody claims msg X is spam
within a few days, it's almost certainly the case that X is safe to add to
the ham training.  That seems so certain that it could be automated.  Msgs
that got "weveral" spam votes would be brought to the list admin's
attention, for human judgment about whether to classify them as errors.
Automating *that* part gets too close to censorship-by-vocal-minority for my
tastes, so if Barry implemented that part I'd kill him <wink>.


From popiel@wolfskeep.com  Sat Nov  2 06:29:39 2002
From: popiel@wolfskeep.com (T. Alexander Popiel)
Date: Fri, 01 Nov 2002 22:29:39 -0800
Subject: [Spambayes] An alternate use 
In-Reply-To: Message from Tim Peters <tim.one@comcast.net> 
	<LNBBLJKPBEHFEDALKOLCIEEKCFAB.tim.one@comcast.net> 
References: <LNBBLJKPBEHFEDALKOLCIEEKCFAB.tim.one@comcast.net> 
Message-ID: <20021102062939.3CD89F5AC@cashew.wolfskeep.com>

In message:  <LNBBLJKPBEHFEDALKOLCIEEKCFAB.tim.one@comcast.net>
             Tim Peters <tim.one@comcast.net> writes:
>[T. Alexander Popiel]
>>
>> 1. Based on recent reports, spambayes works better when given full
>>    data about everything that comes through, not just the mistakes.
>>    This is predicted by the theory, too.
>
>I'd say "representative data" more than "full data".  A random slice of real
>life, consistently applied, should be enough.

Granted.

>> 4. We want a large penetration into the mail-reading populace,
>>    to better force the spammers to change tactics.
>
>Heh.  It's still an irony of this project that I've never particularly
>minded getting 100 spam per day <wink>.

Whereas my disgust with getting 70 spam per day (out of about 100
messages total) is one of the major things that prompted me to
actually try Graham's algorithm. ;-)

>> So, what I propose is that we specifically target mailing list
>> managers (mailman and ecartis being the two obvious first
>> targets) for spambayes integration.  I see two main modes for
>> this: just adding headers for the less intrusive, and actually
>> rejecting or forcing moderation for the heavily policed.
>
>That's actually what started this project:  Barry Warsaw is GNU Mailman's
>author, and he asked me to look into adapting Graham's scheme for
>incorporation into Mailman.  Barry has been pretty much missing in action
>here since then, but I expect him to take it up again.

Heh.  Glad to hear I'm not the only one thinking like this.
I don't claim to have new ideas... recycled ideas are easier. ;-)

>> Training is easily accomplished by taking the list archives
>> as a ham corpus and one of the spam collections floating
>> around as a spam corpus.
>
>That's exactly what I did, and it was anything but easy.  Mixed-source
>corpora create a world of problems, and Mailmain archives in particular save
>*all* the Mailman distortions introduced into the headers.

Blech.  You're right... I just forgot about the troubles you had.
Ecartis is similar with the tainting of the archives.

>> In the case of adding headers, we'll want to avoid collisions
>> with personal use of spambayes, too.  I suggest tagging the
>> X-Spambayes-Disposition header (or whatever we call it) with
>> some identifier for which classifier generated the rating,
>> so that multiple X-Spambayes-Disposition lines are distinguishable.
>> Something like:
>>
>>   X-Spambayes-Disposition: Spam by spambayes@python.org
>>   X-Spambayes-Disposition: Unsure by pennmush@pennmush.org
>>
>> Personal classifiers could leave off the 'by' section.
>>
>> Heck, make it so that X-Spambayes-Disposition lines are turned
>> into words similar to the mailer lines, and then personal
>> classifiers can use the judgements of list classifiers as clues.
>
>Easy to spoof, and I'm sure spammers would pick up on that quickly.

Yes, it would be easy to spoof, unless compared with routing
information... but doing that sort of comparison is beyond
the sorting rule capabilities of something like Outlook (and
Outlook is sadly one of the best GUI tools in that arena).
I'm not even sure procmail is up to the task without help
from a custom program.

On the other hand, we could build the smarts for it into
spambayes itself, for use in the personal classifier figuring
out when to trust the apparent list classifier... perhaps
I'll look into routing analysis for my next algorithmic
experiment.

>One idea we kicked around was to add a
>
>    If this looks like spam, click here:  http://yadda.yadda.yorg/abc?=etc
>
>line at the bottom of each mailing-list msg.  An automated system on the
>server would collect and organize votes.  There's no intention that users
>get to vote on what *is* spam, the real point is more devious:  a msg that
>*nobody* claims is spam almost certainly isn't spam, so it's really most
>valuable as a way to identify ham.  That is, if nobody claims msg X is spam
>within a few days, it's almost certainly the case that X is safe to add to
>the ham training.  That seems so certain that it could be automated.  Msgs
>that got "weveral" spam votes would be brought to the list admin's
>attention, for human judgment about whether to classify them as errors.
>Automating *that* part gets too close to censorship-by-vocal-minority for my
>tastes, so if Barry implemented that part I'd kill him <wink>.

Interesting, as a ham indicator.  Way too corruptible as a spam
indicator, I agree.

- Alex

From piersh@friskit.com  Sat Nov  2 12:11:16 2002
From: piersh@friskit.com (Piers Haken)
Date: Sat, 2 Nov 2002 04:11:16 -0800
Subject: [Spambayes] Outlook plugin errors with Exchange
Message-ID: <9891913C5BFE87429D71E37F08210CB9297502@zeus.sfhq.friskit.com>

> -----Original Message-----
> From: Tim Peters [mailto:tim.one@comcast.net]=20
> Sent: Friday, November 01, 2002 8:36 PM
> To: Piers Haken
> Cc: spambayes@python.org
> Subject: RE: [Spambayes] Outlook plugin errors with Exchange
>=20
>=20
> [Piers Haken]
>=20
> Thank you for the excellent report!

My pleasure. Thanks to you and your team for a great tool. I've been
lurking on this list for a while since I saw it mentioned on slashdot
with the intention of someday writing an outlook or exchange plugin
based on the algorithms you have finessed. But thanks to Mark and co. I
don't have to ;-)

> > ...
> > I realize that this config may well be untested/unsupported,=20
> > especially the fact that my inbox message store is on an Exchange=20
> > server, but hopefully this info can be of some use to someone...
>=20
> There's no intention *not* to support Exchange server, but=20
> I've never been near one and I'm not sure anyone else here is=20
> near one either.  Someone with access to that will have to=20
> deal with it.  You're elected.

Heh, I already took it offline with Mark and he's made a few fixes
already and I've sent him a patch that should work around the problem
I'm seeing (that is, if it doesn't break everything else in the
process). If you want, I can send it to the list.

> > Also, this is my first time using python, so I'm sorry if=20
> I'm missing=20
> > something really simple here.
>=20
> No, you did a great job of faking it <wink>.

Beginner's luck, I assure you ;-)

> > ...
> > 3) messages sent from one exchange account to another (ie,=20
> never going=20
> > over SMTP) have no headers. This may be a problem since the=20
> parser can=20
> > never infer the sender or any other metadata about the message. It=20
> > might be useful to have a special tag that says that the=20
> message has=20
> > no headers, since such email is very probably ham.=20
> Alternatively, some=20
> > SMTP headers could be faked up from the various MAPI properties.
>=20
> By default, the tokenizer code ignores most header fields. =20
> It would be good to simulate a few, especially Subject and=20
> From.  Sticking something like NOHEADERS in the synthesized=20
> Subject header would suffice to teach the classifier that=20
> NOHEADERS-in-a-Subject-header is a strong ham clue, and=20
> there's really no need to get fancier than that.

My patch tries to fake up the subject, from, to && cc fields. However,
there's no easy way to get an smtp address from an X.400 address (there
may not even be one), so I just put the display names (eg, "Tim Peters")
in those cases. If they're not parsed then it doesn't really matter, I
guess the subject is the most important bit. I also added an
"X-Exchange-Message: true" header for these messages, so I guess people
can add that to their options if they want an extra ham bonus.

> > 4) for some reason, my outlook is prefixing the headers of=20
> SMTP mail=20
> > with the string "Microsoft Mail Internet Headers Version=20
> 2.0\r\n", and=20
> > this is causing every SMTP message to throw an exception during=20
> > parsing (for example, when doing a 'show clues'):
> ...
> >   File "C:\Python22\spam\spambayes\email\Parser.py", line 107, in=20
> > _parseheaders
> >     raise Errors.HeaderParseError(
> > email.Errors.HeaderParseError: Not a header, not a continuation:=20
> > ``Microsoft Mail Internet Headers Version 2.0''
>=20
> That would be an error!  The format of header lines is=20
> specified by a public standard, and as the error msg said,=20
> that specific line is neither a valid header line nor a valid=20
> continuation of a preceding header line.

Yeah, I'm not sure why they do this. It's not normally a problem because
these headers are never used by any SMTP transport once they've gone
through the MTA, but it's still a pain, and it'll probably change in the
future, breaking everything...

> > This string is also shown in the 'options' dialog for the=20
> message (on=20
> > both OutlookXP and Outlook2K) so I think it's something=20
> that exchange=20
> > server adds to the message, ugh.
>=20
> Sounds very likely; I haven't seen this.
>=20
> > Here's a patch that fixes this for me and at least allows=20
> me to train=20
> > on a full set of messages:>
> > Index: email/Parser.py
>=20
> The email pkg is a part of standard Python, and we (speaking=20
> as a Python developer here) won't warp it to accept=20
> non-standard headers.  If it's necessary to worm around this=20
> in the Outlook client, it should be easy to do so by fiddling=20
> Outlook2000\msgstore.py's _GetMessageText().  For example,=20
> this is untested but almost certainly close to working:
>=20
>     if headers.startswith("Microsoft Mail"):
>         headers =3D "X-MS-Mail-Gibberish: " + headers

So close to working, in fact, that it works, so that's what I did ;-)


Great stuff guys. Thanks for killing my spam for me.
Piers.
From rob@hooft.net  Sat Nov  2 16:17:10 2002
From: rob@hooft.net (Rob Hooft)
Date: Sat, 02 Nov 2002 17:17:10 +0100
Subject: [Spambayes] spambayes.org
Message-ID: <3DC3FA86.8090305@hooft.net>

I just reserved spambayes.org

Rob

-- 
Rob W.W. Hooft  ||  rob@hooft.net  ||  http://www.hooft.net/people/rob/


From rob@hooft.net  Sat Nov  2 16:42:56 2002
From: rob@hooft.net (Rob Hooft)
Date: Sat, 02 Nov 2002 17:42:56 +0100
Subject: [Spambayes] An alternate use
References: <LNBBLJKPBEHFEDALKOLCIEEKCFAB.tim.one@comcast.net>
Message-ID: <3DC40090.5050109@hooft.net>

Tim Peters wrote:
> [T. Alexander Popiel]
> 
>>7. Most mailing lists are single-topic, and anything off-topic
>>   is unwanted.
> 
> 
> Eh -- probably.  I started with the mailing-list version of
> comp.lang.python, and there's a huge amount of traffic there that never
> mentions Python.  The variety of ham on that group is quite amazing.  But it
> contains almost no advertising beyond conference announcements, and I still
> expect that accounts for the breathtaking results I get on my c.l.py tests
> (2 mistakes out of 34,000 msgs, where one "mistake" is saying that a quote
> of a full Nigerian-scam spam is itself spam).
> 
> 
>>So, what I propose is that we specifically target mailing list
>>managers (mailman and ecartis being the two obvious first
>>targets) for spambayes integration.  I see two main modes for
>>this: just adding headers for the less intrusive, and actually
>>rejecting or forcing moderation for the heavily policed.
> 
> 
> That's actually what started this project:  Barry Warsaw is GNU Mailman's
> author, and he asked me to look into adapting Graham's scheme for
> incorporation into Mailman.  Barry has been pretty much missing in action
> here since then, but I expect him to take it up again.

So, we'd have to make mailing lists keep a spam-archive as well? Or do 
we deliver spambayes with a pre-cooked spam archive to get started with 
new mailing lists?

Rob

-- 
Rob W.W. Hooft  ||  rob@hooft.net  ||  http://www.hooft.net/people/rob/


From Tim@mail.powweb.com  Sat Nov  2 16:47:21 2002
From: Tim@mail.powweb.com (Tim@mail.powweb.com)
Date: Sat, 02 Nov 2002 10:47:21 -0600
Subject: [Spambayes] x-hammie-disposition in pop3proxy
Message-ID: <URPKKE1U751VCBMXSGEGDXTPMOI8461.3dc40199@riven>

Ok, I've got the pop3proxy up and running on my machine.  Very simple to get running.  I don't have a trained database (the real challenge) at this point, and it's 
adding the x-hammie-disposition header with value of 'no'.  I presume that this means that the classifier thinks this is NOT ham?  So if there's no database, then it 
assumes everything is spam?  Or am I reading the meaning of the header backwards?

- Tim
www.fourstonesExpressions.com 


From richie@entrian.com  Sat Nov  2 18:16:05 2002
From: richie@entrian.com (Richie Hindle)
Date: Sat, 02 Nov 2002 18:16:05 +0000
Subject: [Spambayes] x-hammie-disposition in pop3proxy
In-Reply-To: <URPKKE1U751VCBMXSGEGDXTPMOI8461.3dc40199@riven>
References: <URPKKE1U751VCBMXSGEGDXTPMOI8461.3dc40199@riven>
Message-ID: <p558sukad4e6c567aguok61belr91p1l84@4ax.com>

Hi Tim,

> adding the x-hammie-disposition header with value of 'no'.

'No' means it thinks it's ham - the header means "Is it spam?"  At the
moment the header added by pop3proxy.py is always "Yes" or "No" - I'll add
the new "Unsure" value when I get the chance.

> I don't have a trained database (the real challenge) at this point

Use hammie.py to train it - the usage message should tell you everything
you need to know, except how to create the mbox files or directories of
email message to feed into it.  Hopefully your email client will export
messages into one of those formats...

-- 
Richie Hindle
richie@entrian.com


From richie@entrian.com  Sat Nov  2 18:31:08 2002
From: richie@entrian.com (Richie Hindle)
Date: Sat, 02 Nov 2002 18:31:08 +0000
Subject: [Spambayes] x-hammie-disposition in pop3proxy
In-Reply-To: <NMPJGE83YSNHZXXJGC79795VSEALJDB.3dc41848@riven>
References: <p558sukad4e6c567aguok61belr91p1l84@4ax.com>
	<NMPJGE83YSNHZXXJGC79795VSEALJDB.3dc41848@riven>
Message-ID: <hc68suopu88obv82ma7h9qej2vqfu1kf6q@4ax.com>

Hi Tim,

> ... make the proxy listen on different ports.  I've modified the 
> code to do that, was a simple mod.  Do you want the mod?

Yes please!

-- 
Richie Hindle
richie@entrian.com


From tim.one@comcast.net  Sat Nov  2 17:57:39 2002
From: tim.one@comcast.net (Tim Peters)
Date: Sat, 02 Nov 2002 12:57:39 -0500
Subject: [Spambayes] x-hammie-disposition in pop3proxy
In-Reply-To: <URPKKE1U751VCBMXSGEGDXTPMOI8461.3dc40199@riven>
Message-ID: <LNBBLJKPBEHFEDALKOLCCEGLCFAB.tim.one@comcast.net>

[Tim@mail.powweb.com]
> Ok, I've got the pop3proxy up and running on my machine.  Very
> simple to get running.

Good!  I haven't had time to try it yet, so I won't be much help, but I'm
glad it ran easily for you.

> I don't have a trained database (the real challenge)

The difficulty of bootstrapping a database is generally overstated, and
especially by those who haven't yet done it <wink>.  Train on everything you
get for a few days.  I predict you'll find it gets most things right after
just a dozen msgs of each kind.  But it will also make howling mistakes
until you've trained on much more than that.  Even so, don't take the
classifications too seriously at the start, and it should be very helpful
quickly.

> at this point, and it's adding the x-hammie-disposition header with
> value of 'no'.  I presume that this means that the classifier thinks
> this is NOT ham?

More accurately, that the score fell below the value of spam_cutoff you've
set, and if you didn't set one yet, the default value of

spam_cutoff: 0.90

The relevant code appears to be in pop3proxy BayesProxy.onRetr():

            prob = self.bayes.spamprob(tokenizer.tokenize(message))
            if prob > options.spam_cutoff:
                disposition = "Yes"
            else:
                disposition = "No "

> So if there's no database, then it assumes everything is spam?

There's always a database, but at the start it's empty.  If there are no
words in the database, that's not a special case to the code, the math
simply works out to give a score of 0.5 to every msg then (which makes
sense:  in the absence of any evidence at all, it has no reason to favor any
specific conclusion).  Whatever you set ham_cutoff and spam_cutoff to be,
0.5 should definitely be in your Unsure category.  However, it doesn't look
like pop3proxy is paying attention to ham_cutoff yet, nor is it currently
capable of generating an "I'm lost -- help me!" Unsure disposition.  Someone
needs to teach it about the middle ground.

> Or am I reading the meaning of the header backwards?

No, you're reading it right.


From richie@entrian.com  Sat Nov  2 18:43:49 2002
From: richie@entrian.com (Richie Hindle)
Date: Sat, 02 Nov 2002 18:43:49 +0000
Subject: [Spambayes] x-hammie-disposition in pop3proxy
In-Reply-To: <LNBBLJKPBEHFEDALKOLCCEGLCFAB.tim.one@comcast.net>
References: <URPKKE1U751VCBMXSGEGDXTPMOI8461.3dc40199@riven>
	<LNBBLJKPBEHFEDALKOLCCEGLCFAB.tim.one@comcast.net>
Message-ID: <k578su4c8lsfvdvt67qui31593grs4ko5q@4ax.com>

Hi Tim,

> it doesn't look
> like pop3proxy is paying attention to ham_cutoff yet, nor is it currently
> capable of generating an "I'm lost -- help me!" Unsure disposition.  Someone
> needs to teach it about the middle ground.

I'll do this.

-- 
Richie Hindle
richie@entrian.com


From Tim@mail.powweb.com  Sat Nov  2 19:51:02 2002
From: Tim@mail.powweb.com (Tim@mail.powweb.com)
Date: Sat, 02 Nov 2002 13:51:02 -0600
Subject: [Spambayes] x-hammie-disposition in pop3proxy
In-Reply-To: <LNBBLJKPBEHFEDALKOLCCEGLCFAB.tim.one@comcast.net>
Message-ID: <ZV7473FB055174GCQPFEPJIJI2XXSHE.3dc42ca6@riven>

Ok, so Tim says I'm not reading it backwards, Richie says I am...  I think the x-hammie-disposition header should be ham|spam|unsure versus 'yes|no|unsure'.  
This is much clearer, not much chance for interpretive errors...  and furthermore, the header itself should be x-spambayes-disposition, because this says 
clearly where the header came from...  I can make that change, too, if the collective wills it... but if I'm gonna make many changes, it might be reasonable to 
bring me up to speed on the cvs checkin thing...

- Tim

11/2/2002 11:57:39 AM, Tim Peters <tim.one@comcast.net> wrote:

>[Tim@mail.powweb.com]
>> Ok, I've got the pop3proxy up and running on my machine.  Very
>> simple to get running.
>
>Good!  I haven't had time to try it yet, so I won't be much help, but I'm
>glad it ran easily for you.
>
>> I don't have a trained database (the real challenge)
>
>The difficulty of bootstrapping a database is generally overstated, and
>especially by those who haven't yet done it <wink>.  Train on everything you
>get for a few days.  I predict you'll find it gets most things right after
>just a dozen msgs of each kind.  But it will also make howling mistakes
>until you've trained on much more than that.  Even so, don't take the
>classifications too seriously at the start, and it should be very helpful
>quickly.
>
>> at this point, and it's adding the x-hammie-disposition header with
>> value of 'no'.  I presume that this means that the classifier thinks
>> this is NOT ham?
>
>More accurately, that the score fell below the value of spam_cutoff you've
>set, and if you didn't set one yet, the default value of
>
>spam_cutoff: 0.90
>
>The relevant code appears to be in pop3proxy BayesProxy.onRetr():
>
>            prob = self.bayes.spamprob(tokenizer.tokenize(message))
>            if prob > options.spam_cutoff:
>                disposition = "Yes"
>            else:
>                disposition = "No "
>
>> So if there's no database, then it assumes everything is spam?
>
>There's always a database, but at the start it's empty.  If there are no
>words in the database, that's not a special case to the code, the math
>simply works out to give a score of 0.5 to every msg then (which makes
>sense:  in the absence of any evidence at all, it has no reason to favor any
>specific conclusion).  Whatever you set ham_cutoff and spam_cutoff to be,
>0.5 should definitely be in your Unsure category.  However, it doesn't look
>like pop3proxy is paying attention to ham_cutoff yet, nor is it currently
>capable of generating an "I'm lost -- help me!" Unsure disposition.  Someone
>needs to teach it about the middle ground.
>
>> Or am I reading the meaning of the header backwards?
>
>No, you're reading it right.
>
>
>_______________________________________________
>Spambayes mailing list
>Spambayes@python.org
>http://mail.python.org/mailman/listinfo/spambayes
>
>
>
>
- Tim
www.fourstonesExpressions.com 


From richie@entrian.com  Sat Nov  2 21:02:35 2002
From: richie@entrian.com (Richie Hindle)
Date: Sat, 02 Nov 2002 21:02:35 +0000
Subject: [Spambayes] 
 pop3proxy,py now supports 'Unsure' and can run on arbitrary ports
Message-ID: <k3e8su0kci04g1jb19s2u0tnnt2k0nkd9h@4ax.com>

Hi all,

pop3proxy.py now supports the 'Unsure' value for X-Hammie-Disposition.  If
you're using pop3proxy and filtering on 'No' values, you'll now get fewer
hits because some emails that used to be 'No' will be 'Unsure'.

Also, it can now listen on the port of your choice (thanks to Tim Stone)
meaning you can run many proxies on the same machine (and also run it as
non-root on Unix systems).

Finally, it's less anal about correcting for the size of the added header -
it no longer adds trailing spaces to the header to make the message up to
the size reported by the LIST command.  If this breaks anything I'll eat my
head.

-- 
Richie Hindle
richie@entrian.com


From richie@entrian.com  Sat Nov  2 21:03:38 2002
From: richie@entrian.com (Richie Hindle)
Date: Sat, 02 Nov 2002 21:03:38 +0000
Subject: [Spambayes] x-hammie-disposition in pop3proxy
In-Reply-To: <ZV7473FB055174GCQPFEPJIJI2XXSHE.3dc42ca6@riven>
References: <LNBBLJKPBEHFEDALKOLCCEGLCFAB.tim.one@comcast.net>
	<ZV7473FB055174GCQPFEPJIJI2XXSHE.3dc42ca6@riven>
Message-ID: <92d8suolml4uj5cl334lrpvneo6qiid2h0@4ax.com>

Hi Tim,

> Ok, so Tim says I'm not reading it backwards, Richie says I am...

Some misunderstanding I think - the header means "Is it spam?"  But you're
right, 'Yes' / 'No' is less clear (unless we rename the header to something
that makes it clear) than 'Spam' / 'Ham'.  Have we collectively decided
that 'Ham' is the official word for non-Spam?  Someone pointed out a while
ago that it's a little impolite towards Hormel to imply that Spam is the
opposite of ham... though that might be a little hyper-sensitive.  Someone
else (I should check my references but I'm lazy 8-) suggested that our use
of the word 'Ham' is a useful USP.  I vote for keeping it.

I also agree that the header should have a new name - Hammie was the first
front-end to the spambayes project, and other front-ends have since
inherited the header, which is a bit daft (sorry Neale!).  I'd like to drop
the techie word 'disposition' as well - how about:

X-Spambayes-Judgement: Spam / Unsure / Ham
X-Spambayes-Is-Spam: Yes / Unsure / No
X-Spambayes-Looks-Like-Spam: Yes / Unsure / No

If we're going to change this, we should make sure we get it right first
(albeit second) time.  That includes deciding whether there are optional
extra details that can go into the header, or whether there's an optional
extra header to carry those details.  I think there *should* be optional
extra details, probably in a separate header - it's one of the cool things
about SpamAssassin.  I vote to drop all extra details from the main header,
then decide later on whether there should be an extra header.

We ought to sort this out soon, because more and more people are starting
to use the software, and we're going to affect them all if/when we rename
headers.

-- 
Richie Hindle
richie@entrian.com


From Tim@mail.powweb.com  Sat Nov  2 21:26:57 2002
From: Tim@mail.powweb.com (Tim@mail.powweb.com)
Date: Sat, 02 Nov 2002 15:26:57 -0600
Subject: [Spambayes] x-hammie-disposition in pop3proxy
In-Reply-To: <92d8suolml4uj5cl334lrpvneo6qiid2h0@4ax.com>
Message-ID: <8771YW62YTA8WTA5WSSPHCZUF0HCWVR.3dc44321@riven>

Richie,

>X-Spambayes-Judgement: Spam / Unsure / Ham
>X-Spambayes-Is-Spam: Yes / Unsure / No
>X-Spambayes-Looks-Like-Spam: Yes / Unsure / No

I vote for the first.  It contains the most information: This is a judgement, made by spambayes, that says this email is <_>.  This is all that should go in this 
header, unless there is something we can do to make the header less forgeable by spammers, which I doubt.

Further information in other headers might be very useful.  It certainly is in spam assassin.  What information spambayes might be able to share... the stats 
dudes probably have a better handle on that than me.

11/2/2002 3:03:38 PM, Richie Hindle <richie@entrian.com> wrote:

>Hi Tim,
>
>> Ok, so Tim says I'm not reading it backwards, Richie says I am...
>
>Some misunderstanding I think - the header means "Is it spam?"  But you're
>right, 'Yes' / 'No' is less clear (unless we rename the header to something
>that makes it clear) than 'Spam' / 'Ham'.  Have we collectively decided
>that 'Ham' is the official word for non-Spam?  Someone pointed out a while
>ago that it's a little impolite towards Hormel to imply that Spam is the
>opposite of ham... though that might be a little hyper-sensitive.  Someone
>else (I should check my references but I'm lazy 8-) suggested that our use
>of the word 'Ham' is a useful USP.  I vote for keeping it.
>
>I also agree that the header should have a new name - Hammie was the first
>front-end to the spambayes project, and other front-ends have since
>inherited the header, which is a bit daft (sorry Neale!).  I'd like to drop
>the techie word 'disposition' as well - how about:
>
>X-Spambayes-Judgement: Spam / Unsure / Ham
>X-Spambayes-Is-Spam: Yes / Unsure / No
>X-Spambayes-Looks-Like-Spam: Yes / Unsure / No
>
>If we're going to change this, we should make sure we get it right first
>(albeit second) time.  That includes deciding whether there are optional
>extra details that can go into the header, or whether there's an optional
>extra header to carry those details.  I think there *should* be optional
>extra details, probably in a separate header - it's one of the cool things
>about SpamAssassin.  I vote to drop all extra details from the main header,
>then decide later on whether there should be an extra header.
>
>We ought to sort this out soon, because more and more people are starting
>to use the software, and we're going to affect them all if/when we rename
>headers.
>
>-- 
>Richie Hindle
>richie@entrian.com
>
>
>_______________________________________________
>Spambayes mailing list
>Spambayes@python.org
>http://mail.python.org/mailman/listinfo/spambayes
>
>
>
>
- Tim
www.fourstonesExpressions.com 


From Tim@mail.powweb.com  Sat Nov  2 21:36:53 2002
From: Tim@mail.powweb.com (Tim@mail.powweb.com)
Date: Sat, 02 Nov 2002 15:36:53 -0600
Subject: [Spambayes] x-hammie-disposition in pop3proxy
In-Reply-To: <ZV7473FB055174GCQPFEPJIJI2XXSHE.3dc42ca6@riven>
Message-ID: <YSQKTO06WVSNYUHGTNQM1TOM1UGFD9DC.3dc44575@riven>

>>The difficulty of bootstrapping a database is generally overstated, and
>>especially by those who haven't yet done it <wink>.  Train on everything you
>>get for a few days. 

The problem I have is that my windoze based opera mailer does not store mail in textual format in a separate file system artifact for each email.  There is a 
limited functionality for storing a particular mail in a text file, but I have to do that manually one at a time.  Once this is done, then neiltrain.py will work perfectly 
well, but that's still an enormous amount of work.  This will probably be a typical kind of problem.  I know netscape and mozilla do much the same thing.

I'm going to try to figure out a better way to do it.  The idea of an smtp proxy that recognizes forwards to ham@ and spam@ is very attractive...

11/2/2002 1:51:02 PM, Tim@mail.powweb.com, Stone@mail.powweb.com, Four Stones Expressions <tim@fourstonesExpressions.com> wrote:

>Ok, so Tim says I'm not reading it backwards, Richie says I am...  I think the x-hammie-disposition header should be ham|spam|unsure versus 'yes|no|unsure'.  
>This is much clearer, not much chance for interpretive errors...  and furthermore, the header itself should be x-spambayes-disposition, because this says 
>clearly where the header came from...  I can make that change, too, if the collective wills it... but if I'm gonna make many changes, it might be reasonable to 
>bring me up to speed on the cvs checkin thing...
>
>- Tim
>
>11/2/2002 11:57:39 AM, Tim Peters <tim.one@comcast.net> wrote:
>
>>[Tim@mail.powweb.com]
>>> Ok, I've got the pop3proxy up and running on my machine.  Very
>>> simple to get running.
>>
>>Good!  I haven't had time to try it yet, so I won't be much help, but I'm
>>glad it ran easily for you.
>>
>>> I don't have a trained database (the real challenge)
>>
>>The difficulty of bootstrapping a database is generally overstated, and
>>especially by those who haven't yet done it <wink>.  Train on everything you
>>get for a few days.  I predict you'll find it gets most things right after
>>just a dozen msgs of each kind.  But it will also make howling mistakes
>>until you've trained on much more than that.  Even so, don't take the
>>classifications too seriously at the start, and it should be very helpful
>>quickly.
>>
>>> at this point, and it's adding the x-hammie-disposition header with
>>> value of 'no'.  I presume that this means that the classifier thinks
>>> this is NOT ham?
>>
>>More accurately, that the score fell below the value of spam_cutoff you've
>>set, and if you didn't set one yet, the default value of
>>
>>spam_cutoff: 0.90
>>
>>The relevant code appears to be in pop3proxy BayesProxy.onRetr():
>>
>>            prob = self.bayes.spamprob(tokenizer.tokenize(message))
>>            if prob > options.spam_cutoff:
>>                disposition = "Yes"
>>            else:
>>                disposition = "No "
>>
>>> So if there's no database, then it assumes everything is spam?
>>
>>There's always a database, but at the start it's empty.  If there are no
>>words in the database, that's not a special case to the code, the math
>>simply works out to give a score of 0.5 to every msg then (which makes
>>sense:  in the absence of any evidence at all, it has no reason to favor any
>>specific conclusion).  Whatever you set ham_cutoff and spam_cutoff to be,
>>0.5 should definitely be in your Unsure category.  However, it doesn't look
>>like pop3proxy is paying attention to ham_cutoff yet, nor is it currently
>>capable of generating an "I'm lost -- help me!" Unsure disposition.  Someone
>>needs to teach it about the middle ground.
>>
>>> Or am I reading the meaning of the header backwards?
>>
>>No, you're reading it right.
>>
>>
>>_______________________________________________
>>Spambayes mailing list
>>Spambayes@python.org
>>http://mail.python.org/mailman/listinfo/spambayes
>>
>>
>>
>>
>- Tim
>www.fourstonesExpressions.com 
>
>
>
>_______________________________________________
>Spambayes mailing list
>Spambayes@python.org
>http://mail.python.org/mailman/listinfo/spambayes
>
>
>
>
- Tim
www.fourstonesExpressions.com 


From vanhorn@whidbey.com  Sat Nov  2 21:46:19 2002
From: vanhorn@whidbey.com (G. Armour Van Horn)
Date: Sat, 02 Nov 2002 13:46:19 -0800
Subject: [Spambayes] x-hammie-disposition in pop3proxy
References: <LNBBLJKPBEHFEDALKOLCCEGLCFAB.tim.one@comcast.net>
	<92d8suolml4uj5cl334lrpvneo6qiid2h0@4ax.com>
Message-ID: <3DC447AB.D64D6CE6@whidbey.com>

Richie Hindle wrote:

> X-Spambayes-Judgement: Spam / Unsure / Ham
> X-Spambayes-Is-Spam: Yes / Unsure / No
> X-Spambayes-Looks-Like-Spam: Yes / Unsure / No

I know we have a long tradition of spelling errors behind us, such as dropping
an "R" from "referrer" in Apache logs, but I'd hate to start a new one! Please,
only one "E" in "judgment."

Van

--
----------------------------------------------------------
Sign up now for Quotes of the Day, a handful of quotations
on a theme delivered every morning.
Enlightenment! Daily, for free!
mailto:twisted@whidbey.com?subject=Subscribe_QOTD

For web hosting and maintenance,
visit Van's home page: http://www.domainvanhorn.com/van/
----------------------------------------------------------


From guido@python.org  Sat Nov  2 22:41:59 2002
From: guido@python.org (Guido van Rossum)
Date: Sat, 02 Nov 2002 17:41:59 -0500
Subject: [Spambayes] x-hammie-disposition in pop3proxy
In-Reply-To: Your message of "Sat, 02 Nov 2002 13:46:19 PST."
             <3DC447AB.D64D6CE6@whidbey.com> 
References: <LNBBLJKPBEHFEDALKOLCCEGLCFAB.tim.one@comcast.net>
	<92d8suolml4uj5cl334lrpvneo6qiid2h0@4ax.com>  
	<3DC447AB.D64D6CE6@whidbey.com> 
Message-ID: <200211022241.gA2Mfxq07985@pcp02138704pcs.reston01.va.comcast.net>

> > X-Spambayes-Judgement: Spam / Unsure / Ham
> > X-Spambayes-Is-Spam: Yes / Unsure / No
> > X-Spambayes-Looks-Like-Spam: Yes / Unsure / No
> 
> I know we have a long tradition of spelling errors behind us, such
> as dropping an "R" from "referrer" in Apache logs, but I'd hate to
> start a new one! Please, only one "E" in "judgment."

But it's not a spelling error!

--Guido van Rossum (home page: http://www.python.org/~guido/)

From Tim@mail.powweb.com  Sat Nov  2 22:48:21 2002
From: Tim@mail.powweb.com (Tim@mail.powweb.com)
Date: Sat, 02 Nov 2002 16:48:21 -0600
Subject: [Spambayes] SMTP proxy
Message-ID: <TSC7A8741POF042MLNIYX1WVTD9NJLH.3dc45635@riven>

Ok, I have a ***very*** rudimentary SMTP proxy for the purpose of training a spambayes database from mailers like the mozilla mailer, which keeps mail in a 
single file, and gives you very little facility for extracting individual mails.

The proxy is based on code from Lee Smithson (http://sourceforge.net/cvs/?group_id=31674), and requires the DNS module (http://sourceforge.net/cvs/?
group_id=31674).

You run the proxy, naming the port you want it to listen on.  Then you point your mailer to localhost:<port>.  You're all set at that point.  I haven't tested it very 
much, and it doesn't appear to handle error conditions particularly well (at all).

To train, simply forward or redirect a mail to spam@localspambayes.trn or to ham@localspambayes.trn (these addresses are hardcoded at the moment...)  The 
proxy recognizes these addresses and executes a learn and updates probabilities in a database that's named as an argument when the proxy is started up.  I'm 
assuming that update probabilities preserves the existing information in the database... I couldn't tell if this was the case from neiltrain.py.  If not, then this stuff 
won't work yet...

So, now my question is does the project want this stuff?  If so, where should I send it?

- Tim
www.fourstonesExpressions.com 


From richie@entrian.com  Sat Nov  2 22:59:30 2002
From: richie@entrian.com (Richie Hindle)
Date: Sat, 02 Nov 2002 22:59:30 +0000
Subject: [Spambayes] x-hammie-disposition in pop3proxy
In-Reply-To: <3DC447AB.D64D6CE6@whidbey.com>
References: <LNBBLJKPBEHFEDALKOLCCEGLCFAB.tim.one@comcast.net>
	<92d8suolml4uj5cl334lrpvneo6qiid2h0@4ax.com> <3DC447AB.D64D6CE6@whidbey.com>
Message-ID: <jal8su81627i476j0e0ibs28u5j17djf7b@4ax.com>

Hi Van,

> > X-Spambayes-Judgement: Spam / Unsure / Ham
> 
> I know we have a long tradition of spelling errors behind us, such as dropping
> an "R" from "referrer" in Apache logs, but I'd hate to start a new one! Please,
> only one "E" in "judgment."

Is this a British English vs. American English thing?  My Concise Oxford
says:

judgement n. (also judgment) 
  1 the critical faculty; discernment (an error of judgement).
  2 good sense.
  [snip]

listing 'judgement' first and 'judgment' as the variant.  The American
Heritage Dictionary lists 'judgment' first and 'judgement' as the variant.
Princeton's WordNet agrees with the Oxford.  Meriam-Webster agrees with
American Heritage.  My girlfriend (the ultimate authority on most things)
agrees with the Oxford.

I really hate to say this, but we should probably use the common American
spelling (even if it's wrong 8-)

-- 
Richie Hindle
richie@entrian.com


From guido@python.org  Sat Nov  2 23:01:22 2002
From: guido@python.org (Guido van Rossum)
Date: Sat, 02 Nov 2002 18:01:22 -0500
Subject: [Spambayes] Spam at hackers conference
Message-ID: <200211022301.gA2N1MJ08093@pcp02138704pcs.reston01.va.comcast.net>

At the "Hackers" conference (a cool west coast event by invitation
only) there was a session on spam.  A few things to note:

- The term "ham" is now generally accepted :-)

- People are still at the Paul Graham level of Bayesian filtering;
  I wish I had a blurb about the work done here on chi-square.

- Combining different approaches (e.g. blacklists, whitelists,
  Bayesian) seems to make people more comfortable.

- The name of Bill Yerazunis was mentioned as someone who has done
  good spam work.  Paul Graham seems to agree:
  http://www.paulgraham.com/wsy.html ; one idea of his takes groups of
  5 words and does various permutations (including leaving out some)
  and then hashes on the result; very good results apparently.  (Maybe
  the URL abouve has more info?)

--Guido van Rossum (home page: http://www.python.org/~guido/)

From guido@python.org  Sat Nov  2 23:19:17 2002
From: guido@python.org (Guido van Rossum)
Date: Sat, 02 Nov 2002 18:19:17 -0500
Subject: [Spambayes] Email client integration -- what's needed?
In-Reply-To: Your message of "Fri, 01 Nov 2002 18:18:51 EST."
             <15811.3035.967754.435766@slothrop.zope.com> 
References: <15810.64929.812472.459643@slothrop.zope.com>
	<LNBBLJKPBEHFEDALKOLCMECHCFAB.tim.one@comcast.net>  
	<15811.3035.967754.435766@slothrop.zope.com> 
Message-ID: <200211022319.gA2NJHt08300@pcp02138704pcs.reston01.va.comcast.net>

>   >> The pop proxy is great for people who use pop, but lots of people
>   >> don't.
> 
>   TP> Name 362.  Ha!
> 
> Guido and at least 361 other people <wink>.

Um, I get all my mail via pop (and fetchmail).

--Guido van Rossum (home page: http://www.python.org/~guido/)

From Tim@mail.powweb.com  Sat Nov  2 23:18:51 2002
From: Tim@mail.powweb.com (Tim@mail.powweb.com)
Date: Sat, 02 Nov 2002 17:18:51 -0600
Subject: [Spambayes] Spam at hackers conference
Message-ID: <UOPL86TPRP091Z2Y31C8B77VTCAA9SP.3dc45d5b@riven>

I've *always* suspected that spambayes in combination with other technology would present a very powerful anti-spam arsenal.  But spambayes by itself is so 
good, that it may not really require supplemental technology.  I say *always* because I've only been in this game for a couple weeks... ;)  so what do I 
REALLY know?

- Tim

11/2/2002 5:01:22 PM, Guido van Rossum <guido@python.org> wrote:

>At the "Hackers" conference (a cool west coast event by invitation
>only) there was a session on spam.  A few things to note:
>
>- The term "ham" is now generally accepted :-)
>
>- People are still at the Paul Graham level of Bayesian filtering;
>  I wish I had a blurb about the work done here on chi-square.
>
>- Combining different approaches (e.g. blacklists, whitelists,
>  Bayesian) seems to make people more comfortable.
>
>- The name of Bill Yerazunis was mentioned as someone who has done
>  good spam work.  Paul Graham seems to agree:
>  http://www.paulgraham.com/wsy.html ; one idea of his takes groups of
>  5 words and does various permutations (including leaving out some)
>  and then hashes on the result; very good results apparently.  (Maybe
>  the URL abouve has more info?)
>
>--Guido van Rossum (home page: http://www.python.org/~guido/)
>
>_______________________________________________
>Spambayes mailing list
>Spambayes@python.org
>http://mail.python.org/mailman/listinfo/spambayes
>
>
>
>
- Tim
www.fourstonesExpressions.com 


From mhammond@skippinet.com.au  Sat Nov  2 23:51:34 2002
From: mhammond@skippinet.com.au (Mark Hammond)
Date: Sun, 3 Nov 2002 10:51:34 +1100
Subject: [Spambayes] Spam at hackers conference
In-Reply-To: <UOPL86TPRP091Z2Y31C8B77VTCAA9SP.3dc45d5b@riven>
Message-ID: <LCEPIIGDJPKCOIHOBJEPAEPOHHAA.mhammond@skippinet.com.au>

> I've *always* suspected that spambayes in combination with other
> technology would present a very powerful anti-spam arsenal.  But
> spambayes by itself is so
> good, that it may not really require supplemental technology.

I'm finding that too.  My email had 2 different problems - Spam, and
attempted worm payload (Klez et al).

As soon as I had an Outlook plugin working, I hacked up a trivial worm
detector - way before I had the spambayes stuff working.  I was very very
happy with the results - worm problem almost gone!

Then bayes came along.  I made real attempts to keep these worms out of my
spam corpa, as I thought they would mess up Bayes (eg, they often had
"pythonwin" in the subject).

But regardless of how careful I am, Bayes *still* defines them as Spam.  My
worm filter and Bayes are battling over who gets to move the mail message.
No matter how careful I am about keeping these worms from my Spam folder,
Bayes just keeps on knowing they are junk.

This mirrors what Tim has been saying - it seems likely that a single
classifier, over *all* of your mail (including mailing list etc) will be
pretty much all you need.

And-client-software-that-doesn't-keep-crapping-out <wink>

Mark.


From vanhorn@whidbey.com  Sun Nov  3 00:01:47 2002
From: vanhorn@whidbey.com (G. Armour Van Horn)
Date: Sat, 02 Nov 2002 16:01:47 -0800
Subject: [Spambayes] x-hammie-disposition in pop3proxy
References: <LNBBLJKPBEHFEDALKOLCCEGLCFAB.tim.one@comcast.net>
	<92d8suolml4uj5cl334lrpvneo6qiid2h0@4ax.com> <3DC447AB.D64D6CE6@whidbey.com>
	<jal8su81627i476j0e0ibs28u5j17djf7b@4ax.com>
Message-ID: <3DC4676B.F5ED02CE@whidbey.com>

Richie,

Well, you're right. The COD does list it with that spelling first. The RHUD (Random
House Unabridged Dictionary), which is always open right next to my desk, shows
"Also, esp. Brit, judgement" after eight definitions and I hadn't read that far.

Normally I feel pretty odd, having 18 dictionaries on my shelf, 8 of them English.
It's good to be in a group where someone else cares!

Van

Richie Hindle wrote:

> Hi Van,
>
> > > X-Spambayes-Judgement: Spam / Unsure / Ham
> >
> > I know we have a long tradition of spelling errors behind us, such as dropping
> > an "R" from "referrer" in Apache logs, but I'd hate to start a new one! Please,
> > only one "E" in "judgment."
>
> Is this a British English vs. American English thing?  My Concise Oxford
> says:
>
> judgement n. (also judgment)
>   1 the critical faculty; discernment (an error of judgement).
>   2 good sense.
>   [snip]
>
> listing 'judgement' first and 'judgment' as the variant.  The American
> Heritage Dictionary lists 'judgment' first and 'judgement' as the variant.
> Princeton's WordNet agrees with the Oxford.  Meriam-Webster agrees with
> American Heritage.  My girlfriend (the ultimate authority on most things)
> agrees with the Oxford.
>
> I really hate to say this, but we should probably use the common American
> spelling (even if it's wrong 8-)
>
> --
> Richie Hindle
> richie@entrian.com

--
----------------------------------------------------------
Sign up now for Quotes of the Day, a handful of quotations
on a theme delivered every morning.
Enlightenment! Daily, for free!
mailto:twisted@whidbey.com?subject=Subscribe_QOTD

For web hosting and maintenance,
visit Van's home page: http://www.domainvanhorn.com/van/
----------------------------------------------------------


From richie@entrian.com  Sun Nov  3 00:07:03 2002
From: richie@entrian.com (Richie Hindle)
Date: Sun, 03 Nov 2002 00:07:03 +0000
Subject: [Spambayes] SMTP proxy
In-Reply-To: <TSC7A8741POF042MLNIYX1WVTD9NJLH.3dc45635@riven>
References: <TSC7A8741POF042MLNIYX1WVTD9NJLH.3dc45635@riven>
Message-ID: <0ip8sug6mg229optc7fkebjn10j7tktqu5@4ax.com>

Hi Tim,

> I have a ***very*** rudimentary SMTP proxy for the purpose of training
> a spambayes database [...] does the project want this stuff?  If so,
> where should I send it?

Yay!  I'd love to see it - please send your code (either to me or to the
list).

Your comments raise a couple of questions, but I'll wait until I see the
code before asking them.  One thing worth mentioning to anyone joining the
coding team on this project is the Python coding standard at
http://www.python.org/peps/pep-0008.html - if all your code is full of code
( with spaces ( inside parentheses ) ) ( like mine used to be ) then I have
a script which can help (written as a result of the other Tim kindly
pointing me at the style guide when I first submitted code).

-- 
Richie Hindle
richie@entrian.com


From mhammond@skippinet.com.au  Sun Nov  3 00:20:07 2002
From: mhammond@skippinet.com.au (Mark Hammond)
Date: Sun, 3 Nov 2002 11:20:07 +1100
Subject: [Spambayes] x-hammie-disposition in pop3proxy
In-Reply-To: <3DC4676B.F5ED02CE@whidbey.com>
Message-ID: <LCEPIIGDJPKCOIHOBJEPGEABHIAA.mhammond@skippinet.com.au>

> "Also, esp. Brit, judgement" after eight definitions and I hadn't
> read that far.
>
> Normally I feel pretty odd, having 18 dictionaries on my shelf, 8
> of them English.
> It's good to be in a group where someone else cares!

Well, us colonials all gave up caring what the yanks did to the language
long ago <wink>.  I have one dictionary on my desk; the Macquarie, the
official Australian dictionary.

It is listed as:

judgment=judgement

with no further comment on the alternative spellings.

But-the-final-authority-is-that-Outlook's-spell-checker-likes-them-both-too
<wink>

Mark.


From richie@entrian.com  Sun Nov  3 00:21:23 2002
From: richie@entrian.com (Richie Hindle)
Date: Sun, 03 Nov 2002 00:21:23 +0000
Subject: [Spambayes] x-hammie-disposition in pop3proxy
In-Reply-To: <3DC4676B.F5ED02CE@whidbey.com>
References: <LNBBLJKPBEHFEDALKOLCCEGLCFAB.tim.one@comcast.net>
	<92d8suolml4uj5cl334lrpvneo6qiid2h0@4ax.com> <3DC447AB.D64D6CE6@whidbey.com>
	<jal8su81627i476j0e0ibs28u5j17djf7b@4ax.com> <3DC4676B.F5ED02CE@whidbey.com>
Message-ID: <ibq8susf6iep85oa89viu4l2os3in5gbpt@4ax.com>

Hi Van,

> It's good to be in a group where someone else cares!

Definitely.  But aren't most technical groups like that?  Being a language
geek is a direct consequence of being a geek of any kind, isn't it?  8-)

Back to the topic: the spelling 'judgment' looks simply wrong to me (and to
my girlfriend, the nineteenth dictionary).  I suspect it never gets used in
British English.  However, Google lists 2,150,000 hits for 'judgement' and
5,800,000 hits for 'judgment', which implies that the latter is in more
common use worldwide.  The question is, does 'judgement' look as wrong to
American eyes as 'judgment' does to British ones?  Judging (ha ha) by your
initial response, I'd guess it does.  (Pardon me if you're not American!)

Or maybe the *real* question is, shall we call the header
X-Spambayes-Classification?

-- 
Richie Hindle
richie@entrian.com


From Tim@mail.powweb.com  Sun Nov  3 00:30:56 2002
From: Tim@mail.powweb.com (Tim@mail.powweb.com)
Date: Sat, 02 Nov 2002 18:30:56 -0600
Subject: [Spambayes] x-hammie-disposition in pop3proxy
In-Reply-To: <ibq8susf6iep85oa89viu4l2os3in5gbpt@4ax.com>
Message-ID: <YUXR1W3ZWS04FC9453KBAQNA6QKWTBA.3dc46e40@riven>

Well, judgment looks wrong to me... How about x-hammie-classified-as?

- TimS

11/2/2002 6:21:23 PM, Richie Hindle <richie@entrian.com> wrote:

>Hi Van,
>
>> It's good to be in a group where someone else cares!
>
>Definitely.  But aren't most technical groups like that?  Being a language
>geek is a direct consequence of being a geek of any kind, isn't it?  8-)
>
>Back to the topic: the spelling 'judgment' looks simply wrong to me (and to
>my girlfriend, the nineteenth dictionary).  I suspect it never gets used in
>British English.  However, Google lists 2,150,000 hits for 'judgement' and
>5,800,000 hits for 'judgment', which implies that the latter is in more
>common use worldwide.  The question is, does 'judgement' look as wrong to
>American eyes as 'judgment' does to British ones?  Judging (ha ha) by your
>initial response, I'd guess it does.  (Pardon me if you're not American!)
>
>Or maybe the *real* question is, shall we call the header
>X-Spambayes-Classification?
>
>-- 
>Richie Hindle
>richie@entrian.com
>
>
>_______________________________________________
>Spambayes mailing list
>Spambayes@python.org
>http://mail.python.org/mailman/listinfo/spambayes
>
>
- Tim
www.fourstonesExpressions.com 


From popiel@wolfskeep.com  Sun Nov  3 00:30:53 2002
From: popiel@wolfskeep.com (T. Alexander Popiel)
Date: Sat, 02 Nov 2002 16:30:53 -0800
Subject: [Spambayes] x-hammie-disposition in pop3proxy 
In-Reply-To: Message from "G. Armour Van Horn" <vanhorn@whidbey.com> 
   of "Sat, 02 Nov 2002 13:46:19 PST." <3DC447AB.D64D6CE6@whidbey.com> 
References: <LNBBLJKPBEHFEDALKOLCCEGLCFAB.tim.one@comcast.net>
	<92d8suolml4uj5cl334lrpvneo6qiid2h0@4ax.com>  <3DC447AB.D64D6CE6@whidbey.com> 
Message-ID: <20021103003053.44B27F49B@cashew.wolfskeep.com>

In message:  <3DC447AB.D64D6CE6@whidbey.com>
             "G. Armour Van Horn" <vanhorn@whidbey.com> writes:
>Richie Hindle wrote:
>
>> X-Spambayes-Judgement: Spam / Unsure / Ham
>
>I know we have a long tradition of spelling errors behind us, such as
>dropping an "R" from "referrer" in Apache logs, but I'd hate to start
>a new one! Please, only one "E" in "judgment."

Actually, Merriam-Webster lists both forms as valid.

- Alex

From B-Morgan@concentric.net  Sun Nov  3 00:31:59 2002
From: B-Morgan@concentric.net (Brad Morgan)
Date: Sat, 2 Nov 2002 17:31:59 -0700
Subject: [Spambayes] SMTP proxy
In-Reply-To: <TSC7A8741POF042MLNIYX1WVTD9NJLH.3dc45635@riven>
Message-ID: <NABBJOOEOFODEALNMJAJIECHHBAA.B-Morgan@concentric.net>

Tim,

Please submit the SMTP proxy to the project.  I think this is a good
interface for training.  I'm also following the popfile Sourceforge project
and they have a useable interface using HTML (on an alternate port) which is
a reasonable alternative.

I do have a question on the SMTP proxy.  Can it be configured to pass
everything it doesn't capture on the the "normal" proxy (where "normal" is
specified somehow)?  If not, how would it be configured in say, Outlook?

Keep up the good work!

Regards,

Brad


From B-Morgan@concentric.net  Sun Nov  3 00:36:27 2002
From: B-Morgan@concentric.net (Brad Morgan)
Date: Sat, 2 Nov 2002 17:36:27 -0700
Subject: [Spambayes] x-hammie-disposition in pop3proxy
In-Reply-To: <ibq8susf6iep85oa89viu4l2os3in5gbpt@4ax.com>
Message-ID: <NABBJOOEOFODEALNMJAJAECIHBAA.B-Morgan@concentric.net>

> Or maybe the *real* question is, shall we call the header
> X-Spambayes-Classification?

"X-Spambayes-Classification: spam, ham, or unsure" makes perfect sense to
me.  There's enough words in common between British, American, (and
Austrailian) English that we can use <G>.

Regards,

Brad


From popiel@wolfskeep.com  Sun Nov  3 00:40:02 2002
From: popiel@wolfskeep.com (T. Alexander Popiel)
Date: Sat, 02 Nov 2002 16:40:02 -0800
Subject: [Spambayes] x-hammie-disposition in pop3proxy 
In-Reply-To: Message from Richie Hindle <richie@entrian.com> 
	<ibq8susf6iep85oa89viu4l2os3in5gbpt@4ax.com> 
References: <LNBBLJKPBEHFEDALKOLCCEGLCFAB.tim.one@comcast.net>
	<92d8suolml4uj5cl334lrpvneo6qiid2h0@4ax.com> <3DC447AB.D64D6CE6@whidbey.com>
	<jal8su81627i476j0e0ibs28u5j17djf7b@4ax.com> <3DC4676B.F5ED02CE@whidbey.com>
	<ibq8susf6iep85oa89viu4l2os3in5gbpt@4ax.com> 
Message-ID: <20021103004002.92222F49B@cashew.wolfskeep.com>

In message:  <ibq8susf6iep85oa89viu4l2os3in5gbpt@4ax.com>
             Richie Hindle <richie@entrian.com> writes:
>Hi Van,
>
>> It's good to be in a group where someone else cares!
>
>Definitely.  But aren't most technical groups like that?  Being a language
>geek is a direct consequence of being a geek of any kind, isn't it?  8-)

Not a direct consequence, but there is a high correlation.
I've never run chi-square on it, though. ;-)

>The question is, does 'judgement' look as wrong to American eyes as
>'judgment' does to British ones?

Not to this American's eyes... but I tend to go for the British versions
of many words, so it might just be an affectation.

>Or maybe the *real* question is, shall we call the header
>X-Spambayes-Classification?

I like that one.

- Alex

From Tim@mail.powweb.com  Sun Nov  3 00:47:38 2002
From: Tim@mail.powweb.com (Tim@mail.powweb.com)
Date: Sat, 02 Nov 2002 18:47:38 -0600
Subject: [Spambayes] SMTP proxy
In-Reply-To: <NABBJOOEOFODEALNMJAJIECHHBAA.B-Morgan@concentric.net>
Message-ID: <XUBAHEOFE3YMJC71YSO3X74SM632WNH.3dc4722a@riven>

Brad, I've sent the code to Richie, as I don't have checkin privileges (sp?) on the Spambayes project.  Gosh, you guys have got me all worried about 
spelling...

To answer your question, yes it passes mail through right now, but there is some configuration related work that really needs to be done before I'd consider it 
to be ready for primetime.  When you run the proxy, you tell it what port to listen on and give it a dns ip address.  Normal outgoing mail is processed by doing 
a dns lookup on the domain in the to: address, grabbing the smtp server name from the dns lookup return (a bit of a mystery to me) and connecting to that 
server.  This is kinda not right imo.  I think that the outgoing smtp server name should be specifiable as a startup option, as should the port to send on.  That 
way, you can specify localhost:<port> if you have another proxy running, which will allow you to chain them.  In my instance, this is exactly what I have at the 
moment.  So... that'll be coming, soon I hope...

- TimS

11/2/2002 6:31:59 PM, "Brad Morgan" <B-Morgan@concentric.net> wrote:

>Tim,
>
>Please submit the SMTP proxy to the project.  I think this is a good
>interface for training.  I'm also following the popfile Sourceforge project
>and they have a useable interface using HTML (on an alternate port) which is
>a reasonable alternative.
>
>I do have a question on the SMTP proxy.  Can it be configured to pass
>everything it doesn't capture on the the "normal" proxy (where "normal" is
>specified somehow)?  If not, how would it be configured in say, Outlook?
>
>Keep up the good work!
>
>Regards,
>
>Brad
>
>
>
>_______________________________________________
>Spambayes mailing list
>Spambayes@python.org
>http://mail.python.org/mailman/listinfo/spambayes
>
>
- Tim
www.fourstonesExpressions.com 


From Tim@mail.powweb.com  Sun Nov  3 00:48:48 2002
From: Tim@mail.powweb.com (Tim@mail.powweb.com)
Date: Sat, 02 Nov 2002 18:48:48 -0600
Subject: [Spambayes] x-hammie-disposition in pop3proxy 
In-Reply-To: <20021103004002.92222F49B@cashew.wolfskeep.com>
Message-ID: <XQMLI5375MLYWC86542QMSQQOZV8GB.3dc47270@riven>

X-Spambayes-Classification

Perfect.

- TimS

11/2/2002 6:40:02 PM, "T. Alexander Popiel" <popiel@wolfskeep.com> wrote:

>In message:  <ibq8susf6iep85oa89viu4l2os3in5gbpt@4ax.com>
>             Richie Hindle <richie@entrian.com> writes:
>>Hi Van,
>>
>>> It's good to be in a group where someone else cares!
>>
>>Definitely.  But aren't most technical groups like that?  Being a language
>>geek is a direct consequence of being a geek of any kind, isn't it?  8-)
>
>Not a direct consequence, but there is a high correlation.
>I've never run chi-square on it, though. ;-)
>
>>The question is, does 'judgement' look as wrong to American eyes as
>>'judgment' does to British ones?
>
>Not to this American's eyes... but I tend to go for the British versions
>of many words, so it might just be an affectation.
>
>>Or maybe the *real* question is, shall we call the header
>>X-Spambayes-Classification?
>
>I like that one.
>
>- Alex
>
>_______________________________________________
>Spambayes mailing list
>Spambayes@python.org
>http://mail.python.org/mailman/listinfo/spambayes
>
>
- Tim
www.fourstonesExpressions.com 


From Tim@mail.powweb.com  Sun Nov  3 00:54:29 2002
From: Tim@mail.powweb.com (Tim@mail.powweb.com)
Date: Sat, 02 Nov 2002 18:54:29 -0600
Subject: [Spambayes] x-hammie-disposition in pop3proxy 
In-Reply-To: <XQMLI5375MLYWC86542QMSQQOZV8GB.3dc47270@riven>
Message-ID: <1T985KFB9MK09SRD8PLJG1X05SNSOIG.3dc473c5@riven>

Has there been any thought given to additional classifications, beyond ham|unsure|spam?  Like, ham|probablyham|unsure|probablyspam|spam, with 
corresponding cutoffs specified in Options?  I don't know if that's interesting to anybody at all...

I could see X-Spambayes-Classification: probablyspam being useful as a range of mail that should be checked manually...

- TimS

11/2/2002 6:48:48 PM, Tim@mail.powweb.com, Stone@mail.powweb.com, Four Stones Expressions <tim@fourstonesExpressions.com> wrote:

>X-Spambayes-Classification
>
>Perfect.
>
>- TimS
>
>11/2/2002 6:40:02 PM, "T. Alexander Popiel" <popiel@wolfskeep.com> wrote:
>
>>In message:  <ibq8susf6iep85oa89viu4l2os3in5gbpt@4ax.com>
>>             Richie Hindle <richie@entrian.com> writes:
>>>Hi Van,
>>>
>>>> It's good to be in a group where someone else cares!
>>>
>>>Definitely.  But aren't most technical groups like that?  Being a language
>>>geek is a direct consequence of being a geek of any kind, isn't it?  8-)
>>
>>Not a direct consequence, but there is a high correlation.
>>I've never run chi-square on it, though. ;-)
>>
>>>The question is, does 'judgement' look as wrong to American eyes as
>>>'judgment' does to British ones?
>>
>>Not to this American's eyes... but I tend to go for the British versions
>>of many words, so it might just be an affectation.
>>
>>>Or maybe the *real* question is, shall we call the header
>>>X-Spambayes-Classification?
>>
>>I like that one.
>>
>>- Alex
>>
>>_______________________________________________
>>Spambayes mailing list
>>Spambayes@python.org
>>http://mail.python.org/mailman/listinfo/spambayes
>>
>>
>- Tim
>www.fourstonesExpressions.com 
>
>
>
>_______________________________________________
>Spambayes mailing list
>Spambayes@python.org
>http://mail.python.org/mailman/listinfo/spambayes
>
>
- Tim
www.fourstonesExpressions.com 


From vanhorn@whidbey.com  Sun Nov  3 03:12:33 2002
From: vanhorn@whidbey.com (G. Armour Van Horn)
Date: Sat, 02 Nov 2002 19:12:33 -0800
Subject: [Spambayes] x-hammie-disposition in pop3proxy
References: <LNBBLJKPBEHFEDALKOLCCEGLCFAB.tim.one@comcast.net>
	<92d8suolml4uj5cl334lrpvneo6qiid2h0@4ax.com> <3DC447AB.D64D6CE6@whidbey.com>
	<jal8su81627i476j0e0ibs28u5j17djf7b@4ax.com> <3DC4676B.F5ED02CE@whidbey.com>
	<ibq8susf6iep85oa89viu4l2os3in5gbpt@4ax.com>
Message-ID: <3DC49421.8EAA85FE@whidbey.com>

Richie,

I have not found any great correlation between language precision and technical
competence, other than that leaders of most technical communities are both
highly intelligent and well educated. It's amazing how many times I see "loose"
when the obvious meaning is "lose" in online forums. I have a lot of problems
with homophones, but lose and loose don't even rhyme.

The British spelling of judgement would go unnoticed by the vast majority of my
fellow citizens of the distressingly illeducated United States. Of course,
adding or dropping vowels really isn't that big a deal in English, a language
that can be accurately read with an amazing quantity of missing vowels.

As to other Americans, south of us they speak Spanish and wouldn't care, north
of us they have so much British spelling to deal with they wouldn't notice.
(They might notice if you used colour and labor in the same sentence, I
suspect.)

As to the final choice of the name, the image of a stern black-robed jurist
behind a high podium is a lot more appealing to me than an entymologist with a
magnifier or a librarian choosing a Dewey Decimal System code for a book. So I
vote for Judgement over Classification.

As to Outlook, mentioned in the previous message, to my disgust Microsoft Word
accepts both spellings when the US English dictionary is loaded. I really need
to move to a program that allows correcting the factory dictionary.

Van

Richie Hindle wrote:

> Hi Van,
>
> > It's good to be in a group where someone else cares!
>
> Definitely.  But aren't most technical groups like that?  Being a language
> geek is a direct consequence of being a geek of any kind, isn't it?  8-)
>
> Back to the topic: the spelling 'judgment' looks simply wrong to me (and to
> my girlfriend, the nineteenth dictionary).  I suspect it never gets used in
> British English.  However, Google lists 2,150,000 hits for 'judgement' and
> 5,800,000 hits for 'judgment', which implies that the latter is in more
> common use worldwide.  The question is, does 'judgement' look as wrong to
> American eyes as 'judgment' does to British ones?  Judging (ha ha) by your
> initial response, I'd guess it does.  (Pardon me if you're not American!)
>
> Or maybe the *real* question is, shall we call the header
> X-Spambayes-Classification?
>
> --
> Richie Hindle
> richie@entrian.com

--
----------------------------------------------------------
Sign up now for Quotes of the Day, a handful of quotations
on a theme delivered every morning.
Enlightenment! Daily, for free!
mailto:twisted@whidbey.com?subject=Subscribe_QOTD

For web hosting and maintenance,
visit Van's home page: http://www.domainvanhorn.com/van/
----------------------------------------------------------


From tim.one@comcast.net  Sun Nov  3 03:15:43 2002
From: tim.one@comcast.net (Tim Peters)
Date: Sat, 02 Nov 2002 22:15:43 -0500
Subject: [Spambayes] Spam at hackers conference
In-Reply-To: <200211022301.gA2N1MJ08093@pcp02138704pcs.reston01.va.comcast.net>
Message-ID: <LNBBLJKPBEHFEDALKOLCGEJCCFAB.tim.one@comcast.net>

[Guido]
> At the "Hackers" conference (a cool west coast event by invitation
> only) there was a session on spam.  A few things to note:
>
> - The term "ham" is now generally accepted :-)

So where are my royalties <wink>?

> - People are still at the Paul Graham level of Bayesian filtering;
>   I wish I had a blurb about the work done here on chi-square.

If you had the source code, there are accurate ('tho bare bones)
explanations in Options.py and classifier.py.

> - Combining different approaches (e.g. blacklists, whitelists,
>   Bayesian) seems to make people more comfortable.

I doubt a blacklist is going to be worth the bother with this scheme, but a
whitelist may be.  MarkH taught the Outlook client how to traverse "old"
Outlook .pst files, and running my current classifier over the year 2000's
email put about 2 dozen personal msgs from rare correspondents into my
Unsure category (which is fine), but 3 into the Spam category (which is not
fine).  OTOH, they're *such* rare correspondents that I never would have
thought to whitelist them anyway.  Indeed, since it turns out I never
responded to these msgs anyway <sigh>, the world would be exactly the same
if I had never gotten them.

> - The name of Bill Yerazunis was mentioned as someone who has done
>   good spam work.  Paul Graham seems to agree:
>   http://www.paulgraham.com/wsy.html ; one idea of his takes groups of
>   5 words and does various permutations (including leaving out some)
>   and then hashes on the result; very good results apparently.  (Maybe
>   the URL abouve has more info?)

His project is at:

    http://crm114.sourceforge.net/

The source code is easier to read than his attempt to explain it.  I
mentioned this late last week indirectly, in connection with using
many-to-one hashing as a database reduction gimmick; CRM114 carries that to
extremes, and has to else it would consume gigabytes (or even terabytes).

Bill tokenizes a msg.  Looks like he splits on whitespace, and preserves
case, but that's not "the gimmick".  The tokens are all hashed, and the rest
of the scheme works with their hash codes.  So the output of tokenization is
a list of N hash codes (where N is the # of tokens in the msg).

Then a sliding 5-word window marches across the list of hashes, one 5-token
slice at a time.  At *each* of the N-4 window positions, 16 hashes-of-hashes
(HOH) are computed, via 16 different hash functions.  Each HOH folds in the
hash code of the rightmost token in the window, and then all subsets of the
preceding 4 token hashes are folded in too; 2**4 is the # of possible
subsets, and so that's where the 16 comes from.

This gives 16 numbers at each window position, which are used to index
mmap'ed files of one-byte ham and spam counts.  Bill simply keeps a running
total of the ham and spam counts, and whichever total is higher at the end
wins.

We currently compute "exact" word 1-gram stats.  All of Bill's stats are
fuzzy because of multiple layers of many-to-one mappings.  Ignoring that,
and also ignoring some glitches at the start and end of the window
positions, he's effectively capturing:

+ All word 1-grams.
+ All word 2-grams.
+ All word 3-grams.
+ All word 4-grams.
+ All word 5-grams.
+ All word 4-grams taken from 5-word slices but ignoring a word.
+ All word 3-grams taken from 5-word slices but ignoring 2 words.
+ All word 2-grams taken from 5-word slices but ignoring 3 words.

Example:

    the earnings potential is truly staggering

generates a HOH for each of

    the earnings potential is truly
    the earnings potential truly
    the earnings is truly
    the potential is truly
    earnings potential is truly
    [and 6 more skipping 2 words of the first 4]
    [and 4 more skipping 3 words of the first 4]
    truly
    [and 16 more for the 5 words starting at "earnings"]

There are lots of unknowns in this scheme.  It's clear that it will learn
very quickly at the start.  It's unclear how it will do over time.  Things
acting it against there are (not saying they can't be wormed around, am
saying they will need to be wormed around, whether or not the need has
become apparent yet):

1. The spam and ham counts are clamped at one byte, and will eventually
   saturate.

2. 1,000,000 buckets isn't much over time, given the prolific rate of
   HOH generation (16*N HOH's for an N-word msg).  The scheme clearly
   will work best if bucket collisions are rare, but throw just a few
   thousand HOHs at 1,000,000 buckets and collisions are certain.

3. By construction, there's extreme correlation wrt the haminess
   and spaminess of the generated HOH's.  This makes a good combining
   scheme a puzzle.  Graham-combining gets in trouble due to the
   bogus word-independence assumption even for unigrams.  Overlapping
   bigrams suffer much worse correlation.  Overlapping trigrams or
   higher, forget it.  Even chi-combining gets in trouble if the
   tokenzier produces "too many" highly correlated one-grams.

In my earlier experiments with word bigrams, they did worse than what we're
doing now, and there seemed to be solid reasons for that (like "Aahz rocks"
is neutral but "Aahz" on its own is a strong c.l.py ham clue).

In a later experiment grabbing both bigrams and unigrams, at the time I saw
a significant drop in FN rate (along with a large boost in database size).
We've since pushed unigrams to the point where I see better error rates than
I got in that experiment, but I haven't run that experiment again.  It
stands to reason that we're missing *some* useful info.

Bill appears to have stats only for his own email; if there's been wider
testing, I haven't bumped into results.  I'm getting error rates at least as
good on my own email, and better on the c.l.py test.  Bill needed a lot less
training.

Fiddling our codebase to trying something like it wouldn't be hard.


From tim.one@comcast.net  Sun Nov  3 03:20:28 2002
From: tim.one@comcast.net (Tim Peters)
Date: Sat, 02 Nov 2002 22:20:28 -0500
Subject: [Spambayes] Why I added src=cid: etc
Message-ID: <LNBBLJKPBEHFEDALKOLCIEJDCFAB.tim.one@comcast.net>

This is typical of the kind of email I'm getting a lot of lately.  Without
mining the HTML, there's almost nothing to look at, not even a word in the
Subject line.  (Of course, if we weren't throwing the HTML tags away, the
classifier would have learned this stuff on its own.)

Spam Score: 0.999492


'*H*'                          0.000694711
'*S*'                          0.999679
'header:Received:4'            0.151395
'header:Return-Path:1'         0.621969
'header:Message-ID:1'          0.787093
'virus:width=0'                0.842427
'message-id:@fwd04.sul.t-online.com' 0.844828
'virus:height=0'               0.855208
'from:email addr:t-online.de'  0.908163
'from:email name:520018173831-0001' 0.908163
'virus:src=cid:'               0.978469
'virus:</iframe'               0.988432
'virus:<iframe'                0.98951
Message Stream:


Received: from cpimssmtpa49.msn.com ([10.48.181.223]) by
	cpimsstra13.email.msn.com with Microsoft SMTPSVC(5.0.2195.4905);
	Sat, 2 Nov 2002 18:35:05 -0800
Received: from mailout02.sul.t-online.com ([194.25.134.17]) by
	cpimssmtpa49.msn.com with Microsoft SMTPSVC(5.0.2195.4905);
	Sat, 2 Nov 2002 18:21:57 -0800
Received: from fwd04.sul.t-online.de
	by mailout02.sul.t-online.com with smtp
	id 188AQ4-0003mh-04; Sun, 03 Nov 2002 03:23:20 +0100
Received: from Vdjmego (520018173831-0001@[217.4.99.73]) by
	fwd04.sul.t-online.com
	with smtp id 188APQ-11Ip3QC; Sun, 3 Nov 2002 03:22:40 +0100
From: 520018173831-0001@t-online.de (soa)
To: tim_one@msn.com
Subject:
MIME-Version: 1.0
Date: Sun, 3 Nov 2002 03:22:40 +0100
Message-ID: <188APQ-11Ip3QC@fwd04.sul.t-online.com>
X-Sender: 520018173831-0001@t-dialin.net
Return-Path: 520018173831-0001@t-online.de
X-OriginalArrivalTime: 03 Nov 2002 02:21:58.0083 (UTC)
	FILETIME=[C8F85130:01C282DF]


<HTML><HEAD></HEAD><BODY>
<iframe src=cid:O5YGK7EwZ95X79h15JF height=0 width=0>
</iframe>
<FONT></FONT></BODY></HTML>
From anthony@interlink.com.au  Sun Nov  3 04:05:33 2002
From: anthony@interlink.com.au (Anthony Baxter)
Date: Sun, 03 Nov 2002 15:05:33 +1100
Subject: [Spambayes] x-hammie-disposition in pop3proxy 
In-Reply-To: <3DC49421.8EAA85FE@whidbey.com> 
Message-ID: <200211030405.gA345Xo00791@localhost.localdomain>


>>> "G. Armour Van Horn" wrote
> The British spelling of judgement would go unnoticed by the vast
> majority of my fellow citizens of the distressingly illeducated United
> States. Of course, adding or dropping vowels really isn't that big
> a deal in English, a language that can be accurately read with an
> amazing quantity of missing vowels.

So maybe it should be
X-SpmBys-Jdgmnt
?


From skip@pobox.com  Sat Nov  2 14:14:56 2002
From: skip@pobox.com (Skip Montanaro)
Date: Sat, 2 Nov 2002 08:14:56 -0600
Subject: [Spambayes] An alternate use 
In-Reply-To: <20021102062939.3CD89F5AC@cashew.wolfskeep.com>
References: <LNBBLJKPBEHFEDALKOLCIEEKCFAB.tim.one@comcast.net>
        <20021102062939.3CD89F5AC@cashew.wolfskeep.com>
Message-ID: <15811.56800.291193.255713@montanaro.dyndns.org>


    >>> In the case of adding headers, we'll want to avoid collisions with
    >>> personal use of spambayes, too.  I suggest tagging the
    >>> X-Spambayes-Disposition header (or whatever we call it) with some
    >>> identifier for which classifier generated the rating, so that
    >>> multiple X-Spambayes-Disposition lines are distinguishable.
    >>> Something like:
    >>> 
    >>> X-Spambayes-Disposition: Spam by spambayes@python.org
    >>> X-Spambayes-Disposition: Unsure by pennmush@pennmush.org
    >>> 
    >>> Personal classifiers could leave off the 'by' section.
    >>> 
    >>> Heck, make it so that X-Spambayes-Disposition lines are turned into
    >>> words similar to the mailer lines, and then personal classifiers can
    >>> use the judgements of list classifiers as clues.
    >> 
    >> Easy to spoof, and I'm sure spammers would pick up on that quickly.

    Alex> Yes, it would be easy to spoof, unless compared with routing
    Alex> information... but doing that sort of comparison is beyond the
    Alex> sorting rule capabilities of something like Outlook (and Outlook
    Alex> is sadly one of the best GUI tools in that arena).  I'm not even
    Alex> sure procmail is up to the task without help from a custom
    Alex> program.

I was using a spoof-proof mechanism from procmail before I disabled
SpamAssassin.  I inserted my own header using formail:

    :0H
    * ! ^X-SA-Host:
    {
      :0fw
      | spamc | $FORMAIL -a "X-SA-Host: `hostname --fqdn`"
    }

which says, "if there is no X-SA-Host header present, run spamc, add a
header and include the fully qualified hostname".  If an X-SA-Host header is
present it tells me spamc had already been run on this message (I was
running SA on two different machines at the time).  That way I wasn't
relying on SA's own headers to decide whether or not to run it.

Skip


From skip@pobox.com  Sat Nov  2 17:30:22 2002
From: skip@pobox.com (Skip Montanaro)
Date: Sat, 2 Nov 2002 11:30:22 -0600
Subject: [Spambayes] x-hammie-disposition in pop3proxy
In-Reply-To: <URPKKE1U751VCBMXSGEGDXTPMOI8461.3dc40199@riven>
References: <URPKKE1U751VCBMXSGEGDXTPMOI8461.3dc40199@riven>
Message-ID: <15812.2990.423954.983990@montanaro.dyndns.org>


    Tim> Ok, I've got the pop3proxy up and running on my machine.  Very
    Tim> simple to get running.  I don't have a trained database (the real
    Tim> challenge) at this point, and it's adding the x-hammie-disposition
    Tim> header with value of 'no'.  I presume that this means that the
    Tim> classifier thinks this is NOT ham?  So if there's no database, then
    Tim> it assumes everything is spam?  Or am I reading the meaning of the
    Tim> header backwards?

Correct.  "no" means "i think it's ham".  "yes" means "i think it's spam".
"unsure" means ...

"no" and "yes" are interpreted the same as SpamAssassin's use of these
words.  Perhaps this suggests that we need a different header?  SA uses

    X-Spam-Status: yes

which reads in the obvious fashion.  I still think we need to leave
"X-Spam-*" to the SA folks to avoid ambiguity, but maybe we can use

    X-Ham-Status: yes         to mean "it's ham"
    X-Ham-Status: no          to mean "it's spam"

Just a thought.

Skip

From skip@pobox.com  Sat Nov  2 00:04:34 2002
From: skip@pobox.com (Skip Montanaro)
Date: Fri, 1 Nov 2002 18:04:34 -0600
Subject: [Spambayes] Email client integration -- what's needed?
In-Reply-To: <5272PJ65XVL62EDJF62XSFB2WBA091X.3dc30805@riven>
References: <5272PJ65XVL62EDJF62XSFB2WBA091X.3dc30805@riven>
Message-ID: <15811.5778.454998.906629@montanaro.dyndns.org>

>>>>> "Tim" == Tim  <Tim@mail.powweb.com> writes:

    Tim> This proposal has a lot of attractions.  Forwarding to ham@ and
    Tim> spam@ would be a bit of a pain at first, but it would work for
    Tim> existing bodies of mail.  Training would be MUCH simpler with this
    Tim> method, and would not require some fancy-schmancy installation or
    Tim> configuration glorp.

The more you ask people to type, the more mistakes they will make.  I'm
still amazed at how many mistakes I've made, not because I mentally mistook
the nature of an email, but because I simply saved it to the wrong file.

Skip

From skip@pobox.com  Sat Nov  2 00:51:40 2002
From: skip@pobox.com (Skip Montanaro)
Date: Fri, 1 Nov 2002 18:51:40 -0600
Subject: [Spambayes] Email client integration -- what's needed?
In-Reply-To: <15811.7287.50962.651569@slothrop.zope.com>
References: <15811.3035.967754.435766@slothrop.zope.com>
        <LCEPIIGDJPKCOIHOBJEPEELFHHAA.mhammond@skippinet.com.au>
        <15811.7287.50962.651569@slothrop.zope.com>
Message-ID: <15811.8604.210277.923442@montanaro.dyndns.org>


    Jeremy> This part of the code doesn't work that well for my mail
    Jeremy> folders.  The code to move messages from folder to folder needs
    Jeremy> to be written in elisp.  I'm not sure how important that is.

You could try Pymacs...

Skip

From skip@pobox.com  Sun Nov  3 04:23:47 2002
From: skip@pobox.com (Skip Montanaro)
Date: Sat, 2 Nov 2002 22:23:47 -0600
Subject: [Spambayes] Spam at hackers conference
In-Reply-To: <LNBBLJKPBEHFEDALKOLCGEJCCFAB.tim.one@comcast.net>
References: <200211022301.gA2N1MJ08093@pcp02138704pcs.reston01.va.comcast.net>
        <LNBBLJKPBEHFEDALKOLCGEJCCFAB.tim.one@comcast.net>
Message-ID: <15812.42195.737956.831082@montanaro.dyndns.org>


    >> - Combining different approaches (e.g. blacklists, whitelists,
    >> Bayesian) seems to make people more comfortable.

    Tim> I doubt a blacklist is going to be worth the bother with this
    Tim> scheme, but a whitelist may be.

I doubt it.  There is just too much email spoofing going on to trust any
addresses that absolutely.  When using SA, I rarely used its whitelist
facility, and only for odd email addresses whose automailings it always
classified as spam.  For instance, I get a bit of mail from American
Airlines letting me know when the airfare between Chicago and Albany
changes.  As you might imagine, it's very spammy looking.  The only way I
could convince SA to leave it alone was to whitelist it.  With Spambayes,
it's never a problem.

Skip

From tim.one@comcast.net  Sun Nov  3 05:47:37 2002
From: tim.one@comcast.net (Tim Peters)
Date: Sun, 03 Nov 2002 00:47:37 -0500
Subject: [Spambayes] Spam at hackers conference
In-Reply-To: <LNBBLJKPBEHFEDALKOLCGEJCCFAB.tim.one@comcast.net>
Message-ID: <LNBBLJKPBEHFEDALKOLCGEJNCFAB.tim.one@comcast.net>

This is a multi-part message in MIME format.

---------------------- multipart/mixed attachment
[Tim, sketches the CRM114 algorithm]
> ...
> Fiddling our codebase to [try] something like it wouldn't be hard.

Proof attached.  Like the docs say, nothing is sacred here, and if that
algorithm works better, great, we can go home early <wink>.

The attached patches classifier.py to do CRM114 HOH generation and scoring
by default.  The token hash is Python's string hash, which is a better hash
than CRM114 uses.  The 16 HOH hashes used here are, I believe, identical to
the ones CRM114 uses; I grit my teeth at this because they don't appear to
be good HOH functions, but let's let that pass.

This runs very much slower and requires a lot more memory than what we're
using now.  OTOH, the memory use is bounded no matter how much training data
there is, due to layers of many-to-one hash mappings.  If this scheme
becomes serious here, recoding the scoring in C would seem necessary for
both speed and memory efficiency (I'm already playing obscure speed and
memory reduction tricks here, but they don't help enough).  The patch
doesn't change tokenization at all, although I believe Bill preserves case
when tokenizing, doesn't skip either short words or meta-tokenize long
words, and doesn't do any of our "fancy" tokenization gimmicks (a note on
the project site suggests that he'll start decoding base64, because that's
been a problem with the scheme; we are decoding base64, of course).  The
patch also doesn't clamp counts to 1-byte values, although I doubt that
played a role here (later: unclear!).

If you try this, set ham_cutoff and spam_cutoff to 0.5 (later: also unclear
what to do here).  It's just comparing raw counts, and the bigger count
wins.  The score returned here is

    S/(S+H)

where

    S is the sum of the ~16*N HOH spamcounts
    H is the sum of the ~16*N HOH hamcounts

So < 0.5 means S was smaller, and > 0.5 means S was larger.

On a python.org email test I was running anyway, the results weren't
stellar:

filename:    base1    crm
ham:spam:  2741:948
                   2741:948
fp total:        5       2
fp %:         0.18    0.07
fn total:        2     271
fn %:         0.21   28.59
unsure t:       66       0
unsure %:     1.79    0.00
real cost:  $65.20 $291.00
best cost:  $21.40 $177.00
h mean:       0.84   18.28
h sdev:       6.21    6.17
s mean:      98.05   64.63
s sdev:       9.10   23.53
mean diff:   97.21   46.35
k:            6.35    1.56

This isn't a big test, but bloated to 100MB and took so long I killed it
once suspecting a hang (it wasn't hung, so I got to start over again
<wink>).

However,

1. If there's a usable middle ground here, setting both cutoffs to 0.5
   can't reflect that.  Still, the run was done with nbuckets=200, which
   gave the automated histogram analysis a lot of resolution to play
   with, and the best-cost crm value was $177.00, 8x worse than the best-
   cost "before" value (deduced from the same nbuckets value).

2. It occurs to me that *because* it's just scoring by comparing raw
   counts, it's probably crucial to train on an equal number of ham
   and spam.  That there was 3x as much ham in this test made it much
   easier to get high raw hamcounts than high raw spamcounts.  That may
   (or may not) explain the bulk of the huge FN rate.

OK, doing a 10-fold cross-validation run across 2000 random ham and 2000
random spam, but the same random sets for "before" and "after":

filename:    before     crm
ham:spam:  2000:2000
                   2000:2000
fp total:        1    1604
fp %:         0.05   80.20
fn total:        0       0
fn %:         0.00    0.00
unsure t:       20       0
unsure %:     0.50    0.00
real cost:  $14.00$16040.00
best cost:   $2.00 $228.00
h mean:       0.55   53.54
h sdev:       4.50    5.30
s mean:      99.91   71.40
s sdev:       1.64    6.84
mean diff:   99.36   17.86
k:           16.18    1.47

Well, that was a disaster.  My guess:  since virtually all ham contains
strong spam words, 0.5 is a lousy value for spam_cutoff.

For crm:

-> <stat> Ham scores for all runs: 2000 items; mean 53.54; sdev 5.30
-> <stat> min 24.6294; median 54.3869; max 74.4693
-> <stat> percentiles: 5% 43.7643; 25% 51.1123; 75% 56.8845; 95% 60.2207

-> <stat> Spam scores for all runs: 2000 items; mean 71.40; sdev 6.84
-> <stat> min 50; median 69.805; max 96.6838
-> <stat> percentiles: 5% 63.4695; 25% 66.6684; 75% 74.2597; 95% 86.1775

-> best cost for all runs: $228.00
-> per-fp cost $10.00; per-fn cost $1.00; per-unsure cost $0.20
-> achieved at ham & spam cutoffs 0.61 & 0.695
->     fp 1; fn 17; unsure ham 63; unsure spam 942
->     fp rate 0.05%; fn rate 0.85%; unsure rate 25.1%

The highest-scoring ham is one our unigram scheme would never call spam;
since 0.75 is very near the spam 75th-percentile score, you'd have to call
about 25% of all spam "unsure" to avoid calling this spam (and, indeed, the
automated histogram analysis found its best-cost value at an unsure rate of
25.1%):

"""
Data/Ham/Set4/128466.txt
prob = 0.744693151307
prob('*H*') = 59619
prob('*S*') = 173900

Received: from [80.17.80.215] (helo=veronika.quadrante.com)
        by mail.python.org with smtp (Exim 3.21 #1)
        id 16dCZB-0000Op-00
        for python-list@python.org; Tue, 19 Feb 2002 10:52:29 -0500
Received: (qmail 29664 invoked by uid 64014); 19 Feb 2002 16:14:13 -0000
Received: from abottoni@quadrante.com by veronika
        by uid 64011 with qmail-scanner-1.10 (uvscan: v4.1.40/v4121. .
        Clear:0. Processed in 0.341367 secs); 19 Feb 2002 16:14:13 -0000
Received: from unknown (HELO backup.quadrante.com) (80.17.80.210)
  by 80.17.80.215 with SMTP; 19 Feb 2002 16:14:13 -0000
Message-Id: <5.1.0.14.0.20020219163858.00a901a8@veronika.quadrante.com>
X-Sender: abottoni@veronika.quadrante.com
X-Mailer: QUALCOMM Windows Eudora Version 5.1
Date: Tue, 19 Feb 2002 16:56:25 +0100
To: python-list@python.org
From: Alessandro Bottoni <abottoni@quadrante.com>
Subject: Python-based "Portal System"?
Mime-Version: 1.0
Content-Type: text/plain; charset="us-ascii"; format=flowed
Sender: python-list-admin@python.org
Errors-To: python-list-admin@python.org
X-BeenThere: python-list@python.org
X-Mailman-Version: 2.0.8 (101270)
Precedence: bulk
List-Help: <mailto:python-list-request@python.org?subject=help>
List-Post: <mailto:python-list@python.org>
List-Subscribe: <http://mail.python.org/mailman/listinfo/python-list>,
        <mailto:python-list-request@python.org?subject=subscribe>
List-Id: General discussion list for the Python programming language
        <python-list.python.org>
List-Unsubscribe: <http://mail.python.org/mailman/listinfo/python-list>,
        <mailto:python-list-request@python.org?subject=unsubscribe>
List-Archive: <http://mail.python.org/pipermail/python-list/>

Most likely, all of you know a number of open source, pre-built "portal
systems", like ezPublish (http://developer.ez.no), PHPNuke
(www.phpnuke.org), PostNuke (http://www.postnuke.com), Midgard
(http://www.midgard-project.org/) and so on.

Does anybody know if exists a Portal System like those, written in Python?

Thanks in advance

Alessandro Bottoni

PS: I know about Zope (http://www.zope.org) and WebWare
(http://webware.sourceforge.net), already...
"""

The 2nd-highest-scoring ham is due to our own Skip, and is at least as
mysterious:

"""
Data/Ham/Set2/146718.txt
prob = 0.693046527054
prob('*H*') = 80982
prob('*S*') = 182843

Received: from exim by mail.python.org with spamc (Exim 4.02)
        id 17DRZM-0005Xn-00
        for python-list@python.org; Thu, 30 May 2002 11:10:28 -0400
Received: from 12-248-41-177.client.attbi.com ([12.248.41.177])
        by mail.python.org with esmtp (Exim 4.02)
        id 17DRZM-0005Xg-00
        for python-list@python.org; Thu, 30 May 2002 11:10:28 -0400
Received: (from skip@localhost)
        by 12-248-41-177.client.attbi.com (8.11.6/8.11.6) id g4UFAPD25155;
        Thu, 30 May 2002 10:10:25 -0500
From: Skip Montanaro <skip@pobox.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
Message-ID: <15606.16608.916236.657101@12-248-41-177.client.attbi.com>
Date: Thu, 30 May 2002 10:10:24 -0500
To: "David LeBlanc" <whisper@oz.net>
Cc: "Jeff Shannon" <jeff@ccvcorp.com>, <python-list@python.org>
Subject: RE: Crashing IDLE
In-Reply-To: <GCEDKONBLEFPPADDJCOEAEHODFAA.whisper@oz.net>
References: <MPG.175f041fd9efc1f09896e3@news.nwlink.com>
        <GCEDKONBLEFPPADDJCOEAEHODFAA.whisper@oz.net>
X-Mailer: VM 6.96 under 21.4 (patch 6) "Common Lisp" XEmacs Lucid
Reply-To: skip@pobox.com
X-Spam-Status: No, hits=-4.4 required=5.0 tests=IN_REP_TO version=2.20
X-Spam-Level:
Sender: python-list-admin@python.org
Errors-To: python-list-admin@python.org
X-BeenThere: python-list@python.org
X-Mailman-Version: 2.0.11 (101270)
Precedence: bulk
List-Help: <mailto:python-list-request@python.org?subject=help>
List-Post: <mailto:python-list@python.org>
List-Subscribe: <http://mail.python.org/mailman/listinfo/python-list>,
        <mailto:python-list-request@python.org?subject=subscribe>
List-Id: General discussion list for the Python programming language
        <python-list.python.org>
List-Unsubscribe: <http://mail.python.org/mailman/listinfo/python-list>,
        <mailto:python-list-request@python.org?subject=unsubscribe>
List-Archive: <http://mail.python.org/pipermail/python-list/>

    David> I would consider that a bug - "pass" should be checking for
    David> ctrl-c and other events imo. It sure strikes me as a point for
    David> relinquishing control.

It will relinquish control to another thread and sense KeyboardInterrupt.
If your app is not threaded though, Tk will never get control so it can
process its event queue.  That's what fills up.

--
Skip Montanaro (skip@pobox.com - http://www.mojam.com/)
Boycott Netflix - they spam - http://www.musi-cal.com/~skip/netflix.html

"""

Perhaps CRM114's one-byte count clamps are needed to prevent insane scores
(a form of bias acting against the extreme HOH correlation), or perhaps one
of the hash reductions mapped "control" to "big penis", or ... who knows?
If someone wants to pursue this (I've seen enough <wink>), it would be a lot
more interesting now to download CRM114 and run it the way the author
intended.

---------------------- multipart/mixed attachment
A non-text attachment was scrubbed...
Name: crm.patch
Type: application/octet-stream
Size: 11203 bytes
Desc: not available
Url : http://mail.python.org/pipermail/spambayes/attachments/20021103/5cf1f41f/crm.exe

---------------------- multipart/mixed attachment--

From tim.one@comcast.net  Sun Nov  3 07:19:36 2002
From: tim.one@comcast.net (Tim Peters)
Date: Sun, 03 Nov 2002 02:19:36 -0500
Subject: [Spambayes] x-hammie-disposition in pop3proxy
In-Reply-To: <92d8suolml4uj5cl334lrpvneo6qiid2h0@4ax.com>
Message-ID: <LNBBLJKPBEHFEDALKOLCEEKDCFAB.tim.one@comcast.net>

[TimS]
> Ok, so Tim says I'm not reading it backwards, Richie says I am...

That's because I was reading your question backwards.  Sorry!  You were
right the first time:  you were reading it backwards.


From tim.one@comcast.net  Sun Nov  3 07:31:39 2002
From: tim.one@comcast.net (Tim Peters)
Date: Sun, 03 Nov 2002 02:31:39 -0500
Subject: [Spambayes] x-hammie-disposition in pop3proxy
In-Reply-To: <1T985KFB9MK09SRD8PLJG1X05SNSOIG.3dc473c5@riven>
Message-ID: <LNBBLJKPBEHFEDALKOLCOEKFCFAB.tim.one@comcast.net>

[Tim@mail.powweb.com]
> Has there been any thought given to additional classifications,
> beyond ham|unsure|spam?

No; you could get "a score" with 17 decimal digits of precision, about 1 of
which is meangingful,.

> Like, ham|probablyham|unsure|probablyspam|spam, with
> corresponding cutoffs specified in Options?  I don't know if
> that's interesting to anybody at all...
>
> I could see X-Spambayes-Classification: probablyspam being useful
> as a range of mail that should be checked manually...

That's what Unsure is for.  If you don't check Unsure msgs, you'll be sorry.
They split about half-and-half between ham and spam for me, and if the
system *could* have made a better jugmint about them, it would have.

If you do have the score, we've gotten mixed reports here about whether
sorting Unsure msgs by score is helpful.  I find that it is in my email, but
there are many exceptions (ham closer to high end of the Unsure range, and
spam closer to the low end).


From rob@hooft.net  Sun Nov  3 07:32:01 2002
From: rob@hooft.net (Rob Hooft)
Date: Sun, 03 Nov 2002 08:32:01 +0100
Subject: [Spambayes] Spambayes Header Format
Message-ID: <3DC4D0F1.5000509@hooft.net>

Lots of discussion about the Spambayes header, but nobody takes any 
concrete initiatives. For me, the proposal...

    X-Spambayes-Classification: {Ham|Unsure|Spam}

...looks very good. But obviously this will break backward 
compatibility.  And since I'm only using hammie.py and procmail, I can 
only change those parts and test them. To make everything work together 
again we'd have to make a concerted effort. Better now than later?

Rob

-- 
Rob W.W. Hooft  ||  rob@hooft.net  ||  http://www.hooft.net/people/rob/


From tim.one@comcast.net  Sun Nov  3 07:44:43 2002
From: tim.one@comcast.net (Tim Peters)
Date: Sun, 03 Nov 2002 02:44:43 -0500
Subject: [Spambayes] Spam at hackers conference
In-Reply-To: <UOPL86TPRP091Z2Y31C8B77VTCAA9SP.3dc45d5b@riven>
Message-ID: <LNBBLJKPBEHFEDALKOLCCEKICFAB.tim.one@comcast.net>

[Tim@mail.powweb.com]
> I've *always* suspected that spambayes in combination with other
> technology would present a very powerful anti-spam arsenal.  But
> spambayes by itself is so good, that it may not really require
> supplemental technology.  I say *always* because I've only been in
> this game for a couple weeks... ;)  so what do I REALLY know?

I don't know what to do about opt-in advertising, apart from the obvious:
keep an eye out for it in your Spam folder, and train on it as Ham whenever
it shows up there.  This is effective.

Very brief msgs from rare correspondents seem also to be a problem, because
lots of spam is also very brief (believe it or not <wink>).

python.org has a very specific problem:  the various mailing lists have
*-request addresses, for adminstrivia.  Greg currently whitelists the snot
out of those recipients in SpamAssassin, else a significant percentage of
that traffic would be considered spam.  *This* code appears to be less
willing to call it spam than unfiddled SpamAssassin, but it's still the
major source of FPs in my python.org mail tests.  The kind of FP here has
the single word "unsubscribe" or "help" or "confirm 1534232" buried under
10KB of employer-generated HTML disclaimers, or is sent as a reply to a spam
or conference announcement the poster found objectionable, quoted in full.
Making things worse, "subscribe" and "unsubscribe" are themselves
high-spamprob words.

The FP rate is still very low even with that, but every non-trivial scheme
has non-zero error rates, and that has to be realized.


From tim.one@comcast.net  Sun Nov  3 07:52:18 2002
From: tim.one@comcast.net (Tim Peters)
Date: Sun, 03 Nov 2002 02:52:18 -0500
Subject: [Spambayes] Spam at hackers conference
In-Reply-To: <15812.42195.737956.831082@montanaro.dyndns.org>
Message-ID: <LNBBLJKPBEHFEDALKOLCOEKICFAB.tim.one@comcast.net>

[Skip Montanaro, on whitelists]
> I doubt it.  There is just too much email spoofing going on to trust any
> addresses that absolutely.  When using SA, I rarely used its whitelist
> facility, and only for odd email addresses whose automailings it always
> classified as spam.  For instance, I get a bit of mail from American
> Airlines letting me know when the airfare between Chicago and Albany
> changes.  As you might imagine, it's very spammy looking.

Yup, I get the same kind of thing from Expedia (what's up with that?  does
Microsoft own my soul too <wink>?), and it rated as solid spam before
training on it.  Now it's solid ham, in part because the specific routes it
always tells me about have become recognized as strong ham words.

> The only way I could convince SA to leave it alone was to whitelist it.
> With Spambayes, it's never a problem.

OTOH, I talk here about my sisters, and email from them is often brief, and
initially scored as Unsure.  I could whitelist them without problem.
They're not computer geeks, and one of them has never gotten spam:  she has
no web or mailing-list presence at all, and uses a small regional ISP.
Nobody is going to guess her address, unless they break into the ISP's
database.  If they do, then maybe a whitelist will dump a spam or two in my
Inbox.  BFD.

OTOH, after training on msgs from my sisters, my classifier also scores them
as ham now anyway.  I'm having more trouble when esmokes.com changes the
brands of cancer sticks on sale this week <wink>.


From jbublitz@nwinternet.com  Sun Nov  3 07:47:32 2002
From: jbublitz@nwinternet.com (Jim Bublitz)
Date: Sat, 02 Nov 2002 23:47:32 -0800 (PST)
Subject: [Spambayes] Spam at hackers conference
In-Reply-To: <15812.42195.737956.831082@montanaro.dyndns.org>
Message-ID: <XFMail.021102234732.jbublitz@nwinternet.com>

On 03-Nov-02 Skip Montanaro wrote:
 
>> > - Combining different approaches (e.g. blacklists, whitelists,
>> > Bayesian) seems to make people more comfortable.
 
>> Tim> I doubt a blacklist is going to be worth the bother with
>> this scheme, but a whitelist may be.
 
> I doubt it.  There is just too much email spoofing going on to
> trust any addresses that absolutely.  When using SA, I rarely
> used its whitelist facility, and only for odd email addresses
> whose automailings it always classified as spam.  For instance,
> I get a bit of mail from American Airlines letting me know when
> the airfare between Chicago and Albany changes.  As you might
> imagine, it's very spammy looking.  The only way I could
> convince SA to leave it alone was to whitelist it.  With
> Spambayes, it's never a problem.

You may be correct that from a purely technical point of view
Spambayes doesn't really need a whitelist (although the fp rate is
still non-zero), but there are some other considerations.

>From my personal point of view, I spend a lot of money to get
certain email sent to me, and missing some email could be very
costly ($10 could be off by orders of magnitude in the worst case)
For those reasons alone, I want a whitelist.

I also recently saw someplace (/.?) an article about a woman suing
an ISP who cut off her email because of non-payment. She's suing
because she missed an email from a potential employer for a
possible high paying job. If I were in a position similar to that
ISP (potential liability), I think "due diligence" would require
that I make every effort to make sure valid mail got through -
hence a more deterministic method in combination with a statistical
method (in combination with review in my case).

A convenient whitelist option seems to me to make it a more
attractive package. I'd want whitelisted mail to go into the
training database too.

Jim


From tim.one@comcast.net  Sun Nov  3 08:12:00 2002
From: tim.one@comcast.net (Tim Peters)
Date: Sun, 03 Nov 2002 03:12:00 -0500
Subject: [Spambayes] Email client integration -- what's needed?
In-Reply-To: <15811.7287.50962.651569@slothrop.zope.com>
Message-ID: <LNBBLJKPBEHFEDALKOLCGEKKCFAB.tim.one@comcast.net>

[Jeremy Hylton, on the Folder parts of MarkH's training interface\
> ...
> This part of the code doesn't work that well for my mail folders.  The
> code to move messages from folder to folder needs to be written in
> elisp.  I'm not sure how important that is.

Whatever a general-purpose training class may look like, it seems to need
two concepts:  "a msg", and "a collection of msgs", the latter to remember,
e.g., which msgs have been trained as ham, and which as spam.  Mark views
collections as folders because that's actually how they're set up in the
Outlook client, but a "virtual folder" makes sense too.  In your case you
may have just two folders, Ham and Spam, which exist only in cyberspace, as
a way for the training class to keep track of the state of your training.
Mark's MoveTo() is then just a way to record the classification a msg should
have.

> ...
>             # It's important not to commit a transaction until
>             # after update_probabilities is called in update().
>             # Otherwise some new entries will cause scoring to fail.

I'm not sure what that's about, but I probably fixed it late last week
(Outlook has lots of threads, and it was possible there for scoring to occur
in parallel with training; WordInfo records are now created with the
unknown-word spamprob by default instead of with None, so that an attempt to
score a brand-new word is effectively ignored instead of raising an
exception).


From tim.one@comcast.net  Sun Nov  3 08:49:04 2002
From: tim.one@comcast.net (Tim Peters)
Date: Sun, 03 Nov 2002 03:49:04 -0500
Subject: [Spambayes] Something to test
Message-ID: <LNBBLJKPBEHFEDALKOLCEEKMCFAB.tim.one@comcast.net>

This little patch arranges to create "noheader:HEADERNAME" tokens for
headers in options.safe_headers that *don't* appear in a msg's headers.  On
my fat c.l.py test it's a small theoretical improvement:  best-cost falls
from $26.80 to $22.00, by knocking down the score of the second-worst
hopeless FP just enough so that redeeming it *could* be traded away for an
increase in the Unsure rate.  That's not realistic, though (the spam_cutoff
value needed to redeem that FP is no longer insane, but is still
*unreasonably* high).

I'm keener on it because it eliminated a few difficult FP without changing
cutoffs, in three smaller tests on different test data.  I haven't run a
test where it hurt yet, and it has helped several times.

This captures the useful (in my data) part of what Anthony's tokenization of
Reply-To accomplished, without needing to tokenize the Reply-To content (the
thing that helped me there was that tokenizing Reply-To inadvertently
generated a token for the *absence* of a Reply-To header, and that's a ham
clue in my data, provided that the classifier can see it; one effect of the
patch is to generate a "noheader:reply-to" token when no Reply-To is found
in the headers; other effects include that the lack of an Organization
header becomes a spam clue in my data; sometimes more than one of these
coooperate to help push a difficult case to "the right side" of a cutoff).


Index: tokenizer.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/tokenizer.py,v
retrieving revision 1.60
diff -c -r1.60 tokenizer.py
*** tokenizer.py        1 Nov 2002 16:10:13 -0000       1.60
--- tokenizer.py        3 Nov 2002 08:31:44 -0000
***************
*** 1178,1183 ****
--- 1178,1185 ----
                      x2n[x] = x2n.get(x, 0) + 1
          for x in x2n.items():
              yield "header:%s:%d" % x
+         for x in options.safe_headers - Set([k.lower() for k in x2n]):
+             yield "noheader:" + x

      def tokenize_body(self, msg, maxword=options.skip_max_word_size):
          """Generate a stream of tokens from an email Message.


From richie@entrian.com  Sun Nov  3 10:48:21 2002
From: richie@entrian.com (Richie Hindle)
Date: Sun, 03 Nov 2002 10:48:21 +0000
Subject: [Spambayes] Spambayes Header Format
In-Reply-To: <3DC4D0F1.5000509@hooft.net>
References: <3DC4D0F1.5000509@hooft.net>
Message-ID: <3fv9suk4fi0m7bgtm04258gmjvr0j3i046@4ax.com>

Hi Rob,

> Lots of discussion about the Spambayes header, but nobody takes any 
> concrete initiatives. For me, the proposal...
> 
>     X-Spambayes-Classification: {Ham|Unsure|Spam}
> 
> ...looks very good. But obviously this will break backward 
> compatibility.  And since I'm only using hammie.py and procmail, I can 
> only change those parts and test them. To make everything work together 
> again we'd have to make a concerted effort. Better now than later?

I agree.  I volunteer to make the edit, and to combine any duplicated code
("if prob > spam_cutoff: disp = 'Yes'" currently appears in at least two
places, for instance).

Let's give it a couple of days and see whether there are any violent
objections or better suggestions, then I'll make the edit.  Is that OK with
everyone?

-- 
Richie Hindle
richie@entrian.com


From mhammond@skippinet.com.au  Sun Nov  3 11:26:04 2002
From: mhammond@skippinet.com.au (Mark Hammond)
Date: Sun, 3 Nov 2002 22:26:04 +1100
Subject: [Spambayes] Spambayes Header Format
In-Reply-To: <3DC4D0F1.5000509@hooft.net>
Message-ID: <LCEPIIGDJPKCOIHOBJEPKECBHIAA.mhammond@skippinet.com.au>

> Lots of discussion about the Spambayes header, but nobody takes any
> concrete initiatives. For me, the proposal...
>
>     X-Spambayes-Classification: {Ham|Unsure|Spam}

Something I find cute about "Yes, "No", "Unsure" is that it sorts naturally.

And-my-brain-even-processes-it-naturally-ly.

Mark.


From rob@hooft.net  Sun Nov  3 12:18:11 2002
From: rob@hooft.net (Rob Hooft)
Date: Sun, 03 Nov 2002 13:18:11 +0100
Subject: [Spambayes] Spambayes Header Format
References: <3DC4D0F1.5000509@hooft.net>
	<3fv9suk4fi0m7bgtm04258gmjvr0j3i046@4ax.com>
Message-ID: <3DC51403.6030208@hooft.net>

Richie Hindle wrote:
> Hi Rob,
> 
> 
>>Lots of discussion about the Spambayes header, but nobody takes any 
>>concrete initiatives. For me, the proposal...
>>
>>    X-Spambayes-Classification: {Ham|Unsure|Spam}
>>
>>...looks very good. But obviously this will break backward 
>>compatibility.  And since I'm only using hammie.py and procmail, I can 
>>only change those parts and test them. To make everything work together 
>>again we'd have to make a concerted effort. Better now than later?
> 
> 
> I agree.  I volunteer to make the edit, and to combine any duplicated code
> ("if prob > spam_cutoff: disp = 'Yes'" currently appears in at least two
> places, for instance).

Other todo items and ideas:
  - Make the "Yes, Unsure, No" items into Options, keeping the defaults
    the same as in the past for a few days.
  - Add the debugging info optionally as X-Spambayes-Info

Rob

-- 
Rob W.W. Hooft  ||  rob@hooft.net  ||  http://www.hooft.net/people/rob/


From Tim@mail.powweb.com  Sun Nov  3 12:47:42 2002
From: Tim@mail.powweb.com (Tim@mail.powweb.com)
Date: Sun, 03 Nov 2002 06:47:42 -0600
Subject: [Spambayes] x-hammie-disposition in pop3proxy
In-Reply-To: <LNBBLJKPBEHFEDALKOLCOEKFCFAB.tim.one@comcast.net>
Message-ID: <TNQNJDICVUGF5Z8607KF3ZSN32KFOMSQ.3dc51aee@riven>

I agree.. it was a dumb idea.  Hopefully I've exhausted my quota of those... ;)

- TimS

11/3/2002 1:31:39 AM, Tim Peters <tim.one@comcast.net> wrote:

>[Tim@mail.powweb.com]
>> Has there been any thought given to additional classifications,
>> beyond ham|unsure|spam?
>
>No; you could get "a score" with 17 decimal digits of precision, about 1 of
>which is meangingful,.
>
>> Like, ham|probablyham|unsure|probablyspam|spam, with
>> corresponding cutoffs specified in Options?  I don't know if
>> that's interesting to anybody at all...
>>
>> I could see X-Spambayes-Classification: probablyspam being useful
>> as a range of mail that should be checked manually...
>
>That's what Unsure is for.  If you don't check Unsure msgs, you'll be sorry.
>They split about half-and-half between ham and spam for me, and if the
>system *could* have made a better jugmint about them, it would have.
>
>If you do have the score, we've gotten mixed reports here about whether
>sorting Unsure msgs by score is helpful.  I find that it is in my email, but
>there are many exceptions (ham closer to high end of the Unsure range, and
>spam closer to the low end).
>
>
>_______________________________________________
>Spambayes mailing list
>Spambayes@python.org
>http://mail.python.org/mailman/listinfo/spambayes
>
>
- Tim
www.fourstonesExpressions.com 


From Tim@mail.powweb.com  Sun Nov  3 13:25:24 2002
From: Tim@mail.powweb.com (Tim@mail.powweb.com)
Date: Sun, 03 Nov 2002 07:25:24 -0600
Subject: [Spambayes] Email client integration -- what's needed?
In-Reply-To: <LNBBLJKPBEHFEDALKOLCGEKKCFAB.tim.one@comcast.net>
Message-ID: <3097SJGMGMJ09SQTS98CAZWUTXWZX.3dc523c4@riven>

I agree with the need for a general purpose training class that accepts a single message or collection of messages.  In addition it should optionally remember 
existing training, or create a new training database.

>a collection of msgs", the latter to remember,
>e.g., which msgs have been trained as ham, and which as spam.

Remembering is an interesting idea, but what real purpose does it serve aside from making testing easier?

- TimS


11/3/2002 2:12:00 AM, Tim Peters <tim.one@comcast.net> wrote:

>[Jeremy Hylton, on the Folder parts of MarkH's training interface\
>> ...
>> This part of the code doesn't work that well for my mail folders.  The
>> code to move messages from folder to folder needs to be written in
>> elisp.  I'm not sure how important that is.
>
>Whatever a general-purpose training class may look like, it seems to need
>two concepts:  "a msg", and "a collection of msgs", the latter to remember,
>e.g., which msgs have been trained as ham, and which as spam.  Mark views
>collections as folders because that's actually how they're set up in the
>Outlook client, but a "virtual folder" makes sense too.  In your case you
>may have just two folders, Ham and Spam, which exist only in cyberspace, as
>a way for the training class to keep track of the state of your training.
>Mark's MoveTo() is then just a way to record the classification a msg should
>have.
>
>> ...
>>             # It's important not to commit a transaction until
>>             # after update_probabilities is called in update().
>>             # Otherwise some new entries will cause scoring to fail.
>
>I'm not sure what that's about, but I probably fixed it late last week
>(Outlook has lots of threads, and it was possible there for scoring to occur
>in parallel with training; WordInfo records are now created with the
>unknown-word spamprob by default instead of with None, so that an attempt to
>score a brand-new word is effectively ignored instead of raising an
>exception).
>
>
>_______________________________________________
>Spambayes mailing list
>Spambayes@python.org
>http://mail.python.org/mailman/listinfo/spambayes
>
>
- Tim
www.fourstonesExpressions.com 


From rob@hooft.net  Sun Nov  3 14:00:04 2002
From: rob@hooft.net (Rob Hooft)
Date: Sun, 03 Nov 2002 15:00:04 +0100
Subject: [Spambayes] Spambayes Header Format
References: <3DC4D0F1.5000509@hooft.net>
	<3fv9suk4fi0m7bgtm04258gmjvr0j3i046@4ax.com> <3DC51403.6030208@hooft.net>
Message-ID: <3DC52BE4.6010602@hooft.net>

I wrote:

> Other todo items and ideas:
>  - Make the "Yes, Unsure, No" items into Options, keeping the defaults
>    the same as in the past for a few days.
>  - Add the debugging info optionally as X-Spambayes-Info

I just did some of this, and some other plans I had:

  * Added options "header_spam_string", "header_unsure_string",
    "header_ham_string". Defaults are set to "Yes", "Unsure", "No".
  * Added options header_score_digits and header_score_logarithm. The
    first is an integer telling hammie in how many digits it should show
    the score. If the second option is set to "True", scores of 1.00 or
    0.00 are augmented by a logarithmic "one-ness" or "zero-ness" score
    (basically it shows the "number of zeros" or "number of nines" next
    to the score value).
  * Added support for a debugging header using the boolean
    hammie_debug_header option and the string hammie_debug_header_name
  * Changed hammie.py to use all of the new options

Please note that I've tried to make this backward compatible where I 
thought that was essential (hope my thoughts are exhaustive). If the 
pop3 proxy and the outlook plugin are adapted to use the same options as 
hammie, we can change the defaults at any point without breaking the 
interaction (only procmail recipes and other external clients will need 
to be adapted).

Rob

-- 
Rob W.W. Hooft  ||  rob@hooft.net  ||  http://www.hooft.net/people/rob/


From tdickenson@devmail.geminidataloggers.co.uk  Sun Nov  3 15:27:08 2002
From: tdickenson@devmail.geminidataloggers.co.uk (Toby Dickenson)
Date: Sun, 3 Nov 2002 15:27:08 +0000
Subject: [Spambayes] Email client integration -- what's needed?
In-Reply-To: <5272PJ65XVL62EDJF62XSFB2WBA091X.3dc30805@riven>
References: <5272PJ65XVL62EDJF62XSFB2WBA091X.3dc30805@riven>
Message-ID: <200211031527.08809.tdickenson@devmail.geminidataloggers.co.uk>

> Forwarding to ham@ and spam@ would
> be a bit of a pain at first, but it would work for existing bodies of m=
ail.
>  Training would be MUCH simpler with this method, and would not require
> some fancy-schmancy installation or configuration glorp.

Forwarding to spam@ or ham@ has some disadvantages because the forwarding=
=20
process destroys some information. Most mail clients dont forward headers=
=2E=20


From richie@entrian.com  Sun Nov  3 16:41:59 2002
From: richie@entrian.com (Richie Hindle)
Date: Sun, 03 Nov 2002 16:41:59 +0000
Subject: [Spambayes] Email client integration -- what's needed?
In-Reply-To: <200211031527.08809.tdickenson@devmail.geminidataloggers.co.uk>
References: <5272PJ65XVL62EDJF62XSFB2WBA091X.3dc30805@riven>
	<200211031527.08809.tdickenson@devmail.geminidataloggers.co.uk>
Message-ID: <hdkasuojmfn1k762sa55frabququkv0a8r@4ax.com>

Hi Toby,

> Forwarding to spam@ or ham@ has some disadvantages because the forwarding 
> process destroys some information. Most mail clients dont forward headers. 

The inbound part (pop3proxy, hammie, the Outlook stuff, whatever) could
cache the messages, then the SMTP proxy could compare the forwarded
messages with the cache (somehow - there'd be no Message-Id to compare) to
find the original to train against.

You're right - losing headers will make a difference, even with the fairly
minimal header tokenising we currently do.  When I added the Unsure
classification to pop3proxy, I tested it by forwarding a bunch of spams to
myself and they all came out Unsure where they had been Yes before - at
first I thought it was a bug, but then a couple of genuine spams rolled in
and were classified correctly.

-- 
Richie Hindle
richie@entrian.com


From richie@entrian.com  Sun Nov  3 16:42:14 2002
From: richie@entrian.com (Richie Hindle)
Date: Sun, 03 Nov 2002 16:42:14 +0000
Subject: [Spambayes] x-hammie-disposition in pop3proxy
In-Reply-To: <3DC49421.8EAA85FE@whidbey.com>
References: <LNBBLJKPBEHFEDALKOLCCEGLCFAB.tim.one@comcast.net>
	<92d8suolml4uj5cl334lrpvneo6qiid2h0@4ax.com> <3DC447AB.D64D6CE6@whidbey.com>
	<jal8su81627i476j0e0ibs28u5j17djf7b@4ax.com> <3DC4676B.F5ED02CE@whidbey.com>
	<ibq8susf6iep85oa89viu4l2os3in5gbpt@4ax.com> <3DC49421.8EAA85FE@whidbey.com>
Message-ID: <cnhasuklddqb8d5ult2jeieh2t8j8gp6ol@4ax.com>

Hi Van,

> As to the final choice of the name, the image of a stern black-robed jurist
> behind a high podium is a lot more appealing to me than an entymologist with a
> magnifier or a librarian choosing a Dewey Decimal System code for a book. So I
> vote for Judgement over Classification.

I think 'judgement' is the better word too, but I also think the risk of
seeming to have mis-spelled it outweighs the benefits.  Plus, the word
'classify' is already strongly associated with what we're doing.

-- 
Richie Hindle
richie@entrian.com


From popiel@wolfskeep.com  Sun Nov  3 17:01:15 2002
From: popiel@wolfskeep.com (T. Alexander Popiel)
Date: Sun, 03 Nov 2002 09:01:15 -0800
Subject: [Spambayes] Spambayes Header Format 
In-Reply-To: Message from Richie Hindle <richie@entrian.com> 
	<3fv9suk4fi0m7bgtm04258gmjvr0j3i046@4ax.com> 
References: <3DC4D0F1.5000509@hooft.net>
	<3fv9suk4fi0m7bgtm04258gmjvr0j3i046@4ax.com> 
Message-ID: <20021103170116.12994F57D@cashew.wolfskeep.com>

In message:  <3fv9suk4fi0m7bgtm04258gmjvr0j3i046@4ax.com>
             Richie Hindle <richie@entrian.com> writes:
>Hi Rob,
>
>> Lots of discussion about the Spambayes header, but nobody takes any 
>> concrete initiatives. For me, the proposal...
>> 
>>     X-Spambayes-Classification: {Ham|Unsure|Spam}
>> 
>> ...looks very good. But obviously this will break backward 
>> compatibility.  And since I'm only using hammie.py and procmail, I can 
>> only change those parts and test them. To make everything work together 
>> again we'd have to make a concerted effort. Better now than later?
>
>I agree.  I volunteer to make the edit, and to combine any duplicated code
>("if prob > spam_cutoff: disp = 'Yes'" currently appears in at least two
>places, for instance).
>
>Let's give it a couple of days and see whether there are any violent
>objections or better suggestions, then I'll make the edit.  Is that OK with
>everyone?

Sounds good to me.  I was going to second this header proposal anyway,
once I got through the stack of mail waiting for me this morning...

- Alex

From popiel@wolfskeep.com  Sun Nov  3 17:14:08 2002
From: popiel@wolfskeep.com (T. Alexander Popiel)
Date: Sun, 03 Nov 2002 09:14:08 -0800
Subject: [Spambayes] Email client integration -- what's needed? 
In-Reply-To: Message from Tim@mail.powweb.com, Stone@mail.powweb.com, Four
	Stones Expressions <tim@fourstonesExpressions.com> 
	<3097SJGMGMJ09SQTS98CAZWUTXWZX.3dc523c4@riven> 
References: <3097SJGMGMJ09SQTS98CAZWUTXWZX.3dc523c4@riven> 
Message-ID: <20021103171408.3A986F57D@cashew.wolfskeep.com>

In message:  <3097SJGMGMJ09SQTS98CAZWUTXWZX.3dc523c4@riven>
             Tim@mail.powweb.com writes:
>
>Remembering (training) is an interesting idea, but what real purpose
>does it serve aside from making testing easier?

Remembering helps in the following scenario:

Mail is trained on as it is received, reinforcing whatever judgement
spambayes already made.  Then, if a mistake is made, the mistaken
message is untrained from the remembered category and trained into
the new category.

Remembering the training type in association with the message
itself (instead of inferring it from what folder it's in or some
such) makes it simpler to implement incremental training along
these lines.  Heck, it helps even if the training isn't automatic,
because it keeps you from having to train from scratch any time
a training error is discovered.

- Alex

From popiel@wolfskeep.com  Sun Nov  3 17:25:11 2002
From: popiel@wolfskeep.com (T. Alexander Popiel)
Date: Sun, 03 Nov 2002 09:25:11 -0800
Subject: [Spambayes] An alternate use 
In-Reply-To: Message from Skip Montanaro <skip@pobox.com> 
	<15811.56800.291193.255713@montanaro.dyndns.org> 
References: <LNBBLJKPBEHFEDALKOLCIEEKCFAB.tim.one@comcast.net>
	<20021102062939.3CD89F5AC@cashew.wolfskeep.com>
	<15811.56800.291193.255713@montanaro.dyndns.org> 
Message-ID: <20021103172511.6984AF57D@cashew.wolfskeep.com>

In message:  <15811.56800.291193.255713@montanaro.dyndns.org>
             Skip Montanaro <skip@pobox.com> writes:
>
>I was using a spoof-proof mechanism from procmail before I disabled
>SpamAssassin.  I inserted my own header using formail:
>
>    :0H
>    * ! ^X-SA-Host:
>    {
>      :0fw
>      | spamc | $FORMAIL -a "X-SA-Host: `hostname --fqdn`"
>    }
>
>which says, "if there is no X-SA-Host header present, run spamc, add a
>header and include the fully qualified hostname".  If an X-SA-Host header is
>present it tells me spamc had already been run on this message (I was
>running SA on two different machines at the time).  That way I wasn't
>relying on SA's own headers to decide whether or not to run it.

This is not spoof-proof; it's merely relying on no one else inserting
an X-SA-Host header.  If any mail comes in with that header already
on it, you don't run SpamAssassin.  Even if you made the rule pay
attention to the hostname in the header, there's nothing preventing
someone from inserting a header with the right hostname.

The two obvious methods for making it reasonably spoof-proof are
comparing with routing information (and making sure that your
mail daemon (and all the upstream mail daemons that you trust)
reject mail from hosts that lie about their identity), or putting
a cryptographic signature in the header (signing the body + whatever
classification headers you're trusting because of the signature).
Verifying either of these methods is beyond the abilities of most
end-user filters.

- Alex

From tim.one@comcast.net  Sun Nov  3 18:28:18 2002
From: tim.one@comcast.net (Tim Peters)
Date: Sun, 03 Nov 2002 13:28:18 -0500
Subject: [Spambayes] Email client integration -- what's needed?
In-Reply-To: <hdkasuojmfn1k762sa55frabququkv0a8r@4ax.com>
Message-ID: <LNBBLJKPBEHFEDALKOLCKEMHCFAB.tim.one@comcast.net>

[Richie Hindle]
> ...
> You're right - losing headers will make a difference, even with the fairly
> minimal header tokenising we currently do.  When I added the Unsure
> classification to pop3proxy, I tested it by forwarding a bunch of spams to
> myself and they all came out Unsure where they had been Yes before - at
> first I thought it was a bug, but then a couple of genuine spams rolled in
> and were classified correctly.

There's indeed a *lot* of info in the headers we look at by default.  About
a full day of work went into deciding on each one of those, and finding the
most helpful way to tokenize each.  Alas, most of that work went into
discovering which headers didn't improve results, or gave great results for
bogus reasons.  OTOH, at the start we didn't look at headers *at all* in
this project (it took a long time to sort out the problems with headers in
mixed-source corpora), so we worked harder than other projects at tokenizing
the body in effective ways too.

Here's the tokenization generator:

    def tokenize(self, obj):
        msg = self.get_message(obj)

        for tok in self.tokenize_headers(msg):
            yield tok
        for tok in self.tokenize_body(msg):
            yield tok

If we comment out either loop, the classifier will see only the headers or
only the body.

Here are results from doing that, on the same randomized set of 2000 ham +
2000 spam from my c.l.py test, with ham_cutoff=0.2 and spam_cutoff=0.8, and
also using the "generate tokens for the absence of key header lines too"
patch I posted in the wee hours.  "before" is looking at both hdrs and body,
"hdr" looking only at headers (no bodies), and "body" looking only at bodies
(no headers):

filename:   before     hdr    body
ham:spam:  2000:2000       2000:2000
                   2000:2000
fp total:        1       0       5
fp %:         0.05    0.00    0.25
fn total:        0       0       1
fn %:         0.00    0.00    0.05
unsure t:       20      29      62
unsure %:     0.50    0.72    1.55
real cost:  $14.00   $5.80  $63.40
best cost:   $2.00   $1.60  $10.40
h mean:       0.55    0.66    1.68
h sdev:       4.50    3.46    8.02
s mean:      99.91   99.40   99.56
s sdev:       1.64    3.46    4.46
mean diff:   99.36   98.74   97.88
k:           16.18   14.27    7.84

A higher spam_cutoff would have helped the body column a lot, but it's clear
we're getting an enormous amount of useful info out of the handful of header
lines we look at by default; indeed, the hdr column is marginally better
than the before column!

In the body column, the FN was one of those brief "Paul, it was great to see
you today.  The proposal will be ready tomorrow.  Heidi." spams.  The only
real spam clues in those are in the headers.  The FP are harder to
characterize, a mix of conference announcements, one-liner "unsubscribe"
thingies, and thoroughly off-topic posts.  By default they get redeemed
because the headers contain clues that they came from a real person, and
weren't posted using spammer software that leaves behind strange
capitalization (BTW, "MiME-Version:", with the lowercase i, turned out to be
one the highest-spamprob words in my personal email classifier too -- wasn't
unique to BruceG's spam).

Using twice as much test data makes a mildly interesting point:

filename:   before     hdr    body
ham:spam:  4000:4000       4000:4000
                   4000:4000
fp total:        1       0       4
fp %:         0.03    0.00    0.10
fn total:        0       0       1
fn %:         0.00    0.00    0.03
unsure t:       28      71     114
unsure %:     0.35    0.89    1.43
real cost:  $15.60  $14.20  $63.80
best cost:   $2.40   $3.80  $20.00
h mean:       0.36    0.63    1.44
h sdev:       3.28    3.68    6.89
s mean:      99.93   99.44   99.64
s sdev:       1.42    3.40    4.07
mean diff:   99.57   98.81   98.20
k:           21.19   13.96    8.96

The h and s means & sdevs in the hdr column barely budge, but in the body
column obviously "improve".  That suggests there's more variability in the
bodies (than in the headers) of both ham and spam.

Bottom line:  the header info is vital in this scheme for best results, but
you could get a useful classifier out of headers alone or bodies alone!


From tim.one@comcast.net  Sun Nov  3 18:51:48 2002
From: tim.one@comcast.net (Tim Peters)
Date: Sun, 03 Nov 2002 13:51:48 -0500
Subject: [Spambayes] An alternate use
In-Reply-To: <3DC40090.5050109@hooft.net>
Message-ID: <LNBBLJKPBEHFEDALKOLCGEMJCFAB.tim.one@comcast.net>

[Tim]
> That's actually what started this project:  Barry Warsaw is GNU
> Mailman's author, and he asked me to look into adapting Graham's
> scheme for incorporation into Mailman. ...

[Rob Hooft]
> So, we'd have to make mailing lists keep a spam-archive as well? Or do
> we deliver spambayes with a pre-cooked spam archive to get started with
> new mailing lists?

That will remain unclear until someone sets up relevant experiments and
people measure results.  I'm counting on Barry to drive that.  Seeding a
mailing-list classifier with ham may also be a puzzle.  I suspect, but don't
know, that training several times on the initial list introduction post will
do well at that -- most lists have "a topic" <wink>, and a good list intro
is bound to mention many words characteristic of that topic.

For python.org use, I expect we'll share a single spam corpus across all
non-personal email carried by that site.

One of the reasons I keep the default header analysis as
platform-independent as I can is so that it won't be a nightmare to *try* to
share spam stats.  I haven't tried to do this, though.

A hint of potential:  where w is the WordInfo dict from my fat c.l.py test:

"""
d = {}
for k, r in w.iteritems():
    if r.spamprob > 0.95 and r.spamcount + r.hamcount >= 10:
        d[k] = r

f = file('reduced.pik', 'wb')
pickle.dump(d, f, 1)
f.close()
"""

Of the 327,439 words in the full dict, 10,559 pass that rather demanding
test for "strong spamness" (high spamprob and not close to being a hapax).
Seeding a classifier with those *may* work well, although the probabilities
will get recomputed in the new classifier, and it's unclear (to me) how to
fiddle the spamcounts and hamcounts in the inherited words so that they
don't dominate the first year of a list's life.


From tim.one@comcast.net  Sun Nov  3 19:27:08 2002
From: tim.one@comcast.net (Tim Peters)
Date: Sun, 03 Nov 2002 14:27:08 -0500
Subject: [Spambayes] Email client integration -- what's needed?
In-Reply-To: <20021101003712.GA28132@rmunnlfs>
Message-ID: <LNBBLJKPBEHFEDALKOLCAEMNCFAB.tim.one@comcast.net>

[Robin Munn]
> ...
> something that could integrate into, say, Outlook Express and add
> a "Block this junk mail" button (which adds the message to the spam
> corpus) to the E-mail reading interface.

AFAIK, Outlook Express has no hooks at all for programmers -- it's a closed
end-user app (as opposed to Outlook, which is highly programmable).  OTOH,
OE's file format is relatively easy to reverse-engineer (again unlike
Outlook's), which gives some hope for a separate process to watch what the
user does indirectly.  I doubt there's any way to filter incoming mail in OE
short of having the user (1) redirect to a pop3 proxy, and (2) set up an OE
rule to look for a header injected by the proxy.

> ...
> 1. It must integrate into the user's email client as seamlessly as
> possible. This means researching the plugin API of Outlook, Eudora,
> Pegasus Mail, Mozilla, et al.

If you're interested in the masses, you could make life easier by
restricting this to clients actually used by the masses <wink>.

> 2. The algorithm and filtering component must also run in the background
> without any user intervention required after the initial install. This
> means being able to install as a Windows NT service or into the StartUp
> folder of Windows 9x.

The current Outlook 2000 client runs as an in-process server, via COM.  That
means it starts up automatically whenever the user starts Outlook, and
closes itself down when the user quits Outlook.  IOW, services and startup
groups aren't required for Outlook integration.  They may be for OE, but
nobody here has shown a sign of looking at an OE approach (appart from
Richie Hindle, who had OE in mind when he wrote pop3proxy -- although this
may be news to him <wink>).

> 3. There *MUST* be good documentation. We all know the user is going to
> run the installer program before reading the documentation, but we must
> include a "How to train your filter to recognize junk mail" document
> that the installer displays after finishing installation. This means
> actually writing said documentation. :-)

OTOH, the masses don't read docs.  In a previous life I worked at a company
doing commercial software "for the masses", and doing usability testing for
mass use is extremely time-consuming and expensive.  The masses don't see
what techies see when they look at a UI, they don't read docs, and they do
very surprising things.  One of my sisters suffered a network outage at
work, and, in frustration, picked up her keyboard and slammed it on her
desk.  The network happened to come back up again then.  I won't say any
more, apart from noting that she has an ongoing problem with broken
keyboards <wink>.


From richie@entrian.com  Sun Nov  3 19:35:56 2002
From: richie@entrian.com (Richie Hindle)
Date: Sun, 03 Nov 2002 19:35:56 +0000
Subject: [Spambayes] Email client integration -- what's needed?
In-Reply-To: <LNBBLJKPBEHFEDALKOLCKEMHCFAB.tim.one@comcast.net>
References: <hdkasuojmfn1k762sa55frabququkv0a8r@4ax.com>
	<LNBBLJKPBEHFEDALKOLCKEMHCFAB.tim.one@comcast.net>
Message-ID: <a7uasu48krhqntat1ea4bnd5tgac582ehf@4ax.com>

Hi Tim,

> There's indeed a *lot* of info in the headers we look at by default.

[Snip very interesting experiment]

> Bottom line:  the header info is vital in this scheme for best results, but
> you could get a useful classifier out of headers alone or bodies alone!

That last fact could be very useful, but I'm not sure I know how yet. 8-)

-- 
Richie Hindle
richie@entrian.com


From richie@entrian.com  Sun Nov  3 19:44:08 2002
From: richie@entrian.com (Richie Hindle)
Date: Sun, 03 Nov 2002 19:44:08 +0000
Subject: [Spambayes] Email client integration -- what's needed?
In-Reply-To: <LNBBLJKPBEHFEDALKOLCAEMNCFAB.tim.one@comcast.net>
References: <20021101003712.GA28132@rmunnlfs>
	<LNBBLJKPBEHFEDALKOLCAEMNCFAB.tim.one@comcast.net>
Message-ID: <gvuasu08h4eukdvba187ulhf052f996bvf@4ax.com>


[Tim]
> Richie Hindle, who had OE in mind when he wrote pop3proxy -- although this
> may be news to him <wink>

You and your time machine are quite right.

> the masses don't read docs.

I've yet to test this theory, but this is one reason I'd like to use HTML
as the 'GUI toolkit' for the UI of the POP3 proxy.  The docs can be tied so
closely to the UI that people won't even realise they're reading them...

-- 
Richie Hindle
richie@entrian.com


From skip@pobox.com  Sun Nov  3 21:25:39 2002
From: skip@pobox.com (Skip Montanaro)
Date: Sun, 3 Nov 2002 15:25:39 -0600
Subject: [Spambayes] An alternate use 
In-Reply-To: <20021103172511.6984AF57D@cashew.wolfskeep.com>
References: <LNBBLJKPBEHFEDALKOLCIEEKCFAB.tim.one@comcast.net>
        <20021102062939.3CD89F5AC@cashew.wolfskeep.com>
        <15811.56800.291193.255713@montanaro.dyndns.org>
        <20021103172511.6984AF57D@cashew.wolfskeep.com>
Message-ID: <15813.37971.977432.564454@montanaro.dyndns.org>


    >> If an X-SA-Host header is present it tells me spamc had already been
    >> run on this message ...

    Alex> This is not spoof-proof; it's merely relying on no one else
    Alex> inserting an X-SA-Host header.

Well, yeah, but I think the odds of some spammer deciding to crack into my
mailbox by inserting X-SA-Host are slim.  On the other hand, spammers are
clearly already faking SpamAssassin headers.

Skip

From Tim@mail.powweb.com  Sun Nov  3 22:37:43 2002
From: Tim@mail.powweb.com (Tim@mail.powweb.com)
Date: Sun, 03 Nov 2002 16:37:43 -0600
Subject: [Spambayes] Email client integration -- what's needed?
In-Reply-To: <hdkasuojmfn1k762sa55frabququkv0a8r@4ax.com>
Message-ID: <LGYS1UGCUS41YPNXSHC2Z21LJA6NK51.3dc5a537@riven>

Yeah, forward generally loses headers... My mailer has a redirect function, which sends the entire thing, headers and all... much better for this kind of thing.

So this leaves us back at the question of training a database with mailers that do not provide for the export of mail into file system artifacts.  Most mailers do 
only have a forward function, which lops off most of the headers... the smtp could use the mail cached by the pop3proxy, assuming it is running... which 
makes me believe that perhaps the pop3proxy and smtpproxy should be different threads on the same process.  That way, users don't have to have two 
processes running, and the two sides of the equation can more easily keep themselves in sync.

- TimS

11/3/2002 10:41:59 AM, Richie Hindle <richie@entrian.com> wrote:

>Hi Toby,
>
>> Forwarding to spam@ or ham@ has some disadvantages because the forwarding 
>> process destroys some information. Most mail clients dont forward headers. 
>
>The inbound part (pop3proxy, hammie, the Outlook stuff, whatever) could
>cache the messages, then the SMTP proxy could compare the forwarded
>messages with the cache (somehow - there'd be no Message-Id to compare) to
>find the original to train against.
>
>You're right - losing headers will make a difference, even with the fairly
>minimal header tokenising we currently do.  When I added the Unsure
>classification to pop3proxy, I tested it by forwarding a bunch of spams to
>myself and they all came out Unsure where they had been Yes before - at
>first I thought it was a bug, but then a couple of genuine spams rolled in
>and were classified correctly.
>
>-- 
>Richie Hindle
>richie@entrian.com
>
>
>_______________________________________________
>Spambayes mailing list
>Spambayes@python.org
>http://mail.python.org/mailman/listinfo/spambayes
>
>
- Tim
www.fourstonesExpressions.com 


From skip@pobox.com  Sun Nov  3 22:56:29 2002
From: skip@pobox.com (Skip Montanaro)
Date: Sun, 3 Nov 2002 16:56:29 -0600
Subject: [Spambayes] x-hammie-disposition in pop3proxy
In-Reply-To: <ibq8susf6iep85oa89viu4l2os3in5gbpt@4ax.com>
References: <LNBBLJKPBEHFEDALKOLCCEGLCFAB.tim.one@comcast.net>
        <3DC447AB.D64D6CE6@whidbey.com>
        <3DC4676B.F5ED02CE@whidbey.com>
        <ibq8susf6iep85oa89viu4l2os3in5gbpt@4ax.com>
Message-ID: <15813.43421.554220.439152@montanaro.dyndns.org>


    Richie> Or maybe the *real* question is, shall we call the header
    Richie> X-Spambayes-Classification?

I suggested X-Ham-Status.  I believe "ham" and "status" are spelled the same
in most dialects of English.

Skip

From piersh@friskit.com  Sun Nov  3 23:21:33 2002
From: piersh@friskit.com (Piers Haken)
Date: Sun, 3 Nov 2002 15:21:33 -0800
Subject: [Spambayes] Email client integration -- what's needed?
Message-ID: <9891913C5BFE87429D71E37F08210CB9297504@zeus.sfhq.friskit.com>

it would seem to me that instead of trying to cram these message store
capabilities into a protocol that just doesn't support it (POP3), why
not use a protocol that does (IMAP4)?

i'd suggest writing a simple IMAP 'proxy' (possibly single-user) that
retrieves messages from a 'real' mail server via POP3, scans the
incoming messages, classifies them, then puts them in the corresponding
folders. the IMAP server can then reclassify/retrain on the messages
when they are moved between folders (just as the outlook plugin does).

piers.

-----Original Message-----
From: Tim@mail.powweb.com [mailto:Tim@mail.powweb.com]
Sent: Sunday, November 03, 2002 2:38 PM
To: Spambayes
Subject: Re: [Spambayes] Email client integration -- what's needed?


Yeah, forward generally loses headers... My mailer has a redirect
function, which sends the entire thing, headers and all... much better
for this kind of thing.

So this leaves us back at the question of training a database with
mailers that do not provide for the export of mail into file system
artifacts.  Most mailers do=20
only have a forward function, which lops off most of the headers... the
smtp could use the mail cached by the pop3proxy, assuming it is
running... which=20
makes me believe that perhaps the pop3proxy and smtpproxy should be
different threads on the same process.  That way, users don't have to
have two=20
processes running, and the two sides of the equation can more easily
keep themselves in sync.

- TimS

11/3/2002 10:41:59 AM, Richie Hindle <richie@entrian.com> wrote:

>Hi Toby,
>
>> Forwarding to spam@ or ham@ has some disadvantages because the
forwarding=20
>> process destroys some information. Most mail clients dont forward
headers.=20
>
>The inbound part (pop3proxy, hammie, the Outlook stuff, whatever) could
>cache the messages, then the SMTP proxy could compare the forwarded
>messages with the cache (somehow - there'd be no Message-Id to compare)
to
>find the original to train against.
>
>You're right - losing headers will make a difference, even with the
fairly
>minimal header tokenising we currently do.  When I added the Unsure
>classification to pop3proxy, I tested it by forwarding a bunch of spams
to
>myself and they all came out Unsure where they had been Yes before - at
>first I thought it was a bug, but then a couple of genuine spams rolled
in
>and were classified correctly.
>
>--=20
>Richie Hindle
>richie@entrian.com
>
>
>_______________________________________________
>Spambayes mailing list
>Spambayes@python.org
>http://mail.python.org/mailman/listinfo/spambayes
>
>
- Tim
www.fourstonesExpressions.com=20


_______________________________________________
Spambayes mailing list
Spambayes@python.org
http://mail.python.org/mailman/listinfo/spambayes
From seant@iname.com  Mon Nov  4 01:00:35 2002
From: seant@iname.com (Sean True)
Date: Sun, 3 Nov 2002 20:00:35 -0500
Subject: [Spambayes] Email client integration -- what's needed?
In-Reply-To: <9891913C5BFE87429D71E37F08210CB9297504@zeus.sfhq.friskit.com>
Message-ID: <MJEHLHJKGINLONDMMKNEIEKNHFAA.seant@iname.com>

> it would seem to me that instead of trying to cram these message store
> capabilities into a protocol that just doesn't support it (POP3), why
> not use a protocol that does (IMAP4)?
>
> i'd suggest writing a simple IMAP 'proxy' (possibly single-user) that
> retrieves messages from a 'real' mail server via POP3, scans the
> incoming messages, classifies them, then puts them in the corresponding
> folders. the IMAP server can then reclassify/retrain on the messages
> when they are moved between folders (just as the outlook plugin does).
>

I like this idea, with one significant reservation -- I'm now trusting an
external IMAP mail store with my mail, all 2 GB of it. That makes me queasy,
a
little, and underlines the problematic of implementing a mail store, instead
of just
an MTA.

-- Sean


From anthony@interlink.com.au  Mon Nov  4 01:04:07 2002
From: anthony@interlink.com.au (Anthony Baxter)
Date: Mon, 04 Nov 2002 12:04:07 +1100
Subject: [Spambayes] Email client integration -- what's needed? 
In-Reply-To: <9891913C5BFE87429D71E37F08210CB9297504@zeus.sfhq.friskit.com> 
Message-ID: <200211040104.gA4147M06387@localhost.localdomain>


>>> "Piers Haken" wrote
> it would seem to me that instead of trying to cram these message store
> capabilities into a protocol that just doesn't support it (POP3), why
> not use a protocol that does (IMAP4)?
> 
> i'd suggest writing a simple IMAP 'proxy' (possibly single-user) that
> retrieves messages from a 'real' mail server via POP3, scans the
> incoming messages, classifies them, then puts them in the corresponding
> folders. the IMAP server can then reclassify/retrain on the messages
> when they are moved between folders (just as the outlook plugin does).

The problems with this approach are to do with the complexity of IMAP.
It's a heavy weight protocol, and lots of different IMAP clients abuse
it in slightly different ways each. 

Anthony

-- 
Anthony Baxter     <anthony@interlink.com.au>   
It's never too late to have a happy childhood.


From piersh@friskit.com  Mon Nov  4 02:09:51 2002
From: piersh@friskit.com (Piers Haken)
Date: Sun, 3 Nov 2002 18:09:51 -0800
Subject: [Spambayes] Email client integration -- what's needed?
Message-ID: <9891913C5BFE87429D71E37F08210CB9297505@zeus.sfhq.friskit.com>

Might it be possible, in the case where you're already using an IMAP
message store, to write an IMAP client which connects to that store with
the express purpose of filtering the email? It could somehow watch for
messages incoming and being moved between folders and perform the
filtering/retraining based on those events.

I just see problems trying to use POP3, which is essentilly a message
transfer protocol being used to perform functions which should be
applied to a message store. The alternative is to write handlers for all
different kinds of message stores...

Piers.

> -----Original Message-----
> From: Sean True [mailto:seant@iname.com]=20
> Sent: Sunday, November 03, 2002 5:01 PM
> To: Piers Haken; Tim@mail.powweb.com; Spambayes
> Subject: RE: [Spambayes] Email client integration -- what's needed?
>=20
>=20
> > it would seem to me that instead of trying to cram these=20
> message store=20
> > capabilities into a protocol that just doesn't support it=20
> (POP3), why=20
> > not use a protocol that does (IMAP4)?
> >
> > i'd suggest writing a simple IMAP 'proxy' (possibly=20
> single-user) that=20
> > retrieves messages from a 'real' mail server via POP3, scans the=20
> > incoming messages, classifies them, then puts them in the=20
> > corresponding folders. the IMAP server can then=20
> reclassify/retrain on=20
> > the messages when they are moved between folders (just as=20
> the outlook=20
> > plugin does).
> >
>=20
> I like this idea, with one significant reservation -- I'm now=20
> trusting an external IMAP mail store with my mail, all 2 GB=20
> of it. That makes me queasy, a little, and underlines the=20
> problematic of implementing a mail store, instead of just an MTA.
>=20
> -- Sean
>=20
>=20
From Tim@mail.powweb.com  Mon Nov  4 03:04:44 2002
From: Tim@mail.powweb.com (Tim@mail.powweb.com)
Date: Sun, 03 Nov 2002 21:04:44 -0600
Subject: [Spambayes] Need some clarification on the training database
Message-ID: <8285NLPL5YTTQJGXTAXU3WA8OB2.3dc5e3cc@riven>

Ok, so help me out... neiltrain does its thing then writes to a cdb.  But hammie appears to expect that the training database be a pickle.  It's not adding up, and 
when I start the pop3proxy pointing at the database I trained using the code I lifted from neiltrain.py, it pukes (of course)... 

- TimS


From Tim@mail.powweb.com  Mon Nov  4 03:51:45 2002
From: Tim@mail.powweb.com (Tim@mail.powweb.com)
Date: Sun, 03 Nov 2002 21:51:45 -0600
Subject: [Spambayes] Need some clarification on the training database
In-Reply-To: <8285NLPL5YTTQJGXTAXU3WA8OB2.3dc5e3cc@riven>
Message-ID: <QOIDLHRPNK62FBRPA9SM54US7504UR65.3dc5eed1@riven>

Never mind... I figured it out.  Ok, so I have the pop3proxy looking at the training database that my smtpproxy creates when I send it spam and ham via 
redirect, which preserves headers in my mailer.  Unfortunately, it also adds my headers, so I'm training spambayes to recognize mail from myself to myself as 
spam or ham, depending on how many of each i send... ;)

But this isn't a good general solution, because most people use mailers that only offer forward, which may strip off many of the original headers.  It would be a 
simple thing to make the pop3proxy store incoming mails, but this is kinda silly because then the mail may be being stored in three places: the server, the 
mailer, and the pop3proxy.  Another solution might be for the proxy to package up the headers and add them to the mail as an attachment, that the smtpproxy 
could recognize...

All that to say "I'll live to fight tomorrow"

- TimS

11/3/2002 9:04:44 PM, Tim@mail.powweb.com, Stone@mail.powweb.com, Four Stones Expressions <tim@fourstonesExpressions.com> wrote:

>Ok, so help me out... neiltrain does its thing then writes to a cdb.  But hammie appears to expect that the training database be a pickle.  It's not adding up, 
and 
>when I start the pop3proxy pointing at the database I trained using the code I lifted from neiltrain.py, it pukes (of course)... 
>
>- TimS
>
>
>
>
>_______________________________________________
>Spambayes mailing list
>Spambayes@python.org
>http://mail.python.org/mailman/listinfo/spambayes
>
>
- Tim
www.fourstonesExpressions.com 


From B-Morgan@concentric.net  Mon Nov  4 05:17:58 2002
From: B-Morgan@concentric.net (Brad Morgan)
Date: Sun, 3 Nov 2002 22:17:58 -0700
Subject: [Spambayes] Need some clarification on the training database
In-Reply-To: <QOIDLHRPNK62FBRPA9SM54US7504UR65.3dc5eed1@riven>
Message-ID: <NABBJOOEOFODEALNMJAJCEDPHBAA.B-Morgan@concentric.net>

> But this isn't a good general solution, because most people use mailers
that only offer
> forward, which may strip off many of the original headers.  It would be a
> simple thing to make the pop3proxy store incoming mails, but this is kinda
silly because > then the mail may be being stored in three places: the
server, the
> mailer, and the pop3proxy.  Another solution might be for the proxy to
package up the
> headers and add them to the mail as an attachment, that the smtpproxy
> could recognize...

The copy that the pop3proxy keeps doesn't need to be kept long term.  Only
long enough to be told it got something wrong.  After a while, it can clear
its cache under the assumption that no news is good news.

Look at the popfile interface (another Sourceforge project).  They save the
messages and use an HTTP interface to interact with the proxy.  Not a bad
solution.

Regards,

Brad


From tim.one@comcast.net  Mon Nov  4 06:21:45 2002
From: tim.one@comcast.net (Tim Peters)
Date: Mon, 04 Nov 2002 01:21:45 -0500
Subject: [Spambayes] Need some clarification on the training database
In-Reply-To: <QOIDLHRPNK62FBRPA9SM54US7504UR65.3dc5eed1@riven>
Message-ID: <LNBBLJKPBEHFEDALKOLCCEPCCFAB.tim.one@comcast.net>

]Tim@mail.powweb.com\
> Never mind... I figured it out.  Ok, so I have the pop3proxy
> looking at the training database that my smtpproxy creates when I
> send it spam and ham via redirect, which preserves headers in my
> mailer.  Unfortunately, it also adds my headers, so I'm training
> spambayes to recognize mail from myself to myself as spam or ham,
> depending on how many of each i send... ;)

At least that part shouldn't matter much.  The spamprob guesses are based on
the percentages of spam and ham in which a word appears.  If a particular
header clue about you appears in 100% of the forwarded spam, and 100% of the
forwarded ham, it will get spamprob 0.5 regardless of the absolute numbers
involved.  This is, it will be 100% neutral.  This can still distort scores,
but I expect in a very minor way, and via pulling them toward Unsure rather
toward either side.


From rob@hooft.net  Mon Nov  4 06:26:21 2002
From: rob@hooft.net (Rob Hooft)
Date: Mon, 04 Nov 2002 07:26:21 +0100
Subject: [Spambayes] counterweight: it really works!
Message-ID: <3DC6130D.40508@hooft.net>

Hmmm. I trained hammie on my private account yesterday night (already 
running about a week at work), and found this in my spam folder this 
morning:

======================
Subject: ***  We want to finance/buy your business..Pres. please !
Mime-Version: 1.0
Content-type: text/plain; charset="iso-8859-1"
Message-Id: <20021104021425.D097A77809@temoleh.chem.uu.nl>
Date: Mon,  4 Nov 2002 03:14:25 +0100 (CET)
X-Spam-Status: No, hits=0.1 required=5.0
         tests=PLING
         version=2.31
X-Spam-Level:
X-Hammie-Disposition: Yes; 1.00 (8)
======================

Just to remind everyone that this software really works! Its spambayes 
score deviates from 1.0 only by about 10**-8, but SA didn't see much.

The setup I had to make for my private account is really quite involved. 
My private mail domain arrives on a workstation with ample space, but 
without a  pop or imap server and without an adequate backup policy for 
storing E-mail (it used to be forwarded by a postfix virtual map). On 
the other hand, I read it from an IMAP server that doesn't have python2 
installed, nor do I have enough space there for a hammie.db. I ended up 
with a .procmailrc on the workstation that ends by forwarding the 
non-spam messages to the IMAP server using "sendmail -i"...

Rob

-- 
Rob W.W. Hooft  ||  rob@hooft.net  ||  http://www.hooft.net/people/rob/


From anthony@interlink.com.au  Mon Nov  4 06:27:47 2002
From: anthony@interlink.com.au (Anthony Baxter)
Date: Mon, 04 Nov 2002 17:27:47 +1100
Subject: [Spambayes] Something to test 
In-Reply-To: <LNBBLJKPBEHFEDALKOLCEEKMCFAB.tim.one@comcast.net> 
Message-ID: <200211040627.gA46Rm108104@localhost.localdomain>


>>> Tim Peters wrote
> This little patch arranges to create "noheader:HEADERNAME" tokens for
> headers in options.safe_headers that *don't* appear in a msg's headers.  On
> my fat c.l.py test it's a small theoretical improvement:  best-cost falls
> from $26.80 to $22.00, by knocking down the score of the second-worst
> hopeless FP just enough so that redeeming it *could* be traded away for an
> increase in the Unsure rate.  That's not realistic, though (the spam_cutoff
> value needed to redeem that FP is no longer insane, but is still
> *unreasonably* high).
> 

filename:    before  after
ham:spam:  11192:1826     
                   11192:1826
fp total:        0       1  
fp %:         0.00    0.01  
fn total:        7       8 
fn %:         0.38    0.44 
unsure t:      106     107 
unsure %:     0.81    0.82 
real cost:  $28.20  $39.40 
best cost:  $28.20  $30.40 
h mean:       0.63    0.42 
h sdev:       4.19    4.19 
s mean:      98.68   98.63 
s sdev:       7.74    7.95 
mean diff:   98.05   98.21 
k:            8.22    8.09 

The additional fp was a mail-out from Nettwerk (that I've signed up
for, but which are _incredibly_ spammy) that went from 0.956 to 0.964,
where my spam cutoff is 0.96. The noheader: errors-to was the killer
clue that pushed it over the edge. The spam situation is considerably
worse. The additional false negative was something that went from 0.467 
to 0.431 (ham_cutoff 0.45). The damage came from
  prob('noheader:mime-version') = 0.245329
(It was a very short spam)

One fn went from 0.27 to 0.029, due to:
  prob('noheader:subject') = 0.0042591
  prob('noheader:to') = 0.0652536
  prob('noheader:mime-version') = 0.245329

It made pretty much all of my fn's at least slightly worse, if not
much worse.


For what it's worth the "Iron Citadel" comp.lang.python spam is 
currently showing up as a 0.0057 ham, prob('*H*')=1, prob('*S*')=0.0115174
This is far and away the worst spam I've seen for some time.

-- 
Anthony Baxter     <anthony@interlink.com.au>   
It's never too late to have a happy childhood.


From tim.one@comcast.net  Mon Nov  4 06:37:49 2002
From: tim.one@comcast.net (Tim Peters)
Date: Mon, 04 Nov 2002 01:37:49 -0500
Subject: [Spambayes] Email client integration -- what's needed?
In-Reply-To: <MJEHLHJKGINLONDMMKNEIEKNHFAA.seant@iname.com>
Message-ID: <LNBBLJKPBEHFEDALKOLCMEPDCFAB.tim.one@comcast.net>

[Sean True]
> I like this idea, with one significant reservation -- I'm now trusting an
> external IMAP mail store with my mail, all 2 GB of it.

You realize that Outlook has a hard 2GB limit on .pst files, right?  They
keep fixing this in service packs whenever a new verion of Outlook comes
out, most recently for Outlook 2002:

    http://support.microsoft.com/default.aspx?scid=KB;EN-US;q304863

"The fix" this time around is to display

   Task 'Microsoft Exchange Server - Receiving' reported error
       (0x8004060C): 'Unknown Error 0x8004060C'

instead of silently refusing to accept new email(!) when 2GB is approached.
For earlier versions of Outlook, MS now makes a tool available that
truncates a too-big .pst file, with no hope of recovering the lost data.
The good news is that they say you can at least start Outlook again then.

> That makes me queasy, a little, and underlines the problematic of
> implementing a mail store, instead of just an MTA.

I expect that Python's "large file" support on Windows is more reliable than
Outlook's (well, the latter is guaranteed to fail in nasty ways, but that's
not what I mean by reliable <wink>).


From anthony@interlink.com.au  Mon Nov  4 06:40:17 2002
From: anthony@interlink.com.au (Anthony Baxter)
Date: Mon, 04 Nov 2002 17:40:17 +1100
Subject: [Spambayes] Re: [Spambayes-checkins] website background.ht,1.1,1.2 
In-Reply-To: <E188atk-0004Ex-00@usw-pr-cvs1.sourceforge.net> 
Message-ID: <200211040640.gA46eIe08260@localhost.localdomain>


JFYI - I'd like corrections and updates to this. I'm attempting to 
channel Tim (always a error-prone task) and I've undoubtedly got
stuff wrong. 


>>> "Anthony Baxter" wrote
> Update of /cvsroot/spambayes/website
> In directory usw-pr-cvs1:/tmp/cvs-serv16178
> 
> Modified Files:
> 	background.ht 
> Log Message:
> A bit of a potted history here. I probably have a bunch of things here
> that need to be cleaned up and made more obvious, but hey, it's a start.
> 
> 
> Index: background.ht
> ===================================================================
> RCS file: /cvsroot/spambayes/website/background.ht,v
> retrieving revision 1.1
> retrieving revision 1.2
> diff -C2 -d -r1.1 -r1.2
> *** background.ht	19 Sep 2002 23:39:24 -0000	1.1
> --- background.ht	4 Nov 2002 06:39:42 -0000	1.2
> ***************
> *** 15,18 ****
> --- 15,67 ----
>   <p><i>more links? mail anthony at interlink.com.au</i></p>
>   
> + <h2>Overall Approach</h2>
> + <b>Please note that I (Anthony) am writing this based on memory and
> + limited understanding of some of the subtler points of the maths. Gentle
> + corrections are welcome, or even encouraged.</b>
> + <h3>Tokenizing</h3>
> + <p>The architecture of the spambayes system has a couple of distinct 
> + parts. The first, and most obvious, is the <i>tokenizer</i>. This takes
> + a mail message and breaks it up into a series of tokens. At the moment
> + it splits words out of the text parts of a message, there's a variety
> + of header tokenization that goes on as well. The code in tokenizer.py
> + and the comments in the Tokenizer section of Options.py contain more 
> + information about various approaches to tokenizing.</p>
> + 
> + <h3>Combining and Scoring</h3>
> + <p>The next part of the system is the scoring and combining part. This
> + is where the hairy mathematics and statistics come in. </p>
> + <p>Initially we started with Paul Graham's original combining scheme - 
> + this has a number of "magic numbers" and "fuzz factors" built into it. 
> + The Graham combining scheme has a number of problems, aside from the
> + magic in the internal fudge factors - it tends to produce scores of 
> + either 1 or 0, and there's a very small middle ground in between - it 
> + doesn't often claim to be "unsure", and gets it wrong because of this. 
> + There's a number of discussions back and forth between Tim Peters and 
> + Gary Robinson on this subject in the mailing list archives - I'll try 
> + and put links to the relevant threads at some point.</p>
> + <p>Gary produced a number of alternative approaches to combining and
> + scoring word probabilities. The initial one, after much back and forth
> + in the mailing list, is in the code today as 'gary_combining'. A couple
> + of other approaches, using the Central Limit Theorem, were also tried.
> + They produced interesting output - but histograms of the ham and spam
> + distributions had a disturbingly large overlap in the middle. There was
> + also an issue with incremental training and untraining of messages that
> + made it harder to use in the "real world". These two central limit 
> + approaches were dropped after Tim, Gary and Rob Hooft produced a combining
> + scheme using chi-squared probabilities. This is now the default combining
> + scheme. </p>
> + <p>The chi-squared approach produces two numbers - a "ham probability" ("*H
*")
> + and a "spam probability" ("*S*"). A typical spam will have a high *S*
> + and low *H*, while a ham will have high *H* and low *S*. In the case where
> + the message looks entirely unlike anything the system's been trained on,
> + you can end up with a low *H* and low *S* - this is the code saying "I don'
t
> + know what this message is". So at the end of the processing, you end up 
> + with three possible results - "Spam", "Ham", or "Unsure". It's possible to
> + tweak the high and low cutoffs for the Unsure window - this trades off 
> + unsure messages vs possible false positives or negatives.</P>
> + 
> + <h3>Training</h3>
> + <p>TBD</p>
> + 
>   <h2>Mailing list archives</h2>
>   <p>There's a lot of background on what's been tried available from
> 
> 
> 
> _______________________________________________
> Spambayes-checkins mailing list
> Spambayes-checkins@python.org
> http://mail.python.org/mailman/listinfo/spambayes-checkins
> 

-- 
Anthony Baxter     <anthony@interlink.com.au>   
It's never too late to have a happy childhood.


From rob@hooft.net  Mon Nov  4 07:09:39 2002
From: rob@hooft.net (Rob Hooft)
Date: Mon, 04 Nov 2002 08:09:39 +0100
Subject: [Spambayes] Re: [Spambayes-checkins] website
	background.ht,1.1,1.2
References: <200211040640.gA46eIe08260@localhost.localdomain>
Message-ID: <3DC61D33.20602@hooft.net>

Anthony Baxter wrote:
>>+ and a "spam probability" ("*S*"). A typical spam will have a high *S*
>>+ and low *H*, while a ham will have high *H* and low *S*. In the case where
>>+ the message looks entirely unlike anything the system's been trained on,
>>+ you can end up with a low *H* and low *S* - this is the code saying "I don't
>>+ know what this message is". 

Some messages can even have both a high *H* and a high *S*, telling you 
basically that the message looks very much like ham, but also very much 
like spam. In this case spambayes is also unsure where the message 
should be classified, and the final score will be near 0.5.

>>+ So at the end of the processing, you end up 
>>+ with three possible results - "Spam", "Ham", or "Unsure". It's possible to
>>+ tweak the high and low cutoffs for the Unsure window - this trades off 
>>+ unsure messages vs possible false positives or negatives.


-- 
Rob W.W. Hooft  ||  rob@hooft.net  ||  http://www.hooft.net/people/rob/


From tim.one@comcast.net  Mon Nov  4 07:37:32 2002
From: tim.one@comcast.net (Tim Peters)
Date: Mon, 04 Nov 2002 02:37:32 -0500
Subject: [Spambayes] Something to test
In-Reply-To: <200211040627.gA46Rm108104@localhost.localdomain>
Message-ID: <LNBBLJKPBEHFEDALKOLCCEPJCFAB.tim.one@comcast.net>

Just a quickie (I realized I have to sleep sometime <wink>):

[Anthony Baxter]
> ...
>
> For what it's worth the "Iron Citadel" comp.lang.python spam is
> currently showing up as a 0.0057 ham, prob('*H*')=1, prob('*S*')=0.0115174
> This is far and away the worst spam I've seen for some time.

It's great spam.  I had to read quite a bit of it before I realized it was
spam.  If the "real purpose" of this kind of project is to alter spammers'
behavior, then that's a glimpse of the future:  soft-sell spam with lots of
detail talk and almost no blatant advertising.  It scores 0.33 for me today
(H=0.69 and S=0.35), but the still-low S value got "so high" only because of
hapaxes (words unique in this msg).

My lowest-scoring spam to date *still* scores under 0.05 (even after
training on it):

"""
Spam Score: 0.0443661

'*H*'                          0.999899
'*S*'                          0.0886315
'url:python-list'              0.0111408  python.org
'url:mailman'                  0.0116596  python.org
'url:python'                   0.0143819  python.org
'url:listinfo'                 0.0150747  python.org
'header:X-Complaints-to:1'     0.0185989  orignal
'url:org'                      0.0444069  python.org
'header:Errors-to:1'           0.0452392  python.org
'header:Organization:1'        0.0737601  original
'header:Return-path:1'         0.0757934  python.org
'header:Message-id:1'          0.0792662  original
'url:mail'                     0.0882956  python.org
'header:MIME-version:1'        0.0986049  original
'header:Reply-to:1'            0.15973    original
'header:Received:4'            0.162073   original
'subject:new'                  0.327577   original
'url:com'                      0.60446    original
'from:email addr:infonie.fr'   0.844828   hapaxes from here
'from:email name:bmcc'         0.844828
'message-id:@infonie.fr'       0.844828
'paix.'                        0.844828
'url:keyrouz'                  0.844828
'voix'                         0.844828
'x-mailer:mozilla 4.7 (macintosh; i; ppc)' 0.844828   to here
'peace.'                       0.908163   original

Message Stream:

Return-path: <python-list-admin@python.org>
Path:

news.baymountain.com!uunet!ash.uu.net!dfw.uu.net!sac.uu.net!lore.csc.com!nnt
p.abs.net!news.maxwell.syr.edu!newsfeed.icl.net!newsfeed.fjserv.net!news.tel
e.dk!news.tele.dk!small.news.tele.dk!news-fra1.dfn.de!newsfeed.hanau.net!fr.
clara.net!heighliner.fr.clara.net!news.tiscali.fr!not-for-mail
Received: from bright01.icomcast.net (bright01-qfe0.icomcast.net
[172.20.4.8])
 by msgstore01.icomcast.net
 (iPlanet Messaging Server 5.1 HotFix 1.5 (built Sep 23 2002))
 with ESMTP id <0H4V0021NFMT35@msgstore01.icomcast.net> for
 tim.one@ims-ms-daemon (ORCPT tim.one@comcast.net); Thu,
 31 Oct 2002 19:20:53 -0500 (EST)
Received: from mtain03 (bright-LB.icomcast.net [172.20.3.155])for
	<@msgstore01.icomcast.net:tim.one@comcast.net>; Thu,
	31 Oct 2002 19:21:13 -0500 (EST)
Received: from mail.python.org (mail.python.org [12.155.117.29])
 by mtain03.icomcast.net
 (iPlanet Messaging Server 5.1 HotFix 1.5 (built Sep 23 2002))
 with ESMTP id <0H4V009DCFN3ZE@mtain03.icomcast.net> for tim.one@comcast.net
 (ORCPT tim.one@comcast.net); Thu, 31 Oct 2002 19:21:03 -0500 (EST)
Received: from localhost.localdomain ([127.0.0.1] helo=mail.python.org)
	by mail.python.org with esmtp (Exim 4.05)	id 187PYc-0005rG-00; Thu,
	31 Oct 2002 19:21:02 -0500
X-Trace: news2adm.tiscali.fr. 1036109620 28050 172.29.129.3
 (1 Nov 2002 00:13:40 GMT)
Date: Fri, 01 Nov 2002 01:17:34 +0100
From: bmcc@infonie.fr
Subject: new
Sender: python-list-admin@python.org
To: python-list@python.org
Errors-to: python-list-admin@python.org
Reply-to: bmcc@infonie.fr
Message-id: <3DC1C81F.66B97F91@infonie.fr>
Organization: Guest of TISCALI - FRANCE
X-Complaints-to: abuse@libertysurf.fr
MIME-version: 1.0
X-Mailer: Mozilla 4.7 (Macintosh; I; PPC)
Content-type: text/plain; charset=us-ascii
Content-transfer-encoding: 7bit
NNTP-posting-date: Fri, 1 Nov 2002 00:13:40 +0000 (UTC)
X-Accept-Language: en
Precedence: bulk
X-BeenThere: python-list@python.org
X-NNTP-Posting-Host: dyn-212-232-61-200.ppp.tiscali.fr
Newsgroups: comp.lang.python
Lines: 3
NNTP-posting-host: news3adm.tiscali.fr
X-Mailman-Version: 2.0.13 (101270)
List-Post: <mailto:python-list@python.org>
List-Subscribe: <http://mail.python.org/mailman/listinfo/python-list>,
	<mailto:python-list-request@python.org?subject=subscribe>
List-Unsubscribe: <http://mail.python.org/mailman/listinfo/python-list>,
	<mailto:python-list-request@python.org?subject=unsubscribe>
List-Archive: <http://mail.python.org/pipermail/python-list/>
List-Help: <mailto:python-list-request@python.org?subject=help>
List-Id: General discussion list for the Python programming language
 <python-list.python.org>
Xref: news.baymountain.com comp.lang.python:187745


The voice of peace.
La voix de la paix.
http://www.keyrouz.com
--
http://mail.python.org/mailman/listinfo/python-list
"""

I've currently got 1,947 spam in my personal classifier.  2 of them score as
ham (that was one of 'em; the other was similar, and also got huge boosts
from having come via python.org), and 16 score as Unsure (ham cutoff at
0.20, spam cutoff at 0.80).

It's delightful that "peace." is a high spamprob word, eh <wink>?


From tim.one@comcast.net  Mon Nov  4 07:59:26 2002
From: tim.one@comcast.net (Tim Peters)
Date: Mon, 04 Nov 2002 02:59:26 -0500
Subject: [Spambayes] x-hammie-disposition in pop3proxy
In-Reply-To: <LNBBLJKPBEHFEDALKOLCOEKFCFAB.tim.one@comcast.net>
Message-ID: <LNBBLJKPBEHFEDALKOLCIEPLCFAB.tim.one@comcast.net>

[Tim]
> ...
> If you do have the score, we've gotten mixed reports here about whether
> sorting Unsure msgs by score is helpful.  I find that it is in my
> email, but there are many exceptions (ham closer to high end of the
> Unsure range, and spam closer to the low end).

I forgot something there:  sorting by score is *extremely* helpful in a
different context:  after a batch of training, I score the ham and spam
training sets themselves, then sort them by score (Outlook is very good at
this -- sorted displays, and grouped displays, on arbitrary columns, is
built in).  Misclassified msgs reliably end up at "the wrong end" of the
display, and that makes recovery easy.


From anthony@interlink.com.au  Mon Nov  4 09:50:30 2002
From: anthony@interlink.com.au (Anthony Baxter)
Date: Mon, 04 Nov 2002 20:50:30 +1100
Subject: [Spambayes] A couple of small tokenizer experiments.
Message-ID: <200211040950.gA49oU809201@localhost.localdomain>

First experiment was to make the URL tokenizer look for the string
'mailman' in the URL. If it was found, simple push the clue "url: Mailman URL"
onto the clue-pile. This was an attempt to remove the many many related 
clues that get bolted onto the occasional spam that makes it past Greg to
the python.org mailservers. It's something of a violation of "stupid beats
smart", but I'd noticed that the mailman footer from spam via mailman lists
was always providing a bunch of clues that were making life harder.

--- tokenizer.py        1 Nov 2002 16:10:13 -0000       1.60
+++ tokenizer.py        4 Nov 2002 06:59:37 -0000
@@ -931,6 +931,11 @@
         new_text.append(text[i : start])
         new_text.append(' ')
 
+        if guts.find('mailman') != -1:
+            pushclue("url: Mailman URL")
+            i = end
+            break
+
         pushclue("proto:" + proto)
         # Lose the trailing punctuation for casual embedding, like:
         #     The code is at http://mystuff.org/here?  Didn't resolve.

This produced an improvement in unsure and in fn, but made a couple of 
high-scoring hams a bit worse. Nothing that can't be fixed by tweaking
the spam-cutoff number:

filename:  before-nomailma
                   after-nomailman
ham:spam:  11192:1826     
                   11192:1826
fp total:        0       2 
fp %:         0.00    0.02 
fn total:        7       5 
fn %:         0.38    0.27 
unsure t:      108     104 
unsure %:     0.83    0.80 
real cost:  $28.60  $45.80 
best cost:  $28.00  $27.60 
h mean:       0.62    0.67 
h sdev:       4.27    4.52 
s mean:      98.69   98.89 
s sdev:       7.69    6.96 
mean diff:   98.07   98.22 
k:            8.20    8.56 

from the tails of the files, before gave
-> achieved at ham & spam cutoffs 0.48 & 0.96
->     fp 0; fn 8; unsure ham 30; unsure spam 70
->     fp rate 0%; fn rate 0.438%; unsure rate 0.768%
after gave best effort of
-> per-fp cost $10.00; per-fn cost $1.00; per-unsure cost $0.20
-> achieved at ham & spam cutoffs 0.49 & 0.98
->     fp 0; fn 6; unsure ham 31; unsure spam 77
->     fp rate 0%; fn rate 0.329%; unsure rate 0.83%

The top end of the spam histogram is enlightening (I think):
before:
 90    3 *
 91    2 *
 92    8 *
 93    2 *
 94   11 *
 95    1 *
 96    6 *
 97   12 *
 98   18 *
 99 1712 ************************************************************
after:
 90    0 
 91    5 *
 92    5 *
 93    5 *
 94    5 *
 95    7 *
 96    5 *
 97    7 *
 98   21 *
 99 1722 ************************************************************

So this really did some nice work for a kinda ugly hack :)

Next I tried tokenizing the To: line.  I parsed it properly, then 
decoded the real name and split the words. I also added a token for
the RHS and LHS of the email @ sign.


--- tokenizer.py        1 Nov 2002 16:10:13 -0000       1.60
+++ tokenizer.py        4 Nov 2002 09:26:12 -0000
@@ -5,6 +5,8 @@
 
 import email
 import email.Message
+import email.Header
+import email.Utils
 import email.Errors
 import re
 import math
@@ -1099,6 +1110,17 @@
             count = 0
             for addrs in msg.get_all(field, []):
                 count += len(addrs.split(','))
+                for rname,ename in email.Utils.getaddresses([addrs]):
+                    if rname:
+                        d = email.Header.decode_header(rname)[0]
+                        rname,rcharset = d
+                        for w in rname.split():
+                            yield field+'realname: '+w
+                        if rcharset is not None:
+                            yield field+'charset: '+rcharset
+                    if ename:
+                        for w in ename.split('@'):
+                            yield field+'email: '+w
             if count > 0:
                 yield '%s:2**%d' % (field, round(log2(count)))

filename:  after-nomailman
                   after-tocctok
ham:spam:  11192:1826     
                   11192:1826
fp total:        2       2
fp %:         0.02    0.02
fn total:        5       6
fn %:         0.27    0.33
unsure t:      104      83
unsure %:     0.80    0.64
real cost:  $45.80  $42.60
best cost:  $27.60  $25.40
h mean:       0.67    0.60
h sdev:       4.52    4.19
s mean:      98.89   98.96
s sdev:       6.96    6.94
mean diff:   98.22   98.36
k:            8.56    8.84

The "best cost" data, before:
-> per-fp cost $10.00; per-fn cost $1.00; per-unsure cost $0.20
-> achieved at ham & spam cutoffs 0.49 & 0.98
->     fp 0; fn 6; unsure ham 31; unsure spam 77
->     fp rate 0%; fn rate 0.329%; unsure rate 0.83%
and after:
-> best cost for all runs: $25.40
-> per-fp cost $10.00; per-fn cost $1.00; per-unsure cost $0.20
-> achieved at ham & spam cutoffs 0.47 & 0.98
->     fp 0; fn 6; unsure ham 27; unsure spam 70

This, to me, seems a clear win - in particular, I see stuff like:
prob('toemail: connect.com.au') = 0.998359
prob('toemail: arb') = 0.998992
(this is an old old email address that gets mostly spam now).


The final test was to decode the Subject header if it's encoded, and
tokenize that, rather than in encoded. 

--- tokenizer.py        1 Nov 2002 16:10:13 -0000       1.60
+++ tokenizer.py        4 Nov 2002 09:45:25 -0000
@@ -1071,6 +1078,10 @@
         # especially significant in this context.  Experiment showed a small
         # but real benefit to keeping case intact in this specific context.
         x = msg.get('subject', '')
+        # Subject decoding.
+        x, subjcharset = email.Header.decode_header(x)[0]
+        if subjcharset is not None:
+            yield 'subjectcharset:' + subjcharset
         for w in subject_word_re.findall(x):
             for t in tokenize_word(w):
                 yield 'subject:' + t

filename:  after-tocctok2 
                   after-subjdecode
ham:spam:  11192:1826     
                   11192:1826
fp total:        2       1
fp %:         0.02    0.01
fn total:        6       6
fn %:         0.33    0.33
unsure t:       83      87
unsure %:     0.64    0.67
real cost:  $42.60  $33.40
best cost:  $25.40  $24.00
h mean:       0.60    0.59
h sdev:       4.19    4.18
s mean:      98.96   98.92
s sdev:       6.94    7.05
mean diff:   98.36   98.33
k:            8.84    8.76

Tails of the runs:
before:
-> best cost for all runs: $25.40
-> per-fp cost $10.00; per-fn cost $1.00; per-unsure cost $0.20
-> achieved at ham & spam cutoffs 0.47 & 0.98
->     fp 0; fn 6; unsure ham 27; unsure spam 70
->     fp rate 0%; fn rate 0.329%; unsure rate 0.745%
after:
-> best cost for all runs: $24.00
-> per-fp cost $10.00; per-fn cost $1.00; per-unsure cost $0.20
-> achieved at ham & spam cutoffs 0.47 & 0.97
->     fp 0; fn 6; unsure ham 27; unsure spam 63
->     fp rate 0%; fn rate 0.329%; unsure rate 0.691%

Remember that the best before these 3 patches was:
->     fp 0; fn 8; unsure ham 30; unsure spam 70
->     fp rate 0%; fn rate 0.438%; unsure rate 0.768%

So this (to me) seems a bunch of definite wins. But Tim is free
to disagree :)

My remaining 6 fns are:

a brazilian spam-ish thing: (*H* 0.633859 *S* 0.20342 = 0.28478)

-----------------
>From angel@rjnet.com.br  Sat Sep 28 09:35:45 2002
Return-Path: <angel@rjnet.com.br>
Received: from localhost (localhost.localdomain [127.0.0.1])
        by localhost.localdomain (8.11.6/8.11.6) with ESMTP id g8RNZhh05864
        for <anthony@localhost>; Sat, 28 Sep 2002 09:35:44 +1000
Received: from mail.interlink.com.au [203.9.111.130]
        by localhost with POP3 (fetchmail-5.9.0)
        for anthony@localhost (single-drop); Sat, 28 Sep 2002 09:35:44 +1000 (ES
T)
Received: from mediterraneo.rjnet.com.br (root@[200.152.115.30])
        by valdez.interlink.com.au (8.11.6/8.11.2) with ESMTP id g8RNZJc28230
        for <anthony@interlink.com.au>; Sat, 28 Sep 2002 09:35:20 +1000
Received: from locutus.rjnet.com.br (root@locutus.rjnet.com.br [200.222.31.10])
        by mediterraneo.rjnet.com.br (8.11.4/8.11.4) with ESMTP id g8RNNc801901;
        Fri, 27 Sep 2002 20:23:38 -0300
Received: from localhost ([200.222.39.21])
        by locutus.rjnet.com.br (8.11.2/8.11.2) with ESMTP id g8RMqEN00464;
        Fri, 27 Sep 2002 19:52:14 -0300
Date: Fri, 27 Sep 2002 19:52:14 -0300
From: Liliane Andrade Angel <angel@rjnet.com.br>
Message-Id: <200209272252.g8RMqEN00464@locutus.rjnet.com.br>

DATA
-----------------
I plan to try something like tokenizing the oldest three received lines (to
hopefully avoid the previous issues with mail.python.org blowing numbers to
hell) to see if that will help this one.

The "iron citadel" python-list spam (*H* 0.999999, *S* 0.038123 = 0.01906)

A base64d MP3 spam sent via zope-dev 
(*H* 0.993904, *S* 0.187868 = 0.0969820429397)
which got a bunch of hammy clues from "Subject: [Zope-dev] Re: ofpa" and
also the various mailman type clues (although that's better with the 
first patch, above)

Someone spamming Linux CDs via a list at 4thought
(*H* 1, *S* 0.207177 = 0.103588442478)

A short porn spam sent via python-list
(*H* 0.817004, *S* 0.618399 = 0.400697521022)

A wierd german spam for some sort of expert systems (in english).
(*H* 0.997132, *S* 0.84965 = 0.426259133645)


From bkc@murkworks.com  Mon Nov  4 15:07:15 2002
From: bkc@murkworks.com (Brad Clements)
Date: Mon, 04 Nov 2002 10:07:15 -0500
Subject: [Spambayes] Email client integration -- what's needed?
In-Reply-To: <LNBBLJKPBEHFEDALKOLCAEMNCFAB.tim.one@comcast.net>
References: <20021101003712.GA28132@rmunnlfs>
Message-ID: <3DC64611.30897.3CB37091@localhost>

On 3 Nov 2002 at 14:27, Tim Peters wrote:

> AFAIK, Outlook Express has no hooks at all for programmers -- it's a closed


How about this?

http://msdn.microsoft.com/library/en-us/mapi/html/_mapi1book_using_message_filtering_to_manage_messages.asp


I think this is new information released under the DOJ settlement. 


Brad Clements,                bkc@murkworks.com   (315)268-1000
http://www.murkworks.com                          (315)268-9812 Fax
AOL-IM: BKClements


From msergeant@startechgroup.co.uk  Mon Nov  4 15:17:42 2002
From: msergeant@startechgroup.co.uk (Matt Sergeant)
Date: Mon, 04 Nov 2002 15:17:42 +0000
Subject: [Spambayes] counterweight: it really works!
References: <3DC6130D.40508@hooft.net>
Message-ID: <3DC68F96.6070809@startechgroup.co.uk>

Rob Hooft said the following on 04/11/02 06:26:
> Hmmm. I trained hammie on my private account yesterday night (already 
> running about a week at work), and found this in my spam folder this 
> morning:
> 
> ======================
> Subject: ***  We want to finance/buy your business..Pres. please !
> Mime-Version: 1.0
> Content-type: text/plain; charset="iso-8859-1"
> Message-Id: <20021104021425.D097A77809@temoleh.chem.uu.nl>
> Date: Mon,  4 Nov 2002 03:14:25 +0100 (CET)
> X-Spam-Status: No, hits=0.1 required=5.0
>          tests=PLING
>          version=2.31
> X-Spam-Level:
> X-Hammie-Disposition: Yes; 1.00 (8)
> ======================
> 
> Just to remind everyone that this software really works! Its spambayes 
> score deviates from 1.0 only by about 10**-8, but SA didn't see much.

Please don't compare to 4 months old SpamAssassin's. Upgrade if you want 
to compare. Thanks.

Matt.


From msergeant@startechgroup.co.uk  Mon Nov  4 15:40:42 2002
From: msergeant@startechgroup.co.uk (Matt Sergeant)
Date: Mon, 04 Nov 2002 15:40:42 +0000
Subject: [Spambayes] Why I added src=cid: etc
References: <LNBBLJKPBEHFEDALKOLCIEJDCFAB.tim.one@comcast.net>
Message-ID: <3DC694FA.7000905@startechgroup.co.uk>

Tim Peters said the following on 03/11/02 03:20:
> This is typical of the kind of email I'm getting a lot of lately.  Without
> mining the HTML, there's almost nothing to look at, not even a word in the
> Subject line.  (Of course, if we weren't throwing the HTML tags away, the
> classifier would have learned this stuff on its own.)

It's a virus though. Why don't you just get a gateway scanner (like the 
one I wrote [1] for qpsmtpd [2] which plugs into qmail and bounces 
viruses with a 5xx return code) which uses clamav[3]? It's optimised for 
catching viruses, so you can focus on just catching spam (lets face it, 
the techniques are slightly different).

[1] http://use.perl.org/~Matts/journal/ # down at the moment so I can't 
find the specific journal entry - but it was fairly recently and is 
obvious because it's about 50 lines of perl
[2] http://www.develooper.com/code/qpsmtpd/
[3] http://clamav.elektrapro.com/

I'm down from about 20 viruses a day (because my address ends up in a 
lot of web caches) to zero. And I'm very happy about it ;-)


From tim@fourstonesforum.com  Sat Nov  2 18:24:08 2002
From: tim@fourstonesforum.com (Tim Stone Four Stones Forum)
Date: Sat, 02 Nov 2002 12:24:08 -0600
Subject: [Spambayes] x-hammie-disposition in pop3proxy
In-Reply-To: <p558sukad4e6c567aguok61belr91p1l84@4ax.com>
Message-ID: <NMPJGE83YSNHZXXJGC79795VSEALJDB.3dc41848@riven>

Kewl, Richie.  Ok, so the next thing is I have to run three of these things.  I can do that if I can make the proxy listen on different ports.  I've modified the 
code to do that, was a simple mod.  Do you want the mod?

11/2/2002 12:16:05 PM, Richie Hindle <richie@entrian.com> wrote:

>Hi Tim,
>
>> adding the x-hammie-disposition header with value of 'no'.
>
>'No' means it thinks it's ham - the header means "Is it spam?"  At the
>moment the header added by pop3proxy.py is always "Yes" or "No" - I'll add
>the new "Unsure" value when I get the chance.
>
>> I don't have a trained database (the real challenge) at this point
>
>Use hammie.py to train it - the usage message should tell you everything
>you need to know, except how to create the mbox files or directories of
>email message to feed into it.  Hopefully your email client will export
>messages into one of those formats...
>
>-- 
>Richie Hindle
>richie@entrian.com
>
>
>_______________________________________________
>Spambayes mailing list
>Spambayes@python.org
>http://mail.python.org/mailman/listinfo/spambayes
>
>
>
>
- Tim
www.fourstonesExpressions.com 


From tim.one@comcast.net  Mon Nov  4 16:17:56 2002
From: tim.one@comcast.net (Tim Peters)
Date: Mon, 04 Nov 2002 11:17:56 -0500
Subject: [Spambayes] Why I added src=cid: etc
In-Reply-To: <3DC694FA.7000905@startechgroup.co.uk>
Message-ID: <LNBBLJKPBEHFEDALKOLCKEBOCGAB.tim.one@comcast.net>

[Tim]
> This is typical of the kind of email I'm getting a lot of
> lately.  Without> mining the HTML, there's almost nothing to
> look at, not even a word in the Subject line.  (Of course, if we
> weren't throwing the HTML tags away, the classifier would have
> learned this stuff on its own.)

[Matt Sergeant]
> It's a virus though. Why don't you just get a gateway scanner (like the
> one I wrote [1] for qpsmtpd [2] which plugs into qmail and bounces
> viruses with a 5xx return code) which uses clamav[3]?

Because <wink>, like *most* of the world, I'm just running "the email stuff"
that came with my Windows box here.  Not one user in a thousand knows beans
beyond that.

> It's optimised for catching viruses, so you can focus on just catching
> spam (lets face it, the techniques are slightly different).

Yes.  Greg Ward and Neil Schemenauer here have each written their own virus
detectors too, and Greg's stops essentially all viruses from getting beyond
python.org.  The ones I'm getting come from other accounts, but somewhere
along the line the actual virus payload has been stripped out, leaving just
the little HTML trigger.

I wouldn't recommend this project's code for virus/worm detection, although
anecdotal reports here (not controlled experiments) have been that it works
for that purpose too.


From tim.one@comcast.net  Mon Nov  4 16:32:23 2002
From: tim.one@comcast.net (Tim Peters)
Date: Mon, 04 Nov 2002 11:32:23 -0500
Subject: [Spambayes] counterweight: it really works!
In-Reply-To: <3DC68F96.6070809@startechgroup.co.uk>
Message-ID: <LNBBLJKPBEHFEDALKOLCEECBCGAB.tim.one@comcast.net>

[Matt Sergeant, to Rob Hooft]
> Please don't compare to 4 months old SpamAssassin's. Upgrade if you want
> to compare. Thanks.

I expect Rob is typical of single-user SpamAssassin clients, though:  they
download it once, and watch it deteriorate.  I've seen many other reports of
that too.  That makes it a fine comparison for people "like that".  This
codebase doesn't need upgrading, but does need ongoing training on a user's
own email.  Given that, I dare say it appears to work at least as well for
spam detection as an up-to-date SA (can't say about my personal email, as I
don't run SA here; on python.org's list email, I know it works at least as
well, as we've run controlled tests on that -- but I don't know how often
GregW upgrades the SA running at python.org).


From papaDoc@videotron.ca  Mon Nov  4 16:38:44 2002
From: papaDoc@videotron.ca (papaDoc)
Date: Mon, 04 Nov 2002 11:38:44 -0500
Subject: [Spambayes] x-hammie-disposition in pop3proxy
References: <LNBBLJKPBEHFEDALKOLCCEGLCFAB.tim.one@comcast.net>
 <92d8suolml4uj5cl334lrpvneo6qiid2h0@4ax.com> <3DC447AB.D64D6CE6@whidbey.com>
 <200211022241.gA2Mfxq07985@pcp02138704pcs.reston01.va.comcast.net>
Message-ID: <3DC6A294.7020400@videotron.ca>

Hi,

>>>X-Spambayes-Judgement: Spam / Unsure / Ham
>>>X-Spambayes-Is-Spam: Yes / Unsure / No
>>>X-Spambayes-Looks-Like-Spam: Yes / Unsure / No
>>>      
>>>
>>I know we have a long tradition of spelling errors behind us, such
>>as dropping an "R" from "referrer" in Apache logs, but I'd hate to
>>start a new one! Please, only one "E" in "judgment."
>>    
>>
>
>But it's not a spelling error!
>
In French you should remove the "d"   if you want no spelling error ;-)

papaDoc


From Tim@mail.powweb.com  Mon Nov  4 16:39:54 2002
From: Tim@mail.powweb.com (Tim@mail.powweb.com)
Date: Mon, 04 Nov 2002 10:39:54 -0600
Subject: [Spambayes] counterweight: it really works!
Message-ID: <LHPJECGCWVOTFE3X5ZF0JD3OITOJD.3dc6a2da@riven>

The thing that SB has that SA doesn't is the ongoing ability to train a database according to the USER'S definition of spam.  SA has some configurability, but 
who actually does that?  Who wants to download updates?  SB lets me say "This is spam.  Learn from this" or "This is ham..."  I'm not going back.  :)

- TimS

11/4/2002 10:32:23 AM, Tim Peters <tim.one@comcast.net> wrote:

>[Matt Sergeant, to Rob Hooft]
>> Please don't compare to 4 months old SpamAssassin's. Upgrade if you want
>> to compare. Thanks.
>
>I expect Rob is typical of single-user SpamAssassin clients, though:  they
>download it once, and watch it deteriorate.  I've seen many other reports of
>that too.  That makes it a fine comparison for people "like that".  This
>codebase doesn't need upgrading, but does need ongoing training on a user's
>own email.  Given that, I dare say it appears to work at least as well for
>spam detection as an up-to-date SA (can't say about my personal email, as I
>don't run SA here; on python.org's list email, I know it works at least as
>well, as we've run controlled tests on that -- but I don't know how often
>GregW upgrades the SA running at python.org).
>
>
>_______________________________________________
>Spambayes mailing list
>Spambayes@python.org
>http://mail.python.org/mailman/listinfo/spambayes
>
>
- Tim
www.fourstonesExpressions.com 


From msergeant@startechgroup.co.uk  Mon Nov  4 16:37:15 2002
From: msergeant@startechgroup.co.uk (Matt Sergeant)
Date: Mon, 04 Nov 2002 16:37:15 +0000
Subject: [Spambayes] counterweight: it really works!
References: <LNBBLJKPBEHFEDALKOLCEECBCGAB.tim.one@comcast.net>
Message-ID: <3DC6A23B.5000904@startechgroup.co.uk>

Tim Peters said the following on 04/11/02 16:32:
> [Matt Sergeant, to Rob Hooft]
> 
>>Please don't compare to 4 months old SpamAssassin's. Upgrade if you want
>>to compare. Thanks.
> 
> 
> I expect Rob is typical of single-user SpamAssassin clients, though:  they
> download it once, and watch it deteriorate.  I've seen many other reports of
> that too.  That makes it a fine comparison for people "like that".  This
> codebase doesn't need upgrading, but does need ongoing training on a user's
> own email.  Given that, I dare say it appears to work at least as well for
> spam detection as an up-to-date SA (can't say about my personal email, as I
> don't run SA here; on python.org's list email, I know it works at least as
> well, as we've run controlled tests on that -- but I don't know how often
> GregW upgrades the SA running at python.org).

It's the same though. SA needs constant training too - it just happens 
to occur somewhere other than on your own box, and involve some human 
intervention (though luckily not from the user).

Matt.


From msergeant@startechgroup.co.uk  Mon Nov  4 16:35:36 2002
From: msergeant@startechgroup.co.uk (Matt Sergeant)
Date: Mon, 04 Nov 2002 16:35:36 +0000
Subject: [Spambayes] Why I added src=cid: etc
References: <LNBBLJKPBEHFEDALKOLCKEBOCGAB.tim.one@comcast.net>
Message-ID: <3DC6A1D8.6040507@startechgroup.co.uk>

Tim Peters said the following on 04/11/02 16:17:
> [Tim]
> 
>>This is typical of the kind of email I'm getting a lot of
>>lately.  Without> mining the HTML, there's almost nothing to
>>look at, not even a word in the Subject line.  (Of course, if we
>>weren't throwing the HTML tags away, the classifier would have
>>learned this stuff on its own.)
> 
> [Matt Sergeant]
> 
>>It's a virus though. Why don't you just get a gateway scanner (like the
>>one I wrote [1] for qpsmtpd [2] which plugs into qmail and bounces
>>viruses with a 5xx return code) which uses clamav[3]?
> 
> Because <wink>, like *most* of the world, I'm just running "the email stuff"
> that came with my Windows box here.  Not one user in a thousand knows beans
> beyond that.

Ah Windows eh. I didn't realise anyone still used that. ;-) <sympathy/>

>>It's optimised for catching viruses, so you can focus on just catching
>>spam (lets face it, the techniques are slightly different).
> 
> Yes.  Greg Ward and Neil Schemenauer here have each written their own virus
> detectors too, and Greg's stops essentially all viruses from getting beyond
> python.org.  The ones I'm getting come from other accounts, but somewhere
> along the line the actual virus payload has been stripped out, leaving just
> the little HTML trigger.
> 
> I wouldn't recommend this project's code for virus/worm detection, although
> anecdotal reports here (not controlled experiments) have been that it works
> for that purpose too.

Yeah, I've got some neat results just from classifying file extensions. 
The double extension ones are especially good ;-)

Matt.


From tim.one@comcast.net  Mon Nov  4 16:48:46 2002
From: tim.one@comcast.net (Tim Peters)
Date: Mon, 04 Nov 2002 11:48:46 -0500
Subject: [Spambayes] Why I added src=cid: etc
In-Reply-To: <3DC6A1D8.6040507@startechgroup.co.uk>
Message-ID: <LNBBLJKPBEHFEDALKOLCEECFCGAB.tim.one@comcast.net>

[Matt Sergeant, on virus/worm detection]
> Yeah, I've got some neat results just from classifying file extensions.
> The double extension ones are especially good ;-)

GregW's is a bit of Perl that scans for file extensions, and my work account
does some double-extension detection.  They're very effective but Draconian.
For example, I maintain the Python Windows distribution, and the former
prevents users from sending me .exe files directly; the latter prevented a
coworker two weeks ago from sending error-log files because he named them
"xyz.good.log" and "xyz.bad.log".  It seems a very curious thing to me that
email admins seem generally happy to accept false positives when it comes to
suspected virus/worm stuff.  Then again, it's not a very surprising thing
<wink>.


From msergeant@startechgroup.co.uk  Mon Nov  4 16:46:03 2002
From: msergeant@startechgroup.co.uk (Matt Sergeant)
Date: Mon, 04 Nov 2002 16:46:03 +0000
Subject: [Spambayes] counterweight: it really works!
References: <LHPJECGCWVOTFE3X5ZF0JD3OITOJD.3dc6a2da@riven>
Message-ID: <3DC6A44B.4070509@startechgroup.co.uk>

Tim@mail.powweb.com said the following on 04/11/02 16:39:
> The thing that SB has that SA doesn't is the ongoing ability to train a database according to the USER'S definition of spam.  SA has some configurability, but 
> who actually does that?  Who wants to download updates?  SB lets me say "This is spam.  Learn from this" or "This is ham..."  I'm not going back.  :)

FWIW SpamAssassin now has a statistical classifier (in 2.50, which isn't 
officially released yet, but then neither is spambayes [grin]) using the 
Robinson algorithm. I'm hoping to get the chi-squared algorithm in there 
too, but /I had some trouble with it producing wierd results for me (I 
tried to post something to this list about it but it vanished into the 
ether, so I'll try again shortly).

Ultimately I think what people will find is that statistical classifiers 
are a good part of an overall strategy, but not necessarily the end of 
the story in spam detection (which is a shame). SpamAssassin is a pretty 
mature product these days, with some really neat technology going on in 
there like the auto-whitelist, and it's really great that we can all 
learn from each other this way./

Matt.


From tim.one@comcast.net  Mon Nov  4 16:56:36 2002
From: tim.one@comcast.net (Tim Peters)
Date: Mon, 04 Nov 2002 11:56:36 -0500
Subject: [Spambayes] counterweight: it really works!
In-Reply-To: <LHPJECGCWVOTFE3X5ZF0JD3OITOJD.3dc6a2da@riven>
Message-ID: <LNBBLJKPBEHFEDALKOLCIECHCGAB.tim.one@comcast.net>

[TimS]
> The thing that SB has that SA doesn't is the ongoing ability to
> train a database according to the USER'S definition of spam.
> SA has some configurability, but who actually does that?  Who wants
> to download updates?

Email adminstrators do both, and *because* the SB code needs to learn about
ham as well as spam, and opt-in marketing email is so user-specific, it
remains a puzzle how to use this code for, e.g., an email admin serving
1,000 accounts.  The SB code appears quite capable of handling python.org's
mailing list traffic with a lot less bother and resource consumption than SA
requires, but I still don't think it will work well if we fold in the
personal email carried by python.org too.

> SB lets me say "This is spam.  Learn from this" or "This is ham..."  I'm
> not going back.  :)

Provided we can make ongoing training easy enough, I expect single-user
installations will enjoy training SB more than downloading SA, and that it
will work better for them (e.g., there are *some* kinds of spam I want to
see, and training my classifier to accept that kind was, like everything
else, just a matter of putting examples in my ham folder).


From Tim@mail.powweb.com  Mon Nov  4 17:09:14 2002
From: Tim@mail.powweb.com (Tim@mail.powweb.com)
Date: Mon, 04 Nov 2002 11:09:14 -0600
Subject: [Spambayes] Email client integration -- what's needed?
In-Reply-To: <gvuasu08h4eukdvba187ulhf052f996bvf@4ax.com>
Message-ID: <YTWUSPVFDOIRLPLFBOL1WA8YUIEYQO.3dc6a9ba@riven>

It occurs to me that there are a couple of issues floating around here 
regarding integration with email clients, and they're related, 
although the relationship hasn't been brought forward yet.

Issue #1 Using proxys (POP3 and SMTP) for integration into the mailers 
that the masses use.

Issue #2 Putting a user interface onto the pop3proxy


So here goes:

I've been running an SMTP proxy that recognizes mail being sent to 
special addresses and does a train using the message, rather than send 
it to the actual SMTP server.  This works very well, though it 
requires a rather arduous training regimen to get it all started. 
::sigh::  Richie's pop3proxy then picks up the training to classify 
incoming mail, which is filtered into spam by my mailer's standard 
filtering mechanism.  There are two problems with this approach.  
First, it's manual, message-by-message, which means *re-training* is 
kinda out of the question.  The second is that the 'Forward' or 
'Redirect' function of most mailers strips at least some of the 
headers in which valuable clues can be found.

So... some have proposed that we make the pop3proxy so that it caches 
incoming mail.  This cache could be used by the smtpproxy to recover 
the original headers, using a unique id that the pop3proxy embedded in 
the mail somewhere.  Or... we could give the pop3proxy a user 
interface that allows users to select mail in the cache to do training 
on.  This approach eliminates the need for the smtpproxy in the first 
place, and allows a corpus to be built up by the proxy for retraining 
purposes.  While the ui for a caching pop3 proxy might be a bit of a 
challenge, I think this approach bears some examination.

Arguments for:

* Simpler overall system
* Allows the building of an easily usable corpus for average mail 
users like me
* Headers are maintained exactly as they were received, before a 
mailer has the chance to get in and mess 'em up

Arguments against:

* A new user interface that is not a normal part of a user's everyday 
existence
* Now documentation will have to include "User's Guide" as well as 
"Install Guide"
* Some ongoing cache maintenance... expiry of cached messages, etc.

Other considerations?

P.S.   How's this, Skip?

- TimS


From neale@woozle.org  Mon Nov  4 17:58:06 2002
From: neale@woozle.org (Neale Pickett)
Date: 04 Nov 2002 09:58:06 -0800
Subject: [Spambayes] Database reduction
In-Reply-To: <15809.55847.349091.23441@montanaro.dyndns.org>
References: <LNBBLJKPBEHFEDALKOLCIEDPCDAB.tim.one@comcast.net>
	<w53k7jyf8ni.fsf@woozle.org>
	<15809.55847.349091.23441@montanaro.dyndns.org>
Message-ID: <w53u1ixdxox.fsf@woozle.org>

So then, Skip Montanaro <skip@pobox.com> is all like:

>     Neale> When pickling a Bayes object, the pickler is smart enough not to
>     Neale> repeatedly say "this is a wordinfo object" but rather, I assume,
>     Neale> "this is of type 2", only having to name type 2 once.  However,
>     Neale> hammie pickles each wordinfo individually, keyed by a string.
>     Neale> This makes for fast lookups, but giant databases.
> 
> You can always define your own __getstate__ and __setstate__ methods for the
> Wordinfo class which processes a more compact form of the object's state.
> Or am I misunderstanding what you said?

Perhaps a picture would be worth 1K words:

    >>> import classifier
    >>> w = classifier.WordInfo('aoeu', 2)
    >>> import pickle
    >>> w
    WordInfo"('aoeu', 0, 0, 0, 2)"
    >>> pickle.dumps(w, 1)
    'ccopy_reg\n_reconstructor\nq\x00(cclassifier\nWordInfo\nq\x01c__builtin__\nobject\nq\x02Ntq\x03R(U\x04aoeuq\x04K\x00K\x00K\x00K\x02tq\x05bq\x06.'

In case it isn't obvious yet, here's the problem:

    >>> len(pickle.dumps(w, 1))
    102
    >>> len(`w`)
    30

So, at least for hammie, you can get a 66% reduction in database size
by *not* pickling WordInfo types.  Tim calls this "administrative pickle
bloat", which is the coolest jargon term I've heard all year.

As I understand it, things which pickle the Bayes object avoid this
overhead from some pickler optimizations along the lines of "if we've
already seen this type, just give it a number and stop referring to it
by name."  Thus, I suppose the proper way to get this reduction in
hammie would be to extend the pickler to recognize WordInfo types,
right?  If so, I'll add that code in.

Neale

From tim.one@comcast.net  Mon Nov  4 18:53:26 2002
From: tim.one@comcast.net (Tim Peters)
Date: Mon, 04 Nov 2002 13:53:26 -0500
Subject: [Spambayes] Database reduction
In-Reply-To: <w53u1ixdxox.fsf@woozle.org>
Message-ID: <LNBBLJKPBEHFEDALKOLCOEDDCGAB.tim.one@comcast.net>

[Neale Pickett]
> Perhaps a picture would be worth 1K words:
>
>     >>> import classifier
>     >>> w = classifier.WordInfo('aoeu', 2)
>     >>> import pickle
>     >>> w
>     WordInfo"('aoeu', 0, 0, 0, 2)"
>     >>> pickle.dumps(w, 1)
>
> 'ccopy_reg\n_reconstructor\nq\x00(cclassifier\nWordInfo\nq\x01c__b
> uiltin__\nobject\nq\x02Ntq\x03R(U\x04aoeuq\x04K\x00K\x00K\x00K\x02
> tq\x05bq\x06.'
>
> In case it isn't obvious yet, here's the problem:
>
>     >>> len(pickle.dumps(w, 1))
>     102
>     >>> len(`w`)
>     30

OTOH,

>>> cPickle.dumps(w.__getstate__(), 1)
'(U\x04aoeuq\x01K\x00K\x00K\x00K\x02t.'
>>> len(_)
19
>>>

which is shorter than your string repr.  This isn't typical because 2 is an
absurd spamprob (it's > 1, and is an int instead of a double); the savings
would be greater with a real spamprob (which will consume about 19 bytes in
a string repr, but about 8 in a pickle).

> So, at least for hammie, you can get a 66% reduction in database size
> by *not* pickling WordInfo types.  Tim calls this "administrative pickle
> bloat", which is the coolest jargon term I've heard all year.

Glad you liked it <wink>.  If you pickle the states instead, you'll save a
lot of space.  The state is a plain tuple.  On the other end, you have to
construct a WordInfo object and pass the unpickled tuple to its __setstate__
method.

> As I understand it, things which pickle the Bayes object avoid this
> overhead from some pickler optimizations along the lines of "if we've
> already seen this type, just give it a number and stop referring to it
> by name."

Yes, but a Pickler does this automatically.  You're using convenience
functions, which is why you get no savings.  Here's pickle.dumps():

def dumps(object, bin = 0):
    file = StringIO()
    Pickler(file, bin).dump(object)
    return file.getvalue()

It creates a brand new Pickler every time you call dumps, so nothing can be
remembered from one call to the next.  Avoiding that is clumsy in this
context, but possible:

>>> f = StringIO.StringIO()
>>> p = cPickle.Pickler(f, 1)
>>> p.dump(w)
<cPickle.Pickler object at 0x007EE020>
>>> f.getvalue()
'ccopy_reg\n_reconstructor\nq\x01(cclassifier\nWordInfo\nq\x02c__builtin__\n
object\nq\x03NtRq\x04(U\x04abdeq\x05K\x00K\x00K\x00G?\xd3333333tb.'
>>> f.truncate(0)
>>> p.dump(w)
<cPickle.Pickler object at 0x007EE020>
>>> f.getvalue()
'h\x04.'
>>>

In this case, by reusing the Pickler, the second time dumping w created a
2-byte pickle:  the Pickler maintains its own internal dict remembering
everything it pickled in the past.  This can be a real data burden of its
own, though.  See the docs for ways to clear a Pickler's dict (called the
pickle "memo" in the docs).

I'd avoid all that and pickle the states, but that's just me.


From Tim@mail.powweb.com  Mon Nov  4 19:08:48 2002
From: Tim@mail.powweb.com (Tim@mail.powweb.com)
Date: Mon, 04 Nov 2002 13:08:48 -0600
Subject: [Spambayes] My first results with pop3proxy and smtpproxy
Message-ID: <F0GF5Z8207WHZU0VPIEOJPOIE721.3dc6c5c0@riven>

I've trained using the smtpproxy and a few dozen spams that I hadn't 
deleted and hadn't been contaminated by SA before I got involved with 
spambayes (basically SA mistakes).  Even given the small size of the 
corpus, it is doing an amazingly great job classifying inbound mail.  
It even correctly classified one of those "here's another funny 
story" infernal mails that gets forwarded three hundred times, and I 
hadn't trained it on anything like that.  I have to say that a corpus 
of thousands really isn't turning out to be a necessity for spambayes 
to be useful to me.

One other observation... my strong tendency *IS* to train this thing 
only when it makes a mistake.  Skip et.al. has warned boucoup times 
about not doing this... train on a reasonable smattering of both, even 
if they're correctly classified, and train often.  **BUT** if this is 
my tendency and I understand the system, then this will likely be a 
real problem when the masses get started using it.  How to ensure that 
mistakes only training isn't the norm?  Beats me.  But we've either 
gotta figure out how to make sure that the teeming masses don't make 
this error, or we've gotta figure out how to make the system tolerate 
this error reasonably well.

- Tim
www.fourstonesExpressions.com 


From guido@python.org  Mon Nov  4 19:27:08 2002
From: guido@python.org (Guido van Rossum)
Date: Mon, 04 Nov 2002 14:27:08 -0500
Subject: [Spambayes] "Lindsey Carter": Re: [Zope-Annce] New zope.org
	development
Message-ID: <200211041927.gA4JR8h21174@pcp02138704pcs.reston01.va.comcast.net>

This smells like a clever spam, disguised as a zope-announce message
I sent.  SA scored it -2.8:

    X-Spam-Status: No, hits=-2.8 required=5.0 tests=BODY_PYTHON_ZOPE,CLICK_BELOW,FROM_BIGISP,FROM_ENDS_IN_NUMS,QUOTED_EMAIL_TEXT,SPAM_PHRASE_03_05,SUBJ_PYTHON_ZOPE

Wonder if SB will do any better...

--Guido van Rossum (home page: http://www.python.org/~guido/)

------- Forwarded Message

Date:    Mon, 04 Nov 2002 03:18:02 +0000
From:    "Lindsey Carter" <smileylindsey72001@hotmail.com>
To:      guido@python.org
Subject: Re: [Zope-Annce] New zope.org development

Hey, thanks for writing me back here.  You are the only guy so far, but
I only answered 3 ads, you don't have much competition in our area! I'm
sure you got a few emails, but I hope you take the time to get to know
me better.  What are you doing this weekend?  Maybe we can get together
for coffee or ice cream?  =o)  I bet you are waiting to see my pics? I
will get them to you.  I tried attaching them in this email but they
wouldn't fit so click here
http://www.my-homepages.net/lindseyspage/index.html  I'd rather have
picked out my best pics and emailed them to you, but it's probably
better that you see all these.  I hope I'm not too shocking to you?
There is more about me on my homepage and the pics really make for an
interesting conversation (my specialty).  If you like what you see
I'll be waiting to hear from you,
I hope you approve,
xoxo
Lindsey
PS.  pistachio ice cream is my fav.


>From: Guido van Rossum <guido@python.org>
>To: zope-announce@lists.zope.org
>CC: zope3-dev@zope.org, zope@zope.org, zope-web@zope.org
>Subject: [Zope-Annce] New zope.org development
>Date: Thu, 31 Oct 2002 17:10:16 -0500
>
>There have been complaints about zope.org.  The complaints
>include things like "the design is ugly", "the navigation is
>difficult", "it's too slow", and "search results are not
>useful."
>
>For a long time there have been plans to convert the site to
>the current version of Zope using CMF, and various false
>starts have been made, but so far the site is still running
>software that's best described "FrakenZope 2.3"...
>
>I opened my big mouth, and now I'm responsible for fixing
>this. :-)
>
>Actually, a new plan was already in place, and all I have to
>do is coach its execution.  Zope Corporation has retained a
>highly skilled Zope developer, Sidnei da Silva, to do the
>work.  The advantage over Zope Corporation developers doing
>this (as has been tried in the past) is that Sidnei isn't
>likely to be pre-empted by higher-priority customer work:
>for him, this *is* customer work, the customer being Zope
>Corporation.  At the same time, because Sidnei is being
>paid, the expectation is that the plan will be carried out
>at a steady pace, as opposed to simply "letting the
>community sort it out."
>
>Sidnei's plan includes the following pieces:
>
>- use Zope 2.6 with CMF 1.3 for the new site
>
>- use a new skin design, the winner of the zope.org contest
>
>- use the new ZCTextIndex search engine
>
>- migrate all existing users and as much content as practical
>   to the new site
>
>(There's more, but we'd be getting into detail territory.)
>
>The project goals include minimizing the amount of new code
>and content to be created, in order to minimize the risks of
>failure.  We also strive to make future maintenance of the
>site simpler, both at the sysadmin level (process and
>resource control) and at the webmaster level (content
>control).  All these goals are designed to make sure that
>the site can be kept current once the upgrade is in place.
>
>Time-wise, Sidnei expects a preview version of the new site
>to go up as new.zope.org within a month, and the final
>version to go live (replacing the old www.zope.org) within
>2-3 months after that.  Sidnei expects to be asking
>community help with some tasks; he'll post about this
>himself.
>
>In addition, we're also hoping to hire Sidnei as part-time
>webmaster for zope.org, starting next week.  Having a steady
>webmaster will help the site stay accurate and up-to-date.
>
>--Guido van Rossum (home page: http://www.python.org/~guido/)
>
>_______________________________________________
>Zope-Announce maillist  -  Zope-Announce@zope.org
>http://lists.zope.org/mailman/listinfo/zope-announce
>
>   Zope-Announce for Announcements only - no discussions
>
>(Related lists -
>  Users: http://lists.zope.org/mailman/listinfo/zope
>  Developers: http://lists.zope.org/mailman/listinfo/zope-dev )


_________________________________________________________________
Unlimited Internet access -- and 2 months free!� Try MSN. 
http://resourcecenter.msn.com/access/plans/2monthsfree.asp

------- End of Forwarded Message


From jeremy@alum.mit.edu  Mon Nov  4 19:32:51 2002
From: jeremy@alum.mit.edu (Jeremy Hylton)
Date: Mon, 4 Nov 2002 14:32:51 -0500
Subject: [Spambayes] "Lindsey Carter": Re: [Zope-Annce] New zope.org
	development
In-Reply-To: <200211041927.gA4JR8h21174@pcp02138704pcs.reston01.va.comcast.net>
References: <200211041927.gA4JR8h21174@pcp02138704pcs.reston01.va.comcast.net>
Message-ID: <15814.52067.105184.561839@slothrop.zope.com>

>>>>> "GvR" == Guido van Rossum <guido@python.org> writes:

  GvR> This smells like a clever spam, disguised as a zope-announce
  GvR> message I sent.  SA scored it -2.8:

  GvR>     X-Spam-Status: No, hits=-2.8 required=5.0
  GvR>     tests=BODY_PYTHON_ZOPE,CLICK_BELOW,FROM_BIGISP,FROM_ENDS_IN_NUMS,QUOTED_EMAIL_TEXT,SPAM_PHRASE_03_05,SUBJ_PYTHON_ZOPE

  GvR> Wonder if SB will do any better...

I got the same spam and SB was sure it was ham.  It was responding to
me ZODB announcement.  As a result the mail contained a bunch of good
ham indicators from my original announcement.  As I recall, the fact
that it not only contained my announcement but had some of the words
quoted really nailed it.  That is, ">release" is a better ham
indicator than "release".

Jeremy


From tim.one@comcast.net  Mon Nov  4 19:39:31 2002
From: tim.one@comcast.net (Tim Peters)
Date: Mon, 04 Nov 2002 14:39:31 -0500
Subject: [Spambayes] My first results with pop3proxy and smtpproxy
In-Reply-To: <F0GF5Z8207WHZU0VPIEOJPOIE721.3dc6c5c0@riven>
Message-ID: <LNBBLJKPBEHFEDALKOLCOEDLCGAB.tim.one@comcast.net>

[Tim@mail.powweb.com]
> I've trained using the smtpproxy and a few dozen spams that I hadn't
> deleted and hadn't been contaminated by SA before I got involved with
> spambayes (basically SA mistakes).

You don't have to worry about that:  by default, the tokenizer ignores all
header lines SA may have had anything to do with, so it doesn't matter
whether SA has added headers or not.

> Even given the small size of the corpus, it is doing an amazingly great
> job classifying inbound mail.

Good!

> It even correctly classified one of those "here's another funny
> story" infernal mails that gets forwarded three hundred times, and I
> hadn't trained it on anything like that.

Ya, but that would be ham to somebody else.  Train accordingly <wink>.

> I have to say that a corpus of thousands really isn't turning out to
> be a necessity for spambayes to be useful to me.

Indeed, old tests show that it is, on average, *useful* after training on a
single ham and a single spam:  it gets significantly more right than wrong
after that much.  So long as *none* of your ham looks like advertising or
random chatter, a few hundred of each may be fine for you.
Fraction-of-a-percent error rate improvements are important for high-volume
uses (like python.org, which handle more email in a day than most people get
in a year).

> One other observation... my strong tendency *IS* to train this thing
> only when it makes a mistake.

That's a UI problem.  A good UI would deduce what's ham and spam by watching
what you do to your email, and train on a random sampling of it.  The
Outlook client may be the only one making real progress in that direction so
far.

> Skip et.al. has warned boucoup times about not doing this...

That would be me.

> train on a reasonable smattering of both, even if they're correctly
> classified, and train often.

The things I call ham would shock you <wink -- but I do get several
categories of difficult ham>.

> **BUT** if this is  my tendency and I understand the system, then this
> will likely be a  real problem when the masses get started using it.
>  How to ensure that mistakes only training isn't the norm?  Beats me.
> But we've either gotta figure out how to make sure that the teeming
> masses don't make this error, or we've gotta figure out how to make
> the system tolerate this error reasonably well.

It can't tolerate it -- it can only learn what it's been taught, and
reliance on hapaxes is both vital over the short term and brittle over the
long term; ongoing training is needed to prevent hapaxes from becoming a
liability over time.  *Most* spam is dead easy to recognize, though, as is
most ham.  The errors occur in atypical cases.


From guido@python.org  Mon Nov  4 19:42:38 2002
From: guido@python.org (Guido van Rossum)
Date: Mon, 04 Nov 2002 14:42:38 -0500
Subject: [Spambayes] deployment for mailman lists
Message-ID: <200211041942.gA4Jgc621320@pcp02138704pcs.reston01.va.comcast.net>

I just realized that the deployment parameters for mailman lists are
entirely different than for individual users.  This may be obvious
already, but I don't recall reading it here.

- Mailing lists have a tendency to have a clear focus, which is
  recorded in the list archives.  This makes for near-ideal training,
  unless in the past a lot of spam made it into the archives (they
  should be manually checked first).

- Integration into Mailman means that there's only one setup to be
  concerned about, rather than the gazillions of different ways
  ordinary users receive their email.

- The person who administers the list can be assumed to be a little
  bit more clueful than an ordinary user.

- An obvious default policy with tunable parameters presents itself:
  ham goes to the list, spam is dropped (or bounced), and unsure goes
  into the moderator's queue.

(Of course, having this integrated into Mailman also gives Mailman a
leg up against the competition.)

--Guido van Rossum (home page: http://www.python.org/~guido/)

From jeremy@alum.mit.edu  Mon Nov  4 19:43:12 2002
From: jeremy@alum.mit.edu (Jeremy Hylton)
Date: Mon, 4 Nov 2002 14:43:12 -0500
Subject: [Spambayes] "Lindsey Carter": Re: [Zope-Annce] New zope.org
	development
In-Reply-To: <15814.52067.105184.561839@slothrop.zope.com>
References: <200211041927.gA4JR8h21174@pcp02138704pcs.reston01.va.comcast.net>
	<15814.52067.105184.561839@slothrop.zope.com>
Message-ID: <15814.52688.115970.206304@slothrop.zope.com>

On the other hand, the message you forwarded got scored 0.494 with
both *H* and *S* > 0.98.  I'm quite puzzled, though, about how my
training data is getting used.  I looked back at the spam that came
through (now in my spam training set) and see that it got scored
0.000.  It now gets scored 1.000, but for reasons that don't really
make sense to me.

Here's a snippet of the spamish word from the detailed scoring:

>available 0.844827586207
>cc: 0.844827586207
>corp. 0.844827586207
>dickenson 0.844827586207
>persistence 0.844827586207
>pure 0.844827586207
>released 0.844827586207
>source 0.844827586207
>unexpected 0.844827586207
>windows 0.844827586207
>zeo 0.844827586207
>zodb 0.844827586207
>zope 0.844827586207
[zope-annce] 0.844827586207
approve, 0.844827586207
area! 0.844827586207
behavior.) 0.844827586207
beta 0.844827586207
btrees 0.844827586207
compiler, 0.844827586207
conflict 0.844827586207
cream? 0.844827586207
email addr:zope.org, 0.844827586207
emails, 0.844827586207
fav. 0.844827586207
from:"lindsey 0.844827586207
from:carter" 0.844827586207
from:email name:<smileylindsey72001 0.844827586207

It seems like I'm still doing something wrong with pspam and training
but I don't know what.  The odd thing is that I tend to get good
results, the osaf lists aside.

Jeremy


From skip@pobox.com  Mon Nov  4 19:53:27 2002
From: skip@pobox.com (Skip Montanaro)
Date: Mon, 4 Nov 2002 13:53:27 -0600
Subject: [Spambayes] deployment for mailman lists
In-Reply-To: <200211041942.gA4Jgc621320@pcp02138704pcs.reston01.va.comcast.net>
References: <200211041942.gA4Jgc621320@pcp02138704pcs.reston01.va.comcast.net>
Message-ID: <15814.53303.926055.735822@montanaro.dyndns.org>


    Guido> - An obvious default policy with tunable parameters presents
    Guido>   itself: ham goes to the list, spam is dropped (or bounced), and
    Guido>   unsure goes into the moderator's queue.

I would argue that spam should by default go into the moderator's queue as
well.  The default should never be to drop or bounce a message.  Either way,
you run the risk that legitimate mail gets lost.

Skip

From piersh@friskit.com  Mon Nov  4 20:12:20 2002
From: piersh@friskit.com (Piers Haken)
Date: Mon, 4 Nov 2002 12:12:20 -0800
Subject: [Spambayes] My first results with pop3proxy and smtpproxy
Message-ID: <9891913C5BFE87429D71E37F08210CB9297506@zeus.sfhq.friskit.com>

> > One other observation... my strong tendency *IS* to train=20
> this thing=20
> > only when it makes a mistake.
>=20
> That's a UI problem.  A good UI would deduce what's ham and=20
> spam by watching what you do to your email, and train on a=20
> random sampling of it.  The Outlook client may be the only=20
> one making real progress in that direction so far.

The outlook plugin positively rocks in this respect. That's kinda why I
was suggesting taking the IMAP route since you'd easily (!?) be able to
correct classification errors using whichever IMAP-enabled client UI you
prefer(outlook, OE, mozilla, opera, the list goes on...) and it would
not be something new that users would have to learn.

As an aside: I've been using spambays with great success since we got
the outlook plugin to play nicely with exchange. As you probably know,
exchange is a centralized message store, and as such, you can have
multiple clients connected to the same store at the same time. I
generally have up to 3 copies of outlook running at any one time. Two at
work, one at home. I run spambayes at home. This morning, when I got to
work, I saw that SB had marked some ham as 'unsure', so I moved it back
into my inbox using an outlook client on a machine where SB was NOT
running. I then connected to my home machine (the one running SB) and
noticed that, even though I had moved the message on a different
machine, the SB plugin had noticed the move and retrained the database.
Very cool.

Piers.
From tim.one@comcast.net  Mon Nov  4 20:07:23 2002
From: tim.one@comcast.net (Tim Peters)
Date: Mon, 04 Nov 2002 15:07:23 -0500
Subject: [Spambayes] "Lindsey Carter": Re: [Zope-Annce] New zope.org
	development
In-Reply-To: <200211041927.gA4JR8h21174@pcp02138704pcs.reston01.va.comcast.net>
Message-ID: <LNBBLJKPBEHFEDALKOLCAEECCGAB.tim.one@comcast.net>

[Guido]
> This smells like a clever spam,

Click on the link to Lindsey's webpage if you have lingering doubts.

> disguised as a zope-announce message I sent.  SA scored it -2.8:
>
>     X-Spam-Status: No, hits=-2.8 required=5.0
> tests=BODY_PYTHON_ZOPE,CLICK_BELOW,FROM_BIGISP,FROM_ENDS_IN_NUMS,Q
> UOTED_EMAIL_TEXT,SPAM_PHRASE_03_05,SUBJ_PYTHON_ZOPE
>
> Wonder if SB will do any better...

Absolutely:  SB never gives negative scores <wink>.

Barry once floated the idea of trying to strip quoted text in the tokenizer,
but nobody (AFAIK) tried that.  Short of something like that, I expect the
best you can hope for is that this will end up in your Unsure category.  I
believe that QUOTED_EMAIL_TEXT means SA gave it a *ham* boost for containing
quoted email.  The "Re:" in the subject line is a clue of that sort for SB
too, along with various words starting w/ ">".


From tim.one@comcast.net  Mon Nov  4 20:13:30 2002
From: tim.one@comcast.net (Tim Peters)
Date: Mon, 04 Nov 2002 15:13:30 -0500
Subject: [Spambayes] deployment for mailman lists
In-Reply-To: <200211041942.gA4Jgc621320@pcp02138704pcs.reston01.va.comcast.net>
Message-ID: <LNBBLJKPBEHFEDALKOLCAEEECGAB.tim.one@comcast.net>

[Guido]
> I just realized that the deployment parameters for mailman lists are
> entirely different than for individual users.  This may be obvious
> already, but I don't recall reading it here.

You really don't read much of this list <0.9 wink>.

> - Mailing lists have a tendency to have a clear focus, which is
>   recorded in the list archives.  This makes for near-ideal training,
>   unless in the past a lot of spam made it into the archives (they
>   should be manually checked first).

Yes & no.  Zope lists have a clear focus, but c.l.py is all over the map,
from Alex Martelli discussing the right kind of water to use when preparing
pasta, to debates about Microsoft's place in the world.  You can't really
can't imagine what a sprawling zoo c.l.py is until you've stared at 20,000
randomly selected msgs for months.

But what they all have in common is ALMOST NO ADVERTISING.  I believe that
makes them much easier than personal email, barring the one-word subscribe
etc thingies attached to mountains of employer-generated disclaimers.

> - Integration into Mailman means that there's only one setup to be
>   concerned about, rather than the gazillions of different ways
>   ordinary users receive their email.

Yup.

> - The person who administers the list can be assumed to be a little
>   bit more clueful than an ordinary user.

Ditto.

> - An obvious default policy with tunable parameters presents itself:
>   ham goes to the list, spam is dropped (or bounced), and unsure goes
>   into the moderator's queue.

Yup.  But since there *is* a non-zero FP rate, and always will be, dropping
email is probably not a politically acceptable option (even if the FP rate
is lower than the chance of a mail-transport screwup losing the mail).

> (Of course, having this integrated into Mailman also gives Mailman a
> leg up against the competition.)

I believe that's not lost on Barry either <wink>.


From Tim@mail.powweb.com  Mon Nov  4 20:19:53 2002
From: Tim@mail.powweb.com (Tim@mail.powweb.com)
Date: Mon, 04 Nov 2002 14:19:53 -0600
Subject: [Spambayes] "Lindsey Carter": Re: [Zope-Annce] New zope.org
	development
In-Reply-To: <LNBBLJKPBEHFEDALKOLCAEECCGAB.tim.one@comcast.net>
Message-ID: <WVVROVP1VM5ZGE2W8XVKFVUMGP.3dc6d669@riven>

If we were to run a similar bayesian analysis of the pages that spam 
links point to, and used that information as another set of clues for 
classification, would that have made a difference in this instance, 
and in general?  By that I mean, once a mail has been classified as 
spam, we could look at the pages that the page points to and make a 
similar wordlist type classification.  This classification could be 
used in Unsure instances by looking at the pages the mail points to 
and then applying the webpage wordlist bayes classification to it.  If 
it's a probable spam-pointed-to-page, then the mail is probably 
spam...  at least that could weigh (heavily) into the statistics for 
the words in the mail itself....

- TimS

11/4/2002 2:07:23 PM, Tim Peters <tim.one@comcast.net> wrote:

>[Guido]
>> This smells like a clever spam,
>
>Click on the link to Lindsey's webpage if you have lingering doubts.
>
>> disguised as a zope-announce message I sent.  SA scored it -2.8:
>>
>>     X-Spam-Status: No, hits=-2.8 required=5.0
>> tests=BODY_PYTHON_ZOPE,CLICK_BELOW,FROM_BIGISP,FROM_ENDS_IN_NUMS,Q
>> UOTED_EMAIL_TEXT,SPAM_PHRASE_03_05,SUBJ_PYTHON_ZOPE
>>
>> Wonder if SB will do any better...
>
>Absolutely:  SB never gives negative scores <wink>.
>
>Barry once floated the idea of trying to strip quoted text in the 
tokenizer,
>but nobody (AFAIK) tried that.  Short of something like that, I 
expect the
>best you can hope for is that this will end up in your Unsure 
category.  I
>believe that QUOTED_EMAIL_TEXT means SA gave it a *ham* boost for 
containing
>quoted email.  The "Re:" in the subject line is a clue of that sort 
for SB
>too, along with various words starting w/ ">".
>
>
>_______________________________________________
>Spambayes mailing list
>Spambayes@python.org
>http://mail.python.org/mailman/listinfo/spambayes
>
>
- Tim
www.fourstonesExpressions.com 


From tim.one@comcast.net  Mon Nov  4 20:24:48 2002
From: tim.one@comcast.net (Tim Peters)
Date: Mon, 04 Nov 2002 15:24:48 -0500
Subject: [Spambayes] "Lindsey Carter": Re: [Zope-Annce] New
 zope.orgdevelopment
In-Reply-To: <15814.52688.115970.206304@slothrop.zope.com>
Message-ID: <LNBBLJKPBEHFEDALKOLCKEEFCGAB.tim.one@comcast.net>

[Jeremy Hylton]
> On the other hand, the message you forwarded got scored 0.494 with
> both *H* and *S* > 0.98.  I'm quite puzzled, though, about how my
> training data is getting used.  I looked back at the spam that came
> through (now in my spam training set) and see that it got scored
> 0.000.  It now gets scored 1.000, but for reasons that don't really
> make sense to me.

An endless string of hapaxes.  This is what mistake-based training can be
*expected* to do over time:  swing wildly from near 0 to near 1 (or vice
versa).

> Here's a snippet of the spamish word from the detailed scoring:
>
> >available 0.844827586207

Every word with that spamprob is a hapax (unique to this msg).  The
Bayseian-adjusted spamprob for a word is

     s*x + n*p
     ---------
        s+n

where, for a spam hapax, p=1.0 and n=1.  s and x are taken from Options.py
unless you've overridden them; the defaults are s=0.45 and x=0.5.  Plug
those all in and you get

>>> (.45 * 0.5 + 1 * 1.0) / (.45 + 1)
0.84482758620689669
>>>

for a spam hapax.

> >cc: 0.844827586207
> >corp. 0.844827586207
> >dickenson 0.844827586207
> >persistence 0.844827586207
> >pure 0.844827586207
> >released 0.844827586207
> >source 0.844827586207
> >unexpected 0.844827586207
> >windows 0.844827586207
> >zeo 0.844827586207
> >zodb 0.844827586207
> >zope 0.844827586207
> [zope-annce] 0.844827586207
> approve, 0.844827586207
> area! 0.844827586207
> behavior.) 0.844827586207
> beta 0.844827586207
> btrees 0.844827586207
> compiler, 0.844827586207
> conflict 0.844827586207
> cream? 0.844827586207
> email addr:zope.org, 0.844827586207
> emails, 0.844827586207
> fav. 0.844827586207
> from:"lindsey 0.844827586207
> from:carter" 0.844827586207
> from:email name:<smileylindsey72001 0.844827586207

So  they're *all* spam hapaxes, trained on exactly once, in that email.

In Guido's example under my classifier, I get a dozen stronger-than-hapax
spam clues, thanks to training regularly on correctly classified spam too:

'url:asp'                      0.856992
'skin'                         0.88473
'skilled'                      0.908163
'>do'                          0.908163
'url:index'                    0.911483
'shocking'                     0.921667
'ice'                          0.934783
'ads,'                         0.958716
'emails,'                      0.965116
'part-time'                    0.97619
'area!'                        0.987106
'pics'                         0.991159

> It seems like I'm still doing something wrong with pspam and training
> but I don't know what.  The odd thing is that I tend to get good
> results,

Most spam is easy to detect even from hapaxes.  That's what makes
mistake-based training tempting, I'm afraid.

> the osaf lists aside.

What's an osaf list?


From guido@python.org  Mon Nov  4 20:29:14 2002
From: guido@python.org (Guido van Rossum)
Date: Mon, 04 Nov 2002 15:29:14 -0500
Subject: [Spambayes] counterweight: it really works!
In-Reply-To: Your message of "Mon, 04 Nov 2002 16:46:03 GMT."
             <3DC6A44B.4070509@startechgroup.co.uk> 
References: <LHPJECGCWVOTFE3X5ZF0JD3OITOJD.3dc6a2da@riven>  
            <3DC6A44B.4070509@startechgroup.co.uk> 
Message-ID: <200211042029.gA4KTEc21789@pcp02138704pcs.reston01.va.comcast.net>

> FWIW SpamAssassin now has a statistical classifier (in 2.50, which isn't 
> officially released yet, but then neither is spambayes [grin]) using the 
> Robinson algorithm. I'm hoping to get the chi-squared algorithm in there 
> too, but /I had some trouble with it producing wierd results for me (I 
> tried to post something to this list about it but it vanished into the 
> ether, so I'll try again shortly).

Cool!  What do you do for training of your Robinson classifier?

--Guido van Rossum (home page: http://www.python.org/~guido/)

From tim.one@comcast.net  Mon Nov  4 20:50:34 2002
From: tim.one@comcast.net (Tim Peters)
Date: Mon, 04 Nov 2002 15:50:34 -0500
Subject: [Spambayes] deployment for mailman lists
In-Reply-To: <15814.53303.926055.735822@montanaro.dyndns.org>
Message-ID: <LNBBLJKPBEHFEDALKOLCEEEMCGAB.tim.one@comcast.net>

[Skip Montanaro]
> I would argue that spam should by default go into the moderator's
> queue as well.  The default should never be to drop or bounce a
> message.  Either way, you run the risk that legitimate mail gets lost.

That will never fly at python.org:  there's too much spam coming in for
anyone to deal with (or so I've been told -- I believe I get more spam at
home, but only an infinitesimal percentage of the virus email python.org
gets).  Indeed, Greg bounces lots of spam at SMTP *connect* time, without
analyzing it any deeper than seeing that it uses a character set that's out
of favor.  A Reject response should make its way back to the sender then.
If not, ya, the email is lost.  Put a specific price on that, because it's a
tradeoff, and "never" is too costly in other ways to tolerate.


From jeremy@alum.mit.edu  Mon Nov  4 20:44:50 2002
From: jeremy@alum.mit.edu (Jeremy Hylton)
Date: Mon, 4 Nov 2002 15:44:50 -0500
Subject: [Spambayes] "Lindsey Carter": Re: [Zope-Annce] New
 zope.orgdevelopment
In-Reply-To: <LNBBLJKPBEHFEDALKOLCKEEFCGAB.tim.one@comcast.net>
References: <15814.52688.115970.206304@slothrop.zope.com>
	<LNBBLJKPBEHFEDALKOLCKEEFCGAB.tim.one@comcast.net>
Message-ID: <15814.56386.841563.8206@slothrop.zope.com>

I've been training, of late, on a growing sample of my incoming
email.  At the moment just a few hundred of each ham and spam.  It has
done moderately well.  Apparently the Carter spam used to trigger on
words in the old archives I was using -- and the new smaller training
database just doesn't have many occurrences of those words.

The osaf lists are for Kapor et al.'s new PIM.  I've got 24 messages
from those lists in my ham training set, but it hasn't been enough to
get the scores reliably below 0.1.

Jeremy


From guido@python.org  Mon Nov  4 20:44:35 2002
From: guido@python.org (Guido van Rossum)
Date: Mon, 04 Nov 2002 15:44:35 -0500
Subject: [Spambayes] Database reduction
In-Reply-To: Your message of "04 Nov 2002 09:58:06 PST."
             <w53u1ixdxox.fsf@woozle.org> 
References: <LNBBLJKPBEHFEDALKOLCIEDPCDAB.tim.one@comcast.net>
	<w53k7jyf8ni.fsf@woozle.org> <15809.55847.349091.23441@montanaro.dyndns.org>  
	<w53u1ixdxox.fsf@woozle.org> 
Message-ID: <200211042044.gA4KiZE21941@pcp02138704pcs.reston01.va.comcast.net>

> In case it isn't obvious yet, here's the problem:
> 
>     >>> len(pickle.dumps(w, 1))
>     102
>     >>> len(`w`)
>     30
> 
> So, at least for hammie, you can get a 66% reduction in database size
> by *not* pickling WordInfo types.  Tim calls this "administrative pickle
> bloat", which is the coolest jargon term I've heard all year.
> 
> As I understand it, things which pickle the Bayes object avoid this
> overhead from some pickler optimizations along the lines of "if we've
> already seen this type, just give it a number and stop referring to it
> by name."  Thus, I suppose the proper way to get this reduction in
> hammie would be to extend the pickler to recognize WordInfo types,
> right?  If so, I'll add that code in.

I'm aware that pickling new-style class instances is inefficient, due
to the gross hack employed.  I'll try to find time to do something
about this in Python 2.3.

You could also experiment with adding a custom __reduce__ method
and/or custom __getstate__ and __setstate__ methods.  Or pickle tuples
instead of WordInfo instances.  Or make WordInfo a classic class
(classic class instances are pickled more efficiently).

--Guido van Rossum (home page: http://www.python.org/~guido/)

From guido@python.org  Mon Nov  4 20:58:38 2002
From: guido@python.org (Guido van Rossum)
Date: Mon, 04 Nov 2002 15:58:38 -0500
Subject: [Spambayes] "Lindsey Carter": Re: [Zope-Annce] New zope.org
	development
In-Reply-To: Your message of "Mon, 04 Nov 2002 14:32:51 EST."
             <15814.52067.105184.561839@slothrop.zope.com> 
References: <200211041927.gA4JR8h21174@pcp02138704pcs.reston01.va.comcast.net>
	            <15814.52067.105184.561839@slothrop.zope.com> 
Message-ID: <200211042058.gA4Kwc922052@pcp02138704pcs.reston01.va.comcast.net>

> I got the same spam and SB was sure it was ham.  It was responding to
> me ZODB announcement.  As a result the mail contained a bunch of good
> ham indicators from my original announcement.  As I recall, the fact
> that it not only contained my announcement but had some of the words
> quoted really nailed it.  That is, ">release" is a better ham
> indicator than "release".

Yup.  At least one spammer got clever... :-(

--Guido van Rossum (home page: http://www.python.org/~guido/)

From guido@python.org  Mon Nov  4 21:01:18 2002
From: guido@python.org (Guido van Rossum)
Date: Mon, 04 Nov 2002 16:01:18 -0500
Subject: [Spambayes] "Lindsey Carter": Re: [Zope-Annce] New zope.org
	development
In-Reply-To: Your message of "Mon, 04 Nov 2002 14:43:12 EST."
             <15814.52688.115970.206304@slothrop.zope.com> 
References: <200211041927.gA4JR8h21174@pcp02138704pcs.reston01.va.comcast.net>
	<15814.52067.105184.561839@slothrop.zope.com>  
	<15814.52688.115970.206304@slothrop.zope.com> 
Message-ID: <200211042101.gA4L1IY22088@pcp02138704pcs.reston01.va.comcast.net>

> On the other hand, the message you forwarded got scored 0.494 with
> both *H* and *S* > 0.98.  I'm quite puzzled, though, about how my
> training data is getting used.  I looked back at the spam that came
> through (now in my spam training set) and see that it got scored
> 0.000.  It now gets scored 1.000, but for reasons that don't really
> make sense to me.
> 
> Here's a snippet of the spamish word from the detailed scoring:
> 
> >available 0.844827586207
> >cc: 0.844827586207
> >corp. 0.844827586207
> >dickenson 0.844827586207
> >persistence 0.844827586207
> >pure 0.844827586207
> >released 0.844827586207
> >source 0.844827586207
> >unexpected 0.844827586207
> >windows 0.844827586207
> >zeo 0.844827586207
> >zodb 0.844827586207
> >zope 0.844827586207
> [zope-annce] 0.844827586207
> approve, 0.844827586207
> area! 0.844827586207
> behavior.) 0.844827586207
> beta 0.844827586207
> btrees 0.844827586207
> compiler, 0.844827586207
> conflict 0.844827586207
> cream? 0.844827586207
> email addr:zope.org, 0.844827586207
> emails, 0.844827586207
> fav. 0.844827586207
> from:"lindsey 0.844827586207
> from:carter" 0.844827586207
> from:email name:<smileylindsey72001 0.844827586207

At least the last 4 are probably unique to this particular spam, so
you must've trained on it.  That should explain why it's now
considered spam.  Unfortunately you've also made zope-announce posts
look more spammy! :-(

--Guido van Rossum (home page: http://www.python.org/~guido/)

From skip@pobox.com  Mon Nov  4 21:03:27 2002
From: skip@pobox.com (Skip Montanaro)
Date: Mon, 4 Nov 2002 15:03:27 -0600
Subject: [Spambayes] deployment for mailman lists
In-Reply-To: <LNBBLJKPBEHFEDALKOLCEEEMCGAB.tim.one@comcast.net>
References: <15814.53303.926055.735822@montanaro.dyndns.org>
        <LNBBLJKPBEHFEDALKOLCEEEMCGAB.tim.one@comcast.net>
Message-ID: <15814.57503.637984.11424@montanaro.dyndns.org>

>>>>> "Tim" == Tim Peters <tim.one@comcast.net> writes:

    Tim> [Skip Montanaro]
    >> I would argue that spam should by default go into the moderator's
    >> queue as well.  The default should never be to drop or bounce a
    >> message.  Either way, you run the risk that legitimate mail gets
    >> lost.

    Tim> That will never fly at python.org: there's too much spam coming in
    Tim> for anyone to deal with (or so I've been told -- I believe I get
    Tim> more spam at home, but only an infinitesimal percentage of the
    Tim> virus email python.org gets).  

It's fine to give moderators the ability to twiddle these settings.  The
person managing the mailing list can check the "delete spam" box.  All I'm
saying is that the default should not be to delete anything.

Skip

From guido@python.org  Mon Nov  4 21:06:41 2002
From: guido@python.org (Guido van Rossum)
Date: Mon, 04 Nov 2002 16:06:41 -0500
Subject: [Spambayes] deployment for mailman lists
In-Reply-To: Your message of "Mon, 04 Nov 2002 13:53:27 CST."
             <15814.53303.926055.735822@montanaro.dyndns.org> 
References: <200211041942.gA4Jgc621320@pcp02138704pcs.reston01.va.comcast.net>
	            <15814.53303.926055.735822@montanaro.dyndns.org> 
Message-ID: <200211042106.gA4L6fg22122@pcp02138704pcs.reston01.va.comcast.net>

>     Guido> - An obvious default policy with tunable parameters presents
>     Guido>   itself: ham goes to the list, spam is dropped (or
>     Guido>   bounced), and unsure goes into the moderator's queue.
> 
> I would argue that spam should by default go into the moderator's
> queue as well.  The default should never be to drop or bounce a
> message.  Either way, you run the risk that legitimate mail gets
> lost.

For most mailing lists, I disagree.  It's not like you're going to
miss an important message from your boss or from a potential customer
or employer when a false positive is bounced from the
dangerous-hobbies-involving-jello list.

Given the amount of spam that most lists get, and the clumsiness (I
believe Barry agrees with this assessment :-) of the Mailman
moderation API, putting all spam in the moderation queue by default
would be a bad idea.  I agree that it should be possible to configure
it this way if you really want, but I don't think it should be the
default.

--Guido van Rossum (home page: http://www.python.org/~guido/)

From tim.one@comcast.net  Mon Nov  4 21:06:53 2002
From: tim.one@comcast.net (Tim Peters)
Date: Mon, 04 Nov 2002 16:06:53 -0500
Subject: [Spambayes] "Lindsey Carter": Re: [Zope-Annce] New
 zope.orgdevelopment
In-Reply-To: <15814.56386.841563.8206@slothrop.zope.com>
Message-ID: <LNBBLJKPBEHFEDALKOLCEEEPCGAB.tim.one@comcast.net>

[Jeremy Hylton]
> I've been training, of late, on a growing sample of my incoming
> email.

Good -- I knew I could browbeat you into that <wink>.

> At the moment just a few hundred of each ham and spam.  It has done
> moderately well.  Apparently the Carter spam used to trigger on
> words in the old archives I was using -- and the new smaller training
> database just doesn't have many occurrences of those words.

Expiring words over time is something that should be done with ongoing
training too ("database pruning").  There's been no progress on that,
though.

> The osaf lists are for Kapor et al.'s new PIM.  I've got 24 messages
> from those lists in my ham training set, but it hasn't been enough to
> get the scores reliably below 0.1.

With just a few hundred training msgs, that's very surprising to me, and
especially since the one example I've seen scored very solidly as ham under
my classifier (which had not been trained on any of these things).  Could
there be a persistence glitch such that training isn't "taking hold"?

I just looked, and noticed that _remove_msg() didn't do the

    self.wordinfo[word] = record

bit at the end which may be needed to tell a persistent DB that the content
of *record* changed.  Then untraining a msg would screw things up, by
decrementing the nspam or nham count but not reducing the word counts to
match.  I'll check in a fix for that now.  Maybe there are other places
"like that".


From jeremy@alum.mit.edu  Mon Nov  4 21:08:09 2002
From: jeremy@alum.mit.edu (Jeremy Hylton)
Date: Mon, 4 Nov 2002 16:08:09 -0500
Subject: [Spambayes] Database reduction
In-Reply-To: <200211042044.gA4KiZE21941@pcp02138704pcs.reston01.va.comcast.net>
References: <LNBBLJKPBEHFEDALKOLCIEDPCDAB.tim.one@comcast.net>
	<15809.55847.349091.23441@montanaro.dyndns.org>
	<w53u1ixdxox.fsf@woozle.org>
	<200211042044.gA4KiZE21941@pcp02138704pcs.reston01.va.comcast.net>
Message-ID: <15814.57785.966040.687158@slothrop.zope.com>

I'd find it convenient if Bayes was a classic class, too, so that I
can more easily use ExtensionClass-based persistence.

Jeremy


From jeremy@alum.mit.edu  Mon Nov  4 21:10:25 2002
From: jeremy@alum.mit.edu (Jeremy Hylton)
Date: Mon, 4 Nov 2002 16:10:25 -0500
Subject: [Spambayes] "Lindsey Carter": Re: [Zope-Annce] New
 zope.orgdevelopment
In-Reply-To: <LNBBLJKPBEHFEDALKOLCEEEPCGAB.tim.one@comcast.net>
References: <15814.56386.841563.8206@slothrop.zope.com>
	<LNBBLJKPBEHFEDALKOLCEEEPCGAB.tim.one@comcast.net>
Message-ID: <15814.57921.614104.945895@slothrop.zope.com>

>>>>> "TP" == Tim Peters <tim.one@comcast.net> writes:

  TP> I just looked, and noticed that _remove_msg() didn't do the

  TP>     self.wordinfo[word] = record

  TP> bit at the end which may be needed to tell a persistent DB that
  TP> the content of *record* changed.  Then untraining a msg would
  TP> screw things up, by decrementing the nspam or nham count but not
  TP> reducing the word counts to match.  I'll check in a fix for that
  TP> now.  Maybe there are other places "like that".

Actually, the pspam code ended up making WordInfo objects back into
independent persistent objects just so that I don't have to worry
about these sorts of issues.  So this is not the problem now (although
it may have been a week or two ago).

Jeremy


From guido@python.org  Mon Nov  4 21:13:23 2002
From: guido@python.org (Guido van Rossum)
Date: Mon, 04 Nov 2002 16:13:23 -0500
Subject: [Spambayes] deployment for mailman lists
In-Reply-To: Your message of "Mon, 04 Nov 2002 15:13:30 EST."
             <LNBBLJKPBEHFEDALKOLCAEEECGAB.tim.one@comcast.net> 
References: <LNBBLJKPBEHFEDALKOLCAEEECGAB.tim.one@comcast.net> 
Message-ID: <200211042113.gA4LDNM22178@pcp02138704pcs.reston01.va.comcast.net>

> Yup.  But since there *is* a non-zero FP rate, and always will be,
> dropping email is probably not a politically acceptable option (even
> if the FP rate is lower than the chance of a mail-transport screwup
> losing the mail).

That should be up to the list admin though.  The list admin (or his
sysadmin) has the choice to install other software that drops or
rejects spam anyway (as we do for python.org).

--Guido van Rossum (home page: http://www.python.org/~guido/)

From tim.one@comcast.net  Mon Nov  4 21:13:43 2002
From: tim.one@comcast.net (Tim Peters)
Date: Mon, 04 Nov 2002 16:13:43 -0500
Subject: [Spambayes] Database reduction
In-Reply-To: <15814.57785.966040.687158@slothrop.zope.com>
Message-ID: <LNBBLJKPBEHFEDALKOLCGEFCCGAB.tim.one@comcast.net>

[Jeremy Hylton]
> I'd find it convenient if Bayes was a classic class, too, so that I
> can more easily use ExtensionClass-based persistence.

Fine by me -- check it in.  I only want to keep WordInfo instancea
lightweight (via __slots__).


From bkc@murkworks.com  Mon Nov  4 21:20:15 2002
From: bkc@murkworks.com (Brad Clements)
Date: Mon, 04 Nov 2002 16:20:15 -0500
Subject: [Spambayes] "Lindsey Carter": Re: [Zope-Annce] New zope.org
	development
In-Reply-To: <WVVROVP1VM5ZGE2W8XVKFVUMGP.3dc6d669@riven>
References: <LNBBLJKPBEHFEDALKOLCAEECCGAB.tim.one@comcast.net>
Message-ID: <3DC69D7B.16642.3E08EE54@localhost>

On 4 Nov 2002 at 14:19, Tim@mail.powweb.com, Stone@mail.powweb.co wrote:


> If we were to run a similar bayesian analysis of the pages that spam 
> links point to, and used that information as another set of clues for

I would not do this. Spammers could use a webbug-like technique to validate your 
email address this way.

Brad Clements,                bkc@murkworks.com   (315)268-1000
http://www.murkworks.com                          (315)268-9812 Fax
AOL-IM: BKClements


From tim.one@comcast.net  Mon Nov  4 21:26:17 2002
From: tim.one@comcast.net (Tim Peters)
Date: Mon, 04 Nov 2002 16:26:17 -0500
Subject: [Spambayes] "Lindsey Carter": Re: [Zope-Annce] New
 zope.orgdevelopment
In-Reply-To: <200211042101.gA4L1IY22088@pcp02138704pcs.reston01.va.comcast.net>
Message-ID: <LNBBLJKPBEHFEDALKOLCGEFECGAB.tim.one@comcast.net>

[Guido]
> ...
> At least the last 4 are probably unique to this particular spam, so
> you must've trained on it.

Read my reply -- *all* the words here were hapaxes.  No exceptions.

> That should explain why it's now considered spam.  Unfortunately you've
> also made zope-announce posts look more spammy! :-(

As soon as he trains on just one ham from zope-announce, the spamprob will
fall to 0.5.  Scoring relying on hapaxes is brittle, despite the instant
gratification it supplies; the correct cure is to train over a random
sampling of all your email regularly, and whether or not it's been correctly
classified.  I got a dozen stronger-than-hapax spam clues out of your email
example (all from the spam part of it), because I keep training even on spam
that scores 1.0 and ham that scores 0.0; this moves spamprobs out of the
brittle hapax range into a reflection of what email *really* looks like.


From Tim@mail.powweb.com  Mon Nov  4 21:27:15 2002
From: Tim@mail.powweb.com (Tim@mail.powweb.com)
Date: Mon, 04 Nov 2002 15:27:15 -0600
Subject: [Spambayes] "Lindsey Carter": Re: [Zope-Annce] New zope.org
	development
In-Reply-To: <3DC69D7B.16642.3E08EE54@localhost>
Message-ID: <97YT946ZRM1VNH05NH3X86H95NRMIH.3dc6e633@riven>

For sure, there are a number of considerations that might make such a 
proposal impractical.  But my question was more theoretical in nature. 
So... practicalities aside, would an analysis of this nature be 
useful?

- TimS

11/4/2002 3:20:15 PM, "Brad Clements" <bkc@murkworks.com> wrote:

>On 4 Nov 2002 at 14:19, Tim@mail.powweb.com, Stone@mail.powweb.co 
wrote:
>
>
>> If we were to run a similar bayesian analysis of the pages that 
spam 
>> links point to, and used that information as another set of clues 
for
>
>I would not do this. Spammers could use a webbug-like technique to 
validate your 
>email address this way.
>
>Brad Clements,                bkc@murkworks.com   (315)268-1000
>http://www.murkworks.com                          (315)268-9812 Fax
>AOL-IM: BKClements
>
>
>_______________________________________________
>Spambayes mailing list
>Spambayes@python.org
>http://mail.python.org/mailman/listinfo/spambayes
>
>
- Tim
www.fourstonesExpressions.com 


From skip@pobox.com  Mon Nov  4 21:36:00 2002
From: skip@pobox.com (Skip Montanaro)
Date: Mon, 4 Nov 2002 15:36:00 -0600
Subject: [Spambayes] deployment for mailman lists
In-Reply-To: <200211042106.gA4L6fg22122@pcp02138704pcs.reston01.va.comcast.net>
References: <200211041942.gA4Jgc621320@pcp02138704pcs.reston01.va.comcast.net>
        <15814.53303.926055.735822@montanaro.dyndns.org>
        <200211042106.gA4L6fg22122@pcp02138704pcs.reston01.va.comcast.net>
Message-ID: <15814.59456.141076.98902@montanaro.dyndns.org>


    Guido> For most mailing lists, I disagree.  It's not like you're going
    Guido> to miss an important message from your boss or from a potential
    Guido> customer or employer when a false positive is bounced from the
    Guido> dangerous-hobbies-involving-jello list.

Perhaps not, but Mailman and Spambayes could hardly get worse PR than if
valid messages began to simply disappear.  We all know there are any of a
number of reasons why ham can get misclassified.  All I'm saying is make the
default setting for new groups in this yet-to-be Mailman+Spambayes tool be
to forward spam to the moderator.  Mailman can say to the moderator, "This
message looks like spam.  If you would rather I delete such messages, here's
how you do it, and here are the implications."  If mail just disappears,
there is no place to hang that little warning message.

I don't understand why this seems to be such a difficult point to make.  The
readers of this list are so obviously far from the normal user and/or list
moderator that our personal experience as people who read and moderate
technical mailing lists just doesn't apply.  I manage a very active
non-technical mailing list using Mailman.  Most of the people wouldn't know
a Python script if it bit 'em in the ass.  The other people who help me
moderate the list are substantially less computer-savvy than I am.  Trust me
on this.  They wouldn't know how to disable the "delete spam" feature if
they were to somehow figure out why mail was disappearing.

    Guido> Given the amount of spam that most lists get, and the clumsiness
    Guido> (I believe Barry agrees with this assessment :-) of the Mailman
    Guido> moderation API, putting all spam in the moderation queue by
    Guido> default would be a bad idea.  I agree that it should be possible
    Guido> to configure it this way if you really want, but I don't think it
    Guido> should be the default.

That is yet another argument for not deleting mail (and probably an argument
for fixing the moderation interface).  If you save spam you can tell them
precisely where in the moderation interface to go to make the change.  If
the interface is poor, it may well be hard for the moderator to figure out
where to go to stop the bleeding.

Skip


From tim.one@comcast.net  Mon Nov  4 21:37:42 2002
From: tim.one@comcast.net (Tim Peters)
Date: Mon, 04 Nov 2002 16:37:42 -0500
Subject: [Spambayes] "Lindsey Carter": Re: [Zope-Annce] New
 zope.orgdevelopment
In-Reply-To: <97YT946ZRM1VNH05NH3X86H95NRMIH.3dc6e633@riven>
Message-ID: <LNBBLJKPBEHFEDALKOLCEEFHCGAB.tim.one@comcast.net>

[Tim@mail.powweb.com, on chasing URLs]
> For sure, there are a number of considerations that might make such a
> proposal impractical.  But my question was more theoretical in nature.
> So... practicalities aside, would an analysis of this nature be
> useful?

Maybe, but it's hard to ignore the practicalities.  BTW, I expect a URL that
doesn't resolve would be a great spam clue -- lots of spam sites get shut
down within hours.


From pje@telecommunity.com  Mon Nov  4 21:51:22 2002
From: pje@telecommunity.com (Phillip J. Eby)
Date: Mon, 04 Nov 2002 16:51:22 -0500
Subject: [Spambayes] Database reduction
In-Reply-To: <15814.57785.966040.687158@slothrop.zope.com>
References: <200211042044.gA4KiZE21941@pcp02138704pcs.reston01.va.comcast.net>
 <LNBBLJKPBEHFEDALKOLCIEDPCDAB.tim.one@comcast.net>
 <15809.55847.349091.23441@montanaro.dyndns.org>
 <w53u1ixdxox.fsf@woozle.org>
 <200211042044.gA4KiZE21941@pcp02138704pcs.reston01.va.comcast.net>
Message-ID: <5.1.0.14.0.20021104165053.02620be0@mail.telecommunity.com>

At 04:08 PM 11/4/02 -0500, Jeremy Hylton wrote:
>I'd find it convenient if Bayes was a classic class, too, so that I
>can more easily use ExtensionClass-based persistence.

You mean you're not using ZODB 4 yet?  For shame.  :)


From guido@python.org  Mon Nov  4 22:01:52 2002
From: guido@python.org (Guido van Rossum)
Date: Mon, 04 Nov 2002 17:01:52 -0500
Subject: [Spambayes] deployment for mailman lists
In-Reply-To: Your message of "Mon, 04 Nov 2002 15:36:00 CST."
             <15814.59456.141076.98902@montanaro.dyndns.org> 
References: <200211041942.gA4Jgc621320@pcp02138704pcs.reston01.va.comcast.net>
	<15814.53303.926055.735822@montanaro.dyndns.org>
	<200211042106.gA4L6fg22122@pcp02138704pcs.reston01.va.comcast.net>  
	<15814.59456.141076.98902@montanaro.dyndns.org> 
Message-ID: <200211042201.gA4M1qe22562@pcp02138704pcs.reston01.va.comcast.net>


From mhammond@skippinet.com.au  Mon Nov  4 22:14:15 2002
From: mhammond@skippinet.com.au (Mark Hammond)
Date: Tue, 5 Nov 2002 09:14:15 +1100
Subject: [Spambayes] Email client integration -- what's needed?
In-Reply-To: <3DC64611.30897.3CB37091@localhost>
Message-ID: <LCEPIIGDJPKCOIHOBJEPMELKHIAA.mhammond@skippinet.com.au>

> > AFAIK, Outlook Express has no hooks at all for programmers --
> it's a closed
>
>
> How about this?
>
> http://msdn.microsoft.com/library/en-us/mapi/html/_mapi1book_using
> _message_filtering_to_manage_messages.asp
>
> I think this is new information released under the DOJ settlement.

This is simply MAPI documentation, and what the existing Outlook plugin is
using.  (Actually, for the "new message" hook we are using the Outlook model
rather than the documentation you pointed at, but it wont be long until we
move to the MAPI system, I bet <wink>).

I don't think the info is DOJ related - my July 2000 MSDN CD has the same
article.

Unfortunately, I see nothing here that indicates this works for Outlook
Express.  If we can use MAPI with Outlook Express, the plugin should not be
hard to port at all.

Mark.


From guido@python.org  Mon Nov  4 22:11:47 2002
From: guido@python.org (Guido van Rossum)
Date: Mon, 04 Nov 2002 17:11:47 -0500
Subject: [Spambayes] deployment for mailman lists
In-Reply-To: Your message of "Mon, 04 Nov 2002 15:36:00 CST."
             <15814.59456.141076.98902@montanaro.dyndns.org> 
References: <200211041942.gA4Jgc621320@pcp02138704pcs.reston01.va.comcast.net>
	<15814.53303.926055.735822@montanaro.dyndns.org>
	<200211042106.gA4L6fg22122@pcp02138704pcs.reston01.va.comcast.net>  
	<15814.59456.141076.98902@montanaro.dyndns.org> 
Message-ID: <200211042211.gA4MBlg22610@pcp02138704pcs.reston01.va.comcast.net>

>     Guido> For most mailing lists, I disagree.  It's not like you're
>     Guido> going to miss an important message from your boss or from
>     Guido> a potential customer or employer when a false positive is
>     Guido> bounced from the dangerous-hobbies-involving-jello list.

[Skip]
> Perhaps not, but Mailman and Spambayes could hardly get worse PR
> than if valid messages began to simply disappear.  We all know there
> are any of a number of reasons why ham can get misclassified.  All
> I'm saying is make the default setting for new groups in this
> yet-to-be Mailman+Spambayes tool be to forward spam to the
> moderator.  Mailman can say to the moderator, "This message looks
> like spam.  If you would rather I delete such messages, here's how
> you do it, and here are the implications."  If mail just disappears,
> there is no place to hang that little warning message.
> 
> I don't understand why this seems to be such a difficult point to
> make.  The readers of this list are so obviously far from the normal
> user and/or list moderator that our personal experience as people
> who read and moderate technical mailing lists just doesn't apply.  I
> manage a very active non-technical mailing list using Mailman.  Most
> of the people wouldn't know a Python script if it bit 'em in the
> ass.  The other people who help me moderate the list are
> substantially less computer-savvy than I am.  Trust me on this.
> They wouldn't know how to disable the "delete spam" feature if they
> were to somehow figure out why mail was disappearing.

But the key is that *you* are the list's main administrator and in
charge of the initial setup.  So *you* should set it up to minimize
your pain (which includes constant worries about lost mail due to
false positives in the spam filter).

I believe that while Mailman is relatively easy to set up, it requires
(at least) typical mail admin skills, and a mail admin already has in
his/her head ideas about the cost of lost mail.  You seem to have been
burned by this, and as a consequence I believe you're on the
conservative side.  As long as the consequences are clear when a list
admin chooses to enable spam filtering, I think the default should be
for convenience, not for liability.

>     Guido> Given the amount of spam that most lists get, and the
>     Guido> clumsiness (I believe Barry agrees with this assessment
>     Guido> :-) of the Mailman moderation API, putting all spam in
>     Guido> the moderation queue by default would be a bad idea.  I
>     Guido> agree that it should be possible to configure it this way
>     Guido> if you really want, but I don't think it should be the
>     Guido> default.
> 
> That is yet another argument for not deleting mail (and probably an
> argument for fixing the moderation interface).  If you save spam you
> can tell them precisely where in the moderation interface to go to
> make the change.  If the interface is poor, it may well be hard for
> the moderator to figure out where to go to stop the bleeding.

There's no way you can design a web moderation interface to deal well
with manually moderating 200 spams per day.  IMO if you show *all*
spam in the moderation interface, the kind of non-techie moderator
that you describe is *more* likely to make mistakes (rejecting ham or
approving spam) than in the default that I propose.

You've made this same (or a very similar) point many times, and while
I agree with you that it's bad to delete spam in many setups, I
strongly disagree in this case.

--Guido van Rossum (home page: http://www.python.org/~guido/)

From tim.one@comcast.net  Mon Nov  4 22:17:01 2002
From: tim.one@comcast.net (Tim Peters)
Date: Mon, 04 Nov 2002 17:17:01 -0500
Subject: [Spambayes] Proposing to drop retain_pure_html_tags
Message-ID: <LNBBLJKPBEHFEDALKOLCOEFOCGAB.tim.one@comcast.net>

AFAIK, nobody enables retain_pure_html_tags anymore.  In the very early days
of the project, it was the only choice.  If it's enabled, it's virtually the
same as saying "any msg whatsoever using HTML or XML is spam, even if it's
just a plain-text msg discussing HTML examples".  That was *almost*
appropriate for the early c.l.py tests, because HTML msgs are so hated on
tech mailing lists.

The algorithms have since improved to the point where it does more harm than
good even on my python.org tests, so I can't imagine a good use for it
anymore.

There's still a world of info we're missing inside HTML decorations, but
retaining all of it will never work (the presence or absence of assorted
HTML decorations violateds the word-independence assumption to an extreme;
it's a bogus assumption anyway, but there's no hope of recovering from the
abuse HTML tags heap on it).


From bkc@murkworks.com  Mon Nov  4 22:26:37 2002
From: bkc@murkworks.com (Brad Clements)
Date: Mon, 04 Nov 2002 17:26:37 -0500
Subject: [Spambayes] Email client integration -- what's needed?
In-Reply-To: <LCEPIIGDJPKCOIHOBJEPMELKHIAA.mhammond@skippinet.com.au>
References: <3DC64611.30897.3CB37091@localhost>
Message-ID: <3DC6AD09.3250.3E45B0C3@localhost>

On 5 Nov 2002 at 9:14, Mark Hammond wrote:

> > http://msdn.microsoft.com/library/en-us/mapi/html/_mapi1book_using
> > _message_filtering_to_manage_messages.asp
> >
> > I think this is new information released under the DOJ settlement.

> I don't think the info is DOJ related - my July 2000 MSDN CD has the same
> article.
> 

On the DOJ info page at Microsoft, which I had a really hard time finding. It listed 
"Outlook Express APIs" and that URL pointed to these pages.

Hmm. Who knows. :-(


Brad Clements,                bkc@murkworks.com   (315)268-1000
http://www.murkworks.com                          (315)268-9812 Fax
AOL-IM: BKClements


From neale@woozle.org  Mon Nov  4 22:36:37 2002
From: neale@woozle.org (Neale Pickett)
Date: 04 Nov 2002 14:36:37 -0800
Subject: [Spambayes] Database reduction
In-Reply-To: <LNBBLJKPBEHFEDALKOLCOEDDCGAB.tim.one@comcast.net>
References: <LNBBLJKPBEHFEDALKOLCOEDDCGAB.tim.one@comcast.net>
Message-ID: <w53iszdc68a.fsf@woozle.org>

So then, Tim Peters <tim.one@comcast.net> is all like:

> OTOH,
> 
> >>> cPickle.dumps(w.__getstate__(), 1)
> '(U\x04aoeuq\x01K\x00K\x00K\x00K\x02t.'
> >>> len(_)
> 19
> >>>
> 
> which is shorter than your string repr.  This isn't typical because 2
> is an absurd spamprob (it's > 1, and is an int instead of a double);
> the savings would be greater with a real spamprob (which will consume
> about 19 bytes in a string repr, but about 8 in a pickle).

Right.  I had some code in hammie to pickle the tuple instead of the
object itself, but I thought it was a pretty gnarly kludge at the time.
In any case, some variation on this seems obviously the right way to go.

> [ Tim magic regarding pickle hacks ]

> I'd avoid all that and pickle the states, but that's just me.

I'm inclined to agree with you.  If I do this, though, we have to all
agree on a convention: if you need to modify a wordinfo object, you
*must* write it back to the dictionary.  Otherwise hammie will never
know it changed.  I was bitten by this a few times at first, and I
haven't played with the code enough to know if any of this has crept
back in.

Would it be out of line to alter WordInfo to be immutable, to encourage
folks to write it back to the dictionary?

Neale

From skip@pobox.com  Mon Nov  4 22:38:27 2002
From: skip@pobox.com (Skip Montanaro)
Date: Mon, 4 Nov 2002 16:38:27 -0600
Subject: [Spambayes] deployment for mailman lists
In-Reply-To: <200211042211.gA4MBlg22610@pcp02138704pcs.reston01.va.comcast.net>
References: <200211041942.gA4Jgc621320@pcp02138704pcs.reston01.va.comcast.net>
        <15814.53303.926055.735822@montanaro.dyndns.org>
        <200211042106.gA4L6fg22122@pcp02138704pcs.reston01.va.comcast.net>
        <15814.59456.141076.98902@montanaro.dyndns.org>
        <200211042211.gA4MBlg22610@pcp02138704pcs.reston01.va.comcast.net>
Message-ID: <15814.63203.981010.604877@montanaro.dyndns.org>


    Guido> But the key is that *you* are the list's main administrator and
    Guido> in charge of the initial setup.  So *you* should set it up to
    Guido> minimize your pain (which includes constant worries about lost
    Guido> mail due to false positives in the spam filter).

Correct, but regardless of my abilities in this particular case, the
*default* for new mailing lists - those created by ~mailman/bin/newlist -
should be to not delete the spam.  The administrator of the site has to run
that.  The moderator of the list (who generally won't have shell access to
the machine running Mailman) will then get her chance to go through and
fiddle the bits.

    Guido> I believe that while Mailman is relatively easy to set up, it
    Guido> requires (at least) typical mail admin skills, and a mail admin
    Guido> already has in his/her head ideas about the cost of lost mail.
    Guido> You seem to have been burned by this, and as a consequence I
    Guido> believe you're on the conservative side.  As long as the
    Guido> consequences are clear when a list admin chooses to enable spam
    Guido> filtering, I think the default should be for convenience, not for
    Guido> liability.

It has nothing to do with getting burned, I just have relevant current
experience dealing with less technical lists.  There are tons of
non-technical folks out there running Mailman-managed mailing lists.
Consider that many hosting companies like Hostway make this available to
their customers.  Every other mail-handling tool I've ever seen (sendmail,
fetchmail, procmail, etc) goes to great lengths to avoid losing mail.  Why
shouldn't Mailman?

    Guido> There's no way you can design a web moderation interface to deal
    Guido> well with manually moderating 200 spams per day.  IMO if you show
    Guido> *all* spam in the moderation interface, the kind of non-techie
    Guido> moderator that you describe is *more* likely to make mistakes
    Guido> (rejecting ham or approving spam) than in the default that I
    Guido> propose.

I'm not saying that you have to design an interface to deal with moderating
200 spams a day.  I'm also not saying it's a one-time-only setting.  Still,
by making the default for held spam messages be "discard" instead of
"defer", Mailman could make it a one-click operation to delete all 200 with
one "Submit All Data" click from the moderation interface.  I haven't used
Mailman 2.1 yet, but I think that was something Barry had hoped to make a
configuration option as well.

    Guido> You've made this same (or a very similar) point many times, and
    Guido> while I agree with you that it's bad to delete spam in many
    Guido> setups, I strongly disagree in this case.

Only because you seem to continually misunderstand what I'm saying.  I am
*only* saying it's bad to delete spam by default when the list is first
created.  Let the list moderator decide, "I can't handle all this crap,
please delete it for me".

I see two scenarios:

    1.  An existing mailing list is converted to a new Mailman+Spambayes
        setup.  The moderator is either (a) thankful that all the spam which
        had previously shown up on the list is now somewhere he can deal
        with it, or (b) he was already doing something to deflect most/all
        the spam, so doesn't see much of it in the moderation interface.

    2. A brand new mailing list is setup with Mailman+Spambayes.  As a new
       list, it should not be getting 200 spams per day.  The moderator will
       have time to figure out how to change the settings on the list to
       delete spam instead of hold it.

I just don't understand why you have a hard time understanding that
out-of-the-box Mailman+Spambayes should not delete spam.  It's a one-click
change for Greg or Barry, or whoever controls python-list.  Why not err on
the side of caution?

Skip


From skip@pobox.com  Mon Nov  4 22:39:49 2002
From: skip@pobox.com (Skip Montanaro)
Date: Mon, 4 Nov 2002 16:39:49 -0600
Subject: [Spambayes] Email client integration -- what's needed?
In-Reply-To: <3DC6AD09.3250.3E45B0C3@localhost>
References: <3DC64611.30897.3CB37091@localhost>
        <3DC6AD09.3250.3E45B0C3@localhost>
Message-ID: <15814.63285.147666.504828@montanaro.dyndns.org>

    Mark> I don't think the info is DOJ related - my July 2000 MSDN CD has
    Mark> the same article.

    Brad> On the DOJ info page at Microsoft, which I had a really hard time
    Brad> finding. It listed "Outlook Express APIs" and that URL pointed to
    Brad> these pages.

Maybe they were just trying to convince DOJ they were complying with the
proposed settlement.

Skip


From tim.one@comcast.net  Mon Nov  4 22:41:48 2002
From: tim.one@comcast.net (Tim Peters)
Date: Mon, 04 Nov 2002 17:41:48 -0500
Subject: [Spambayes] Email client integration -- what's needed?
In-Reply-To: <LCEPIIGDJPKCOIHOBJEPMELKHIAA.mhammond@skippinet.com.au>
Message-ID: <LNBBLJKPBEHFEDALKOLCCEGICGAB.tim.one@comcast.net>

[Mark Hammond]
> ...
> Unfortunately, I see nothing here that indicates this works for Outlook
> Express.  If we can use MAPI with Outlook Express, the plugin
> should not be hard to port at all.

Noting that, according to

    http://support.microsoft.com/default.aspx?scid=KB;EN-US;q192119

at least OE4 physically replaced the system Mapi32.dll with its own version
that spoke only Simple MAPI, when OE was selected as the default email
program.  As a result, all "real" MAPI and CDO apps failed.  Things may have
improved since then (OE is at version 6 now, I believe), but OE4 looks
hopeless for this reason.  Rummaging around, I get the *impression* that
they've solved the conflicting DLL problem, but that OE is still restricted
to Simple MAPI.


From guido@python.org  Mon Nov  4 22:54:24 2002
From: guido@python.org (Guido van Rossum)
Date: Mon, 04 Nov 2002 17:54:24 -0500
Subject: [Spambayes] deployment for mailman lists
In-Reply-To: Your message of "Mon, 04 Nov 2002 16:38:27 CST."
             <15814.63203.981010.604877@montanaro.dyndns.org> 
References: <200211041942.gA4Jgc621320@pcp02138704pcs.reston01.va.comcast.net>
	<15814.53303.926055.735822@montanaro.dyndns.org>
	<200211042106.gA4L6fg22122@pcp02138704pcs.reston01.va.comcast.net>
	<15814.59456.141076.98902@montanaro.dyndns.org>
	<200211042211.gA4MBlg22610@pcp02138704pcs.reston01.va.comcast.net>  
	<15814.63203.981010.604877@montanaro.dyndns.org> 
Message-ID: <200211042254.gA4MsOc22974@pcp02138704pcs.reston01.va.comcast.net>

>     Guido> But the key is that *you* are the list's main
>     Guido> administrator and in charge of the initial setup.  So
>     Guido> *you* should set it up to minimize your pain (which
>     Guido> includes constant worries about lost mail due to false
>     Guido> positives in the spam filter).
> 
> Correct, but regardless of my abilities in this particular case, the
> *default* for new mailing lists - those created by
> ~mailman/bin/newlist - should be to not delete the spam.  The
> administrator of the site has to run that.  The moderator of the
> list (who generally won't have shell access to the machine running
> Mailman) will then get her chance to go through and fiddle the bits.

The default should be not to enabl spambayes filtering at all, since
there's no way to set up the training data to begin with.

>     Guido> I believe that while Mailman is relatively easy to set
>     Guido> up, it requires (at least) typical mail admin skills, and
>     Guido> a mail admin already has in his/her head ideas about the
>     Guido> cost of lost mail.  You seem to have been burned by this,
>     Guido> and as a consequence I believe you're on the conservative
>     Guido> side.  As long as the consequences are clear when a list
>     Guido> admin chooses to enable spam filtering, I think the
>     Guido> default should be for convenience, not for liability.
> 
> It has nothing to do with getting burned, I just have relevant
> current experience dealing with less technical lists.  There are
> tons of non-technical folks out there running Mailman-managed
> mailing lists.  Consider that many hosting companies like Hostway
> make this available to their customers.  Every other mail-handling
> tool I've ever seen (sendmail, fetchmail, procmail, etc) goes to
> great lengths to avoid losing mail.  Why shouldn't Mailman?

See above.  Enabling spam filtering should be an explicit step.  The
UI should clarify the consequences and show the configuration
settings.  But the default configuration settings *once spam filtering
is enabled* should be to bounce (not drop) spam scoring higher than
the top of the "uncertain" region.  Example UI:

   [ ] Enable Baysian spam filtering [help link]

       [ 95 ] Spam cutoff score
       [  5 ] Ham cutoff score

       Disposition for messages scoring at least spam cutoff:
       (x)  Bounce
       ( )  Discard
       ( )  Moderate

       Disposition for messages scoring between ham and spam cutoff:
       ( )  Moderate
       (x)  Approve

       <more config options, in particular where to get the ham
       training data>

>     Guido> There's no way you can design a web moderation interface
>     Guido> to deal well with manually moderating 200 spams per day.
>     Guido> IMO if you show *all* spam in the moderation interface,
>     Guido> the kind of non-techie moderator that you describe is
>     Guido> *more* likely to make mistakes (rejecting ham or
>     Guido> approving spam) than in the default that I propose.
> 
> I'm not saying that you have to design an interface to deal with
> moderating 200 spams a day.  I'm also not saying it's a
> one-time-only setting.  Still, by making the default for held spam
> messages be "discard" instead of "defer", Mailman could make it a
> one-click operation to delete all 200 with one "Submit All Data"
> click from the moderation interface.  I haven't used Mailman 2.1
> yet, but I think that was something Barry had hoped to make a
> configuration option as well.

And that's exactly what I fear -- mixing the spam and unsure messages
in a single moderation queue will increase mistakes.

>     Guido> You've made this same (or a very similar) point many
>     Guido> times, and while I agree with you that it's bad to delete
>     Guido> spam in many setups, I strongly disagree in this case.
> 
> Only because you seem to continually misunderstand what I'm saying.
> I am *only* saying it's bad to delete spam by default when the list
> is first created.  Let the list moderator decide, "I can't handle
> all this crap, please delete it for me".

OK, then we agree.  I say spam filtering shouldn't be enabled at all
when the list is created -- after all you have no ham training data!

> I see two scenarios:
> 
>     1.  An existing mailing list is converted to a new
>         Mailman+Spambayes setup.  The moderator is either (a)
>         thankful that all the spam which had previously shown up on
>         the list is now somewhere he can deal with it, or (b) he was
>         already doing something to deflect most/all the spam, so
>         doesn't see much of it in the moderation interface.

Depends on whether whatever he was doing before can be ported to the
MM setup.

>     2. A brand new mailing list is setup with Mailman+Spambayes.  As
>        a new list, it should not be getting 200 spams per day.  The
>        moderator will have time to figure out how to change the
>        settings on the list to delete spam instead of hold it.
> 
> I just don't understand why you have a hard time understanding that
> out-of-the-box Mailman+Spambayes should not delete spam.  It's a
> one-click change for Greg or Barry, or whoever controls python-list.
> Why not err on the side of caution?

It was all a big misunderstanding.

--Guido van Rossum (home page: http://www.python.org/~guido/)

From tim.one@comcast.net  Mon Nov  4 23:18:34 2002
From: tim.one@comcast.net (Tim Peters)
Date: Mon, 04 Nov 2002 18:18:34 -0500
Subject: [Spambayes] deployment for mailman lists
In-Reply-To: <200211042254.gA4MsOc22974@pcp02138704pcs.reston01.va.comcast.net>
Message-ID: <LNBBLJKPBEHFEDALKOLCKEGPCGAB.tim.one@comcast.net>

[Guido]
> ...
> The default should be not to enable spambayes filtering at all, since
> there's no way to set up the training data to begin with.

Well, we don't know that yet.  Work on seeding classifiers has been minimal
so far.  I agree it shouldn't be enabled by default regardless.


From tim.one@comcast.net  Mon Nov  4 23:49:16 2002
From: tim.one@comcast.net (Tim Peters)
Date: Mon, 04 Nov 2002 18:49:16 -0500
Subject: [Spambayes] Database reduction
In-Reply-To: <w53iszdc68a.fsf@woozle.org>
Message-ID: <LNBBLJKPBEHFEDALKOLCMEHDCGAB.tim.one@comcast.net>

[Neale Pickett]
> Right.  I had some code in hammie to pickle the tuple instead of the
> object itself, but I thought it was a pretty gnarly kludge at the time.
> In any case, some variation on this seems obviously the right way to go.

If you use __getstate__() to get the tuple, there's nothing objectionable
about it:  it's the *purpose* of __getstate__/__setstate__ to get/set state
into/from tuples.  Objectionable would be to access the fields directly
yourself by name, since they may change over time.  There's a problem here,
though, in that only the Bayes class saves a PICKLE_VERSION identifier in
its pickles; changes in WordInfo structure can't be transparent to old
databases unless WordInfo pickles contained a version identifier too.

>> I'd avoid all that and pickle the states, but that's just me.

> I'm inclined to agree with you.  If I do this, though, we have to all
> agree on a convention: if you need to modify a wordinfo object, you
> *must* write it back to the dictionary.  Otherwise hammie will never
> know it changed.  I was bitten by this a few times at first, and I
> haven't played with the code enough to know if any of this has crept
> back in.

I fixed one of those today.  The database still isn't getting updated with
the new word atimes during scoring, but I've ignored that because nobody has
made any use of atimes yet.

I have to say it's painful to do these redundant stores -- it generally
doubles the number of dict operations, and that's a speed drag.  However,
compared to I/O and tokenization times, it appears to be a minor drag at
worst.

> Would it be out of line to alter WordInfo to be immutable, to encourage
> folks to write it back to the dictionary?

I've done enough bending backwards for a subsystem I don't use <wink>.
There are only a handful of places these structs mutate.


From tim.one@comcast.net  Tue Nov  5 00:14:17 2002
From: tim.one@comcast.net (Tim Peters)
Date: Mon, 04 Nov 2002 19:14:17 -0500
Subject: [Spambayes] Email client integration -- what's needed?
In-Reply-To: <gvuasu08h4eukdvba187ulhf052f996bvf@4ax.com>
Message-ID: <LNBBLJKPBEHFEDALKOLCOEHFCGAB.tim.one@comcast.net>

[Richie Hindle]
> ...
> I've yet to test this theory, but this is one reason I'd like to use HTML
> as the 'GUI toolkit' for the UI of the POP3 proxy.  The docs can
> be tied so closely to the UI that people won't even realise they're
> reading them...

More, my sisters are fluent with their browsers.  One's favorite editor is
still Notepad, although she knows her away around Word far better than I do
(I'll pit my Notepad skills against anyone's, though <wink>).  So, based on
my sibling experience, the two UIs that work are those with no buttons to
push, and those with too many to push to even know where to start.  Let that
be your guiding principle <wink>.


From tim.one@comcast.net  Tue Nov  5 00:21:41 2002
From: tim.one@comcast.net (Tim Peters)
Date: Mon, 04 Nov 2002 19:21:41 -0500
Subject: [Spambayes] Email client integration -- what's needed?
In-Reply-To: <3097SJGMGMJ09SQTS98CAZWUTXWZX.3dc523c4@riven>
Message-ID: <LNBBLJKPBEHFEDALKOLCEEHHCGAB.tim.one@comcast.net>

[Tim]
> a collection of msgs", the latter to remember,
> e.g., which msgs have been trained as ham, and which as spam.

[Tim@mail.powweb.com]
> Remembering is an interesting idea, but what real purpose does it
> serve aside from making testing easier?

Not for testing.  Say a user discovers they made a mistake, and moves a
misclassified spam into their ham folder.  The action needed then is
two-fold:  *un*train the msg as spam, *re*train on it as ham.  That's too
many manual steps for a user to keep track of.  If a training class
remembers what it's done with each msg, though, you need merely inform it
that msg X has moved from thither to yon, and it can deduce evertyhing
needed from that.  Likewise if the user drags a msg from their unsure folder
into a ham folder, or into a spam folder, etc.

Life might be easier this way if a client supported attaching metadata to
msgs too (Outlook supports rich facilities for doing so; others are probably
restricted to injecting more headers).


From Tim@mail.powweb.com  Tue Nov  5 00:34:27 2002
From: Tim@mail.powweb.com (Tim@mail.powweb.com)
Date: Mon, 04 Nov 2002 18:34:27 -0600
Subject: [Spambayes] Email client integration -- what's needed?
In-Reply-To: <LNBBLJKPBEHFEDALKOLCEEHHCGAB.tim.one@comcast.net>
Message-ID: <MLK75LHGJDIHIH95WDANJ8LITR1X.3dc71213@riven>

So exactly how does one 'untrain' given a particular message?

11/4/2002 6:21:41 PM, Tim Peters <tim.one@comcast.net> wrote:

>[Tim]
>> a collection of msgs", the latter to remember,
>> e.g., which msgs have been trained as ham, and which as spam.
>
>[Tim@mail.powweb.com]
>> Remembering is an interesting idea, but what real purpose does it
>> serve aside from making testing easier?
>
>Not for testing.  Say a user discovers they made a mistake, and moves 
a
>misclassified spam into their ham folder.  The action needed then is
>two-fold:  *un*train the msg as spam, *re*train on it as ham.  That's 
too
>many manual steps for a user to keep track of.  If a training class
>remembers what it's done with each msg, though, you need merely 
inform it
>that msg X has moved from thither to yon, and it can deduce 
evertyhing
>needed from that.  Likewise if the user drags a msg from their unsure 
folder
>into a ham folder, or into a spam folder, etc.
>
>Life might be easier this way if a client supported attaching 
metadata to
>msgs too (Outlook supports rich facilities for doing so; others are 
probably
>restricted to injecting more headers).
>
>
>_______________________________________________
>Spambayes mailing list
>Spambayes@python.org
>http://mail.python.org/mailman/listinfo/spambayes
>
>
- Tim
www.fourstonesExpressions.com 


From tim.one@comcast.net  Tue Nov  5 00:48:40 2002
From: tim.one@comcast.net (Tim Peters)
Date: Mon, 04 Nov 2002 19:48:40 -0500
Subject: [Spambayes] Something to test
In-Reply-To: <200211040627.gA46Rm108104@localhost.localdomain>
Message-ID: <LNBBLJKPBEHFEDALKOLCAEHKCGAB.tim.one@comcast.net>

[Tim]
> This little patch arranges to create "noheader:HEADERNAME" tokens for
> headers in options.safe_headers that *don't* appear in a msg's headers.

This has been checked in now, disabled by default, under bool option name
record_header_absence.

[Anthony Baxter]

Thanks for testing!

> filename:    before  after
> ham:spam:  11192:1826
>                    11192:1826
> fp total:        0       1
> fp %:         0.00    0.01
> fn total:        7       8
> fn %:         0.38    0.44
> unsure t:      106     107
> unsure %:     0.81    0.82
> real cost:  $28.20  $39.40
> best cost:  $28.20  $30.40
> h mean:       0.63    0.42
> h sdev:       4.19    4.19
> s mean:      98.68   98.63
> s sdev:       7.74    7.95
> mean diff:   98.05   98.21
> k:            8.22    8.09

Wow -- it cut your ham mean by a third <wink>.

> The additional fp was a mail-out from Nettwerk (that I've signed up
> for, but which are _incredibly_ spammy) that went from 0.956 to 0.964,
> where my spam cutoff is 0.96. The noheader: errors-to was the killer
> clue that pushed it over the edge. The spam situation is considerably
> worse. The additional false negative was something that went from 0.467
> to 0.431 (ham_cutoff 0.45). The damage came from
>   prob('noheader:mime-version') = 0.245329
> (It was a very short spam)

So, in all, it nudged two marginal msgs over the edge, but in the wrong
directions.  So I disabled it by default.  It helps python.org tests,
though, so it's an option now.

> One fn went from 0.27 to 0.029, due to:
>   prob('noheader:subject') = 0.0042591
>   prob('noheader:to') = 0.0652536

Those are bizarre.  From where do you get ham lacking Subject and To
headers?  In my personal classifier,

                   #h  #s  spamprob
'noheader:to'      10  95  0.884678455795
'noheader:subject'  2  16  0.858858950186

Is there some systematic reason for why you've got lots of ham without key
header lines?  Your noheader:subject spamprob in particular is astonishingly
low.

>   prob('noheader:mime-version') = 0.245329
>
> It made pretty much all of my fn's at least slightly worse, if not
> much worse.

The lack of common headers in your ham is the mystery to me.  Try to figure
out why that is?  For example, perhaps you have some systematic source of
ham creating headers the email pkg can't parse.  In that case we fall back
to the raw body text, and don't get any header info at all.  But in that
case, we should learn *why* the email pkg is blowing up, and worm around it.

For the same reason your FN got worse, your FN would get better if these
things had the high spamprobs they were expected to have (and do have, in
all my tests; nobody else has reported on this experiment, alas).


From tim.one@comcast.net  Tue Nov  5 01:01:20 2002
From: tim.one@comcast.net (Tim Peters)
Date: Mon, 04 Nov 2002 20:01:20 -0500
Subject: [Spambayes] Another great spam
Message-ID: <LNBBLJKPBEHFEDALKOLCAEHLCGAB.tim.one@comcast.net>

It's a trend!  This was a heavy spam day on my home box, with about 150
nailed by the classifier.  But this got scored as rock solid ham.

poets-should-be-shot-just-like-mimes-ly y'rs  - tim


Spam Score: 0.00738292


'*H*'                          0.985376
'*S*'                          0.000141353
'issue.'                       0.0155709
"i'd"                          0.0199122
'section,'                     0.0302013
'publications'                 0.0348837
'page.'                        0.0352476
'left-hand'                    0.0505618
'bay'                          0.0652174
'god,'                         0.0652174
'header:Organization:1'        0.0740827
'there,'                       0.074749
'(in'                          0.083982
'bay,'                         0.0918367
'bay.'                         0.0918367
'homepage,'                    0.0918367
'issue'                        0.120995
'scroll'                       0.143265
'bar).'                        0.155172
'listing.'                     0.155172
'poem'                         0.155172
'subject:Announcement'         0.155172
'leaves'                       0.17751
'archive'                      0.207696
'know'                         0.213762
'page'                         0.219117
'yours,'                       0.245748
'which'                        0.250988
'subject:/'                    0.263804
'thought'                      0.265127
'since'                        0.273836
'released'                     0.277373
'contents'                     0.281415
'web'                          0.290549
'some'                         0.305297
'hello,'                       0.306255
'almost'                       0.312699
'hope'                         0.316514
'two'                          0.327412
'link'                         0.331387
'wishes'                       0.334614
'once'                         0.369855
'about'                        0.37169
'already'                      0.372141
'wanted'                       0.374675
'that'                         0.374984
'doing'                        0.381985
'let'                          0.385857
'however,'                     0.386094
'2002'                         0.388204
'menu'                         0.388208
'changing'                     0.389442
'actually'                     0.398668
'url:com'                      0.608106
'gone!'                        0.612957
'header:MIME-Version:1'        0.613927
'give'                         0.616814
'bottom'                       0.620525
'here.'                        0.633211
'care,'                        0.650199
'way.'                         0.654826
'content'                      0.65579
'thank'                        0.661727
'noheader:errors-to'           0.663115
'header:Return-Path:1'         0.681022
'best'                         0.688081
'powerful'                     0.702059
'year'                         0.734858
'address:'                     0.742064
'click'                        0.749785
'publication'                  0.775658
'interest'                     0.780263
'header:Received:3'            0.78214
'online'                       0.785239
'color'                        0.81036
'amen'                         0.844828
'from:email name:<john'        0.844828
'message-id:@CAMPAIGN'         0.844828
'poetry'                       0.844828
'header:Reply-To:1'            0.901776

Message Stream:


Received: from cpimssmtpa18.msn.com ([10.48.181.50]) by
	cpimsstra13.email.msn.com with Microsoft SMTPSVC(5.0.2195.4905);
	Mon, 4 Nov 2002 15:47:03 -0800
X-MSN-Trace: {AE588338-A3E4-4F92-B125-D94A104CE310}
Received: from webpro.com ([208.135.62.103]) by cpimssmtpa18.msn.com with
	Microsoft SMTPSVC(5.0.2195.4905);	 Mon, 4 Nov 2002 15:33:02 -0800
Received: from CAMPAIGN [208.135.62.199] by webpro.com
  (SMTPD32-7.07) id A2F89700AA; Mon, 04 Nov 2002 18:30:00 -0500
From: <john@johnamen.com>
To: <tim_one@msn.com>
Subject: John Amen/A Publication Announcement
Date: MON, 04 NOV 2002 18:28:14 -0400
MIME-Version: 1.0
Reply-To: john@johnamen.com
Message-Id: <200211041830453.SM00816@CAMPAIGN>
X-RBL-Warning: SPAMHEADERS: This E-mail has headers consistent with spam
	[4000020e].
X-Note: This E-mail was scanned by Declude JunkMail (www.declude.com) for
	spam.
Organization: WEBPRO International
Return-Path: john@johnamen.com
X-OriginalArrivalTime: 04 Nov 2002 23:33:03.0155 (UTC)
	FILETIME=[84E88830:01C2845A]


Hello,

I hope you are doing well and enjoying the autumn.  The leaves are already
changing color here.  God, another year almost gone!

I wanted to let you know about three publications in which my poems are
currently appearing:

Two poems in Thunder Sandwich.
Web address:   http://www.thundersandwich.com
Once you get to the homepage, you'll be flashed to a contents page.  Two of
my poems are included in the poetry section.

One poem in Sidereality.
Web address:   http://www.sidereality.com
Once at the homepage, you can click on "contents" (in the left-hand menu
bar).  My poem is in the poetry section.

One poem in Poetry Bay.  A new issue of this publication has actually been
released since the one in which my poem appeared; however, I thought I'd
give you an archive link.
Web address:   http://www.poetrybay.com
Once there, you click on "Poetry Bay Online Magazine."  That will take you
to the current issue of Poetry Bay, which includes some powerful poems.  If
you scroll to the bottom of the page and click on "Summer 2002" in the
"Prior Versions" section, you'll link to the Summer 2002 issue.  My poem is
included in the content listing.

I thank you for your interest and hope that these poems move you in some
way.  Take care, and best wishes for the winter!

Yours,

John Amen


From tim.one@comcast.net  Tue Nov  5 01:06:58 2002
From: tim.one@comcast.net (Tim Peters)
Date: Mon, 04 Nov 2002 20:06:58 -0500
Subject: [Spambayes] Email client integration -- what's needed?
In-Reply-To: <MLK75LHGJDIHIH95WDANJ8LITR1X.3dc71213@riven>
Message-ID: <LNBBLJKPBEHFEDALKOLCEEHMCGAB.tim.one@comcast.net>

[Tim@mail.powweb.com]
> So exactly how does one 'untrain' given a particular message?

Exactly the same way you trained that msg, except that instead of calling
the classifier's learn() method, you call its unlearn() method.  The
arglists are identical, and exactly the same stuff should be passed in both
cases.


From vanhorn@whidbey.com  Tue Nov  5 02:59:23 2002
From: vanhorn@whidbey.com (G. Armour Van Horn)
Date: Mon, 04 Nov 2002 18:59:23 -0800
Subject: [Spambayes] deployment for mailman lists
References: <200211041942.gA4Jgc621320@pcp02138704pcs.reston01.va.comcast.net>
	<200211042106.gA4L6fg22122@pcp02138704pcs.reston01.va.comcast.net>
Message-ID: <3DC7340B.953FACD9@whidbey.com>

Guido van Rossum wrote:

> >     Guido> - An obvious default policy with tunable parameters presents
> >     Guido>   itself: ham goes to the list, spam is dropped (or
> >     Guido>   bounced), and unsure goes into the moderator's queue.
> >
> > I would argue that spam should by default go into the moderator's
> > queue as well.  The default should never be to drop or bounce a
> > message.  Either way, you run the risk that legitimate mail gets
> > lost.
>
> For most mailing lists, I disagree.  It's not like you're going to
> miss an important message from your boss or from a potential customer
> or employer when a false positive is bounced from the
> dangerous-hobbies-involving-jello list.
>
> Given the amount of spam that most lists get, and the clumsiness (I
> believe Barry agrees with this assessment :-) of the Mailman
> moderation API, putting all spam in the moderation queue by default
> would be a bad idea.  I agree that it should be possible to configure
> it this way if you really want, but I don't think it should be the
> default.

Although I find the 2.1b interface to be rather clumsier than 2.0.13 was, I
would certainly want the default to allow for moderation. About half the
lists I run are commercial, announcement lists for employees. It's not that
you risk missing an important message from a potential employer, which should
be barred, but from the current employer, who is paying for the list.

Also, as I had suggested earlier, I would want the list output to be fed into
ham@. As I understand it, it would be desirable to forward the spam to the
spam@ address to keep at least some training on the spam side going on. With
the combination of hammie running in front and manual moderation, a dedicated
hammie for a single mailing list would be spectacular over time.

Van

--
----------------------------------------------------------
Sign up now for Quotes of the Day, a handful of quotations
on a theme delivered every morning.
Enlightenment! Daily, for free!
mailto:twisted@whidbey.com?subject=Subscribe_QOTD

For web hosting and maintenance,
visit Van's home page: http://www.domainvanhorn.com/van/
----------------------------------------------------------


From tim.one@comcast.net  Tue Nov  5 04:10:32 2002
From: tim.one@comcast.net (Tim Peters)
Date: Mon, 04 Nov 2002 23:10:32 -0500
Subject: FW: RE: RE: [Spambayes] Email client integration -- what's needed?
Message-ID: <LNBBLJKPBEHFEDALKOLCMEIICGAB.tim.one@comcast.net>

Tim, all msgs to you bounce, with

  Recipient address: Tim@mail.powweb.com
  Reason: Remote SMTP server has rejected address
  Diagnostic code: smtp;550 <Tim@mail.powweb.com>: User unknown
  Remote system: dns;mail.powweb.com (TCP|24.153.64.230|22677|
                 63.251.213.34|25) (mail.powweb.com ESMTP Postfix)

-----Original Message-----
From: Tim Peters [mailto:tim.one@comcast.net]
Sent: Monday, November 04, 2002 11:02 PM
To: Tim@mail.powweb.com
Subject: RE: RE: RE: [Spambayes] Email client integration -- what's
needed?


> Ok, I'm totally with ya now.  Is anyone working on a general purpose
> training class?

Not seriously as such, although the Outlook client has steps in that
direction.  I expect people think the retraining steps are too trivial to
factor out, but I think that's a mistake:  while the general class should
indeed end up being simple, there are subtleties that should be captured
once and for all, and the very existence of a training class will help the
next person figure out how to proceed with the next client.  The current
Tester and TestDriver classes (esp. the former) have that flavor too:  their
existence has driven the creation of concrete test drivers, and supplied
just enough commonality so that post-test analysis tools have been
relatively easy to write.

> If not, I can take a crack at it...  The smtpproxy is kinda broken
> without it, because while it can train, it will need some kind of
> remembering in order to be able to untrain...

I think you're in a good position, then.  When the client allows tight
integration, it seems hard to abstract things enough for easy reusability;
with a nightmare client <0.7 wink>, it should be easier to picture an ideal.


From Tim@mail.powweb.com  Tue Nov  5 04:19:01 2002
From: Tim@mail.powweb.com (Tim@mail.powweb.com)
Date: Mon, 04 Nov 2002 22:19:01 -0600
Subject: FW: RE: RE: [Spambayes] Email client integration -- what's
	needed?
In-Reply-To: <LNBBLJKPBEHFEDALKOLCMEIICGAB.tim.one@comcast.net>
Message-ID: <IFYT0JIYVVS73DA3YLGUSMIFERPZY72.3dc746b5@riven>

Tim, I don't know why it's doing that.  I'm sending from 
tim@fourstonesExpressions.com... but it comes out on the list with the  
invalid address...  Lemme do some more investigation... but it's 
probably my stinkin host... forever screwing me up.

Certainly a training class would have helped me get my head around the 
training side of things.  It's not really a trivial abstraction...

Ok, I'll take a crack at the training class, then... got some ideas, 
but could use a few suggestions on some remembering stuff... All we 
can really ever count on having is a few basic headers and the message 
body.  Somehow from whatever we have, we need to create a key that 
will be used to find a saved message.  I could hash the entire 
message, or use a checksum... ideas?

11/4/2002 10:10:32 PM, Tim Peters <tim.one@comcast.net> wrote:

>Tim, all msgs to you bounce, with
>
>  Recipient address: Tim@mail.powweb.com
>  Reason: Remote SMTP server has rejected address
>  Diagnostic code: smtp;550 <Tim@mail.powweb.com>: User unknown
>  Remote system: dns;mail.powweb.com (TCP|24.153.64.230|22677|
>                 63.251.213.34|25) (mail.powweb.com ESMTP Postfix)
>
>-----Original Message-----
>From: Tim Peters [mailto:tim.one@comcast.net]
>Sent: Monday, November 04, 2002 11:02 PM
>To: Tim@mail.powweb.com
>Subject: RE: RE: RE: [Spambayes] Email client integration -- what's
>needed?
>
>
>> Ok, I'm totally with ya now.  Is anyone working on a general 
purpose
>> training class?
>
>Not seriously as such, although the Outlook client has steps in that
>direction.  I expect people think the retraining steps are too 
trivial to
>factor out, but I think that's a mistake:  while the general class 
should
>indeed end up being simple, there are subtleties that should be 
captured
>once and for all, and the very existence of a training class will 
help the
>next person figure out how to proceed with the next client.  The 
current
>Tester and TestDriver classes (esp. the former) have that flavor too:  
their
>existence has driven the creation of concrete test drivers, and 
supplied
>just enough commonality so that post-test analysis tools have been
>relatively easy to write.
>
>> If not, I can take a crack at it...  The smtpproxy is kinda broken
>> without it, because while it can train, it will need some kind of
>> remembering in order to be able to untrain...
>
>I think you're in a good position, then.  When the client allows 
tight
>integration, it seems hard to abstract things enough for easy 
reusability;
>with a nightmare client <0.7 wink>, it should be easier to picture an 
ideal.
>
>
>_______________________________________________
>Spambayes mailing list
>Spambayes@python.org
>http://mail.python.org/mailman/listinfo/spambayes
>
>
- Tim
www.fourstonesExpressions.com 


From tim.one@comcast.net  Tue Nov  5 04:59:26 2002
From: tim.one@comcast.net (Tim Peters)
Date: Mon, 04 Nov 2002 23:59:26 -0500
Subject: FW: RE: RE: [Spambayes] Email client integration -- what's
	needed?
In-Reply-To: <IFYT0JIYVVS73DA3YLGUSMIFERPZY72.3dc746b5@riven>
Message-ID: <LNBBLJKPBEHFEDALKOLCCEILCGAB.tim.one@comcast.net>

[TimS]
> Certainly a training class would have helped me get my head around the
> training side of things.  It's not really a trivial abstraction...

It will be if done right <wink> -- it's making it concrete that will be
non-trivial.

> Ok, I'll take a crack at the training class, then... got some ideas,
> but could use a few suggestions on some remembering stuff... All we
> can really ever count on having is a few basic headers and the message
> body.  Somehow from whatever we have, we need to create a key that
> will be used to find a saved message.  I could hash the entire
> message, or use a checksum... ideas?

I don't think the training class should know anything concrete about msgs.
Instead it should work with opaque message objects.  Off the top of my head,
msgs should support:

+ An arbitrary but consistent total ordering (so that they're usable
  as keys in B-Tree based persistent databases), and hashability (so that
  they're usable as keys in a dict).

+ A method to return a human-comprehensible name (perhaps an access
  path relative to the client's folder hierarchy -- but the training
  class shouldn't care).  Note that if these names are required to
  be unique strings, that can be exploited to give a consistent total
  ordering, and hashability (just compare or hash the string names).

+ A method to deliver a token stream, suitable for passing to the
  classifier.  I expect it would be most convenient to make msgs
  iterable, so they can be passed directly as-is to tokenize().

The existing msgs.Msg class does part of this stuff, but is a concrete
class, and geared toward testing.

A training class needs to specify a Msg interface (protocol, abstract base
clase, however you like to think of these things), and clients need to
supply classes or factory functions that implement that interface (protocol,
whatever).

Right?  This is just OO design:  identify the objects and actors in the
domain, and model them with classes.  The client will have to supply
concrete versions that implement the interfaces the trainer requires.  The
trick is to define the trainer in such a way that it requires exactly enough
to get its job done, and clients have to implement at least that much (but
may implement more).


From tim.one@comcast.net  Tue Nov  5 06:05:41 2002
From: tim.one@comcast.net (Tim Peters)
Date: Tue, 05 Nov 2002 01:05:41 -0500
Subject: [Spambayes] More mildly clever spam
In-Reply-To: <LNBBLJKPBEHFEDALKOLCAEHLCGAB.tim.one@comcast.net>
Message-ID: <LNBBLJKPBEHFEDALKOLCCEIOCGAB.tim.one@comcast.net>

I won't show the whole thing here.  It scored 0.62 for me (H=0.75, S=0.99),
so was Unsure, but looking at it was baffling:

    Our highly successful 24 year old multi-national company gives you
    an exclusive business that's guaranteed to make you an extra weekly
    etc etc

Despite the obvious spamcicity, the clue list had words like sniper, emacs
and distros.  Turns out they were only visible in reverse video, thanks to
HTML trickery:

"""
Anyone, regardless of background, education or experience can easily<br>
make money with <b><i>BT Online </i>&copy;</b>. We provide everything  you
need.<br>
You can start making a Guaranteed extra income just 5 minutes from now!<br>
<br><font color=white size=1>Finally, I've always found that the RPMs Nvidia
supplies don't put all the files in all the right places. I strongly
recommend using the binary tarballs for the Nvidia kernel and GLX driver
instead. It's incredibly easy; you just unpack them wherever you please,
bust out a root shell and run make from the top level directories. It's
actually easier and faster than RPM, and even better, it always works. Just
make sure that the statements Load "glx" and Driver "nvidia" appear in
etc/X11/XF86Config under Section "Module" and Section "Device" respectively
before you re-boot (or make sure you know how to use emacs or vi, and make
sure you know the path to your XF86Config file -- different distros put it
in different places.)<br></FONT>
<b>"If you can check your email, you can make $$ with <i>BT  Online</i>
&copy;"</b>
"""

Etc on both sides.  There are snippets of news stories about the East Coast
snipers, tech postings, and business stories, spread evenly throughout the
msg.  The white-on-white text is actually used to space out spam paragraphs!

I expect that the worst this gimmickery can do with our code is knock a spam
into Unsure territory.  Indeed, despite that there was a lot more hidden ham
than visible spam in this msg, it had 33 words with spamprobs above 0.90,
and it's darned hard to hit that many words with spamprobs below 0.10 by
luck.  This one was particularly lucky in including sniper news, since I
live in the snipers' target area, and have lots of ham about that from
friends & relatives over the last month.  I say "lucky" instead of clever
here because tim_one@msn.com was just one of 22 tim_xyz@msn.com addresses in
the To and Cc lines.

What's amazing me now is how very few spam I get that try to play tricks at
all!


From rob@hooft.net  Tue Nov  5 07:09:32 2002
From: rob@hooft.net (Rob W.W. Hooft)
Date: Tue, 05 Nov 2002 08:09:32 +0100
Subject: [Spambayes] counterweight: it really works!
References: <3DC6130D.40508@hooft.net> <3DC68F96.6070809@startechgroup.co.uk>
Message-ID: <3DC76EAC.4000807@hooft.net>

Matt Sergeant wrote:
> Rob Hooft said the following on 04/11/02 06:26:

>> Just to remind everyone that this software really works! Its spambayes 
>> score deviates from 1.0 only by about 10**-8, but SA didn't see much

> Please don't compare to 4 months old SpamAssassin's. Upgrade if you want 
> to compare. Thanks.

OK, sorry, maybe I should have updated. OTOH, that is part of the 
problem, isn't it?

Rob

-- 
Rob W.W. Hooft  ||  rob@hooft.net  ||  http://www.hooft.net/people/rob/


From rob@hooft.net  Tue Nov  5 07:14:16 2002
From: rob@hooft.net (Rob W.W. Hooft)
Date: Tue, 05 Nov 2002 08:14:16 +0100
Subject: [Spambayes] Why I added src=cid: etc
References: <LNBBLJKPBEHFEDALKOLCKEBOCGAB.tim.one@comcast.net>
	<3DC6A1D8.6040507@startechgroup.co.uk>
Message-ID: <3DC76FC8.5010109@hooft.net>

Matt Sergeant wrote:
[on viruses]
> 
> Yeah, I've got some neat results just from classifying file extensions. 
> The double extension ones are especially good ;-)
> 
> Matt.

2-line virusscanner in /etc/postfix/body_checks:

/^(Content-(Type|Disposition):.*|[[:space:]]*(file)?)name=("[^"]*|[^[:space:]]*)\.(exe|com|scr|pif|bat|lnk|dll|vbs|js)/ 
REJECT
/^Content-Type:[[:space:]]*audio\// REJECT


-- 
Rob W.W. Hooft  ||  rob@hooft.net  ||  http://www.hooft.net/people/rob/


From richie@entrian.com  Tue Nov  5 08:44:56 2002
From: richie@entrian.com (richie@entrian.com)
Date: Tue, 05 Nov 2002 08:44:56 +0000
Subject: [Spambayes] HTML user interface for spambayes
Message-ID: <E188zKS-0002cy-0U@anchor-post-39.mail.demon.net>

Hi all,

Lots happening in the past couple of days!  To make sure we're not
duplicating effort, I'd like to let people know that I'm partway through
writing an HTML-based user interface for the spambayes database and the
POP3 proxy.

I should commit an initial version in the next couple of days.  At the
moment, it gives you:

 o A 'Word query' form where you can get information from the database
   about a specific word
 o Training by uploading message files to it (one at once at the moment,
   but I'll add support for mbox files)
 o Training by pasting an email into a form
 o The status of the pop3proxy - how many emails classified and so on,
   plus a shutdown button.

I had a look at POPFile, and their HTML user interface lets you
(re)classify recently-proxied messages very easily.  This is a great
idea, and along with Tim's SMTP proxy stuff should make the process of
(re)classification nice and simple.

One question: can we still untrain a message?  The code is still there,
but I have it in my head that some of Gary's ideas prevented untraining
from working, and that was why the CV tests needed to retrain from
scratch for each pass... or am I getting this totally wrong?  I hope I
am, because reclassifying (through the web or through SMTP) will need
to use the untraining stuff.

Also on my list is to commit Tim Stone's SMTP proxy code, possibly after
integrating it with the pop3proxy (but I need to discuss that with you,
Tim, after looking in more detail at the code, hopefully tonight).

-- 
Richie Hindle
richie@entrian.com


From rob@hooft.net  Tue Nov  5 10:22:57 2002
From: rob@hooft.net (Rob W.W. Hooft)
Date: Tue, 05 Nov 2002 11:22:57 +0100
Subject: [Spambayes] HTML user interface for spambayes
References: <E188zKS-0002cy-0U@anchor-post-39.mail.demon.net>
Message-ID: <3DC79C01.4090303@hooft.net>

richie@entrian.com wrote:
> One question: can we still untrain a message?  The code is still there,
> but I have it in my head that some of Gary's ideas prevented untraining
> from working, and that was why the CV tests needed to retrain from
> scratch for each pass... or am I getting this totally wrong?  I hope I
> am, because reclassifying (through the web or through SMTP) will need
> to use the untraining stuff.

All the CV stuff that could not untrain has been removed from the code; 
with the score methods that are in there now, untrain will work fine.

Rob
-- 
Rob W.W. Hooft  ||  rob@hooft.net  ||  http://www.hooft.net/people/rob/


From msergeant@startechgroup.co.uk  Tue Nov  5 10:20:58 2002
From: msergeant@startechgroup.co.uk (Matt Sergeant)
Date: Tue, 05 Nov 2002 10:20:58 +0000
Subject: [Spambayes] counterweight: it really works!
References: <LHPJECGCWVOTFE3X5ZF0JD3OITOJD.3dc6a2da@riven>
	<3DC6A44B.4070509@startechgroup.co.uk>
	<200211042029.gA4KTEc21789@pcp02138704pcs.reston01.va.comcast.net>
Message-ID: <3DC79B8A.9000409@startechgroup.co.uk>

Guido van Rossum said the following on 04/11/02 20:29:
>>FWIW SpamAssassin now has a statistical classifier (in 2.50, which isn't 
>>officially released yet, but then neither is spambayes [grin]) using the 
>>Robinson algorithm. I'm hoping to get the chi-squared algorithm in there 
>>too, but /I had some trouble with it producing wierd results for me (I 
>>tried to post something to this list about it but it vanished into the 
>>ether, so I'll try again shortly).
> 
> 
> Cool!  What do you do for training of your Robinson classifier?

At the moment it's manually trained (same was as spambayes), but we're 
looking into auto-training - feeding back current SA results into the 
corpus. That's a double edged sword of course, but it'll be interesting 
research.

Matt.


From msergeant@startechgroup.co.uk  Tue Nov  5 10:27:40 2002
From: msergeant@startechgroup.co.uk (Matt Sergeant)
Date: Tue, 05 Nov 2002 10:27:40 +0000
Subject: [Spambayes] counterweight: it really works!
References: <3DC6130D.40508@hooft.net> <3DC68F96.6070809@startechgroup.co.uk>
	<3DC76EAC.4000807@hooft.net>
Message-ID: <3DC79D1C.5040403@startechgroup.co.uk>

Rob W.W. Hooft said the following on 05/11/02 07:09:
> Matt Sergeant wrote:
> 
>>Rob Hooft said the following on 04/11/02 06:26:
> 
> 
>>>Just to remind everyone that this software really works! Its spambayes 
>>>score deviates from 1.0 only by about 10**-8, but SA didn't see much
> 
> 
>>Please don't compare to 4 months old SpamAssassin's. Upgrade if you want 
>>to compare. Thanks.
> 
> OK, sorry, maybe I should have updated. OTOH, that is part of the 
> problem, isn't it?

As I explained in another email, both spambayes (and other statistical 
solutions) and SpamAssassin need constantly updating. It's just that 
spambayes is slightly easier as the manual intervention you have to do 
isn't far off your regular reading of email (whereas spamassassin 
requires you to drop into a console and type: perl -MCPAN -e 'install 
Mail::SpamAssassin'). Of course you could put the latter in a cron job, 
but most sensible people wouldn't trust it.

BTW: I'm not suggesting that SA would have caught that particular spam - 
it probably wouldn't have, I just hate to see invalid comparisons.

Matt.


From msergeant@startechgroup.co.uk  Tue Nov  5 10:29:04 2002
From: msergeant@startechgroup.co.uk (Matt Sergeant)
Date: Tue, 05 Nov 2002 10:29:04 +0000
Subject: [Spambayes] Why I added src=cid: etc
References: <LNBBLJKPBEHFEDALKOLCKEBOCGAB.tim.one@comcast.net>
	<3DC6A1D8.6040507@startechgroup.co.uk> <3DC76FC8.5010109@hooft.net>
Message-ID: <3DC79D70.3030309@startechgroup.co.uk>

Rob W.W. Hooft said the following on 05/11/02 07:14:
> Matt Sergeant wrote:
> [on viruses]
> 
>>Yeah, I've got some neat results just from classifying file extensions. 
>>The double extension ones are especially good ;-)
>>
>>Matt.
> 
> 
> 2-line virusscanner in /etc/postfix/body_checks:
> 
> /^(Content-(Type|Disposition):.*|[[:space:]]*(file)?)name=("[^"]*|[^[:space:]]*)\.(exe|com|scr|pif|bat|lnk|dll|vbs|js)/ 
> REJECT
> /^Content-Type:[[:space:]]*audio\// REJECT

Never REJECT on file extension. Only ever ACCEPT! This is the same rule 
as firewalling - never close off insecure ports, only open the ones you 
know are secure and/or needed.

Matt.


From sjoerd@acm.org  Tue Nov  5 10:42:42 2002
From: sjoerd@acm.org (Sjoerd Mullender)
Date: Tue, 05 Nov 2002 11:42:42 +0100
Subject: [Spambayes] Something to test
In-Reply-To: <LNBBLJKPBEHFEDALKOLCEEKMCFAB.tim.one@comcast.net> 
References: <LNBBLJKPBEHFEDALKOLCEEKMCFAB.tim.one@comcast.net> 
Message-ID: <200211051042.gA5AggI07324@indus.ins.cwi.nl>

On Sun, Nov 3 2002 Tim Peters wrote:

> Index: tokenizer.py
> ===================================================================
> RCS file: /cvsroot/spambayes/spambayes/tokenizer.py,v
> retrieving revision 1.60
> diff -c -r1.60 tokenizer.py
> *** tokenizer.py        1 Nov 2002 16:10:13 -0000       1.60
> --- tokenizer.py        3 Nov 2002 08:31:44 -0000
> ***************
> *** 1178,1183 ****
> --- 1178,1185 ----
>                       x2n[x] = x2n.get(x, 0) + 1
>           for x in x2n.items():
>               yield "header:%s:%d" % x
> +         for x in options.safe_headers - Set([k.lower() for k in x2n]):
> +             yield "noheader:" + x
> 
>       def tokenize_body(self, msg, maxword=options.skip_max_word_size):
>           """Generate a stream of tokens from an email Message.

Here are my results:

filename:     cv1s    cv2s
ham:spam:  11850:3360     
                   11850:3360
fp total:        3       3
fp %:         0.03    0.03
fn total:        4       4
fn %:         0.12    0.12
unsure t:      103     100
unsure %:     0.68    0.66
real cost:  $54.60  $54.00
best cost:  $26.60  $25.80
h mean:       0.20    0.19
h sdev:       3.15    3.15
s mean:      99.29   99.28
s sdev:       5.94    5.95
mean diff:   99.09   99.09
k:           10.90   10.89

The difference between the two runs: 3 unsure messages got nailed
correctly, so it's a marginal improvement.

-- Sjoerd Mullender <sjoerd@acm.org>

From bkc@murkworks.com  Tue Nov  5 15:07:54 2002
From: bkc@murkworks.com (Brad Clements)
Date: Tue, 05 Nov 2002 10:07:54 -0500
Subject: [Spambayes] More mildly clever spam
In-Reply-To: <LNBBLJKPBEHFEDALKOLCCEIOCGAB.tim.one@comcast.net>
References: <LNBBLJKPBEHFEDALKOLCAEHLCGAB.tim.one@comcast.net>
Message-ID: <3DC797B2.10341.41DA65C8@localhost>

On 5 Nov 2002 at 1:05, Tim Peters wrote:

> I say "lucky" instead
> of clever here because tim_one@msn.com was just one of 22 tim_xyz@msn.com
> addresses in the To and Cc lines.

Is the tokenizer still not parsing/counting recipients?

> What's amazing me now is how very few spam I get that try to play tricks at
> all!

Someone, somewhere will put this into an automated spam tool, just wait.


Brad Clements,                bkc@murkworks.com   (315)268-1000
http://www.murkworks.com                          (315)268-9812 Fax
AOL-IM: BKClements


From guido@python.org  Tue Nov  5 15:23:40 2002
From: guido@python.org (Guido van Rossum)
Date: Tue, 05 Nov 2002 10:23:40 -0500
Subject: [Spambayes] counterweight: it really works!
In-Reply-To: Your message of "Tue, 05 Nov 2002 10:20:58 GMT."
             <3DC79B8A.9000409@startechgroup.co.uk> 
References: <LHPJECGCWVOTFE3X5ZF0JD3OITOJD.3dc6a2da@riven>
	<3DC6A44B.4070509@startechgroup.co.uk>
	<200211042029.gA4KTEc21789@pcp02138704pcs.reston01.va.comcast.net>  
	<3DC79B8A.9000409@startechgroup.co.uk> 
Message-ID: <200211051523.gA5FNec19110@odiug.zope.com>

> >>FWIW SpamAssassin now has a statistical classifier (in 2.50, which isn't 
> >>officially released yet, but then neither is spambayes [grin]) using the 
> >>Robinson algorithm. I'm hoping to get the chi-squared algorithm in there 
> >>too, but /I had some trouble with it producing wierd results for me (I 
> >>tried to post something to this list about it but it vanished into the 
> >>ether, so I'll try again shortly).
> > 
> > 
> > Cool!  What do you do for training of your Robinson classifier?
> 
> At the moment it's manually trained (same was as spambayes), but we're 
> looking into auto-training - feeding back current SA results into the 
> corpus. That's a double edged sword of course, but it'll be interesting 
> research.

With the manual training, do you distribute it pre-trained on a
standard set of email?  Or do you let the installer train it?

--Guido van Rossum (home page: http://www.python.org/~guido/)

From guido@python.org  Tue Nov  5 15:41:29 2002
From: guido@python.org (Guido van Rossum)
Date: Tue, 05 Nov 2002 10:41:29 -0500
Subject: [Spambayes] deployment for mailman lists
In-Reply-To: Your message of "Mon, 04 Nov 2002 18:59:23 PST."
             <3DC7340B.953FACD9@whidbey.com> 
References: <200211041942.gA4Jgc621320@pcp02138704pcs.reston01.va.comcast.net>
	<15814.53303.926055.735822@montanaro.dyndns.org>
	<200211042106.gA4L6fg22122@pcp02138704pcs.reston01.va.comcast.net>  
	<3DC7340B.953FACD9@whidbey.com> 
Message-ID: <200211051541.gA5FfTs19274@odiug.zope.com>

> Although I find the 2.1b interface to be rather clumsier than 2.0.13
> was,

Please provide Barry with details!

> I would certainly want the default to allow for moderation. About
> half the lists I run are commercial, announcement lists for
> employees. It's not that you risk missing an important message from
> a potential employer, which should be barred, but from the current
> employer, who is paying for the list.

If they are closed for posting, you shouldn't need to turn on
additional spam filtering anyway.  Even if they aren't, I would find
it strange that such "internal" lists would get much spam -- if
they're internal, why do their addresses appear on the web?

--Guido van Rossum (home page: http://www.python.org/~guido/)

From msergeant@startechgroup.co.uk  Tue Nov  5 15:40:52 2002
From: msergeant@startechgroup.co.uk (Matt Sergeant)
Date: Tue, 05 Nov 2002 15:40:52 +0000
Subject: [Spambayes] counterweight: it really works!
References: <LHPJECGCWVOTFE3X5ZF0JD3OITOJD.3dc6a2da@riven>
	<3DC6A44B.4070509@startechgroup.co.uk>
	<200211042029.gA4KTEc21789@pcp02138704pcs.reston01.va.comcast.net>
	<3DC79B8A.9000409@startechgroup.co.uk>
	<200211051523.gA5FNec19110@odiug.zope.com>
Message-ID: <3DC7E684.3090008@startechgroup.co.uk>

Guido van Rossum said the following on 05/11/02 15:23:

> With the manual training, do you distribute it pre-trained on a
> standard set of email?  Or do you let the installer train it?

We're planning to ship it pre-trained. Otherwise we lose some 
plug-and-playness.

Matt.


From Paul.Moore@atosorigin.com  Tue Nov  5 15:53:57 2002
From: Paul.Moore@atosorigin.com (Moore, Paul)
Date: Tue, 5 Nov 2002 15:53:57 -0000
Subject: [Spambayes] Outlook addin - initial impressions
Message-ID: <16E1010E4581B049ABC51D4975CEDB885E2D86@UKDCX001.uk.int.atosorigin.com>

(I hope this works, I'm posting from my work a/c, rather than the home =
one that's subscribed to the list...)

I just grabbed the latest spambayes from CVS and set it up on my Outlook =
installation. Unfortunately, I haven't been saving spam, but what the =
heck, I thought, I'll train it on what I've got (320 good mails in my =
inbox, and a measly 34 spam received today). Amazingly, even with this =
tiny training set, the results are pretty good already. I've only =
received a few more spams so far today (the US-based spammers haven't =
woken up yet, I guess) but they've all been caught correctly.

One thing is not right, though. In the dialog for managing the filter, =
the option to automatically run the filters is greyed out (as is the =
"Advanced" button, but I took that as being "no advanced options yet"). =
So I can't set automatic filtering on, and I have to manually filter my =
mail. FWIW, I do have a number of rules wizard entries which run to =
filter out my mailing list traffic - I understand that the rules wizard =
runs first and so only mails left by that will get checked by Spambayes, =
but that's OK. I checked the trace output, and saw

-----------------------------------------------
Collecting Python Trace Output...
Outlook Spam Addin module loading
SpamAddin - Connecting to Outlook
Loaded bayes database from 'C:\Documents and =
Settings\UK03306.UKAO\Desktop\spambayes\Outlook2000\default_bayes_databas=
e.pck'
Loaded message database from 'C:\Documents and =
Settings\UK03306.UKAO\Desktop\spambayes\Outlook2000default_message_databa=
se.pck'
Bayes database initialized with 34 spam and 320 good messages
Setting image to delete_as_spam.bmp
AntiSpam: Watching for new messages in folder Inbox
AntiSpam: Watching for new messages in folder Spam
SpamAddin - OnAddInsUpdate None
SpamAddin - OnStartupComplete None
Traceback (most recent call last):
  File "C:\Documents and =
Settings\UK03306.UKAO\Desktop\spambayes\Outlook2000\dialogs\ManagerDialog=
.py", line 88, in OnInitDialog
    self.UpdateControlStatus()
  File "C:\Documents and =
Settings\UK03306.UKAO\Desktop\spambayes\Outlook2000\dialogs\ManagerDialog=
.py", line 131, in updateControlStatus
    self.SetDlgItemText(IDC_FILTER_STATUS, filter_status)
UnboundLocalError: local variable 'filter_status' referenced before =
assignment
win32ui: OnInitDialog() virtual handler (<bound method =
ManagerDialog.OnInitDialog of <dialogs.ManagerDialog.ManagerDialog =
instance at 0x04DDD5B8>>) raised an exception
AntiSpam: Watching for new messages in folder Inbox
AntiSpam: Watching for new messages in folder Spam
Traceback (most recent call last):
  File "C:\Documents and =
Settings\UK03306.UKAO\Desktop\spambayes\Outlook2000\dialogs\ManagerDialog=
.py", line 157, in OnButDoSomething
    self.UpdateControlStatus()
  File "C:\Documents and =
Settings\UK03306.UKAO\Desktop\spambayes\Outlook2000\dialogs\ManagerDialog=
.py", line 131, in UpdateControlStatus
    self.SetDlgItemText(IDC_FILTER_STATUS, filter_status)
UnboundLocalError: local variable 'filter_status' referenced before =
assignment
win32ui: Error in Command Message handler for command ID 1029, Code 0
C:\Documents and =
Settings\UK03306.UKAO\Desktop\spambayes\Outlook2000\about.html
-----------------------------------------------

I'm guessing that the UnboundLocalError is not right...

But even fixing that (by setting filter_status to "" at the top of the =
routine) didn't enable the "enable filtering" button. I can't see an =
obvious answer to this. I'll try to find out a bit more, but I thought I =
may as well let the list know, in case it's an easy problem for someone =
familiar with the code...

Anyway, thanks for the tool - I'm very impressed, even given that there =
are still some rough edges to smooth out.

Paul.

From Paul.Moore@atosorigin.com  Tue Nov  5 16:28:56 2002
From: Paul.Moore@atosorigin.com (Moore, Paul)
Date: Tue, 5 Nov 2002 16:28:56 -0000
Subject: [Spambayes] Re: Outlook addin - initial impressions
Message-ID: <16E1010E4581B049ABC51D4975CEDB885E2D88@UKDCX001.uk.int.atosorigin.com>

> (I hope this works, I'm posting from my work a/c, rather
> than the home one that's subscribed to the list...)=20

I see it did - thanks, Mr. Moderator, I've subscribed now :-)
And sorry for the horrible formatting...

Looking again at the notes, I wonder - is the problem of the
"filter" button not being enabled what is referred to as "Filtering
an Exchange Server public store appears to not work."? I was planning
on filtering my inbox, which is indeed on Exchange...

Assuming this is the issue, then can I offer myself as a guinea pig?
Is there anything useful I can do (test scripts I can run, output I
can report) to help diagnose the problem? [OK, so "provide a patch to
fix the problem" is probably more help, but I've not got that far yet
:-)]

Paul.


From tim.one@comcast.net  Tue Nov  5 17:04:54 2002
From: tim.one@comcast.net (Tim Peters)
Date: Tue, 05 Nov 2002 12:04:54 -0500
Subject: [Spambayes] More mildly clever spam
In-Reply-To: <3DC797B2.10341.41DA65C8@localhost>
Message-ID: <BIEJKCLHCIOIHAGOKOLHMEKODNAA.tim.one@comcast.net>

[Tim]
>> I say "lucky" instead of clever here because tim_one@msn.com was
>> just one of 22 tim_xyz@msn.com addresses in the To and Cc lines.

[Brad Clements]
> Is the tokenizer still not parsing/counting recipients?

It is counting To and Cc recipients now, and the spam in question did get
two strikes against it for "fat" recipient lists.  It probably wouldn't have
helped to tokenize the recipients.

>> What's amazing me now is how very few spam I get that try to
>> play tricks at all!

> Someone, somewhere will put this into an automated spam tool, just wait.

It looked automated to me already.  It's another case where, if we weren't
throwing HTML tags away, the classifier would pick up on the HTML tricks
(like size=1 and color=white) by itself.  If it becomes a popular dodge, not
blinding the classifier to those would help; and/or the tokenizer could try
to figure which text "is invisible", and not let the classifier see that
stuff.  It's like this trick was deep <wink>.


From tim.one@comcast.net  Tue Nov  5 17:27:40 2002
From: tim.one@comcast.net (Tim Peters)
Date: Tue, 05 Nov 2002 12:27:40 -0500
Subject: [Spambayes] HTML user interface for spambayes
In-Reply-To: <E188zKS-0002cy-0U@anchor-post-39.mail.demon.net>
Message-ID: <BIEJKCLHCIOIHAGOKOLHIELDDNAA.tim.one@comcast.net>

[richie@entrian.com]
> ...
> One question: can we still untrain a message?

Yes, and I hope forever more:  all of the problematic "third training pass"
combining schemes were removed from the codebase.  Learning and unlearning
can be mixed freely under the combining schemes remaining.


From bkc@murkworks.com  Tue Nov  5 17:39:22 2002
From: bkc@murkworks.com (Brad Clements)
Date: Tue, 05 Nov 2002 12:39:22 -0500
Subject: [Spambayes] Capitol Steps spam song.. OT?
Message-ID: <3DC7BB31.17489.42651270@localhost>

Sorry if this is OT.

Anyone else hear the Capitol steps halloween special this year.. They had a spam  
song.. wasn't too bad.

http://www.capsteps.com/radio/


Brad Clements,                bkc@murkworks.com   (315)268-1000
http://www.murkworks.com                          (315)268-9812 Fax
AOL-IM: BKClements


From vanhorn@whidbey.com  Tue Nov  5 17:42:47 2002
From: vanhorn@whidbey.com (G. Armour Van Horn)
Date: Tue, 05 Nov 2002 09:42:47 -0800
Subject: [Spambayes] deployment for mailman lists
References: <200211041942.gA4Jgc621320@pcp02138704pcs.reston01.va.comcast.net>
	<15814.53303.926055.735822@montanaro.dyndns.org>
	<200211042106.gA4L6fg22122@pcp02138704pcs.reston01.va.comcast.net>  
	<200211051541.gA5FfTs19274@odiug.zope.com>
Message-ID: <3DC80317.B38952C4@whidbey.com>

Guido,

While the membership functions are more powerful, and probably a real boon
to those with really large lists, spreading those functions over three
pages is clumsier for dealing with small lists like mine. The old
interface for clearing deferred posts had everything on one page, now you
need to jump to a second page. I see what bigger lists gain, and am not
particularly complaining.

The lists do not get "a lot" of spam, but because they are in folks
address books the address does get spread around. Also, some of the lists
here are definitely public. All of the lists here are unmoderated but
accept messages only from members. But spam and viruses do get submitted
and require manual intervention. Unfortunately, many of the posts that
require attention are "non member" submissions, most of which are valid. I
would welcome a way to eliminate the spam and virus entries and reduce the
number of trips to the admin interface.

Van

Guido van Rossum wrote:

> > Although I find the 2.1b interface to be rather clumsier than 2.0.13
> > was,
>
> Please provide Barry with details!
>
> > I would certainly want the default to allow for moderation. About
> > half the lists I run are commercial, announcement lists for
> > employees. It's not that you risk missing an important message from
> > a potential employer, which should be barred, but from the current
> > employer, who is paying for the list.
>
> If they are closed for posting, you shouldn't need to turn on
> additional spam filtering anyway.  Even if they aren't, I would find
> it strange that such "internal" lists would get much spam -- if
> they're internal, why do their addresses appear on the web?
>
> --Guido van Rossum (home page: http://www.python.org/~guido/)

--
----------------------------------------------------------
Sign up now for Quotes of the Day, a handful of quotations
on a theme delivered every morning.
Enlightenment! Daily, for free!
mailto:twisted@whidbey.com?subject=Subscribe_QOTD

For web hosting and maintenance,
visit Van's home page: http://www.domainvanhorn.com/van/
----------------------------------------------------------


From skip@pobox.com  Tue Nov  5 18:11:53 2002
From: skip@pobox.com (Skip Montanaro)
Date: Tue, 5 Nov 2002 12:11:53 -0600
Subject: [Spambayes] I thought bogus message structure problem was solved...
Message-ID: <15816.2537.764299.386817@montanaro.dyndns.org>


---------------------- multipart/mixed attachment
I just saw a message with no hammie header.  Looking at my procmail log=
 file
I saw this traceback info:

    Tue Nov  5 11:49:47 2002
    Traceback (most recent call last):
      File "/Users/skip/local/bin/hammie.py", line 488, in ?
=09main()
      File "/Users/skip/local/bin/hammie.py", line 472, in main
=09filtered =3D h.filter(msg)
      File "/Users/skip/local/bin/hammie.py", line 269, in filter
=09msg =3D email.message_from_string(msg)
      File "/Users/skip/local/lib/python2.3/email/__init__.py", line 52=
, in message_from_string
=09return Parser(_class, strict=3Dstrict).parsestr(s)
      File "/Users/skip/local/lib/python2.3/email/Parser.py", line 75, =
in parsestr
=09return self.parse(StringIO(text), headersonly=3Dheadersonly)
      File "/Users/skip/local/lib/python2.3/email/Parser.py", line 64, =
in parse
=09self._parsebody(root, fp)
      File "/Users/skip/local/lib/python2.3/email/Parser.py", line 228,=
 in _parsebody
=09msgobj =3D self.parsestr(part)
      File "/Users/skip/local/lib/python2.3/email/Parser.py", line 75, =
in parsestr
=09return self.parse(StringIO(text), headersonly=3Dheadersonly)
      File "/Users/skip/local/lib/python2.3/email/Parser.py", line 62, =
in parse
=09self._parseheaders(root, fp)
      File "/Users/skip/local/lib/python2.3/email/Parser.py", line 128,=
 in _parseheaders
=09raise Errors.HeaderParseError(
    email.Errors.HeaderParseError: Not a header, not a continuation: ``=
It=92s Easier to Shop Online!''
    procmail: Program failure (1) of "/Users/skip/local/bin/hammie.py"
    procmail: Rescue of unfiltered data succeeded

The message structure is clearly bogus (attached for completeness).  I
thought someone had fixed this problem, but it appears it was only in o=
ther
contexts.  Looking around for ParseError I see that in a couple instanc=
es
MessageParseError (base for HeaderParseError) is trapped, as in this sn=
ippet
from mboxutils.py:

    def _factory(fp):
=09# Helper for getmbox
=09try:
=09    return email.message_from_file(fp)
=09except email.Errors.MessageParseError:
=09    return ''

However, it seems like we ought to be able to come up with a better fal=
lback
action than returning an empty string when classifying messages.  Is th=
ere a
way to simply treat the entire body as plain text even though the
Content-Type header says otherwise?

Skip


---------------------- multipart/mixed attachment
An embedded message was scrubbed...
From: OfficeManager <ink@gonetdeals.com>
Subject: F_R_E_E Shipping! Printer Ink Sale! Details Inside!
Date: Tue, 5 Nov 2002 12:44:16 -0500
Size: 225
Url: http://mail.python.org/pipermail/spambayes/attachments/20021105/a129a481/attachment.txt

---------------------- multipart/mixed attachment--

From jeremy@alum.mit.edu  Tue Nov  5 18:22:48 2002
From: jeremy@alum.mit.edu (Jeremy Hylton)
Date: Tue, 5 Nov 2002 13:22:48 -0500
Subject: [Spambayes] I thought bogus message structure problem was
	solved...
In-Reply-To: <15816.2537.764299.386817@montanaro.dyndns.org>
References: <15816.2537.764299.386817@montanaro.dyndns.org>
Message-ID: <15816.3192.941960.767496@slothrop.zope.com>

I've had similar problems with some digests for python bugs & patches
lists.  Barry says to try again with the latest version of the email
package.  He thinks it is fixed.

Jeremy


From fazal@majid.fm  Tue Nov  5 18:29:51 2002
From: fazal@majid.fm (Fazal Majid)
Date: Tue, 5 Nov 2002 10:29:51 -0800
Subject: [Spambayes] Terminology in user documentation: "spam" vs. "junk mail"
	
Message-ID: <MPBBIKONHLCABGAGKJBNOEDGJCAA.fazal@majid.fm>

> Guido van Rossum  guido@python.org
> Fri Nov 1 00:56:13 2002
>
> If they don't know the word "spam", they don't need a spam filter yet.
>
> I agree that we need something better than "ham".  Non-spam works for
> me; "good mail" too.
>
> --Guido van Rossum (home page: http://www.python.org/~guido/)

While revulsion for spam may be universal, Jews, Muslims, Hindus and
Buddhists (combined, over 50% of the world's population) would not
necessarily think of "ham" as something desirable.

--
Fazal Majid                                     Mail:   fazal@majid.fm
1111 Jones St. Apt #1                           Voice: +1 415 359 0918
San Francisco, CA 94109                         PCS:   +1 818 231 2144
USA                                             http://www.majid.info/


From tim@fourstonesExpressions.com  Tue Nov  5 18:35:06 2002
From: tim@fourstonesExpressions.com (Tim Stone - Four Stones Expressions)
Date: Tue, 05 Nov 2002 12:35:06 -0600
Subject: [Spambayes] Terminology in user documentation: "spam" vs. "junk
	mail"
In-Reply-To: <MPBBIKONHLCABGAGKJBNOEDGJCAA.fazal@majid.fm>
Message-ID: <71SMZLILFSOGAJDRO5193JEUPUS4W4.3dc80f5a@riven>

Good point, Fazal.

- TimS

11/5/2002 12:29:51 PM, "Fazal Majid" <fazal@majid.fm> wrote:

>> Guido van Rossum  guido@python.org
>> Fri Nov 1 00:56:13 2002
>>
>> If they don't know the word "spam", they don't need a spam filter 
yet.
>>
>> I agree that we need something better than "ham".  Non-spam works 
for
>> me; "good mail" too.
>>
>> --Guido van Rossum (home page: http://www.python.org/~guido/)
>
>While revulsion for spam may be universal, Jews, Muslims, Hindus and
>Buddhists (combined, over 50% of the world's population) would not
>necessarily think of "ham" as something desirable.
>
>--
>Fazal Majid                                     Mail:   
fazal@majid.fm
>1111 Jones St. Apt #1                           Voice: +1 415 359 
0918
>San Francisco, CA 94109                         PCS:   +1 818 231 
2144
>USA                                             
http://www.majid.info/
>
>
>_______________________________________________
>Spambayes mailing list
>Spambayes@python.org
>http://mail.python.org/mailman/listinfo/spambayes
>
>
- Tim
www.fourstonesExpressions.com 


From skip@pobox.com  Tue Nov  5 18:56:33 2002
From: skip@pobox.com (Skip Montanaro)
Date: Tue, 5 Nov 2002 12:56:33 -0600
Subject: [Spambayes] deployment for mailman lists
In-Reply-To: <200211051541.gA5FfTs19274@odiug.zope.com>
References: <200211041942.gA4Jgc621320@pcp02138704pcs.reston01.va.comcast.net>
        <15814.53303.926055.735822@montanaro.dyndns.org>
        <200211042106.gA4L6fg22122@pcp02138704pcs.reston01.va.comcast.net>
        <3DC7340B.953FACD9@whidbey.com>
        <200211051541.gA5FfTs19274@odiug.zope.com>
Message-ID: <15816.5217.915721.290970@montanaro.dyndns.org>


    Guido> Even if they aren't, I would find it strange that such "internal"
    Guido> lists would get much spam -- if they're internal, why do their
    Guido> addresses appear on the web?

Everything leaks eventually, for lots of reasons.

Skip

From tim.one@comcast.net  Tue Nov  5 19:32:04 2002
From: tim.one@comcast.net (Tim Peters)
Date: Tue, 05 Nov 2002 14:32:04 -0500
Subject: [Spambayes] I thought bogus message structure problem was
	solved...
Message-ID: <64b6162160.6216064b61@icomcast.net>

This is a multi-part message in MIME format.

---------------------- multipart/mixed attachment
[Skip Montanaro]
> I just saw a message with no hammie header.  Looking at my 
> procmail log file I saw this traceback info:
> ...
>      File "/Users/skip/local/bin/hammie.py", line 269, in filter
> 	msg = email.message_from_string(msg)

How about refactoring the code?  tokenizer.Tokenizer.get_message() 
wrestles with all known problem getting an email object out of a 
string, and all code should use it.  

> ...
> However, it seems like we ought to be able to come up with a 
> better fallback action than returning an empty string when
> classifying messages.
>
> Is there a way to simply treat the entire body as plain text even
> though the Content-Type header says otherwise?

See the method referenced above.

---------------------- multipart/mixed attachment
An embedded message was scrubbed...
From: OfficeManager <ink@gonetdeals.com>
Subject: F_R_E_E Shipping! Printer Ink Sale! Details Inside!
Date: Tue, 05 Nov 2002 12:44:16 -0500
Size: 258
Url: http://mail.python.org/pipermail/spambayes/attachments/20021105/eeadc60e/attachment.txt

---------------------- multipart/mixed attachment
_______________________________________________
Spambayes mailing list
Spambayes@python.org
http://mail.python.org/mailman/listinfo/spambayes

---------------------- multipart/mixed attachment--

From tim.one@comcast.net  Tue Nov  5 19:39:05 2002
From: tim.one@comcast.net (Tim Peters)
Date: Tue, 05 Nov 2002 14:39:05 -0500
Subject: [Spambayes] Terminology in user documentation: "spam" vs.
 "junk mail"
Message-ID: <5df3b5f93f.5f93f5df3b@icomcast.net>

[Fazal Majid]
> While revulsion for spam may be universal, Jews, Muslims, Hindus and
> Buddhists (combined, over 50% of the world's population) would not
> necessarily think of "ham" as something desirable.

They would if they ever ate spam <wink>.  If the Hormel Corporation and 
over half the world's population can't tolerate good-natured wordplay, 
then (a) I believe it, and (b) that's life.


From db3l@fitlinxx.com  Tue Nov  5 19:42:27 2002
From: db3l@fitlinxx.com (David Bolen)
Date: 05 Nov 2002 14:42:27 -0500
Subject: [Spambayes] Re: Outlook addin - initial impressions
References: <16E1010E4581B049ABC51D4975CEDB885E2D88@UKDCX001.uk.int.atosorigin.com>
Message-ID: <r0hu1ivpzvg.fsf@ctwd0222.corp.fitlinxx.com>

"Moore, Paul" <Paul.Moore@atosorigin.com> writes:

> Looking again at the notes, I wonder - is the problem of the
> "filter" button not being enabled what is referred to as "Filtering
> an Exchange Server public store appears to not work."? I was planning
> on filtering my inbox, which is indeed on Exchange...

I think it's more a buglet when first setting up filters and not
specifying an action to perform.  If you leave the filter set to
Untouched (the default) and don't select folders this will happen just
due to how the code tries to build the string for displaying on the
main manager dialog.

A workaround is just to pick an action (Copy/Move) and target folders on
the filter - e.g., use the suggested setup of having Spam and
Possible Spam folders to move the mail into.  It's not having folders
selected that is bypassing the logic to enable the checkbox.

The other fix is to patch the code to set filter_status to something
useful to display, and also set ok_to_enable to True, so the checkbox
for filtering gets enabled.  In this case you'll get some later
complaints about the Untouched setting when it applies the filter but
all will work properly without moving/copying the message.

There were definitely problems doing ID lookups on an Exchange server
public store, but as of the changes in CVS from 11/2 or so it seems to
be working properly at least for me - I've been filtering my Exchange
server inbox since yesterday as well as using Exchange server folders
for both Spam and Possible Spam folders, with no burps yet.

-- David


From mhammond@skippinet.com.au  Tue Nov  5 21:58:00 2002
From: mhammond@skippinet.com.au (Mark Hammond)
Date: Wed, 6 Nov 2002 08:58:00 +1100
Subject: [Spambayes] Outlook addin - initial impressions
In-Reply-To: <16E1010E4581B049ABC51D4975CEDB885E2D86@UKDCX001.uk.int.atosorigin.com>
Message-ID: <LCEPIIGDJPKCOIHOBJEPMEBAHJAA.mhammond@skippinet.com.au>

> One thing is not right, though. In the dialog for managing the
> filter, the option to automatically run the filters is greyed out
> (as is the "Advanced" button, but I took that as being "no
> advanced options yet"). So I can't set automatic filtering on,
> and I have to manually filter my mail. FWIW, I do have a number
> of rules wizard entries which run to filter out my mailing list
> traffic - I understand that the rules wizard runs first and so
> only mails left by that will get checked by Spambayes, but that's
> OK. I checked the trace output, and saw


> rDialog.py", line 131, in updateControlStatus
>     self.SetDlgItemText(IDC_FILTER_STATUS, filter_status)
> UnboundLocalError: local variable 'filter_status' referenced
> before assignment

> But even fixing that (by setting filter_status to "" at the top
> of the routine) didn't enable the "enable filtering" button. I

The problem would be that the code detected filtering can not be enabled,
but neglected to set the status to indicate why.

A quick check of the code shows there are 2 ways this can happen - if no
"Spam" or "Uncertain" folders are setup in the training dialog.  But if this
is not the case, we should work out what branch is being taken that wants
filtering to be disabled.

I checked in a patch that catches what I found - please see if it fixes your
problem.

> Anyway, thanks for the tool - I'm very impressed, even given that
> there are still some rough edges to smooth out.

Thanks!  Hopefully some of the people using this pre-alpha tool will be
willing to pick up the sand-paper <wink>

Mark.


From lists@morpheus.demon.co.uk  Tue Nov  5 21:50:03 2002
From: lists@morpheus.demon.co.uk (Paul Moore)
Date: Tue, 05 Nov 2002 21:50:03 +0000
Subject: [Spambayes] Outlook addin - initial impressions
References: <16E1010E4581B049ABC51D4975CEDB885E2D86@UKDCX001.uk.int.atosorigin.com>
Message-ID: <n2m-g.lm47660k.fsf@morpheus.demon.co.uk>

"Moore, Paul" <Paul.Moore@atosorigin.com> writes:

> (I hope this works, I'm posting from my work a/c, rather than the home
> one that's subscribed to the list...)

That one got through (thanks, Mr. Moderator!) although I must
apologise for the dreadful formatting... But the followup which I
posted after subscribing doesn't seem to have :-(

But I've got more information since then...

> One thing is not right, though. In the dialog for managing the filter,
> the option to automatically run the filters is greyed out

Specifically, in the add-in manager dialog, the "Enable filtering"
checkbox is greyed out, so I can't enable filtering.

The DB has 82 good and 27 spam, so it's "big enough" to allow
filtering to be enabled (according to the code)

[... pause while Paul fiddles ...]

Ah! I hadn't wanted to move "Possible Spam", so I'd left that option
as "Leave untouched". The code doesn't enable the "enable" checkbox in
that case. But if I change the "Possible Spam" action to move possible
spam to "Inbox", and then reset it to "Untouched", the enable checkbox
is enabled!

So it's not as bad as I thought, but I guess it is a minor bug.

Paul.

-- 
This signature intentionally left blank

From richie@entrian.com  Tue Nov  5 22:25:41 2002
From: richie@entrian.com (Richie Hindle)
Date: Tue, 05 Nov 2002 22:25:41 +0000
Subject: [Spambayes] HTML user interface for spambayes
In-Reply-To: <BIEJKCLHCIOIHAGOKOLHIELDDNAA.tim.one@comcast.net>
References: <E188zKS-0002cy-0U@anchor-post-39.mail.demon.net>
	<BIEJKCLHCIOIHAGOKOLHIELDDNAA.tim.one@comcast.net>
Message-ID: <ra9gsus1eoikvm56oshs8rpa9dvu4p0vtt@4ax.com>


[Richie]
> One question: can we still untrain a message?

[Tim and Rob]
> Yes

Good.  A POPFile-style HTML retraining interface is feasible then - that
goes from my 'maybe' list to my 'ooo yes please' list.

-- 
Richie Hindle
richie@entrian.com


From richie@entrian.com  Tue Nov  5 22:22:31 2002
From: richie@entrian.com (Richie Hindle)
Date: Tue, 05 Nov 2002 22:22:31 +0000
Subject: [Spambayes] HTML user interface for spambayes
In-Reply-To: <E188zKS-0002cy-0U@anchor-post-39.mail.demon.net>
References: <E188zKS-0002cy-0U@anchor-post-39.mail.demon.net>
Message-ID: <1vagsug6n353cqupqd9ie47s2nl62lq2gj@4ax.com>

Hi,

The HTML-based user interface is now committed.  Quoting myself:

> it gives you:
>
>  o A 'Word query' form where you can get information from the database
>    about a specific word
>  o Training by uploading message files to it (one at once at the moment,
>    but I'll add support for mbox files)
>  o Training by pasting an email into a form
>  o The status of the pop3proxy - how many emails classified and so on,
>    plus a shutdown button.

"pop3proxy -b" (plus the existing options) will launch a web browser
containing the user interface (after loading the database, which is a bit
annoying on a slow machine).  Training saves the pickle (if you're using
one) every time, which is also a pain on slow machines - I'll sort that out
RSN.

I'm planning to add an HTML (re)training interface a la POPFile as well -
it keeps a cache of last (configurably) few messages and lets you
(re)classify them from your browser.  When time permits...

My plan of building the documentation and the UI together has not yet come
to fruition - I was too interested in making the UI work than on designing
it.  Apart from the viking helmet - that was essential.  8-)

-- 
Richie Hindle
richie@entrian.com


From richie@entrian.com  Wed Nov  6 00:05:07 2002
From: richie@entrian.com (Richie Hindle)
Date: Wed, 06 Nov 2002 00:05:07 +0000
Subject: [Spambayes] Spambayes Header Format
In-Reply-To: <3DC52BE4.6010602@hooft.net>
References: <3DC4D0F1.5000509@hooft.net>
	<3fv9suk4fi0m7bgtm04258gmjvr0j3i046@4ax.com> <3DC51403.6030208@hooft.net>
	<3DC52BE4.6010602@hooft.net>
Message-ID: <lo7gsucdkikqfh13ogld0mfhgr1pc4redi@4ax.com>


Rob moved the names and values of the X-Hammie-Disposition header into the
Options - pop3proxy now respects this as well as hammie.  The defaults are
still the old "X-Hammie-Disposition" system.  pop3proxy doesn't put in any
of the extra header pieces; just plain old Yes/No/Unsure.

Are there any other pieces that have any of this hard-coded?  If not,
should we switch to X-Spambayes-Classification: Spam/Ham/Unsure?  A few
people have come up with objections or alternatives, but more have backed
this idea than any other (including leaving things as they are - or is
there a silent majority in favour of that option?).

-- 
Richie Hindle
richie@entrian.com


From skip@pobox.com  Wed Nov  6 01:31:53 2002
From: skip@pobox.com (Skip Montanaro)
Date: Tue, 5 Nov 2002 19:31:53 -0600
Subject: [Spambayes] Spambayes Header Format
In-Reply-To: <lo7gsucdkikqfh13ogld0mfhgr1pc4redi@4ax.com>
References: <3DC4D0F1.5000509@hooft.net>
        <3DC51403.6030208@hooft.net>
        <3DC52BE4.6010602@hooft.net>
        <lo7gsucdkikqfh13ogld0mfhgr1pc4redi@4ax.com>
Message-ID: <15816.28937.523690.35204@montanaro.dyndns.org>


    Richie> If not, should we switch to X-Spambayes-Classification:
    Richie> Spam/Ham/Unsure?

I still like "X-Ham-Status: {yes,no,unsure}" but never saw any responses pro
or con to the idea, which I mentioned a couple times.  Did those messages
ever turn up on the list?  I think it reads well, uses the catchy "ham"
marketing term I like so well, provides a nice counter to SpamAssassin's
X-Spam-Status and doesn't exhibit any spelling differences across English
dialects.  The only potential problem is the religious dietary issue, which
I believe Tim addressed in an earlier message.

Skip

From skip@pobox.com  Wed Nov  6 01:46:44 2002
From: skip@pobox.com (Skip Montanaro)
Date: Tue, 5 Nov 2002 19:46:44 -0600
Subject: [Spambayes] I thought bogus message structure problem was
        solved...
In-Reply-To: <64b6162160.6216064b61@icomcast.net>
References: <64b6162160.6216064b61@icomcast.net>
Message-ID: <15816.29828.224103.989829@montanaro.dyndns.org>


    Tim> How about refactoring the code?  tokenizer.Tokenizer.get_message()
    Tim> wrestles with all known problem getting an email object out of a
    Tim> string, and all code should use it.

Okay, that's in process.  Thx for the pointer.  Question: mboxcount used to
keep track of good and bad messages.  Using the approach found in the above
get_message() means no messages appear to be unparseable anymore without a
little extra dance after the call.  Does that matter to anyone?

Skip

From skip@pobox.com  Wed Nov  6 01:50:20 2002
From: skip@pobox.com (Skip Montanaro)
Date: Tue, 5 Nov 2002 19:50:20 -0600
Subject: [Spambayes] I thought bogus message structure problem was
        solved...
In-Reply-To: <15816.29828.224103.989829@montanaro.dyndns.org>
References: <64b6162160.6216064b61@icomcast.net>
        <15816.29828.224103.989829@montanaro.dyndns.org>
Message-ID: <15816.30044.478724.248988@montanaro.dyndns.org>


    Skip> Question: mboxcount used to keep track of good and bad messages.
    Skip> Using the approach found in the above get_message() means no
    Skip> messages appear to be unparseable anymore without a little extra
    Skip> dance after the call.  Does that matter to anyone?

Never mind.  I figure if a message object is returned with both "to" and
"cc" fields == None, then it was unparseable.

Skip


From skip@pobox.com  Wed Nov  6 02:18:54 2002
From: skip@pobox.com (Skip Montanaro)
Date: Tue, 5 Nov 2002 20:18:54 -0600
Subject: [Spambayes] mboxutils.get_message()
Message-ID: <15816.31758.56856.328120@montanaro.dyndns.org>

For those of you who don't read spambayes-checkins, I got rid of all the
places which try to generate email.Message.Message objects and moved
tokenizer.Tokenizer.get_message() to mboxutils.get_message().  The latter is
now the best factory function to use.  It accepts Message objects, strings
and file-like objects as inputs.  If it encounters a message parsing error,
it just creates a new mail message and stuffs the current input's message
body in it as plain text then returns that.

I replaced _factory() functions in several files with get_message() as well.

Skip


From Paul.Moore@atosorigin.com  Wed Nov  6 09:12:33 2002
From: Paul.Moore@atosorigin.com (Moore, Paul)
Date: Wed, 6 Nov 2002 09:12:33 -0000
Subject: [Spambayes] Re: Outlook addin - initial impressions
Message-ID: <16E1010E4581B049ABC51D4975CEDB885E2D89@UKDCX001.uk.int.atosorigin.com>

My messages seem to be passing each other in the ether...

From: Moore, Paul=20
> Looking again at the notes, I wonder - is the problem of the
> "filter" button not being enabled what is referred to as "Filtering
> an Exchange Server public store appears to not work."? I was planning
> on filtering my inbox, which is indeed on Exchange...

As I said in another post, this turned out to be a minor UI issue. Just
tested again at work, and filtering an Exchange inbox does indeed work
fine.

> Assuming this is the issue, then can I offer myself as a guinea pig?

This offer still stands, though. I'm happy to help if I can.

Paul.


_______________________________________________
Spambayes mailing list
Spambayes@python.org
http://mail.python.org/mailman/listinfo/spambayes

From Paul.Moore@atosorigin.com  Wed Nov  6 10:08:58 2002
From: Paul.Moore@atosorigin.com (Moore, Paul)
Date: Wed, 6 Nov 2002 10:08:58 -0000
Subject: [Spambayes] Outlook plugin - training
Message-ID: <16E1010E4581B049ABC51D4975CEDB88619926@UKDCX001.uk.int.atosorigin.com>

When the Outlook plugin filters mails, it classifies them as either
spam or potential spam, and can put them in appropriate folders.

In the spam/potential spam folders, there is a "Recover from Spam"
button available, and in other folders there is a "Delete as spam"
button. These buttons add the message to the training database as well
as taking the appropriate action.

One thing I don't see, however, is a means of confirming the
classifier's decisions as correct. A "yes, that is spam" button for
the spam folder, and a "yes, that's ham" button in non-spam folders.

As I'm starting from a very small message base, I worry that correct
classifications are still somewhat based on "luck", and training based
on correct decisions would help to increase both my and the
classifier's confidence level.

I can do this by regular retraining, but that has 2 disadvantages:
it's much clumsier than simply clicking on a "clever boy!" button, and
it relies on me not deleting messages until I do a training run. Much
of the ham I get is "read and forget", so I'd rather delete
immediately.

When I get a chance to dive into the code, I'll see how hard this
would be to implement.

Paul.

From piersh@friskit.com  Wed Nov  6 11:35:25 2002
From: piersh@friskit.com (Piers Haken)
Date: Wed, 6 Nov 2002 03:35:25 -0800
Subject: [Spambayes] Outlook plugin - training
Message-ID: <9891913C5BFE87429D71E37F08210CB91839FE@zeus.sfhq.friskit.com>

I don't believe you need this. I think that the classifier automatically
trains on messages as they arrive (or at least on messages that it's
sure about). You only need to retrain if it has made a mistake, or if
it's unsure.

Piers.

> -----Original Message-----
> From: Moore, Paul [mailto:Paul.Moore@atosorigin.com]=20
> Sent: Wednesday, November 06, 2002 2:09 AM
> To: Spambayes (E-mail)
> Subject: [Spambayes] Outlook plugin - training
>=20
>=20
> When the Outlook plugin filters mails, it classifies them as=20
> either spam or potential spam, and can put them in=20
> appropriate folders.
>=20
> In the spam/potential spam folders, there is a "Recover from=20
> Spam" button available, and in other folders there is a=20
> "Delete as spam" button. These buttons add the message to the=20
> training database as well as taking the appropriate action.
>=20
> One thing I don't see, however, is a means of confirming the=20
> classifier's decisions as correct. A "yes, that is spam"=20
> button for the spam folder, and a "yes, that's ham" button in=20
> non-spam folders.
>=20
> As I'm starting from a very small message base, I worry that=20
> correct classifications are still somewhat based on "luck",=20
> and training based on correct decisions would help to=20
> increase both my and the classifier's confidence level.
>=20
> I can do this by regular retraining, but that has 2=20
> disadvantages: it's much clumsier than simply clicking on a=20
> "clever boy!" button, and it relies on me not deleting=20
> messages until I do a training run. Much of the ham I get is=20
> "read and forget", so I'd rather delete immediately.
>=20
> When I get a chance to dive into the code, I'll see how hard=20
> this would be to implement.
>=20
> Paul.
>=20
> _______________________________________________
> Spambayes mailing list
> Spambayes@python.org http://mail.python.org/mailman/listinfo/spambayes
>=20
From Paul.Moore@atosorigin.com  Wed Nov  6 12:00:56 2002
From: Paul.Moore@atosorigin.com (Moore, Paul)
Date: Wed, 6 Nov 2002 12:00:56 -0000
Subject: [Spambayes] Outlook plugin - training
Message-ID: <16E1010E4581B049ABC51D4975CEDB88619928@UKDCX001.uk.int.atosorigin.com>

From: Piers Haken [mailto:piersh@friskit.com]

> I don't believe you need this. I think that the classifier
> automatically trains on messages as they arrive (or at least on
> messages that it's sure about). You only need to retrain if it
> has made a mistake, or if it's unsure.

If it does, it doesn't update the "Database has XXX good and XXX
spam" information in the manager dialog (at least not in all
cases) - I just got a correctly classified spam, and the reported
number of spams in the database hadn't changed.

But I'd be happy if it does train on what it classifies - then
all I do is correct any mistakes.

What does it do about "Potential Spam"? Train as if it were spam,
and then correct its assumption when I move it back to the inbox?
That probably makes most sense, given that the "Potential Spam"
folder gets a "Recover from Spam" button, rather than a "Delete
as spam" one.

Paul.

PS In case it's not obvious, I'll summarise what I've learnt once
   I get to grips with all of this. Hopefully, a summary would be
   useful for the documentation...

From Paul.Moore@atosorigin.com  Wed Nov  6 13:23:45 2002
From: Paul.Moore@atosorigin.com (Moore, Paul)
Date: Wed, 6 Nov 2002 13:23:45 -0000
Subject: [Spambayes] Outlook plugin - training
Message-ID: <16E1010E4581B049ABC51D4975CEDB885E2D92@UKDCX001.uk.int.atosorigin.com>

From: Moore, Paul=20
> What does it do about "Potential Spam"? Train as if it were spam,
> and then correct its assumption when I move it back to the inbox?
> That probably makes most sense, given that the "Potential Spam"
> folder gets a "Recover from Spam" button, rather than a "Delete
> as spam" one.

Actually, I'm not sure I like "Potential Spam" being treated as spam
until confirmed as OK. I have Rules Wizard rules which sort E-Mail
traffic out into folders. I'm entirely happy with the behavious I
understand to be the case - rules processed before the plugin - as
I don't get spam on list addresses, so I'm OK with list traffic being
totally excluded from the spam process.

But I've had a couple of list messages end up in "Potential Spam".
Either the rules wizard is missing them (possible, I never had much
confidence in it :-() or the plugin is interfering somehow.

I think I may switch off the "potential spam" bit, and just filter
out known spam, and classify my Inbox by hand. I'll leave it a bit
longer before deciding, though.

Paul

From richie@entrian.com  Wed Nov  6 16:56:29 2002
From: richie@entrian.com (Richie Hindle)
Date: Wed, 06 Nov 2002 16:56:29 +0000
Subject: [Spambayes] Spambayes Header Format
In-Reply-To: <15816.28937.523690.35204@montanaro.dyndns.org>
References: <3DC4D0F1.5000509@hooft.net> <3DC51403.6030208@hooft.net>
	<3DC52BE4.6010602@hooft.net> <lo7gsucdkikqfh13ogld0mfhgr1pc4redi@4ax.com>
	<15816.28937.523690.35204@montanaro.dyndns.org>
Message-ID: <t0iisuk0lji2vj4df8nabd174frvejnkp6@4ax.com>

Hi Skip,

> I still like "X-Ham-Status: {yes,no,unsure}" but never saw any responses pro
> or con to the idea, which I mentioned a couple times.  Did those messages
> ever turn up on the list?

Yes, I saw them, but didn't have a strong opinion on it so I kept quiet.

I prefer "X-Spambayes-Classification: Spam" to "X-Ham-Status: No" for a
couple of reasons: the former contains the word 'Spam' in at least one
place, or two for spams.  It contains the word 'Spambayes', which
immediately tells you the name of the software that added the header (or at
least gives you something to tell Google you Feel Lucky about).

I like the 'Ham' word as well (notwithstanding the non-pig-eaters, who I
doubt will be offended), but not enough to want it in the name of the
header itself - using it as a header value is enough to make it a visible
USP.

All MHO.

-- 
Richie Hindle
richie@entrian.com


From richie@entrian.com  Wed Nov  6 18:05:14 2002
From: richie@entrian.com (Richie Hindle)
Date: Wed, 06 Nov 2002 18:05:14 +0000
Subject: [Spambayes] Spambayes Header Format
In-Reply-To: <t0iisuk0lji2vj4df8nabd174frvejnkp6@4ax.com>
References: <3DC4D0F1.5000509@hooft.net> <3DC51403.6030208@hooft.net>
	<3DC52BE4.6010602@hooft.net> <lo7gsucdkikqfh13ogld0mfhgr1pc4redi@4ax.com>
	<15816.28937.523690.35204@montanaro.dyndns.org>
	<t0iisuk0lji2vj4df8nabd174frvejnkp6@4ax.com>
Message-ID: <d8misu02anqjc7v6llb6himdujuk6ufhd0@4ax.com>


I said:
> I prefer "X-Spambayes-Classification: Spam" to "X-Ham-Status: No" for a
> couple of reasons: the former contains the word 'Spam' in at least one
> place, or two for spams.

Sorry, I should have said _why_ I think that's an advantage - it
immediately tells you that the header has something to do with spam, rather
than being just another random email header.  If you haven't heard of our
cool 'Ham' term, the header is meaningless.

(OK, why are you looking at the header if you've never heard of Spambayes?
Maybe your giant multinational corporation is making us all rich by running
Enterprise Spambayes Double Pro Plus Premium Edition while we laze about on
our private island.  Or is that not how Open Source works?  8-)

-- 
Richie Hindle
richie@entrian.com


From tim@fourstonesExpressions.com  Wed Nov  6 18:16:32 2002
From: tim@fourstonesExpressions.com (Tim Stone - Four Stones Expressions)
Date: Wed, 06 Nov 2002 12:16:32 -0600
Subject: [Spambayes] Spambayes Header Format
In-Reply-To: <d8misu02anqjc7v6llb6himdujuk6ufhd0@4ax.com>
Message-ID: <E9TSYUWSJQO53USMQMVTHVQVQNMUR.3dc95c80@riven>

11/6/2002 12:05:14 PM, Richie Hindle <richie@entrian.com> wrote:

>
>I said:
>> I prefer "X-Spambayes-Classification: Spam" to "X-Ham-Status: No" for a
>> couple of reasons: the former contains the word 'Spam' in at least one
>> place, or two for spams.
>
>Sorry, I should have said _why_ I think that's an advantage - it
>immediately tells you that the header has something to do with spam, rather
>than being just another random email header.  If you haven't heard of our
>cool 'Ham' term, the header is meaningless.

The thing with including Spambayes in the header is that if someone hasn't 
heard of it, and does a google on the term, they'll get something meaningful.  
on the other hand, if they do a google on ham...

I vote for "X-Spambayes-Classification: Spam"

- TimS
>
>(OK, why are you looking at the header if you've never heard of Spambayes?
>Maybe your giant multinational corporation is making us all rich by running
>Enterprise Spambayes Double Pro Plus Premium Edition while we laze about on
>our private island.  Or is that not how Open Source works?  8-)
>
>-- 
>Richie Hindle
>richie@entrian.com
>
>
>_______________________________________________
>Spambayes mailing list
>Spambayes@python.org
>http://mail.python.org/mailman/listinfo/spambayes
>
>
- Tim
www.fourstonesExpressions.com 


From papaDoc@videotron.ca  Wed Nov  6 18:55:55 2002
From: papaDoc@videotron.ca (papaDoc)
Date: Wed, 06 Nov 2002 13:55:55 -0500
Subject: [Spambayes] Spambayes Header Format
References: <E9TSYUWSJQO53USMQMVTHVQVQNMUR.3dc95c80@riven>
Message-ID: <3DC965BB.80207@videotron.ca>

Hi,

I also prefer :

X-Spambayes-Classification: "Ham|Unsure|Spam" 
to 
X-Ham-Status: "Yes|No|Unsure"


Because it is less confusing since x-Ham-Status is not a question the 
answer should not be
"yes|or" but a status like "processed_and_found_to_be_ham", 
"unsure_need_more_training"

papaDoc


From tim.one@comcast.net  Wed Nov  6 19:27:41 2002
From: tim.one@comcast.net (Tim Peters)
Date: Wed, 06 Nov 2002 14:27:41 -0500
Subject: [Spambayes] Outlook plugin - training
In-Reply-To: 
 <16E1010E4581B049ABC51D4975CEDB885E2D92@UKDCX001.uk.int.atosorigin.com>
Message-ID: <LNBBLJKPBEHFEDALKOLCKEDACHAB.tim.one@comcast.net>

[Moore, Paul, on the Outlook2K client]
> Actually, I'm not sure I like "Potential Spam" being treated as spam
> until confirmed as OK.

It doesn't.  "Potential Spam" really means "unsure" -- it would be as
accurate to call it "Potential Ham", but neither is as accurate as Unsure.
The system knows it doesn't know what to call msgs in this category, and the
client doesn't automatically train on Unsure msgs (unless you *manually*
drag one into your Spam folder, or into one of your Ham folders).

> I have Rules Wizard rules which sort E-Mail traffic out into folders.
> I'm entirely happy with the behavious I understand to be the case - rules
> processed before the plugin - as I don't get spam on list addresses, so
> I'm OK with list traffic being totally excluded from the spam process.

The Define Filters dialog has a multi-selection folder control, so you can
tell the client to watch any number of folders (you're not limited to the
Inbox alone; add the destination folders of your other Outlook rules if you
want email coming into those watched too).

The interaction with Outlook's Rules Wizard (RW) remains unclear.  The RW's
internal workings appear undocumented, and there appears no way to hook into
it.  I've definitely seen the addin's filtering rules trigger *while* the RW
was still running, and in some cases that can lead to the addin's filtering
looking at a msg more than once.  For example, the addin's filter may
trigger when a msg first arrives in the Inbox, and then a second time on the
same msg when the RW moves it into a different folder that the addin's
filter is also watching.  In this case the client suffers an internal
exception, as the entry ID Outlook told it to use for the first trigger gets
invalidated by the move.  It works OK in the end, but "something isn't quite
right" about it.

> But I've had a couple of list messages end up in "Potential Spam".
> Either the rules wizard is missing them (possible, I never had much
> confidence in it :-() or the plugin is interfering somehow.

Sorry, can't say without a concrete example to stare at.  I haven't seen tha
addin make any mistakes here, although it's common to get baffled about
exactly why the RW does what it does.

> I think I may switch off the "potential spam" bit, and just filter
> out known spam, and classify my Inbox by hand. I'll leave it a bit
> longer before deciding, though.

You'll be happier if you keep an Unsure folder.  For me, about 1% of my
email ends up there, about half-and-half ham vs spam, and my Inbox is
virtually spam-free (while my Spam folder is pure spam now -- about 100 per
day).

Another:  Note that this is pre-alpha software, and you should definitely
keep persistent Ham and Spam folders for training, as updating the code may
invalidate your database(s), or introduce tokenization and/or scoring and/or
configuration changes that render your database(s) worse than useless.  IOW,
you should stay prepared to retrain from scratch.  I set up a distinct .pst
file to hold Ham and Spam examples for this purpose, to keep from cluttering
my primary msg store.  The folder controls in the addin (unlike several in
Outlook itself!) allow selecting multiple folders from multiple msg stores
too, and my Spam folder is actually in this other .pst file.


From tim.one@comcast.net  Wed Nov  6 19:36:04 2002
From: tim.one@comcast.net (Tim Peters)
Date: Wed, 06 Nov 2002 14:36:04 -0500
Subject: [Spambayes] Outlook plugin - training
In-Reply-To: 
 <16E1010E4581B049ABC51D4975CEDB88619926@UKDCX001.uk.int.atosorigin.com>
Message-ID: <LNBBLJKPBEHFEDALKOLCCEDCCHAB.tim.one@comcast.net>

[Moore, Paul]
> ...
> I can do this by regular retraining, but that has 2 disadvantages:
> it's much clumsier than simply clicking on a "clever boy!" button, and
> it relies on me not deleting messages until I do a training run. Much
> of the ham I get is "read and forget", so I'd rather delete
> immediately.
>
> When I get a chance to dive into the code, I'll see how hard this
> would be to implement.

Automatic training needs lots of work.  The Outlook client has gotten
smarter than anything else about this so far, but at the moment it's
basically automating "mistake based" training, which I think will prove to
be a Bad Idea over time.

Ideal is to train regularly on a random sample of all msgs, whether or not
correctly classified (I fake this by hand for now).  That presents some UI
and algorithmic challenges.

It will also create a database size problem:  without a strategy for pruning
useless words, the database will grow without bounds (an intuition that at a
certain non-fantastic size, "all words" will have been seen is incorrect for
computer-based indexing apps, and especially for email -- unique words keep
appearing and keep bloating the beast).  There's been no research done here
yet on how to prune a database over time without damaging accuracy.


From just@letterror.com  Wed Nov  6 19:55:28 2002
From: just@letterror.com (Just van Rossum)
Date: Wed,  6 Nov 2002 20:55:28 +0100
Subject: [Spambayes] Upgrade problem
Message-ID: <r01050400-1021-B40FB3A8F1C111D68CC8003065D5E7E4@[10.0.0.23]>

Hi there,

First off: I started playing with spambayes last sunday, and it's been a blast
so far. I'm using pop3proxy.py, love the brand new web interface.

However, I did a cvs up today, and unpickling the database stopped working, as
classifier.Bayes became a classic class. After some twiddling I managed to
repair it, but now I get AssertionErrors during training:

  [python:~/code/spambayes] just% ./hammie.py -g mymail/good.mbox.fix 
  Training ham (mymail/good.mbox.fix):
       4
  Traceback (most recent call last):
    File "./hammie.py", line 483, in ?
      main()
    File "./hammie.py", line 460, in main
      h.update_probabilities()
    File "./hammie.py", line 336, in update_probabilities
      self.bayes.update_probabilities()
    File "classifier.py", line 327, in update_probabilities
      assert hamcount <= nham
  AssertionError
  
Is my db screwed or is it repairable?

Just

From tim@fourstonesExpressions.com  Wed Nov  6 20:02:53 2002
From: tim@fourstonesExpressions.com (Tim Stone - Four Stones Expressions)
Date: Wed, 06 Nov 2002 14:02:53 -0600
Subject: [Spambayes] Upgrade problem
In-Reply-To: <r01050400-1021-B40FB3A8F1C111D68CC8003065D5E7E4@[10.0.0.23]>
Message-ID: <TS4IHEB514ZTPURVSIGOJTQ21FDMGZU.3dc9756d@riven>

Lemme answer before Tim gets to ya...

This is why you keep a corpus.  This is pre-alpha code, and anything that 
anyone does at any time can screw the world up.  You should simply delete your 
database and retrain it.  If you don't have a corpus, go ahead and make one 
now... <wink>

- TimS

11/6/2002 1:55:28 PM, Just van Rossum <just@letterror.com> wrote:

>Hi there,
>
>First off: I started playing with spambayes last sunday, and it's been a 
blast
>so far. I'm using pop3proxy.py, love the brand new web interface.
>
>However, I did a cvs up today, and unpickling the database stopped working, 
as
>classifier.Bayes became a classic class. After some twiddling I managed to
>repair it, but now I get AssertionErrors during training:
>
>  [python:~/code/spambayes] just% ./hammie.py -g mymail/good.mbox.fix 
>  Training ham (mymail/good.mbox.fix):
>       4
>  Traceback (most recent call last):
>    File "./hammie.py", line 483, in ?
>      main()
>    File "./hammie.py", line 460, in main
>      h.update_probabilities()
>    File "./hammie.py", line 336, in update_probabilities
>      self.bayes.update_probabilities()
>    File "classifier.py", line 327, in update_probabilities
>      assert hamcount <= nham
>  AssertionError
>  
>Is my db screwed or is it repairable?
>
>Just
>
>_______________________________________________
>Spambayes mailing list
>Spambayes@python.org
>http://mail.python.org/mailman/listinfo/spambayes
>
>
- Tim
www.fourstonesExpressions.com 


From just@letterror.com  Wed Nov  6 20:12:44 2002
From: just@letterror.com (Just van Rossum)
Date: Wed,  6 Nov 2002 21:12:44 +0100
Subject: [Spambayes] Upgrade problem
In-Reply-To: <TS4IHEB514ZTPURVSIGOJTQ21FDMGZU.3dc9756d@riven>
Message-ID: <r01050400-1021-1D9D1FB9F1C411D68CC8003065D5E7E4@[10.0.0.23]>

Tim Stone - Four Stones Expressions wrote:

> Lemme answer before Tim gets to ya...
> 
> This is why you keep a corpus.  This is pre-alpha code, and anything that 
> anyone does at any time can screw the world up.  You should simply delete your 
> database and retrain it.  If you don't have a corpus, go ahead and make one 
> now... <wink>

Okelydokely! Hey, it already works so well, why not call it "beta"? <wink>

Just

From python-spambayes@discworld.dyndns.org  Wed Nov  6 20:16:09 2002
From: python-spambayes@discworld.dyndns.org (Charles Cazabon)
Date: Wed, 6 Nov 2002 14:16:09 -0600
Subject: [Spambayes] Outlook plugin - training
In-Reply-To: <LNBBLJKPBEHFEDALKOLCCEDCCHAB.tim.one@comcast.net>;
	from tim.one@comcast.net on Wed, Nov 06, 2002 at 02:36:04PM -0500
References: 
	<16E1010E4581B049ABC51D4975CEDB88619926@UKDCX001.uk.int.atosorigin.com>
	<LNBBLJKPBEHFEDALKOLCCEDCCHAB.tim.one@comcast.net>
Message-ID: <20021106141609.B31428@discworld.dyndns.org>

Tim Peters <tim.one@comcast.net> wrote:
> 
> It will also create a database size problem:  without a strategy for pruning
> useless words, the database will grow without bounds (an intuition that at a
> certain non-fantastic size, "all words" will have been seen is incorrect for
> computer-based indexing apps, and especially for email -- unique words keep
> appearing and keep bloating the beast).

Did you actually find this?  I found the growth tailed off dramatically after
not too long.  I no longer have the exact numbers, but database growth for me
tailed off almost to nothing after I had trained on something like 1500
messages.

Charles
-- 
-----------------------------------------------------------------------
Charles Cazabon                 <python-spambayes@discworld.dyndns.org>
GPL'ed software available at:     http://www.qcc.ca/~charlesc/software/
-----------------------------------------------------------------------

From just@letterror.com  Wed Nov  6 21:42:27 2002
From: just@letterror.com (Just van Rossum)
Date: Wed,  6 Nov 2002 22:42:27 +0100
Subject: [Spambayes] Upgrade problem
In-Reply-To: <TS4IHEB514ZTPURVSIGOJTQ21FDMGZU.3dc9756d@riven>
Message-ID: <r01050400-1021-A9382B2EF1D011D68CC8003065D5E7E4@[10.0.0.23]>

Tim Stone - Four Stones Expressions wrote:

> This is why you keep a corpus.  This is pre-alpha code, and anything that 
> anyone does at any time can screw the world up.  You should simply delete your 
> database and retrain it.  If you don't have a corpus, go ahead and make one 
> now... <wink>

Alright, this triggered a feature request in me, which resulted in some hacking
activity <wink>. The patch below appends training messages to one of two mbox
files ('_pop3proxyspam.mbox' or '_pop3proxyham.mbox' respectively), making it
easier to later rebuild the database from scratch, while still being able to
train ad hoc with the web interface of pop3proxy.py. Good idea?

Just


Index: pop3proxy.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/pop3proxy.py,v
retrieving revision 1.10
diff -c -r1.10 pop3proxy.py
*** pop3proxy.py    5 Nov 2002 22:18:56 -0000   1.10
--- pop3proxy.py    6 Nov 2002 21:37:03 -0000
***************
*** 608,615 ****
          raise SystemExit
  
      def onUpload(self, params):
!         message = params.get('file') or params.get('text')            
          isSpam = (params['which'] == 'spam')
          self.bayes.learn(tokenizer.tokenize(message), isSpam, True)
          self.push("""<p>Trained on your message. Saving database...</p>""")
          self.push(" ")  # Flush... must find out how to do this properly...
--- 608,626 ----
          raise SystemExit
  
      def onUpload(self, params):
!         message = params.get('file') or params.get('text')
          isSpam = (params['which'] == 'spam')
+         # Append the message to a file, to make it easier to rebuild
+         # the database later.
+         message = message.replace('\r\n', '\n').replace('\r', '\n')
+         if isSpam:
+             f = open("_pop3proxyspam.mbox", "a")
+         else:
+             f = open("_pop3proxyham.mbox", "a")
+         f.write("From ???@???\n")  # fake From line (XXX good enough?)
+         f.write(message)
+         f.write("\n")
+         f.close()
          self.bayes.learn(tokenizer.tokenize(message), isSpam, True)
          self.push("""<p>Trained on your message. Saving database...</p>""")
          self.push(" ")  # Flush... must find out how to do this properly...

From mhammond@skippinet.com.au  Wed Nov  6 22:09:04 2002
From: mhammond@skippinet.com.au (Mark Hammond)
Date: Thu, 7 Nov 2002 09:09:04 +1100
Subject: [Spambayes] Outlook plugin - training
In-Reply-To: <9891913C5BFE87429D71E37F08210CB91839FE@zeus.sfhq.friskit.com>
Message-ID: <LCEPIIGDJPKCOIHOBJEPMEFCHJAA.mhammond@skippinet.com.au>

[Piers responding to Paul]
> I don't believe you need this. I think that the classifier automatically
> trains on messages as they arrive (or at least on messages that it's
> sure about). You only need to retrain if it has made a mistake, or if
> it's unsure.

As Tim says, we really only do "mistake" training - nothing is trained as it
comes in, only scored.  Manually moving messages (via the button or d&d) is
the only thing that triggers an incremental re-train.

The key limitation of this scheme, as Tim also alludes to, is that this
never correctly classifies ham.  However, I actually see this incremental
training more as a "get smarter now" than a "just get smarter" technique -
ie, a user sees a mis-classified Spam, by re-training they are increasing
the chances that the next similar mail will be handled correctly.  Instant
feedback, especially while the user is getting started.

ie, it is indeed "mistake based training", but that may still prove useful
in addition to ongoing training.

I can't help thinking that we are somehow underestimating our own tool here.
As is common when people first use this tool, spam is generally found in the
ham set and vice-versa.  Because of this, I know that my Inbox is spam free
(but less sure about my other "ham" folders).  I'm also sure that my Spam
folder has no ham.  This should remain true while I continue to use the
tool.

So surely we can exploit this somehow.  Off the top of my head:
* Assume we don't trust the last 2 days of mail (as the user may not yet
have sorted them).  Anything in the "good" and "spam" folders older than
this can be assumed correctly classified, and able to be trained on.

* A process could go through all ham and spam trained on, and score each
message.  Any "suspect" messages are presented in a list (much like the
Outlook "Find Message" result list).  The user can indicate that the message
is correct (and the system will remember, never asking about this message
again) or is indeed incorrectly classified.  If incorrect, it will be moved,
and incrementally trained as per now.  (I can also picture a whitelist
kicking in here; if incorrect, offer to add user to whitelist.  If user in
the whitelist, assume ham thereby meaning mail from this person can never
again be spam)

I can picture this working in the background, and simply indicating to the
user that there are "conflicts" to be resolved at their leisure.  Further, I
imagine that as we build better training data for each message store, the
number of "conflicts" actually found would generally be zero - ie, the
system would find that all 2 day and older mail correctly classifies.

While the above is more a brain-fart than a reasoned design, I agree that
staying out of your face is important for widespread use.

Mark.


From anthony@interlink.com.au  Wed Nov  6 22:19:40 2002
From: anthony@interlink.com.au (Anthony Baxter)
Date: Thu, 07 Nov 2002 09:19:40 +1100
Subject: [Spambayes] Outlook plugin - training 
In-Reply-To: <LNBBLJKPBEHFEDALKOLCCEDCCHAB.tim.one@comcast.net> 
Message-ID: <200211062219.gA6MJe502959@localhost.localdomain>


>>> Tim Peters wrote
> Automatic training needs lots of work.  The Outlook client has gotten
> smarter than anything else about this so far, but at the moment it's
> basically automating "mistake based" training, which I think will prove to
> be a Bad Idea over time.
> 
> Ideal is to train regularly on a random sample of all msgs, whether or not
> correctly classified (I fake this by hand for now).  That presents some UI
> and algorithmic challenges.

Note that "random sample" is not as trivial as all that, either - if
you have a very high ham:spam ratio in your training DB, your accuracy
will suffer (see the tests from Alex, myself and others). 

An easy example of this is those of us who are on a bunch of higher
volume python.org lists - Greg's sterling work there means that very
little spam gets through there. 

As spambayes takes over the world, this could be a larger problem.

Anthony
-- 
Anthony Baxter     <anthony@interlink.com.au>   
It's never too late to have a happy childhood.


From lists@morpheus.demon.co.uk  Wed Nov  6 22:38:46 2002
From: lists@morpheus.demon.co.uk (Paul Moore)
Date: Wed, 06 Nov 2002 22:38:46 +0000
Subject: [Spambayes] Outlook plugin - training
References: 
	<16E1010E4581B049ABC51D4975CEDB885E2D92@UKDCX001.uk.int.atosorigin.com>
	<LNBBLJKPBEHFEDALKOLCKEDACHAB.tim.one@comcast.net>
Message-ID: <n2m-g.u1iue32h.fsf@morpheus.demon.co.uk>

Tim Peters <tim.one@comcast.net> writes:

> [Moore, Paul, on the Outlook2K client]
>> Actually, I'm not sure I like "Potential Spam" being treated as
>> spam until confirmed as OK.
>
> It doesn't.  "Potential Spam" really means "unsure" -- it would be
> as accurate to call it "Potential Ham", but neither is as accurate
> as Unsure.  The system knows it doesn't know what to call msgs in
> this category, and the client doesn't automatically train on Unsure
> msgs (unless you *manually* drag one into your Spam folder, or into
> one of your Ham folders).

That sounds like the best option. But it makes me wonder - what is a
"Spam" folder, and what is a "Ham" folder, in this context? My best
guess is that we're looking at the folders defined in the training
dialog. I'm having difficulty following the addin code, but that feels
logical (I've never seen an Outlook addin before, so I'm struggling
with "lots of code, can't see the flow" problems ATM...)

>> I have Rules Wizard rules which sort E-Mail traffic out into
>> folders.  I'm entirely happy with the behavious I understand to be
>> the case - rules processed before the plugin - as I don't get spam
>> on list addresses, so I'm OK with list traffic being totally
>> excluded from the spam process.
>
> The Define Filters dialog has a multi-selection folder control, so
> you can tell the client to watch any number of folders (you're not
> limited to the Inbox alone; add the destination folders of your
> other Outlook rules if you want email coming into those watched
> too).

I'm not entirely sure I do. As I said, anything moved by the rules
wizard is list traffic, and as such is (a) non-spam (so no need to
check it) and (b) not at all typical of personal mail. My intuition
says that including list traffic will tend to dilute the clues which
distinguish personal mail and spam. Of course, I know that the
classifier *really* works by magic, and so my intuition is useless :-)


> The interaction with Outlook's Rules Wizard (RW) remains unclear.
> The RW's internal workings appear undocumented, and there appears no
> way to hook into it.  I've definitely seen the addin's filtering
> rules trigger *while* the RW was still running, and in some cases
> that can lead to the addin's filtering looking at a msg more than
> once.  For example, the addin's filter may trigger when a msg first
> arrives in the Inbox, and then a second time on the same msg when
> the RW moves it into a different folder that the addin's filter is
> also watching.  In this case the client suffers an internal
> exception, as the entry ID Outlook told it to use for the first
> trigger gets invalidated by the move.  It works OK in the end, but
> "something isn't quite right" about it.

Ooh, that's even worse than I thought (and also entirely consistent
with what I've come to expect from Outlook :-()

>> I think I may switch off the "potential spam" bit, and just filter
>> out known spam, and classify my Inbox by hand. I'll leave it a bit
>> longer before deciding, though.
>
> You'll be happier if you keep an Unsure folder.  For me, about 1% of
> my email ends up there, about half-and-half ham vs spam, and my
> Inbox is virtually spam-free (while my Spam folder is pure spam now
> -- about 100 per day).

You could easily be right on this. It's not so much that I don't want
an Unsure folder, as that I don't know how best to manage it. My
instinctive reaction is that I want "Spam" and "Not Spam" buttons, and
then I read or delete the message in situ. Using the act of moving the
message to indicate the status feels wrong.

But maybe, in the light of what you said above (about watching
multiple folders), I need to rethink this - for "normal mail" folders
at least, if not for list traffic.

OK, I'll try thinking in terms of 4 categories of folder - ham, spam,
unsure, and "list traffic". In real terms, "list traffic" is no
different than unsure, other than in that the addin will never put
mail into the "list traffic" folders. I think that fits what I'm
after, and doesn't stray too far from the "expected model". I even
think that (if it works) I can write the logic up well enough to serve
as the basis for some documentation :-)

> Another:  Note that this is pre-alpha software, and you should definitely
> keep persistent Ham and Spam folders for training, as updating the code may
> invalidate your database(s), or introduce tokenization and/or scoring and/or
> configuration changes that render your database(s) worse than useless.  IOW,
> you should stay prepared to retrain from scratch.  I set up a distinct .pst
> file to hold Ham and Spam examples for this purpose, to keep from cluttering
> my primary msg store.  The folder controls in the addin (unlike several in
> Outlook itself!) allow selecting multiple folders from multiple msg stores
> too, and my Spam folder is actually in this other .pst file.

Oh, I agree. I'm keeping spam now, so that I have a good training set
of spam. I already keep loads of ham, so I don't feel the need to keep
any more. But I do delete a particular *type* of message - the
one-liners from Accomodation Services about cars with their lights
left on, fire alarm tests, and the like. I'd rather not bother
retaining these - just read and hit the delete button. OK, maybe I
could code up a "move to ham archive" button which I could put next to
the delete button. Maybe that's worth doing. It's back to that "how
does the classifier know?" question again :-)

Paul.

-- 
This signature intentionally left blank

From lists@morpheus.demon.co.uk  Wed Nov  6 23:31:35 2002
From: lists@morpheus.demon.co.uk (Paul Moore)
Date: Wed, 06 Nov 2002 23:31:35 +0000
Subject: [Spambayes] Outlook plugin - training
References: <9891913C5BFE87429D71E37F08210CB91839FE@zeus.sfhq.friskit.com>
	<LCEPIIGDJPKCOIHOBJEPMEFCHJAA.mhammond@skippinet.com.au>
Message-ID: <n2m-g.fzuewa08.fsf@morpheus.demon.co.uk>

"Mark Hammond" <mhammond@skippinet.com.au> writes:

> ie, it is indeed "mistake based training", but that may still prove useful
> in addition to ongoing training.

>From a newcomer's point of view, I think a key point is that "mistake
based training" is easy to understand.

I also believe that "confirmation based training" (my "clever boy!" 
button for specifically affirming that the classifier's magic gave the
right answer) is easy to understand. More than that, a new user
*expects* to need to do something like this, as the initial impression
is one of amazement at the accuracy of the classifier. But such a
gadget will fall into disuse as the user starts to expect the
classifier to be right - so it probably doesn't have enough long-term
value to be worth providing.

Batch training (keeping ham and spam, and pumping it into the
classifier in a regular training run) feels highly unnatural. My
instinct is to *delete* spam - keeping it feels wrong.

> I can't help thinking that we are somehow underestimating our own tool here.

Coming at it from cold, I can confirm that the effect feels like pure
magic. I trained on what I thought was a uselessly small corpus (I had
*no* historical spam, so I retrieved the day's batch from the wastebin
and used that). The results have been so good that I can already, 2
days later, feel myself tending to "trust" the classifier, and
forgetting about training issues.

But unlike Mark, my instinct is that this is not such a good thing
(solely from a training point of view). If people get such good
results on inadequate training, they won't work at it enough, so the
need is to make good training so easy and automatic that the tendency
to forget to bother is offset.

It's too late to think this through right now. I'll ponder some more
in the morning...

Paul.

-- 
This signature intentionally left blank

From tim.one@comcast.net  Thu Nov  7 02:16:25 2002
From: tim.one@comcast.net (Tim Peters)
Date: Wed, 06 Nov 2002 21:16:25 -0500
Subject: [Spambayes] My first non-personal personal false positive
Message-ID: <LNBBLJKPBEHFEDALKOLCKEFGCHAB.tim.one@comcast.net>

I think this is ham.  It just squeaked over my 0.80 ham cutoff, and s=
uffers
because I get a hell of a lot more spam in Spanish.  BTW, if anyone c=
an
really read what he's asking, feel encouraged to reply!

Spam Score: 0.800253


'*H*'                          0.397658
'*S*'                          0.998163
'python'                       0.000386299
'header:Return-path:1'         0.073398
'header:Message-id:1'          0.0778812
'espa?ol'                      0.0918367
'header:MIME-version:1'        0.09891
'header:Received:6'            0.1493
'programar'                    0.155172
'msn'                          0.193048
'url:msn'                      0.301957
'web'                          0.304265
'noheader:reply-to'            0.387101
'url:com'                      0.61757
'noheader:errors-to'           0.67097
'url:g'                        0.79048
'from:skip:=3D 40'               0.811451
'mediante'                     0.844828
'pagina'                       0.844828
'tiene'                        0.844828
'from:email addr:hotmail.com>' 0.867843
'clic'                         0.908163
'muy'                          0.908163
'pero'                         0.908163
'saber'                        0.908163
'con'                          0.922673
'bien'                         0.934783
'eso'                          0.934783
'hola'                         0.934783
'que'                          0.945895
'aqu?'                         0.949438
'les'                          0.973373
'por'                          0.991803

Message Stream:


Return-path: <chestersv@hotmail.com>
Received: from bright07. (bright07-qfe0.icomcast.net [172.20.4.162])
 by msgstore01.icomcast.net
 (iPlanet Messaging Server 5.1 HotFix 1.5 (built Sep 23 2002))
 with ESMTP id <0H5600EHCOJTG1@msgstore01.icomcast.net>; Wed,
 06 Nov 2002 21:07:05 -0500 (EST)
Received: from mtain03 (bright-LB.icomcast.net [172.20.3.155])
=09by bright07. (8.11.6/8.11.6) with ESMTP id gA727TG20948; Wed,
 06 Nov 2002 21:07:29 -0500 (EST)
Received: from mail.python.org (mail.python.org [12.155.117.29])
 by mtain03.icomcast.net
 (iPlanet Messaging Server 5.1 HotFix 1.5 (built Sep 23 2002))
 with ESMTP id <0H5600FCNOK1PT@mtain03.icomcast.net>; Wed,
 06 Nov 2002 21:07:13 -0500 (EST)
Received: from f226.pav1.hotmail.com ([64.4.31.226] helo=3Dhotmail.co=
m)
=09by mail.python.org with esmtp (Exim 4.05)
=09id 189c4g-00038I-00=09for webmaster@python.org;
=09Wed, 06 Nov 2002 21:07:14 -0500
Received: from mail pickup service by hotmail.com with Microsoft SMTP=
SVC;
Wed,
 06 Nov 2002 18:05:40 -0800
Received: from 168.243.104.248 by pv1fd.pav1.hotmail.msn.com with HTT=
P; Thu,
 07 Nov 2002 02:05:39 +0000 (GMT)
Date: Thu, 07 Nov 2002 02:05:39 +0000
=46rom: =3D?iso-8859-1?B?amFpciBjZXJvbiBEZWxl824=3D?=3D <chestersv@ho=
tmail.com>
Subject: PETICION
X-Originating-IP: [168.243.104.248]
To: webmaster@python.org
Message-id: <F226eUzmRf6Ci6nx8dy000000c5@hotmail.com>
MIME-version: 1.0
Content-type: text/html; charset=3Diso-8859-1
Content-transfer-encoding: 8BIT
X-Spam-Status: No, hits=3D1.5 required=3D5.0 tests=3DFROM_BIGISP,SPAM=
_PHRASE_00_01
X-Spam-Level: *
X-OriginalArrivalTime: 07 Nov 2002 02:05:40.0148 (UTC)
 FILETIME=3D[2BBA4740:01C28602]


<html><div style=3D'background-color:'><DIV>
<DIV>
<DIV>hola a todos, bueno me encontraba navegando por la web y me tope=
 con su
pagina ya que a mi me encantaria aprender a programar pero tengo un p=
roblema
yo no puedo Ingles muy bien por eso quiero saber si tiene el manual d=
e
PYTHON en espa=F1ol si me ayudan les estare eternamente agradecido cu=
idense
mucho...</DIV></DIV></DIV></div><br clear=3Dall><hr>Charla con tus am=
igos en
l=EDnea mediante MSN Messenger: <a href=3D"http://g.msn.com/8HMYES/20=
15">Haz
clic aqu=ED</a> </html>


From matt@mondoinfo.com  Thu Nov  7 02:47:22 2002
From: matt@mondoinfo.com (Matthew Dixon Cowles)
Date: Wed, 6 Nov 2002 20:47:22 -0600 (CST)
Subject: [Spambayes] My first non-personal personal false positive
In-Reply-To: <LNBBLJKPBEHFEDALKOLCKEFGCHAB.tim.one@comcast.net>
References: <LNBBLJKPBEHFEDALKOLCKEFGCHAB.tim.one@comcast.net>
Message-ID: <1036636892.95.852@sake.mondoinfo.com>

>=20hola=20a=20todos,=20bueno=20me=20encontraba=20navegando=20por=20la=20we=
b=20y=20me=20tope
>=20con=20su=20pagina=20ya=20que=20a=20mi=20me=20encantaria=20aprender=20a=
=20programar=20pero
>=20tengo=20un=20problema=20yo=20no=20puedo=20Ingles=20muy=20bien=20por=20e=
so=20quiero=20saber
>=20si=20tiene=20el=20manual=20de=20PYTHON=20en=20espa=F1ol=20si=20me=20ayu=
dan=20les=20estare
>=20eternamente=20agradecido=20cuidense=20mucho...

He's=20asking=20where=20he=20can=20find=20a=20Python=20manual=20in=20Spanis=
h=2E=20I'd=20send=20a
reply=20to=20him=20but=20his=20address=20seems=20to=20have=20gone=20astray=
=20from=20the
message.=20The=20From=20header

From:=20=3D?iso-8859-1?B?amFpciBjZXJvbiBEZWxl824=3D?=3D=20

decodes=20to=20only=20a=20name.

Regards,
Matt


From tim.one@comcast.net  Thu Nov  7 03:03:50 2002
From: tim.one@comcast.net (Tim Peters)
Date: Wed, 06 Nov 2002 22:03:50 -0500
Subject: [Spambayes] My first non-personal personal false positive
In-Reply-To: <1036636892.95.852@sake.mondoinfo.com>
Message-ID: <LNBBLJKPBEHFEDALKOLCEEFNCHAB.tim.one@comcast.net>

[Matthew Dixon Cowles]
> He's asking where he can find a Python manual in Spanish. I'd send =
a
> reply to him but his address seems to have gone astray from the
> message. The From header
>
> From: =3D?iso-8859-1?B?amFpciBjZXJvbiBEZWxl824=3D?=3D
>
> decodes to only a name.

Thanks!  I sent this to him:

"""
Apesadumbrado, no hablo espa=F1ol:

    http://www.python.org/doc/NonEnglish.html#spanish
"""

I expect Babelfish gave me an absurd translation for "Sorry,", but I =
have no
shame <wink>.  I copied Python-Help so someone else can take it from =
there.


From tim.one@comcast.net  Thu Nov  7 03:11:50 2002
From: tim.one@comcast.net (Tim Peters)
Date: Wed, 06 Nov 2002 22:11:50 -0500
Subject: [Spambayes] New scam
Message-ID: <LNBBLJKPBEHFEDALKOLCEEGACHAB.tim.one@comcast.net>

This is a multi-part message in MIME format.

---------------------- multipart/mixed attachment
I haven't seen this one before.  The classifier nailed it, of course.  This
jerk even set up a bad web site to "confirm" the claims:

    http://www.delottonetherlands.net

That's worth looking at just for the photo of "our new computer room were
[sic] the computer balloting is carried out".

always-suspected-guido-had-a-night-job-ly y'rs  - tim

---------------------- multipart/mixed attachment
An embedded message was scrubbed...
From: DIRECTOR  OF PROMOTIONS <DELOTTONETHERLANDS@mail.netpiper.com>
Subject: CONGRATULATIONS! OUR LUCKY  WINNER.
Date: Tue, 05 Nov 2002 21:49:34 -0500
Size: 3712
Url: http://mail.python.org/pipermail/spambayes/attachments/20021106/a9727231/attachment.txt

---------------------- multipart/mixed attachment--

From francois.granger@free.fr  Thu Nov  7 08:56:13 2002
From: francois.granger@free.fr (Fran=?ISO-8859-1?B?5w==?=ois Granger)
Date: Thu, 07 Nov 2002 09:56:13 +0100
Subject: [Spambayes] My first non-personal personal false positive
In-Reply-To: <LNBBLJKPBEHFEDALKOLCKEFGCHAB.tim.one@comcast.net>
Message-ID: <B9EFE93D.5C286%francois.granger@free.fr>

on 7/11/02 3:16, Tim Peters at tim.one@comcast.net wrote:

> 'mediante', 'pagina', 'tiene', 'clic', 'muy', 'pero', 'saber', 'con', 'bien',
'eso', 'hola', 'que', 'aqu?', 'les', 'por'

Here are the most probable English equivalents of the Spanish words.
> 'using', 'page', 'have', 'click', 'much', 'but', 'know', 'with', 'good',
'this', 'Hi', 'that', 'here', 'the', 'for'

This illustrate he need for properly balanced training sets and re raise the
question of language discrimination. At least prior language discrimination
would allow for a different database for each language or for a systematic
"unsure" flag for not trained languages. If you put my messages in a Ham
training set, you will flag French spams as ham because of my French sig ;-)

All these words should rate around 0.5 since they are among the most common
ones in this language.

-- 
Le courrier est un moyen de communication. Les gens devraient
se poser des questions sur les implications politiques des choix (ou non
choix) de leurs outils et technologies. Pour des courriers propres :
<http://marc.herbert.free.fr/mail/> -- <http://minilien.com/?IXZneLoID0>


From francois.granger@free.fr  Thu Nov  7 08:57:45 2002
From: francois.granger@free.fr (Fran=?ISO-8859-1?B?5w==?=ois Granger)
Date: Thu, 07 Nov 2002 09:57:45 +0100
Subject: [Spambayes] Upgrade problem
In-Reply-To: <r01050400-1021-B40FB3A8F1C111D68CC8003065D5E7E4@[10.0.0.23]>
Message-ID: <B9EFE999.5C289%francois.granger@free.fr>

on 6/11/02 20:55, Just van Rossum at just@letterror.com wrote:

> First off: I started playing with spambayes last sunday, and it's been a blast
> so far. I'm using pop3proxy.py, love the brand new web interface.

Did you installed it on MacOS9 or MacOSX ?

-- 
Le courrier est un moyen de communication. Les gens devraient
se poser des questions sur les implications politiques des choix (ou non
choix) de leurs outils et technologies. Pour des courriers propres :
<http://marc.herbert.free.fr/mail/> -- <http://minilien.com/?IXZneLoID0>


From just@letterror.com  Thu Nov  7 09:05:59 2002
From: just@letterror.com (Just van Rossum)
Date: Thu,  7 Nov 2002 10:05:59 +0100
Subject: [Spambayes] Upgrade problem
In-Reply-To: <B9EFE999.5C289%francois.granger@free.fr>
Message-ID: <r01050400-1021-281375DCF23011D68CC8003065D5E7E4@[10.0.0.23]>

Fran=E7ois Granger wrote:

> Did you installed it on MacOS9 or MacOSX ?

OSX, with unix Python 2.3a. In a way it's too bad spambayes doesn't work =
with
2.2, so you can't use the Python shipped with 10.2. (In theory it might w=
ork
under OS9, but I've never had much luck with sockets in MacPython 2.x, bu=
t you
could try. It uses asyncore and not threading, so that's hopeful for 9.)

Just

PS: the web interface of pop3proxy.py is pretty good and useful, the only
downside is that it saves the database after each training, which makes i=
t hard
to train with a few messages: after each message you have to wait (up to =
10
seconds on my machine with my database) before you can continue. Maybe an
explicit "Save database" button is an idea?

From francois.granger@free.fr  Thu Nov  7 11:00:47 2002
From: francois.granger@free.fr (Fran=?ISO-8859-1?B?5w==?=ois Granger)
Date: Thu, 07 Nov 2002 12:00:47 +0100
Subject: [Spambayes] Upgrade problem
In-Reply-To: <r01050400-1021-281375DCF23011D68CC8003065D5E7E4@[10.0.0.23]>
Message-ID: <B9F0066F.5C2A5%francois.granger@free.fr>

on 7/11/02 10:05, Just van Rossum at just@letterror.com wrote:

> Fran=E7ois Granger wrote:
>=20
>> Did you installed it on MacOS9 or MacOSX ?
>=20
> OSX, with unix Python 2.3a. In a way it's too bad spambayes doesn't work =
with
> 2.2, so you can't use the Python shipped with 10.2. (In theory it might w=
ork
> under OS9, but I've never had much luck with sockets in MacPython 2.x, bu=
t you
> could try. It uses asyncore and not threading, so that's hopeful for 9.)

I got up to have it running with MacOS9.1 and Python 2.2.1. The Web server
works and the proxy answers to a telnet on 127.0.0.1:110. I think I don't
get the idea of the setting for the proxy. I give to spambayes my pop3
server name, I then change my account in my mail reader to have it to
connect to 127.0.0.1 as a pop3 server. And nothing happens.

> after each message you have to wait (up to 10
> seconds on my machine with my database) before you can continue. Maybe an
> explicit "Save database" button is an idea?

With the -d parameter, you can use a anydbm instead of Pickle. With some
hack it can probably use gdbm as the anydbm db.


--=20
Le courrier est un moyen de communication. Les gens devraient
se poser des questions sur les implications politiques des choix (ou non
choix) de leurs outils et technologies. Pour des courriers propres :
<http://marc.herbert.free.fr/mail/> -- <http://minilien.com/?IXZneLoID0>


From Paul.Moore@atosorigin.com  Thu Nov  7 11:01:14 2002
From: Paul.Moore@atosorigin.com (Moore, Paul)
Date: Thu, 7 Nov 2002 11:01:14 -0000
Subject: [Spambayes] Outlook plugin - training
Message-ID: <16E1010E4581B049ABC51D4975CEDB8861992D@UKDCX001.uk.int.atosorigin.com>

> It's too late to think this through right now. I'll ponder some more
> in the morning...

Some post-ponder musings...

I'm assuming (based on a message I recall seeing recently) that it's
possible to "correct" training - ie, if I train the classifier that a
specific message is spam, I can later say "no it isn't, it's ham".

Assuming that this is so, is it not reasonable to train dynamically
on an "assume I got it right" basis? In other words, whenever the
addin filters a message as ham or spam, automatically train on that
basis as well. Then, if the user sees a mistake, he corrects it, which
automatically retrains the classifier (manually deleting as spam or
moving a message already does this).

This will keep the database right up to date, and all the user has to
do is correct any bad decisions the classifier makes (which he should
be doing anyway).

I've ignored database growth issues, but other than that, is there any
other problem with this approach?

Paul.

From just@letterror.com  Thu Nov  7 11:11:35 2002
From: just@letterror.com (Just van Rossum)
Date: Thu,  7 Nov 2002 12:11:35 +0100
Subject: [Spambayes] Upgrade problem
In-Reply-To: <B9F0066F.5C2A5%francois.granger@free.fr>
Message-ID: <r01050400-1021-4F723E12F24211D68CC8003065D5E7E4@[10.0.0.23]>

Fran=E7ois Granger wrote:

> > after each message you have to wait (up to 10
> > seconds on my machine with my database) before you can continue. Mayb=
e an
> > explicit "Save database" button is an idea?
>=20
> With the -d parameter, you can use a anydbm instead of Pickle. With som=
e
> hack it can probably use gdbm as the anydbm db.

Right, that's the obvious solution. Thanks.

Just

From just@letterror.com  Thu Nov  7 14:21:21 2002
From: just@letterror.com (Just van Rossum)
Date: Thu,  7 Nov 2002 15:21:21 +0100
Subject: [Spambayes] Upgrade problem
In-Reply-To: <B9F0066F.5C2A5%francois.granger@free.fr>
Message-ID: <r01050400-1021-34DAE228F25C11D68CC8003065D5E7E4@[10.0.0.23]>

Fran=E7ois Granger wrote:

> > after each message you have to wait (up to 10
> > seconds on my machine with my database) before you can continue. Mayb=
e an
> > explicit "Save database" button is an idea?
>=20
> With the -d parameter, you can use a anydbm instead of Pickle. With som=
e
> hack it can probably use gdbm as the anydbm db.

Ok, so I did it. With my current setup anydbm uses dbhash/bsddb, and trai=
ning
(on a single message) performance seems _worse_ than with the pickle (abo=
ut 20
seconds now, around 10 with pickle). Don't know whether the training itse=
lf is
slower or updating the database. Training with my entire corpus took many=
 times
longer as well. Not to mention that the database is now 20 megs instead o=
f 5...
Would gdbm be expected to work faster? (I currently don't even have it.)

Just

From msergeant@startechgroup.co.uk  Thu Nov  7 14:21:11 2002
From: msergeant@startechgroup.co.uk (Matt Sergeant)
Date: Thu, 07 Nov 2002 14:21:11 +0000
Subject: [Spambayes] Chi-squared perl port problems
Message-ID: <3DCA76D7.4070404@startechgroup.co.uk>

[Moderators - sent this from the wrong address. Please kill that mail]

OK, I've tried to convert your chi-squared stuff to Perl, but for some
reason it's producing bizarre results. It always scores low. And I have
no idea why, because I thought I'd copied the code pretty much verbatim
(albeit adding in a few $'s and {}'s ;-)

First of all, here's the token scores an email in question gets:

received-ip:207.230.250.119                        => 1.00000
AWESOME                                            => 1.00000
BUKAKKE                                            => 1.00000
GALLERIES!                                         => 1.00000
received-ip:218.53.86.224                          => 1.00000
orgasmic                                           => 1.00000
jism                                               => 1.00000
barrages                                           => 1.00000
href=http://205.197.95.39/users/belinda/bukkakehouse/index.html => 1.00000
href=http://205.197.95.39/remove.php               => 1.00000
border=20                                          => 1.00000
bukakke                                            => 1.00000
from:<hotchicks@ibm.com>                           => 1.00000
color=#FFCC33                                      => 1.00000
color=#FFCC99                                      => 1.00000
from:"Carol"                                       => 1.00000
size=+3                                            => 0.91967
content-type:text/html                             => 0.90484
size=+4                                            => 0.89516
bgcolor=#000000                                    => 0.89012
size=+2                                            => 0.88557
instant                                            => 0.79354
align=center                                       => 0.78635
access!                                            => 0.77353
color=#FFFFFF                                      => 0.77167
remove                                             => 0.76071
width=600                                          => 0.75813
now!                                               => 0.75306
click                                              => 0.66231
bukkake                                            => 0.59412
20123                                              => 0.59412
faces!                                             => 0.59412
color=#FF6600                                      => 0.55386
here                                               => 0.46364
face=verdana                                       => 0.44700
yourself                                           => 0.43374
action.                                            => 0.35701
bordercolor=Black                                  => 0.32793
blow                                               => 0.28531
stop                                               => 0.25280
japanese                                           => 0.14122
drenched                                           => 0.10872
facial                                             => 0.07652
please                                             => 0.07193
drinking                                           => 0.06229
house                                              => 0.04293

The resulting score my chi-squared code gives this is 0.331736284189509
- which to me is obviously incorrect (if you pass it through Paul
Graham's method it scores 1.0).

So here's the code I'm using:

      if (1) {
          # Chi-Squared method. Produces mostly boolean $result
          # but with a grey area.
          my ($H, $S);
          my ($Hexp, $Sexp);
          $H = $S = 1.0;
          $Hexp = $Sexp = 0;

          my $num_clues = @sorted;

          foreach my $row (@sorted) {
              $S *= 1.0 - $row->[PROB];
              $H *= $row->[PROB];
              if ($S < 1e-200) {
                  my $e;
                  ($S, $e) = frexp($S);
                  $Sexp += $e;
              }
              if ($H < 1e-200) {
                  my $e;
                  ($H, $e) = frexp($H);
                  $Hexp += $e;
              }
          }

          $S = log($S) + $Sexp + LN2;
          $H = log($H) + $Hexp + LN2;

          if ($num_clues) {
              $S = 1.0 - chi2q(-2.0 * $S, 2 * $num_clues);
              $H = 1.0 - chi2q(-2.0 * $H, 2 * $num_clues);

              $result = (($S - $H) + 1.0) / 2.0;
          }
          else {
              $result = 0.5;
          }
      }

And here's the chi2q routine, if that's relevant:

# Chi-squared function
sub chi2q {
      my ($x2, $v) = @_;

      die "v must be even in chi2q(x2, v)" if $v & 1;
      my $m = $x2 / 2.0;
      my ($sum, $term);
      $sum = $term = exp(0 - $m);
      for my $i (1 .. ($v >> 2)) {
          $term *= $m / $i;
          $sum += $term;
      }
      return $sum < 1.0 ? $sum : 1.0;
}

I also added some debugging output so that I could see the three stages
of S and H (after the loop, after the log(), and after the chi2q bit).
Here's the output from that:

S1=1e-10; H1=1.25335384490988e-12
S2=-22.3327037493805; H2=-26.7120509011492
S3=0.389722189708954; H3=0.726249621329936

If you can help me at all, I would *really* appreciate it, as I honestly
can't see where your code and mine differs. Thanks!

Matt.


From sjoerd@acm.org  Thu Nov  7 14:34:45 2002
From: sjoerd@acm.org (Sjoerd Mullender)
Date: Thu, 07 Nov 2002 15:34:45 +0100
Subject: [Spambayes] Upgrade problem
In-Reply-To: <r01050400-1021-34DAE228F25C11D68CC8003065D5E7E4@[10.0.0.23]> 
References: <r01050400-1021-34DAE228F25C11D68CC8003065D5E7E4@[10.0.0.23]> 
Message-ID: <200211071434.gA7EYjZ28924@indus.ins.cwi.nl>

On Thu, Nov 7 2002 Just van Rossum wrote:

> Fran=E7ois Granger wrote:
> 
> > > after each message you have to wait (up to 10
> > > seconds on my machine with my database) before you can continue. May=
be an
> > > explicit "Save database" button is an idea?
> > 
> > With the -d parameter, you can use a anydbm instead of Pickle. With so=
me
> > hack it can probably use gdbm as the anydbm db.
> 
> Ok, so I did it. With my current setup anydbm uses dbhash/bsddb, and tra=
ining
> (on a single message) performance seems _worse_ than with the pickle (ab=
out 20
> seconds now, around 10 with pickle). Don't know whether the training its=
elf is
> slower or updating the database. Training with my entire corpus took man=
y times
> longer as well. Not to mention that the database is now 20 megs instead =
of 5...
> Would gdbm be expected to work faster? (I currently don't even have it.)=


The problem with training is that the update_probabilities() method
which is called at the end goes through the whole database and updates
just about every word.  So the whole database is touched and needs to
be written to disk.

-- Sjoerd Mullender <sjoerd@acm.org>

From skip@pobox.com  Thu Nov  7 14:42:08 2002
From: skip@pobox.com (Skip Montanaro)
Date: Thu, 7 Nov 2002 08:42:08 -0600
Subject: [Spambayes] Chi-squared perl port problems
In-Reply-To: <3DCA76D7.4070404@startechgroup.co.uk>
References: <3DCA76D7.4070404@startechgroup.co.uk>
Message-ID: <15818.31680.223575.90177@montanaro.dyndns.org>


    Matt> OK, I've tried to convert your chi-squared stuff to Perl, but for
    Matt> some reason it's producing bizarre results. 

I think

          $S = log($S) + $Sexp + LN2;
          $H = log($H) + $Hexp + LN2;

should be

          $S = log($S) + $Sexp * LN2;
          $H = log($H) + $Hexp * LN2;

Skip

From popiel@wolfskeep.com  Thu Nov  7 15:01:13 2002
From: popiel@wolfskeep.com (T. Alexander Popiel)
Date: Thu, 07 Nov 2002 07:01:13 -0800
Subject: [Spambayes] Upgrade problem 
In-Reply-To: Message from Sjoerd Mullender <sjoerd@acm.org> 
	<200211071434.gA7EYjZ28924@indus.ins.cwi.nl> 
References: <r01050400-1021-34DAE228F25C11D68CC8003065D5E7E4@[10.0.0.23]>
	<200211071434.gA7EYjZ28924@indus.ins.cwi.nl> 
Message-ID: <20021107150113.4AADCF5CC@cashew.wolfskeep.com>

In message:  <200211071434.gA7EYjZ28924@indus.ins.cwi.nl>
             Sjoerd Mullender <sjoerd@acm.org> writes:
>
>The problem with training is that the update_probabilities() method
>which is called at the end goes through the whole database and updates
>just about every word.  So the whole database is touched and needs to
>be written to disk.

Why don't we just store the counts, and only compute the probabilities
when we need to reference them?  Yes, it is more efficient for bulk
testing to only compute the probabilities once, but it's definitely
a lose for incremental training.

Unless there's good arguments against, I'll make a patch for this
in the next day or two.

- Alex

From msergeant@startechgroup.co.uk  Thu Nov  7 15:03:43 2002
From: msergeant@startechgroup.co.uk (Matt Sergeant)
Date: Thu, 07 Nov 2002 15:03:43 +0000
Subject: [Spambayes] Chi-squared perl port problems
References: <3DCA76D7.4070404@startechgroup.co.uk>
	<15818.31680.223575.90177@montanaro.dyndns.org>
Message-ID: <3DCA80CF.4040101@startechgroup.co.uk>

Skip Montanaro said the following on 07/11/02 14:42:
> 
>     Matt> OK, I've tried to convert your chi-squared stuff to Perl, but for
>     Matt> some reason it's producing bizarre results. 
> 
> I think
> 
>           $S = log($S) + $Sexp + LN2;
>           $H = log($H) + $Hexp + LN2;
> 
> should be
> 
>           $S = log($S) + $Sexp * LN2;
>           $H = log($H) + $Hexp * LN2;

Thanks. That was one difference. However I still get odd results. Here's 
another set of tokens, which scores 1.0 under graham, but 0.03-ish with 
my chi-squared code:

2161361384acrd-zgwm                                => 1.00000
FREE?!                                             => 1.00000
Schoolgirl                                         => 1.00000
src=http://www.studiocev.com/stop/images/amateuryouth/images/spacer.gif 
=> 1.00000
received-by:mail2.studiocev.com                    => 1.00000
amateuryouth.com                                   => 1.00000
bgcolor=#525D94                                    => 1.00000
src=http://www.studiocev.com/stop/images/amateuryouth/images/index_09.jpg 
=> 1.00000
src=http://www.studiocev.com/stop/images/amateuryouth/images/index_07.jpg 
=> 1.00000
src=http://www.studiocev.com/stop/images/amateuryouth/images/index_05.jpg 
=> 1.00000
received-ip:216.136.138.4                          => 1.00000
src=http://www.studiocev.com/stop/images/amateuryouth/images/index_04.jpg 
=> 1.00000
src=http://www.studiocev.com/stop/images/amateuryouth/images/index_03.jpg 
=> 1.00000
src=http://www.studiocev.com/stop/images/amateuryouth/images/index_01.jpg 
=> 1.00000
skip:21613 19                                      => 1.00000
from:<sforeman@studiocev.com>                      => 1.00000
height=188                                         => 1.00000
href=http://amateuryouth.com/enter.html            => 1.00000
href=http://www.studiocev.com/unsubscribe.html     => 1.00000
width=345                                          => 1.00000
width=376                                          => 1.00000
height=73                                          => 0.97424
height=141                                         => 0.96062
size=+1                                            => 0.95901
height=62                                          => 0.95276
free!!                                             => 0.91652
content-type:text/html                             => 0.90486
rowspan=4                                          => 0.89331
width=298                                          => 0.88645
size=5                                             => 0.88634
width=375                                          => 0.84656
align=center                                       => 0.78637
color=#FFFFFF                                      => 0.77172
remove                                             => 0.76079
width=30                                           => 0.74293
width=153                                          => 0.73757
rowspan=2                                          => 0.73211
height=47                                          => 0.71467
color=WHITE                                        => 0.70927
border=0                                           => 0.69548
target=_blank                                      => 0.69225
width=47                                           => 0.69081
width=1                                            => 0.67389
cellpadding=0                                      => 0.65914
colspan=6                                          => 0.65661
cellspacing=0                                      => 0.65439
colspan=5                                          => 0.64894
width=15                                           => 0.64689
width=130                                          => 0.64543
height=1                                           => 0.63505
colspan=2                                          => 0.61052
from:Foreman"                                      => 0.59412
face=Verdana                                       => 0.58404
sites                                              => 0.57298
enter                                              => 0.52994
colspan=4                                          => 0.52631
colspan=3                                          => 0.52302
absolutely                                         => 0.51730
width=77                                           => 0.49915
here                                               => 0.46375
yourself                                           => 0.43398
offer                                              => 0.33481
face=arial                                         => 0.33373
http-equiv=Content-Type                            => 0.27761
years                                              => 0.27318
models                                             => 0.25215
least                                              => 0.24571
time                                               => 0.22573
this                                               => 0.21638
within                                             => 0.19113
content=text/html; charset=iso-8859-1              => 0.16949
from                                               => 0.12558
come                                               => 0.10411
listed                                             => 0.07929
please                                             => 0.07193
service                                            => 0.02372
from:"Susan                                        => 0.02048
limited                                            => 0.01139

And here's the S and H values at each stage this time:

S1=1e-10; H1=1.36351472450952e-22
S2=-23.0258509299405; H2=-50.3468063235703
S3=0.00083341211626875; H3=0.941170965180294

Every single email I throw at this gives me a high H and a low S. I'm 
really not sure what I'm doing wrong here...

Matt.


From tim.one@comcast.net  Thu Nov  7 15:28:18 2002
From: tim.one@comcast.net (Tim Peters)
Date: Thu, 07 Nov 2002 10:28:18 -0500
Subject: [Spambayes] Chi-squared perl port problems
Message-ID: <cb0b0cb348.cb348cb0b0@icomcast.net>

[Matt Sergeant]
>      for my $i (1 .. ($v >> 2)) {

Watch out for that too -- shifting right by 2 isn't dividing by 2, so 
you're systematically getting results (probably way) too small out of 
chi2Q.  The right range here is 1 .. ($v/2-1).

Suggestion:  use chi2.py's showscore() function to show the internal 
Python details on small artificial prob vectors.  Then you can check 
intermediates one-by-one against the Perl version.


From guido@python.org  Thu Nov  7 15:36:37 2002
From: guido@python.org (Guido van Rossum)
Date: Thu, 07 Nov 2002 10:36:37 -0500
Subject: [Spambayes] Upgrade problem
In-Reply-To: Your message of "Thu, 07 Nov 2002 10:05:59 +0100."
             <r01050400-1021-281375DCF23011D68CC8003065D5E7E4@[10.0.0.23]> 
References: <r01050400-1021-281375DCF23011D68CC8003065D5E7E4@[10.0.0.23]> 
Message-ID: <200211071536.gA7FabZ27507@odiug.zope.com>

> OSX, with unix Python 2.3a. In a way it's too bad spambayes doesn't
> work with 2.2, so you can't use the Python shipped with 10.2.

Long ago, we settled for Python 2.2 (some people wanted 2.1, but that
was unbearable).  If you see violations of 2.2 compatibility, please
supply patches (we'll also gladly give you checkin permission).

(If it makes a difference, I'd prefer aiming for 2.2 compatibility
over 2.2.2 compatibility, since 2.2 is probably what comes with MacOS
10.2.  Unless it gets too ugly.)

--Guido van Rossum (home page: http://www.python.org/~guido/)

From just@letterror.com  Thu Nov  7 15:40:15 2002
From: just@letterror.com (Just van Rossum)
Date: Thu,  7 Nov 2002 16:40:15 +0100
Subject: [Spambayes] Upgrade problem
In-Reply-To: <200211071536.gA7FabZ27507@odiug.zope.com>
Message-ID: <r01050400-1021-37DA84BAF26711D68CC8003065D5E7E4@[10.0.0.23]>

Guido van Rossum wrote:

> Long ago, we settled for Python 2.2 (some people wanted 2.1, but that
> was unbearable).  If you see violations of 2.2 compatibility, please
> supply patches (we'll also gladly give you checkin permission).
> 
> (If it makes a difference, I'd prefer aiming for 2.2 compatibility
> over 2.2.2 compatibility, since 2.2 is probably what comes with MacOS
> 10.2.  Unless it gets too ugly.)

The docs say 2.2.1 and that's correct: the code is littered with True and False.
Those are the only 2.2.1-isms I've seen. But a patch would nevertheless be quite
big.

Just

From just@letterror.com  Thu Nov  7 15:47:22 2002
From: just@letterror.com (Just van Rossum)
Date: Thu,  7 Nov 2002 16:47:22 +0100
Subject: [Spambayes] Upgrade problem 
In-Reply-To: <20021107150113.4AADCF5CC@cashew.wolfskeep.com>
Message-ID: <r01050400-1021-36396617F26811D68CC8003065D5E7E4@[10.0.0.23]>

T. Alexander Popiel wrote:

> Why don't we just store the counts, and only compute the probabilities
> when we need to reference them?  Yes, it is more efficient for bulk
> testing to only compute the probabilities once, but it's definitely
> a lose for incremental training.
> 
> Unless there's good arguments against, I'll make a patch for this
> in the next day or two.

+1 (I assume you mean implementing a chaching scheme, right?)

Just

From just@letterror.com  Thu Nov  7 16:01:12 2002
From: just@letterror.com (Just van Rossum)
Date: Thu,  7 Nov 2002 17:01:12 +0100
Subject: [Spambayes] Upgrade problem
In-Reply-To: <r01050400-1021-37DA84BAF26711D68CC8003065D5E7E4@[10.0.0.23]>
Message-ID: <r01050400-1021-249E5890F26A11D68CC8003065D5E7E4@[10.0.0.23]>

Just van Rossum wrote:

> Guido van Rossum wrote:
> 
> > Long ago, we settled for Python 2.2 (some people wanted 2.1, but that
> > was unbearable).  If you see violations of 2.2 compatibility, please
> > supply patches (we'll also gladly give you checkin permission).
> > 
> > (If it makes a difference, I'd prefer aiming for 2.2 compatibility
> > over 2.2.2 compatibility, since 2.2 is probably what comes with MacOS
> > 10.2.  Unless it gets too ugly.)
> 
> The docs say 2.2.1 and that's correct: the code is littered with True
> and False. Those are the only 2.2.1-isms I've seen. But a patch would
> nevertheless be quite big.

I just did a quick test with 2.2 (adding True and False to __builtins__ ;-), and
the only other 2.2.1-ism is bool(), which is only used in Options.py. After
fixing that everything seems to work just fine.

I'd be happy to add a this

try:
    True, False
except NameError:
    True, False = 1, 0

to a bunch of files, and patch the docs.

Your call. My sf username is "jvr" ;-)

Just

From bfallik@attbi.com  Thu Nov  7 19:00:32 2002
From: bfallik@attbi.com (Brian Fallik)
Date: Thu, 7 Nov 2002 14:00:32 -0500
Subject: [Spambayes] FW: I finally found you!
Message-ID: <009601c2868f$f36eded0$0302a8c0@disaster>

I recently received this email message, which I believe is very clever SPAM.
In fact, it took me a few readings to actually figure it out (I admit I was
initially excited about JennyB).  I checked out the base URL without the
form data (http://www.5050dating.com) and became suspicious because it is a
dating service.  Then I performed a search for jenny on their site didn't
find any girls matching that name.  However I did find about 5 other guys,
all recently registered, who had posted comments about looking for Jenny B.
D'oh.

The message is very generic, except for the reference to college.  The
mistake is that I graduated from Cornell several years ago, and anyone who
knew me from high school would know that.  I've concluded that 5050 dating
got their information from my email address, which is pretty obvious.

My question to the group is: how would the Bayesian filter handle a message
like this, which can even trick humans?

I'm not a member of this list so please CC me on any replies.

Thanks,
brian


-----Original Message-----
From: Jenny B [mailto:jenny14296@hotmail.com]
Sent: Wednesday, November 06, 2002 6:23 PM
To: baf11@cornell.edu
Subject: I finally found you!


Hey you, I haven't seen to you in sooooo long. I guess I was just a little
shy then and I don't remember if you would remember me. But I ran into some
of the guys we went to high school with and they said you were going to
Cornell now. Good for you.
Well, the reason I'm writing is because I've kinda always had a crush on
you. I wanted to see if there was any way I could get a second chance of
getting to know you better.
Anyway, a bunch of my girlfriends and I just got on 5050 Dating <--just
click on that to meet up. You've got to come up there. It is pretty wild &
there are a few things I've got to tell you. Who knows, maybe we'll hit it
off this time and next time you come in town we can get together and I can
show you a good time. I look forward to catching up. See ya soon!
Jenny:)

----------------------------------------------------------------------------
----
Add photos to your e-mail with MSN 8. Get 2 months FREE*.
From jeremy@alum.mit.edu  Thu Nov  7 19:28:06 2002
From: jeremy@alum.mit.edu (Jeremy Hylton)
Date: Thu, 7 Nov 2002 14:28:06 -0500
Subject: [Spambayes] FW: I finally found you!
In-Reply-To: <009601c2868f$f36eded0$0302a8c0@disaster>
References: <009601c2868f$f36eded0$0302a8c0@disaster>
Message-ID: <15818.48838.108249.714677@slothrop.zope.com>

Hey you, I don't think we've exchanged email before.  Did we know each
other in college?  I hope you're not too shy to answer; don't worry
about your girlfriends getting a copy of this message.  Would you feel
better if you hadn't told us you were initially exicted about JennyB?

(Let's see how that get's scored :-).

The answer to your question depends on the ham and spam you've trained
with.  My classifier was sure that your message, including the quoted
spam, was ham.  It was unsure about the spam by itself (although it
didn't have properly formed headers to tokenize).  Here's the detailed
scoring information.

Jeremy

Score: 0.336511997458

Clues
-----
*H* 0.532493458876
*S* 0.205517453791
anyway, 0.0196506550218
hey 0.0302013422819
bunch 0.0348837209302
soon! 0.0505617977528
kinda 0.0652173913043
catching 0.0918367346939
dating 0.0918367346939
guess 0.116621141434
i've 0.179689607795
maybe 0.201283517411
i'm 0.213137445659
haven't 0.232979675235
things 0.32151534693
were 0.329670122312
could 0.333164025881
we'll 0.349417503362
but 0.353848452351
would 0.362072927452
going 0.374658912657
got 0.378612196264
content-type:text/plain 0.387559318404
me. 0.391376634892
tell 0.603151379913
well, 0.606935963897
you've 0.608393669222
because 0.610456054038
getting 0.618613009077
seen 0.645415129199
hit 0.6617111985
high 0.668283205336
always 0.67912287871
forward 0.683019483647
now. 0.705543778678
you. 0.719307394428
you, 0.740951557525
show 0.750270607144
off 0.75066835312
town 0.763455632778
wild 0.763455632778
subject:you 0.775359384491
who 0.781981579908
better. 0.823070962358
girlfriends 0.844827586207
shy 0.844827586207
subject:found 0.844827586207
subject:! 0.849127345648
click 0.863834813092
message-id:invalid 0.908163265306


From skip@pobox.com  Thu Nov  7 19:34:01 2002
From: skip@pobox.com (Skip Montanaro)
Date: Thu, 7 Nov 2002 13:34:01 -0600
Subject: [Spambayes] FW: I finally found you!
In-Reply-To: <009601c2868f$f36eded0$0302a8c0@disaster>
References: <009601c2868f$f36eded0$0302a8c0@disaster>
Message-ID: <15818.49193.777463.850934@montanaro.dyndns.org>


    Brian> My question to the group is: how would the Bayesian filter handle
    Brian> a message like this, which can even trick humans?

It all depends on what data you've trained on.  It's hard for us to get a
good read on this particular message because you left out most of the
headers, which are often good sources of clues.

-- 
Skip Montanaro - skip@pobox.com
http://www.mojam.com/
http://www.musi-cal.com/

From tim.one@comcast.net  Thu Nov  7 19:44:19 2002
From: tim.one@comcast.net (Tim Peters)
Date: Thu, 07 Nov 2002 14:44:19 -0500
Subject: [Spambayes] Upgrade problem
In-Reply-To: <r01050400-1021-B40FB3A8F1C111D68CC8003065D5E7E4@[10.0.0.23]>
Message-ID: <BIEJKCLHCIOIHAGOKOLHCECHDOAA.tim.one@comcast.net>

pJust van Rossum]
> ...
> However, I did a cvs up today, and unpickling the database
> stopped working, as classifier.Bayes became a classic class. After
> some twiddling I managed to repair it, but now I get AssertionErrors
> during training:

I suppose it would have worked to restore the inheritance from object long
enough to open the old pickle, then copy the contents into an instance of
the changed class and pickle that.

>   [python:~/code/spambayes] just% ./hammie.py -g mymail/good.mbox.fix
>   Training ham (mymail/good.mbox.fix):
>        4
>   Traceback (most recent call last):
>     File "./hammie.py", line 483, in ?
>       main()
>     File "./hammie.py", line 460, in main
>       h.update_probabilities()
>     File "./hammie.py", line 336, in update_probabilities
>       self.bayes.update_probabilities()
>     File "classifier.py", line 327, in update_probabilities
>       assert hamcount <= nham
>   AssertionError
>
> Is my db screwed or is it repairable?

It's obviously screwed, and whether it's repairable depends on exactly what
"some twiddling" meant.  I'm sure you've built a new from scratch by now,
though!


From tim.one@comcast.net  Thu Nov  7 19:54:17 2002
From: tim.one@comcast.net (Tim Peters)
Date: Thu, 07 Nov 2002 14:54:17 -0500
Subject: [Spambayes] Upgrade problem
In-Reply-To: <r01050400-1021-A9382B2EF1D011D68CC8003065D5E7E4@[10.0.0.23]>
Message-ID: <BIEJKCLHCIOIHAGOKOLHMECIDOAA.tim.one@comcast.net>

[Just van Rossum]
> Alright, this triggered a feature request in me, which resulted in some
> hacking activity <wink>. The patch below appends training messages to
> one of two mbox files ('_pop3proxyspam.mbox' or '_pop3proxyham.mbox'
> respectively), making it easier to later rebuild the database from
> scratch, while still  being able to train ad hoc with the web interface
> of pop3proxy.py. Good idea?

Yes, and it's another reason to create a dedicated "training class" module,
so that various clients can at least share an *interface* for doing such
stuff (and so that new clients don't have to reinvent these concepts from
scratch each time around).


From tim.one@comcast.net  Thu Nov  7 20:19:30 2002
From: tim.one@comcast.net (Tim Peters)
Date: Thu, 07 Nov 2002 15:19:30 -0500
Subject: [Spambayes] Upgrade problem
In-Reply-To: <20021107150113.4AADCF5CC@cashew.wolfskeep.com>
Message-ID: <BIEJKCLHCIOIHAGOKOLHEECLDOAA.tim.one@comcast.net>

[T. Alexander Popiel]
> Why don't we just store the counts, and only compute the probabilities
> when we need to reference them?  Yes, it is more efficient for bulk
> testing to only compute the probabilities once, but it's definitely
> a lose for incremental training.

Unqualified judgments are always wrong <wink>.  I often get email in batches
of 200, and scoring speed is important to me -- much more so than training
speed.  It will be even more so at python.org, where training probably won't
occur more often than once a week, but scoring is ongoing around the clock.
Note that for purposes of scoring, the *counts* needn't be saved at all now,
and a scoring-only database can exploit that (and this project's
neiltrain.py already does).

> Unless there's good arguments against, I'll make a patch for this
> in the next day or two.

When one size doesn't fit all, think instead about subclasses, different
methods, additional arguments, and/or instance attributes.  It's also nice
that the current code separates probability estimation algorithms from
probability combination algorithms.


From tim.one@comcast.net  Thu Nov  7 20:23:32 2002
From: tim.one@comcast.net (Tim Peters)
Date: Thu, 07 Nov 2002 15:23:32 -0500
Subject: [Spambayes] Upgrade problem
In-Reply-To: <r01050400-1021-249E5890F26A11D68CC8003065D5E7E4@[10.0.0.23]>
Message-ID: <BIEJKCLHCIOIHAGOKOLHEECMDOAA.tim.one@comcast.net>

[Just van Rossum]
> ...
> I'd be happy to add a this
>
> try:
>     True, False
> except NameError:
>     True, False = 1, 0
>
> to a bunch of files, and patch the docs.
>
> Your call. My sf username is "jvr" ;-)

It's fine by me, and you have commit privileges now.

From richie@entrian.com  Thu Nov  7 20:45:37 2002
From: richie@entrian.com (Richie Hindle)
Date: Thu, 07 Nov 2002 20:45:37 +0000
Subject: [Spambayes] Upgrade problem
In-Reply-To: <BIEJKCLHCIOIHAGOKOLHEECLDOAA.tim.one@comcast.net>
References: <20021107150113.4AADCF5CC@cashew.wolfskeep.com>
	<BIEJKCLHCIOIHAGOKOLHEECLDOAA.tim.one@comcast.net>
Message-ID: <5tjlsu8ak2a734sjb4hosp28qrvp6fdm13@4ax.com>


[Tim]
> Note that for purposes of scoring, the *counts* needn't be saved at all now

A quick note in case someone decides to remove the counts from the
database: the HTML front end has a "Word query" feature which will tell you
the information in the database for a given word - it's interesting to see
how many more times the word 'Viagra' appears in ham than in spam.  I mean
the other way round.

-- 
Richie Hindle
richie@entrian.com


From richie@entrian.com  Thu Nov  7 20:53:06 2002
From: richie@entrian.com (Richie Hindle)
Date: Thu, 07 Nov 2002 20:53:06 +0000
Subject: [Spambayes] Corpus module (was: Upgrade problem)
In-Reply-To: <BIEJKCLHCIOIHAGOKOLHMECIDOAA.tim.one@comcast.net>
References: <r01050400-1021-A9382B2EF1D011D68CC8003065D5E7E4@[10.0.0.23]>
	<BIEJKCLHCIOIHAGOKOLHMECIDOAA.tim.one@comcast.net>
Message-ID: <fljlsusfv2tcnmiv8a0jurqnc9fn8mn7q7@4ax.com>


[Tim Peters]
> it's another reason to create a dedicated "training class" module,
> so that various clients can at least share an *interface* for doing such
> stuff

Tim Stone and I have made a start on this (or rather Tim has and I've poked
my nose in) - I mention it because he's away until the weekend and we
wouldn't want anyone to duplicate the work.

It may be too early to talk details (and slightly rude in Tim's absence -
my apologies!) but here's the email I sent to Tim outlining how I thought
it might work.  I was thinking more about generic Message and Corpus
classes than specifically about training.

Laughing and pointing should be directed towards me rather than Tim.

-------------------------------------------------------------------------

[Tim S]
> We would include methods in Corpus to add a message to, remove a message from, 
> move from one to another, with the appropriate untraining/retraining built in.   
> We *could* have a method that, given a message substance (headers and body) 
> would find an existing message in a corpus that matched it (somehow).  We 
> would include metadata with the corpus that tells us whether it's a 
> spam/ham/untrained corpus, so the retraining can be done.  We could even 
> include a fourth type of corpus (cache) with methods to use expiry data in the 
> message metadata to remove old cache messages...

This is excellent stuff.  A Corpus contains Messages.  CacheCorpus is a
subclass of Corpus that adds the concept of expiry, and contains
CachedMessages (CachedMessage being a subclass of Message) that know about
their own expiry details (time of creation, size, time of last use,
whatever it depends on).  That's very neat.

A Corpus wouldn't know how to create Message objects, nor would a Message
object know how to create itself - classes *derived from* them would know
how to do that.  For instance (totally untested code, probably full of
typos) -

class Message:
    def __init__(self, messageText):
        """Pass in the text of the message, headers and body."""
        # etc.

    def name(self):
        """Returns a name for this message which is unique within its
        corpus."""
        raise NotImplementedError

class FileMessage(Message):
    """A Message representing an email stored in a file on disk."""

    def __init__(self, pathname):
        self.pathname = pathname
        messageFile = open(self.pathname)
        messageText = messageFile.read()
        Message.__init__(messageText)
        messageFile.close()

    def name(self):
        return self.pathname

...so the Message class dictates that all Messages must have name unique to
their corpus, but doesn't dictate how that name is determined.  Concrete
Message-derived classes fill in that detail.  [I may be putting too much
into the base class by demanding that the text of the message be given to
the constructor - that precludes making FileMessage lazy, and only read the
file when it needs to.]

'Corpus' works the same way; again, the details may be naive, but this is
the general idea:

class Corpus:
    """A collection of Message objects."""

    def __getitem__(self, messageName):
        """Makes Corpus act like a dictionary: a la corpus[messageName]"""
        raise NotImplementedError

class DirectoryCorpus(Corpus):
    """Represents a corpus of messages stored as individual files in a
    directory.  Example: corpus = DirectoryCorpus('mydir', '*.msg')"""

    def __init__(self, directoryPathname, globPattern):
        self.directoryPathname = directoryPathname
        self.globPattern = globPattern
        self.messageCache = {}  # The messages we're read from disk so far.

    def __getitem__(self, messageName):
        try:
            return self.messageCache[messageName]
        except KeyError:
            if not fnmatch.fnmatch(messageName, self.globPattern):
                raise KeyError, "Message name doesn't match naming pattern"
            pathname = os.path.join(self.directoryPathname, messageName)
            message = FileMessage(pathname)  # May raise IOError - let it.
            self.messageCache[messageName] = message
            return message

Here I've implemented the laziness idea by only reading the file when it's
asked for.

Maybe the message cache should go in Corpus - that would be useful for
*all* Corpus implementations.

You can then envisage a MailboxCorpus, and OutlookFolderCorpus, an
IMAPFolderCorpus, a POP3AccountCorpus, a PigeonMessagingCorpus and so on.

> move [Messages] from one [Corpus] to another, with the appropriate
> untraining/retraining built in.   

Yes - this could work using observer objects registered with Corpus
objects:

class CorpusObserver:
    """Derive your class from this and call corpus.addObserver to be
    informed when things happen to a corpus."""

    def onAddMessage(self, corpus, message):
        """Called when a message is added to a corpus."""
        pass   # Not NotImlementedError, so that people don't have to
               # implement *all* the event functions of CorpusObserver.

class Corpus:
    def __init__(self):
        self.observers = []   # List of CorpusObservers to inform of events

    def addObserver(self, observer):
        self.observers.append(observer)

    def addMessage(self, message):
        """External code adds messages by calling this - for example, in an
        OutlookCorpus it would be called as a result of the user dragging
        a message into the folder."""
        self.messageCache[message.name()] = message
        for observer in self.observers:
            observer.onAddMessage(self, message)

class AutoTrainer(CorpusObserver):
    """Trains the given classifier when messages are added or removed from
    the given Ham/Spam corpuses."""

    def __init__(self, bayes, hamCorpus, spamCorpus):
        self.bayes = bayes
        self.hamCorpus = hamCorpus
        self.spamCorpus = spamCorpus
        hamCorpus.addObserver(self)
        spamCorpus.addObserver(self)

    def onAddMessage(self, corpus, message):
        if corpus == self.spamCorpus:
            self.bayes.learn(tokenize(message), True)
        else:
            assert corpus == self.hamCorpus, "Unknown corpus"
            self.bayes.learn(tokenize(message), False)

...and likewise for removeMessage, onRemoveMessage and unlearn.

> I'm going to be travelling for the rest of the week, and may not be able to 
> connect, so you may not hear from me till Friday sometime...

OK.  Hopefully this will get to you before you leave, and give you plenty
to think about.  You might want to run it past Tim Peters, 'cos he's *far*
better at this kind of thing than I am (though he's also busy).  I think
this is the sort of thing he has in mind.

Most of the *new* code that's needed is defining the abstract concepts and
their interfaces, rather than writing code that actually *does* anything -
it's building a framework.

Once the framework is there, most of the code needed to implement the
functionality should already be in the project - code to hook into Outlook,
to train on a message, to parse mbox files, and so on.  It just needs
hooking into the framework.

The mark of a good framework is when you write a tiny little class (like
AutoTrainer above for instance) that contains hardly any code but adds a
major new feature (in this case, automatic training when moving messages
around in Outlook).

-------------------------------------------------------------------------

-- 
Richie Hindle
richie@entrian.com


From tim.one@comcast.net  Thu Nov  7 21:00:21 2002
From: tim.one@comcast.net (Tim Peters)
Date: Thu, 07 Nov 2002 16:00:21 -0500
Subject: [Spambayes] Outlook plugin - training
In-Reply-To: <20021106141609.B31428@discworld.dyndns.org>
Message-ID: <BIEJKCLHCIOIHAGOKOLHKEDADOAA.tim.one@comcast.net>

[Tim]
> It will also create a database size problem:  without a strategy for
> pruning useless words, the database will grow without bounds

[Charles Cazabon]
> Did you actually find this?

Yes.

> I found the growth tailed off dramatically after not too long.

That too -- the second derivative is negative from the start, but the first
remains positive.  "It's like" log that way, growing ever more slowly, but
inexorably.

> I no longer have the exact numbers, but database growth for
> me tailed off almost to nothing after I had trained on something like
> 1500 messages.

When I run my c.l.py test, 10 classifiers are built each training on about
30,000 msgs.  The classifier pickles hug 18MB each then.  My classifier at
work has been trained on about 1,100 msgs, and its classifier pickle is
about 2MB.  My classifier at home has been trained on about 3,000 msgs, and
its classifier pickle is about 4MB.  That last one is from memory, so when I
get home I'll make up a different number so that the three points exactly
fit a log curve <wink>.

Nobody has used this system long enough under a high enough daily load yet
to get frantic about database bloat, but the people who have run very large
tests must all be aware that it's inevitable (without pruning).  I've
already noticed the increase in startup time on my home box, due to loading
a bigger pickle every day.


From tim@zope.com  Thu Nov  7 22:11:22 2002
From: tim@zope.com (Tim Peters)
Date: Thu, 7 Nov 2002 17:11:22 -0500
Subject: [Spambayes] Outlook plugin - training
In-Reply-To: <200211062219.gA6MJe502959@localhost.localdomain>
Message-ID: <BIEJKCLHCIOIHAGOKOLHMEDFDOAA.tim@zope.com>

[Anthony Baxter]
> Note that "random sample" is not as trivial as all that, either - if
> you have a very high ham:spam ratio in your training DB, your accuracy
> will suffer (see the tests from Alex, myself and others).

I still need to try to make sense of those tests.  A real complication is
that more than one thing changes when trying to test ratios:  it's not just
the ratio that changes, it's the absolute number of each trained on too.
For example, (a) train on 5000 ham and 1000 spam; or, (b) train on 50000 ham
and 10000 spam.  The ratios are identical.  Do we expect the error rates to
be identical too?  I don't, but haven't tried it.  I expect the latter would
do better than the former, despite the identical ratios, simply because more
msgs allow better spamprob estimates.

Something missing in "the ratio tests" is a rationale (even an
after-the-fact one) for believing there's some aspect of the system that's
sensitive to the ratio.  The combining method certainly is not, and the
spamprob estimation (update_probabilities()) deliberately works with
percentages instead of raw counts so that the ham::spam training ratio has
no direct effect on the spamprobs calculated.

> An easy example of this is those of us who are on a bunch of higher
> volume python.org lists - Greg's sterling work there means that very
> little spam gets through there.

The total # of spam training msgs does limit how high a spamprob can get,
and the total # of ham training msgs limits how low.  The *suspicion* I had
running my large c.l.py test is that it wasn't the ratio that mattered so
much as the absolute number, and that the error rates didn't "settle down"
to the 4th digit until I got near 10,000 spam total.

> As spambayes takes over the world, this could be a larger problem.

Despite all the above <wink>, when faking "random sample" by hand in my
personal classifiers, I see I've *ended up* aiming for about an equal number
of each in my training data.  That works well too (for me, and
anecdotally -- these aren't controlled experiments).


From tim@zope.com  Thu Nov  7 22:35:56 2002
From: tim@zope.com (Tim Peters)
Date: Thu, 7 Nov 2002 17:35:56 -0500
Subject: [Spambayes] Outlook plugin - training
In-Reply-To: <n2m-g.u1iue32h.fsf@morpheus.demon.co.uk>
Message-ID: <BIEJKCLHCIOIHAGOKOLHMEDGDOAA.tim@zope.com>

[Paul Moore]
> That sounds like the best option. But it makes me wonder - what is a
> "Spam" folder, and what is a "Ham" folder, in this context? My best
> guess is that we're looking at the folders defined in the training
> dialog.

Right, that's what I meant.

> I'm having difficulty following the addin code, but that feels
> logical (I've never seen an Outlook addin before, so I'm struggling
> with "lots of code, can't see the flow" problems ATM...)

It's a GUI app:  all the interesting things happen by magic via callbacks
and hooks, and tracing the connections between what the user sees and
"pieces of code" is puzzling.  MS MAPI is also a massive, low-level API.
Add prints to the code and they'll be displayed in PythonWin's trace window;
that can help.  I'm more lost than not in it myself!

>> The Define Filters dialog has a multi-selection folder control,
>> ...

> I'm not entirely sure I do. As I said, anything moved by the rules
> wizard is list traffic, and as such is (a) non-spam (so no need to
> check it) and (b) not at all typical of personal mail. My intuition
> says that including list traffic will tend to dilute the clues which
> distinguish personal mail and spam.

Don't worry about it before you try it.  I suggest trying it because I'm not
sure it's possible to *stop* the system now from scoring all incoming msgs
(the "new msg in Inbox" filter appears to trigger for every one, regardless
of whether the RW decides to move it; after that it may just be a race
between the RW and the addin deciding where to move each).

> Of course, I know that the classifier *really* works by magic, and
> so my intuition is useless :-)

It's more that unless you know exactly how the math works, your intuition is
simply baseless here, carried over from some other experience.  Do *you*
have trouble distinguishing personal and work email from spam?  There you
go, and you can't even compute inverse chi-squared probabilities to 14
significant digits on demand in your head <wink>.

> ..
> You could easily be right on this. It's not so much that I don't want
> an Unsure folder, as that I don't know how best to manage it.

What's to manage?  I get about 600 emails per day, and about 1% end up in
Unsure (about 6 -- actually less than that, lately; the system is learning).
Looking at 6 msgs is no burden.  I often find that msgs that end up here are
neither ham *nor* spam to me, and then don't train on them at all.  Jeremy
Hylton said the same today about his experience -- we're both glad we see
them instead of calling them spam, and we're both glad they didn't clutter
our Inbox.

It's peculiar that there are msgs that are subjectively neither ham nor spam
(I wasn't expecting this!), and it's downright spooky that the Unsure folder
tends to collect them.

> My instinctive reaction is that I want "Spam" and "Not Spam" buttons,
> and then I read or delete the message in situ.

MarkH has since implemented this in the Unsure folder.

> Using the act of moving the message to indicate the status feels wrong.
>
> But maybe, in the light of what you said above (about watching
> multiple folders), I need to rethink this - for "normal mail" folders
> at least, if not for list traffic.
>
> OK, I'll try thinking in terms of 4 categories of folder - ham, spam,
> unsure, and "list traffic".

I still think you're making life too complicated.  Is list traffic spam?  If
so, call it spam.  If not, call it ham.

> ...
> the delete button. Maybe that's worth doing. It's back to that "how
> does the classifier know?" question again :-)

It knows what you teach it, of course <wink>.


From just@letterror.com  Thu Nov  7 23:12:45 2002
From: just@letterror.com (Just van Rossum)
Date: Fri,  8 Nov 2002 00:12:45 +0100
Subject: [Spambayes] Upgrade problem
In-Reply-To: <BIEJKCLHCIOIHAGOKOLHEECLDOAA.tim.one@comcast.net>
Message-ID: <r01050400-1021-6EAB9700F2A611D68CC8003065D5E7E4@[10.0.0.23]>

Tim Peters wrote:

> [T. Alexander Popiel]
> > Why don't we just store the counts, and only compute the probabilities
> > when we need to reference them?  Yes, it is more efficient for bulk
> > testing to only compute the probabilities once, but it's definitely
> > a lose for incremental training.
> 
> Unqualified judgments are always wrong <wink>.  I often get email in batches
> of 200, and scoring speed is important to me -- much more so than training
> speed.  It will be even more so at python.org, where training probably won't
> occur more often than once a week, but scoring is ongoing around the clock.

I think it can be done with almost no extra overhead with a caching scheme. This
assumes (probably wrongly <wink>) that the cache stays in memory between runs.
Something like this perhaps:

*** classifier.py   Thu Nov  7 23:03:07 2002
--- classifier.py.hack  Fri Nov  8 00:04:05 2002
***************
*** 456,459 ****
--- 456,460 ----
  
          wordinfoget = self.wordinfo.get
+         spamprobget = self.spamprobcache.get
          now = time.time()
          for word in Set(wordstream):
***************
*** 463,467 ****
              else:
                  record.atime = now
!                 prob = record.spamprob
              distance = abs(prob - 0.5)
              if distance >= mindist:
--- 464,470 ----
              else:
                  record.atime = now
!                 prob = spamprobget(word)
!                 if prob is None:
!                     prob = self.calcspamprob(word, record)
              distance = abs(prob - 0.5)
              if distance >= mindist:


Just

From popiel@wolfskeep.com  Fri Nov  8 00:06:27 2002
From: popiel@wolfskeep.com (T. Alexander Popiel)
Date: Thu, 07 Nov 2002 16:06:27 -0800
Subject: [Spambayes] Outlook plugin - training 
In-Reply-To: Message from "Tim Peters" <tim@zope.com> 
	<BIEJKCLHCIOIHAGOKOLHMEDFDOAA.tim@zope.com> 
References: <BIEJKCLHCIOIHAGOKOLHMEDFDOAA.tim@zope.com> 
Message-ID: <20021108000627.2B918F5CC@cashew.wolfskeep.com>

In message:  <BIEJKCLHCIOIHAGOKOLHMEDFDOAA.tim@zope.com>
             "Tim Peters" <tim@zope.com> writes:
>[Anthony Baxter]
>> Note that "random sample" is not as trivial as all that, either - if
>> you have a very high ham:spam ratio in your training DB, your accuracy
>> will suffer (see the tests from Alex, myself and others).
>
>I still need to try to make sense of those tests.  A real complication is
>that more than one thing changes when trying to test ratios:  it's not just
>the ratio that changes, it's the absolute number of each trained on too.

True.

>For example, (a) train on 5000 ham and 1000 spam; or, (b) train on 50000 ham
>and 10000 spam.  The ratios are identical.  Do we expect the error rates to
>be identical too?  I don't, but haven't tried it.

I have tried this, and the effects of ratio were diminished
as the training set size increased.  For details, see
http://www.wolfskeep.com/~popiel/spambayes/ratio2 .  The
tests were done with gary-combining, not chi-square, so I
really ought to rerun them.

>I expect the latter would do better than the former, despite the identical
>ratios, simply because more msgs allow better spamprob estimates.

It depended on what the ratio in question was... for 1:4
ham:spam, increased training set size hurt instead of helped,
in the ranges that I was able to test.  For 1:1, increased
training helped instead of hurt.

>Something missing in "the ratio tests" is a rationale (even an
>after-the-fact one) for believing there's some aspect of the system that's
>sensitive to the ratio.  The combining method certainly is not, and the
>spamprob estimation (update_probabilities()) deliberately works with
>percentages instead of raw counts so that the ham::spam training ratio
>has no direct effect on the spamprobs calculated.

Eh, I have a perfectly good rationale for believing that
something is sensitive the the ratio: the tests I've run
show such a sensitivity.  What's missing is a theory on
_why_ there's a sensitivity. ;-)

I don't think the following theory is perfectly phrased, but
it seems plausible to me:

Perhaps the number of topics discussed in ham is greater
than that in spam.  Thus, the average percentage of ham
messages containing a particular significant ham word is
systematically lower than the average probability of a
particular significant spam word appearing in spam messages.
As the training set size increases, the percentage difference
becomes more consistent and pronounced.  Since we're then
combining the percentages, we systematically skew slightly
due to the differing averages.

Changing the ratio of ham to spam has the effect of changing
the number of topics discussed, particularly when the training
set size is small and random chance can exclude all instances
of a given topic.  Balancing the number of topics removes the
skew in the probabilities.  As training set size increases,
adjusting the ratio has less effect, because it has less
likelyhood of eliminating topics of discussion.

I think that would account for my data.

>The total # of spam training msgs does limit how high a spamprob can get,
>and the total # of ham training msgs limits how low.  The *suspicion* I had
>running my large c.l.py test is that it wasn't the ratio that mattered so
>much as the absolute number, and that the error rates didn't "settle down"
>to the 4th digit until I got near 10,000 spam total.

I suspect that by the time the corpora got that large, adjusting
the training ratio wouldn't make a lick of difference if the
corpora were sampled randomly to achieve the given ratio.  There
would just be too little chance of excluding a topic from the
samples.  Systematically excluding a topic might produce equivalent
results to my ratio tests.

- Alex

From richie@entrian.com  Fri Nov  8 00:17:25 2002
From: richie@entrian.com (Richie Hindle)
Date: Fri, 08 Nov 2002 00:17:25 +0000
Subject: [Spambayes] SMTP proxy questions
Message-ID: <ievlsu08r2krkv5n6clac6c5p58uianqph@4ax.com>


[Me]
> Also on my list is to commit Tim Stone's SMTP proxy code, possibly after
> integrating it with the pop3proxy (but I need to discuss that with you,
> Tim, after looking in more detail at the code, hopefully tonight).

I've discussed this with Tim S, and he's going off the SMTP proxy idea
while I'm still broadly in favour of it.  What do people think - do
non-Outlook users want to forward messages to 'spam' and 'ham' to train the
system, or use an HTML UI?

The most difficult problem for retraining-by-forwarding is matching the
forwarded message to one from the cache, after Outlook Express has stripped
the headers, top-quoted the users .sig, converted it to HTML and added
fifteen macro viruses.  Any ideas?  Can the tokeniser help?

Or perhaps there's another way.  The only other option I'd thought of was
to add two hyperlinks to the end of the message, "This is spam" and "This
is ham" (in ways that would work for both HTML and plain-text messages, in
both HTML and plain-text email clients).  They'd link to the HTML interface
and tell it the cache ID of the message.  Adding content to emails is way
more intrusive (and difficult) than adding headers.  But no more intrusive
than the .sig that mailman adds.

-- 
Richie Hindle
richie@entrian.com


From anthony@interlink.com.au  Fri Nov  8 00:30:09 2002
From: anthony@interlink.com.au (Anthony Baxter)
Date: Fri, 08 Nov 2002 11:30:09 +1100
Subject: [Spambayes] SMTP proxy questions 
In-Reply-To: <ievlsu08r2krkv5n6clac6c5p58uianqph@4ax.com> 
Message-ID: <200211080030.gA80UAf11390@localhost.localdomain>


> I've discussed this with Tim S, and he's going off the SMTP proxy idea
> while I'm still broadly in favour of it.  What do people think - do
> non-Outlook users want to forward messages to 'spam' and 'ham' to train the
> system, or use an HTML UI?

I'd have to say I don't like the idea. There's too many potential places
where it can all go horribly horribly pear-shaped, and too many rat-holes
that the various email clients can screw up with.

Anthony

-- 
Anthony Baxter     <anthony@interlink.com.au>   
It's never too late to have a happy childhood.


From jbublitz@nwinternet.com  Fri Nov  8 01:15:29 2002
From: jbublitz@nwinternet.com (Jim Bublitz)
Date: Thu, 07 Nov 2002 17:15:29 -0800 (PST)
Subject: [Spambayes] SMTP proxy questions
In-Reply-To: <ievlsu08r2krkv5n6clac6c5p58uianqph@4ax.com>
Message-ID: <XFMail.021107171529.jbublitz@nwinternet.com>

On 08-Nov-02 Richie Hindle wrote:
> Or perhaps there's another way.  The only other option I'd
> thought of was to add two hyperlinks to the end of the message,
> "This is spam" and "This is ham" (in ways that would work for
> both HTML and plain-text messages, in both HTML and plain-text
> email clients).  They'd link to the HTML interface and tell it
> the cache ID of the message.  Adding content to emails is way
> more intrusive (and difficult) than adding headers.  But no more
> intrusive than the .sig that mailman adds.

What about adding a MIME object to the msg with the Spambayes info
(text/spambayes?) - or will forwarding lose that info too? The
email module should be able to do this.

Jim


From tim.one@comcast.net  Fri Nov  8 04:07:18 2002
From: tim.one@comcast.net (Tim Peters)
Date: Thu, 07 Nov 2002 23:07:18 -0500
Subject: [Spambayes] Proposing to drop retain_pure_html_tags
In-Reply-To: <LNBBLJKPBEHFEDALKOLCOEFOCGAB.tim.one@comcast.net>
Message-ID: <LNBBLJKPBEHFEDALKOLCEEPMCHAB.tim.one@comcast.net>

FYI, that option is gone now.

From tim.one@comcast.net  Fri Nov  8 04:29:17 2002
From: tim.one@comcast.net (Tim Peters)
Date: Thu, 07 Nov 2002 23:29:17 -0500
Subject: [Spambayes] Proposing to rename some fundamental options
In-Reply-To: <LNBBLJKPBEHFEDALKOLCOEFOCGAB.tim.one@comcast.net>
Message-ID: <LNBBLJKPBEHFEDALKOLCAEPOCHAB.tim.one@comcast.net>

The original names made more sense when we had half a dozen competing
schemes.

Current                         Proposed
-------                         --------
robinson_probability_x          unknown_word_prob
robinson_probability_s          unknown_word_strength
robinson_minimum_prob_strength  minimum_prob_strength


Note:  unknown_word_prob is what the Baysian prob adjustment moves toward,
more strongly the less evidence backs up a counting spamprob estimate (the
fewer the msgs a word has been seen in, the more the adjustment pushes the
spamprob toward unknown_word_prob; for a word that's never been seen before,
this reduces to unknown_word_prob exactly).

We've always set it to 0.5 by default, and previous tests never showed
benefit from changing that.

We've gotten better since then, though, and it's possible to deduce "a more
correct" value.  For example, take the mean of all the by-counting spamprobs
in your database, across words that have appeared in at least 10 msgs (so
that there's reason to have *some* confidence in the by-counting guess).
That's then an estimate of the spamprob a new word will eventually get over
time.

Across 3 databases I tried this on, it turned out to be a little over 0.5,
from 0.513 (my home personal classifier) to 0.540 (fat c.l.py test).

If someone has time for a controlled experiment, run the attached code to
find this guess for one of your databases; then if it differs from 0.5, try
a before-and-after test just changing that much.  If there's any promise
here, update_probabilities() could easily be changed to compute and use this
automatically.

"""
import cPickle as pickle
f = file('fat.pik', 'rb')  # your database pickle goes here

c = pickle.load(f)
f.close()
w = c.wordinfo

def guessx():
    nham = float(c.nham or 1.0)
    nspam = float(c.nspam or 1.0)
    n = 0
    probsum = 0.0
    for rec in w.itervalues():
        if rec.hamcount + rec.spamcount >= 10:
            hamratio = rec.hamcount / nham
            spamratio = rec.spamcount / nspam
            prob = spamratio / (spamratio + hamratio)
            probsum += prob
            n += 1
    print n, probsum / n

guessx()
"""


From mhammond@skippinet.com.au  Fri Nov  8 04:48:54 2002
From: mhammond@skippinet.com.au (Mark Hammond)
Date: Fri, 8 Nov 2002 15:48:54 +1100
Subject: [Spambayes] Corpus module (was: Upgrade problem)
In-Reply-To: <fljlsusfv2tcnmiv8a0jurqnc9fn8mn7q7@4ax.com>
Message-ID: <LCEPIIGDJPKCOIHOBJEPAEKIHJAA.mhammond@skippinet.com.au>

> Laughing and pointing should be directed towards me rather than Tim.

None of that, but some thoughts <wink>.

I think that the classes I posted a while ago suffer from the exact reverse
problem as your idea.  My idea was to make a "message store" that is largely
independent of training.  I believe the problem with your design is that it
deals with the training at the expense of the message store.

Obviously, but worth mentioning, is that there are competing interests here.
My focus is towards clients, and specifically the outlook one (if there were
more clients I would be happy to think of them too <wink>).  Alot of the
focus of this group is towards admins rather than individuals (which is just
fine!)  But it seems the current thinking is of a corpus as being a fairly
static, well-controlled set of messages used almost purely for training
purposes.

For client programs, this may not be practical.  The corpus is a more
dynamic set of messages - and worse, actually *is* the user's set of
messages rather than a collection of message copies.

For example, "moving" a message in a corpus may actually mean moving the
message in the user's real inbox.  This may or may not be what is intended -
a corpus "move" operation is more about changing a message's classification
than it is about physically moving pieces of mail around.

> A Corpus wouldn't know how to create Message objects, nor would a Message
> object know how to create itself - classes *derived from* them would know
> how to do that.  For instance (totally untested code, probably full of
> typos) -
>
> class Message:

Jeremy and I both posted real code, so starting with something that takes
that into consideration would be good.

> I may be putting too much
> into the base class by demanding that the text of the message be given to
> the constructor - that precludes making FileMessage lazy, and
> only read the
> file when it needs to.]

It also defeats the abstract nature of the class.

> 'Corpus' works the same way; again, the details may be naive, but this is
> the general idea:

I'm hoping I don't sound grumpy, but again, the few systems that already
exist for this engine are the best ones to use to discover the naivety early
<wink>

> You can then envisage a MailboxCorpus, and OutlookFolderCorpus, an
> IMAPFolderCorpus, a POP3AccountCorpus, a PigeonMessagingCorpus and so on.

I can't quite imagine that at the moment, as per my comments at the top.

Off the top of my head, I believe we need:
* An abstract "message id"
* A message classification database, as discussed before - basically just a
dictionary, keyed by ID, holding either "spam" or "ham".
* A "corpus" becomes just an enumerator of message IDs for bulk/batch
training.  It has no move etc operations.
* A "message store" is capable of returning a message object given its ID.
* The training API simply takes message objects and updates the probability
and message databases.

At that level, we really don't need much else - no folders or any other
grouping of messages.  I'm really not too sure there is much value in adding
higher-level concepts such as folders or message store "move" operations -
certainly not at the outset, where there are too many competing
requirements.

> Yes - this could work using observer objects registered with Corpus
> objects:

This could work, but may be too simple to be necessary.  If the process of
re-training a message in the Outlook GUI becomes:

def RetrainMessageAsSpam():
	# Outlook specific code to get an ID.
	message = message_store.GetMessage(id)
	if not classifier.IsSpam(message):
		classifier.train(message, is_spam=True)

And not a whole lot else, it doesn't seem worth it.  Unfortunately, the
decision to perform the retrain is the complex, but client specific part.
Is this a newly delivered message?  Did the user manually move the message
somewhere?  Did the user click one of our buttons?  Is the user deleting old
ham that we want to train on before it dies forever?

Outlook does this via examining what Outlook event we are seeing, and
looking at meta-data we possibly previously attached to the message.  I'm
not sure this can be encapsulated well at the moment without adding all our
meta-data etc baggage to the base classes.

> Most of the *new* code that's needed is defining the abstract concepts and
> their interfaces, rather than writing code that actually *does* anything -
> it's building a framework.

*cough* ummm...  This is doomed to failure.  Code *must* do something to be
taken seriously.  At the very least, I would expect to see the existing test
driver framework running against these "abstract concepts" <wink>

> Once the framework is there, most of the code needed to implement the
> functionality should already be in the project - code to hook
> into Outlook,
> to train on a message, to parse mbox files, and so on.  It just needs
> hooking into the framework.

See above <wink>.

Mark.


From tim.one@comcast.net  Fri Nov  8 04:50:42 2002
From: tim.one@comcast.net (Tim Peters)
Date: Thu, 07 Nov 2002 23:50:42 -0500
Subject: [Spambayes] SMTP proxy questions
In-Reply-To: <ievlsu08r2krkv5n6clac6c5p58uianqph@4ax.com>
Message-ID: <LNBBLJKPBEHFEDALKOLCGEAACIAB.tim.one@comcast.net>

[Richie Hindle]
> ...
> The most difficult problem for retraining-by-forwarding is matching the
> forwarded message to one from the cache, after Outlook Express
> has stripped the headers, top-quoted the users .sig, converted it
> to HTML and added fifteen macro viruses.  Any ideas?

If user can be convinced to forward as an *attachment*, those problems go
away, at least in OE.  You can create a new msg there, select any number of
msgs, drag them to the msg as a group, and OE will create an attachment for
each one.  Unlike Outlook, OE appears to save the original stuff that came
in over the wire (we're finding it's a real hoot in the OL client to try to
guess what the original MIME structure may have been).

> Can the tokeniser help?

If you put in a token unique to each msg, sure <wink>.  Perhaps the "loose
checksum" program Skip checked in could be useful for this.


From tim.one@comcast.net  Fri Nov  8 05:06:43 2002
From: tim.one@comcast.net (Tim Peters)
Date: Fri, 08 Nov 2002 00:06:43 -0500
Subject: [Spambayes] Upgrade problem
In-Reply-To: <5tjlsu8ak2a734sjb4hosp28qrvp6fdm13@4ax.com>
Message-ID: <LNBBLJKPBEHFEDALKOLCGEACCIAB.tim.one@comcast.net>

[Richie Hindle]
> A quick note in case someone decides to remove the counts from the
> database:

Neil Schemenauer already does, in his CDB code (neil*.py).  It's a lean
scoring-only database, mapping tokens to *just* spamprobs.  If he went on to
store them as scaled ints, he could almost certainly reduce this to 2 bytes
of prob info per token, and possibly even just 1.

> the HTML front end has a "Word query" feature which will tell you the
> information in the database for a given word - it's interesting to see
> how many more times the word 'Viagra' appears in ham than in spam.  I
> mean the other way round.

What a geek <wink>.


From tim.one@comcast.net  Fri Nov  8 05:48:25 2002
From: tim.one@comcast.net (Tim Peters)
Date: Fri, 08 Nov 2002 00:48:25 -0500
Subject: [Spambayes] Upgrade problem
In-Reply-To: <r01050400-1021-6EAB9700F2A611D68CC8003065D5E7E4@[10.0.0.23]>
Message-ID: <LNBBLJKPBEHFEDALKOLCIEAFCIAB.tim.one@comcast.net>

[Just van Rossum]
> I think it can be done with almost no extra overhead with a
> caching scheme.  This assumes (probably wrongly <wink>) that
> the cache stays in memory between runs.
> Something like this perhaps:
>
> *** classifier.py   Thu Nov  7 23:03:07 2002
> --- classifier.py.hack  Fri Nov  8 00:04:05 2002
> ***************
> *** 456,459 ****
> --- 456,460 ----
>
>           wordinfoget = self.wordinfo.get
> +         spamprobget = self.spamprobcache.get
>           now = time.time()
>           for word in Set(wordstream):
> ***************
> *** 463,467 ****
>               else:
>                   record.atime = now
> !                 prob = record.spamprob
>               distance = abs(prob - 0.5)
>               if distance >= mindist:
> --- 464,470 ----
>               else:
>                   record.atime = now
> !                 prob = spamprobget(word)
> !                 if prob is None:
> !                     prob = self.calcspamprob(word, record)
>               distance = abs(prob - 0.5)
>               if distance >= mindist:

Sorry, I don't know what this is trying to accomplish.  Like, what is
self.spamprobcache?  There's no such thing now, and the patch doesn't appear
to create one (i.e., this code doesn't run).  Whatever it's supposed to be,
why isn't spamprobcache.get *itself* responsible for returning a spamprob,
instead of making its caller deal with two cases?  If the answer is "it's
supposed to be a dict, so .get ain't that smart", then the memory burden for
a long-running scorer process will zoom, negating one of the benefits people
attached to "real databases" thought they were buying in return for giant
files and slothful performance <wink>.

Life would be easier if databaseheads trained all they liked as often as
they liked, but refrained from calling update_probabilities() until the end
of the day (or other "quiet time").  The idea that the model should be
updated after every msg trained on is an extreme.


From tim.one@comcast.net  Fri Nov  8 06:23:13 2002
From: tim.one@comcast.net (Tim Peters)
Date: Fri, 08 Nov 2002 01:23:13 -0500
Subject: [Spambayes] Corpus module (was: Upgrade problem)
In-Reply-To: <fljlsusfv2tcnmiv8a0jurqnc9fn8mn7q7@4ax.com>
Message-ID: <LNBBLJKPBEHFEDALKOLCCEAICIAB.tim.one@comcast.net>

[Richie Hindle, cogitates about Messages and their Corpus(ora)]

That's the ticket!  Backing off to a more fundamental level looks useful to
me too.  We never even straightened that much out for testing purposes
(msgs.py isn't general enough; for some custom test drivers (never checked
in), I couldn't even reuse the MsgStream class for my *own* directory
structures).

I disagree with Mark's

> If the process of re-training a message in the Outlook GUI becomes:
>
> def RetrainMessageAsSpam():
>     # Outlook specific code to get an ID.
>     message = message_store.GetMessage(id)
>     if not classifier.IsSpam(message):
>         classifier.train(message, is_spam=True)
>
> And not a whole lot else, it doesn't seem worth it.

because it illustrates the point <wink>:  it doesn't look like a correct
re-training method (although it may be, depending on assumptions about where
"id" comes from, and what assorted classifier methods do), and while a
correct method shouldn't be hard, in the absence of a class dedicated to
doing the simple common things that *can* be done in a common way, everyone
will keep screwing it up in their own client code.

> ...
> You might want to run it past Tim Peters, 'cos he's *far* better at this
> kind of thing than I am (though he's also busy).

I have to do more Python and Zope work now, so have to guard my time on
*this* project more jealously than I have.  MarkH and SeanT and JeremyH all
have ideas here too, and I trust you'll sort them out as a harmonious family
bent on world domination.  As a general strategy, the first person to check
code in usually wins <wink -- but take a clue from Mark, and from the
earlier days of this project, and from the pop3 proxy, and sling code more
than talk about it -- refactoring in Python is easy when the need becomes
apparent from real life>.

> ...
> The mark of a good framework is when you write a tiny little class (like
> AutoTrainer above for instance) that contains hardly any code but adds a
> major new feature (in this case, automatic training when moving messages
> around in Outlook).

The client-specific code to hook and track msg movement in Outlook is
relatively massive, so everything else appears a drop in the bucket to Mark.
Nevertheless, if a usable framework for capturing the *common* part of this
stuff were available, removing the 5 lines of code quoted above would help
(the Outlook client, and all others).


From B-Morgan@concentric.net  Fri Nov  8 06:25:30 2002
From: B-Morgan@concentric.net (Brad Morgan)
Date: Thu, 7 Nov 2002 23:25:30 -0700
Subject: [Spambayes] SMTP proxy questions
In-Reply-To: <ievlsu08r2krkv5n6clac6c5p58uianqph@4ax.com>
Message-ID: <NABBJOOEOFODEALNMJAJMEOGHBAA.B-Morgan@concentric.net>

As I see it, having pop3proxy keep copies of the messages and using an HTML
UI for training has the least amount of dependancy on the email client's
forwarding capabilities (or lack thereof).

I have a severe aversion to opening spam that will probably carry over to
unsure messages, so having a link added to the message body may not do me
much good.

I will, however, go to an HTML UI and examine a message if that UI doesn't
"execute" the HTML.  I don't want to see pretty, raw data is good enough for
me to decide.

I hate to keep mentioning a "rival" project <G>, but popfile's UI seems
pretty close to what I think would work best here.

Regards,

Brad

-----Original Message-----
From: spambayes-bounces@python.org
[mailto:spambayes-bounces@python.org]On Behalf Of Richie Hindle
Sent: Thursday, November 07, 2002 5:17 PM
To: spambayes@python.org
Subject: [Spambayes] SMTP proxy questions


[Me]
> Also on my list is to commit Tim Stone's SMTP proxy code, possibly after
> integrating it with the pop3proxy (but I need to discuss that with you,
> Tim, after looking in more detail at the code, hopefully tonight).

I've discussed this with Tim S, and he's going off the SMTP proxy idea
while I'm still broadly in favour of it.  What do people think - do
non-Outlook users want to forward messages to 'spam' and 'ham' to train the
system, or use an HTML UI?

The most difficult problem for retraining-by-forwarding is matching the
forwarded message to one from the cache, after Outlook Express has stripped
the headers, top-quoted the users .sig, converted it to HTML and added
fifteen macro viruses.  Any ideas?  Can the tokeniser help?

Or perhaps there's another way.  The only other option I'd thought of was
to add two hyperlinks to the end of the message, "This is spam" and "This
is ham" (in ways that would work for both HTML and plain-text messages, in
both HTML and plain-text email clients).  They'd link to the HTML interface
and tell it the cache ID of the message.  Adding content to emails is way
more intrusive (and difficult) than adding headers.  But no more intrusive
than the .sig that mailman adds.

--
Richie Hindle
richie@entrian.com


_______________________________________________
Spambayes mailing list
Spambayes@python.org
http://mail.python.org/mailman/listinfo/spambayes


From tim.one@comcast.net  Fri Nov  8 06:46:14 2002
From: tim.one@comcast.net (Tim Peters)
Date: Fri, 08 Nov 2002 01:46:14 -0500
Subject: [Spambayes] Outlook plugin - training
In-Reply-To: 
 <16E1010E4581B049ABC51D4975CEDB8861992D@UKDCX001.uk.int.atosorigin.com>
Message-ID: <LNBBLJKPBEHFEDALKOLCGEAKCIAB.tim.one@comcast.net>

[Moore, Paul]
> ...
> I'm assuming (based on a message I recall seeing recently) that it's
> possible to "correct" training - ie, if I train the classifier that a
> specific message is spam, I can later say "no it isn't, it's ham".

That's right, and at the level of classifier.py it's a two-step process:
unlearn() as spam, then learn() as ham.  It actually doesn't matter which
order those are done in, but I won't admit to that <wink>.

> Assuming that this is so, is it not reasonable to train dynamically
> on an "assume I got it right" basis?

Depending on context, it *may* be.

> In other words, whenever the addin filters a message as ham or spam,
> automatically train on that basis as well. Then, if the user sees a
> mistake, he corrects it, which automatically retrains the classifier
> (manually deleting as spam or moving a message already does this).

Assuming a conscientious user, and a client that knows enough about what the
user is doing, that should work fine.

> This will keep the database right up to date, and all the user has to
> do is correct any bad decisions the classifier makes (which he should
> be doing anyway).
>
> I've ignored database growth issues, but other than that, is there any
> other problem with this approach?

Doubtless hundreds, but why quibble <wink>.  A misclassified msg will have
bad effects at once if the training gets reflected into the probabilities at
once, so it gets less appealing the less zealous the user is about
correcting mistakes right away.  That can be mitigated by doing the day's
training into a distinct dict, or not calling update_probabilities() in a
single dict, until "the end of the day", when the user has (presumably)
corrected all the day's mistakes they're going to correct.  But if the model
updating is going to be delayed anyway, then it makes as much sense to delay
doing any training on "the day's" msgs until the end of the day.
Determining what "the end of the day" means is a puzzle then too.  For
example, maybe I left my email client running and went on a week-long
vacation.  I'm not going to look over 700 presumed spam when I get back,
I'll just delete it.  But if ham was in there, I've now let it train in the
wrong direction, and that will hurt.

In other contexts, the scheme doesn't get off the ground.  For example, for
python.org use, nobody is going to review msgs claimed to be spam.  A system
feeding on its own judgments is going to reinforce its own mistakes too, so
the "conscientious, timely, reviewing human" bit is important.


From tim.one@comcast.net  Fri Nov  8 07:20:18 2002
From: tim.one@comcast.net (Tim Peters)
Date: Fri, 08 Nov 2002 02:20:18 -0500
Subject: [Spambayes] Outlook plugin - training
In-Reply-To: <LCEPIIGDJPKCOIHOBJEPMEFCHJAA.mhammond@skippinet.com.au>
Message-ID: <LNBBLJKPBEHFEDALKOLCMEAMCIAB.tim.one@comcast.net>

[Mark Hammond]
> ...
> The key limitation of this scheme, as Tim also alludes to, is that this
> never correctly classifies ham.  However, I actually see this
> incremental training more as a "get smarter now" than a "just get
> smarter" technique - ie, a user sees a mis-classified Spam, by re-
> training they are increasing the chances that the next similar mail
> will be handled correctly.  Instant feedback, especially while the user
> is getting started.
>
> ie, it is indeed "mistake based training", but that may still prove
> useful in addition to ongoing training.

I sure agree it's *very* useful at the start, and expect it will continue to
be useful over time.

> I can't help thinking that we are somehow underestimating our own
> tool here.

I'm going to try an experiment:  I'm going to wipe my home database and
start over from scratch, training first on one ham and one spam, then only
on mistakes and unsures.  This should be fun <wink>.

> As is common when people first use this tool, spam is generally
> found in the ham set and vice-versa.  Because of this, I know that my
> Inbox is spam free (but less sure about my other "ham" folders).  I'm
> also sure that my Spam folder has no ham.  This should remain true
> while continue to use the tool.

How do you know your Spam folder has no ham?  I know mine doesn't because I
routinely score it, sort on the score, and stare at "the wrong end".  I find
ham there as often as not, *usually* apparently due to mousing error when
dragging a training ham into the Ham folder and overshooting the mark.

> So surely we can exploit this somehow.  Off the top of my head:
> * Assume we don't trust the last 2 days of mail (as the user may not
> yet have sorted them).  Anything in the "good" and "spam" folders older
> than this can be assumed correctly classified, and able to be trained
> on.

Provided the user has already done a decent amount of training, then as Paul
Moore suggested it could even work to trust ham-vs-spam decisions
immediately, and let user corrections undo those as needed.  A well-trained
system should be pretty robust against a few misclassifications over the
short term.

> * A process could go through all ham and spam trained on, and score each
> message.  Any "suspect" messages are presented in a list (much like the
> Outlook "Find Message" result list).  The user can indicate that the
> message is correct (and the system will remember, never asking about
> this message again) or is indeed incorrectly classified.  If incorrect,
> it will be moved, and incrementally trained as per now.  (I can also
> picture a whitelist kicking in here; if incorrect, offer to add user to
> whitelist.  If user in the whitelist, assume ham thereby meaning mail
> from this person can never again be spam)

Tell us about the mistakes *you* see.  I feel like we're designing a
solution to a hypothetical problem otherwise.  The only "mistake" I
routinely see is that my cigarettes-via-web advertising keeps getting
knocked back into Unsure territory.  That doesn't bother me enough to do
anything about it, but if it bothers you enough <wink> then, yes, a
whitelist would solve that one.

> I can picture this working in the background, and simply indicating to
> the user that there are "conflicts" to be resolved at their leisure.

Or maybe we could just move those back to the Unsure folder.  The user
should already know what to do about things in Unsure, so it's nothing new
to them.  Moving a msg out of Unsure could be taken as a positive sign that
the user has classified such a msg once and for all (well, until they move
it again, anyway).

> Further, I imagine that as we build better training data for each
> message store, the number of "conflicts" actually found would
> generally be zero - ie, the system would find that all 2 day and
> older mail correctly classifies.

I expect that's true.


From just@letterror.com  Fri Nov  8 07:54:04 2002
From: just@letterror.com (Just van Rossum)
Date: Fri,  8 Nov 2002 08:54:04 +0100
Subject: [Spambayes] Upgrade problem
In-Reply-To: <LNBBLJKPBEHFEDALKOLCIEAFCIAB.tim.one@comcast.net>
Message-ID: <r01050400-1021-4B54CB90F2EF11D68CC8003065D5E7E4@[10.0.0.23]>

Tim Peters wrote:

> [Just van Rossum]
> > I think it can be done with almost no extra overhead with a
> > caching scheme.  This assumes (probably wrongly <wink>) that
> > the cache stays in memory between runs.
> > Something like this perhaps:
> >
> > *** classifier.py   Thu Nov  7 23:03:07 2002
> > --- classifier.py.hack  Fri Nov  8 00:04:05 2002
> > ***************
> > *** 456,459 ****
> > --- 456,460 ----
> >
> >           wordinfoget = self.wordinfo.get
> > +         spamprobget = self.spamprobcache.get
> >           now = time.time()
> >           for word in Set(wordstream):
> > ***************
> > *** 463,467 ****
> >               else:
> >                   record.atime = now
> > !                 prob = record.spamprob
> >               distance = abs(prob - 0.5)
> >               if distance >= mindist:
> > --- 464,470 ----
> >               else:
> >                   record.atime = now
> > !                 prob = spamprobget(word)
> > !                 if prob is None:
> > !                     prob = self.calcspamprob(word, record)
> >               distance = abs(prob - 0.5)
> >               if distance >= mindist:
> 
> Sorry, I don't know what this is trying to accomplish.  Like, what is
> self.spamprobcache?  There's no such thing now, and the patch doesn't appear
> to create one (i.e., this code doesn't run). 

Tim, don't be such a programmer <wink>. But ok, I promise I'll never post
pseudocode as a patch again...

> Whatever it's supposed to be,
> why isn't spamprobcache.get *itself* responsible for returning a spamprob,
> instead of making its caller deal with two cases? 

I thought I was doing your performance needs a favor <wink>.

> If the answer is "it's
> supposed to be a dict, so .get ain't that smart",

That's the answer.

> then the memory burden for
> a long-running scorer process will zoom, negating one of the benefits people
> attached to "real databases" thought they were buying in return for giant
> files and slothful performance <wink>.

Right. If a float takes up 20 bytes in memory (just a guess, no time to look),
then for a database of 100000 words (that's roughly the size of my personal db)
the memory burden is 100000 * (8 + 20), almost three megs.

Just in case the higher memory usage is not an issue, there's a simpler
approach: don't store spamprob in the db, but call bayes.update_probabilities()
on startup. update_probabilities() takes about 2 seconds on my lowly 400Mhz PPC
on my db (hm, that's using pickle, so will be a lot more when using a db :-( ).
You can tell I'm thinking mostly about long running processes...

I guess you're right, one size doesn't fit all. One last idea for this morning:
how about splitting the db in a training db (storing hamcount and spamcount) and
a classifying db (storing only spamprob)?

> Life would be easier if databaseheads trained all they liked as often as
> they liked, but refrained from calling update_probabilities() until the end
> of the day (or other "quiet time").  The idea that the model should be
> updated after every msg trained on is an extreme.

Good points.

Just

From richie@entrian.com  Fri Nov  8 08:06:33 2002
From: richie@entrian.com (Richie Hindle)
Date: Fri, 08 Nov 2002 08:06:33 +0000
Subject: [Spambayes] Upgrade problem
In-Reply-To: <r01050400-1021-281375DCF23011D68CC8003065D5E7E4@[10.0.0.23]>
References: <B9EFE999.5C289%francois.granger@free.fr>
	<r01050400-1021-281375DCF23011D68CC8003065D5E7E4@[10.0.0.23]>
Message-ID: <rirmsuse1blns4r3h9apiibvcluabnd9g7@4ax.com>


[Just]
> the web interface of pop3proxy.py is pretty good and useful, the only
> downside is that it saves the database after each training

That's now fixed (at least partly) along with some other bits:

 o The database is now saved (optionally) on exit, rather than after each
   message you train with.  There should be explicit save/reload commands,
   but they can come later.
 o It now keeps two mbox files of all the messages that have been used to
   train via the web interface - thanks to Just for the patch.
 o All the sockets now use async - the web interface used to freeze
   whenever the proxy was awaiting a response from the POP3 server.  That's
   now fixed.
 o It now copes with POP3 servers that don't issue a welcome command.
 o The training form now appears in the training results, so you can train
   on another message without having to go back to the Home page.

-- 
Richie Hindle
richie@entrian.com


From tim.one@comcast.net  Fri Nov  8 09:15:24 2002
From: tim.one@comcast.net (Tim Peters)
Date: Fri, 08 Nov 2002 04:15:24 -0500
Subject: [Spambayes] Outlook plugin - training
In-Reply-To: <LNBBLJKPBEHFEDALKOLCMEAMCIAB.tim.one@comcast.net>
Message-ID: <LNBBLJKPBEHFEDALKOLCOEBFCIAB.tim.one@comcast.net>

[Tim]
> ...
> I'm going to try an experiment:  I'm going to wipe my home database and
> start over from scratch, training first on one ham and one spam, then
> only on mistakes and unsures.  This should be fun <wink>.

It is!  The msg from me I'm replying to here scored 94 (solid spam).  I've
now got 5 ham and 5 spam in my training set, most of the new ones from
Unsures.  The latest spam was a blatant false negative, from Hapax City:

'*H*'                          0.998601
'*S*'                          8.60833e-005
'can'                          0.0652174
'have'                         0.0652174
"don't"                        0.0918367
'never'                        0.0918367
'number'                       0.0918367
'one'                          0.0918367
'what'                         0.0918367
'"the'                         0.155172   ham hapaxes from here
'able'                         0.155172
'about'                        0.155172
'against'                      0.155172
'also'                         0.155172
'any'                          0.155172
'anything'                     0.155172
'back'                         0.155172
'because'                      0.155172
'been'                         0.155172
'check'                        0.155172
'even'                         0.155172
'find'                         0.155172
'found'                        0.155172
'heard'                        0.155172
'how'                          0.155172
'into'                         0.155172
"it's"                         0.155172
'more'                         0.155172
'needed'                       0.155172
'other'                        0.155172
'out'                          0.155172
'own'                          0.155172
'people'                       0.155172
'skip:a 10'                    0.155172
'skip:i 10'                    0.155172
'special'                      0.155172
'subject:.'                    0.155172
'subject:: '                   0.155172
'their'                        0.155172
'them.'                        0.155172
'they'                         0.155172
'those'                        0.155172
'time'                         0.155172
'time.'                        0.155172
'unsubscribe'                  0.155172
'until'                        0.155172
'useful'                       0.155172
'using'                        0.155172   to here
'and'                          0.275281
'for'                          0.275281
'subject: '                    0.275281
'you'                          0.275281
'from'                         0.355072
'not'                          0.355072
'off'                          0.355072
'our'                          0.355072
'when'                         0.355072
'new'                          0.644928
'see'                          0.644928
'url:gif'                      0.724719
'url:www'                      0.724719
'call'                         0.844828   spam hapaxes from here
'contact'                      0.844828
'credit'                       0.844828
'email.'                       0.844828
'every'                        0.844828
'further'                      0.844828
'header:Received:2'            0.844828
'made'                         0.844828
'more!'                        0.844828
'most'                         0.844828
'now'                          0.844828
'plus,'                        0.844828
'receive'                      0.844828
'search'                       0.844828
'skip:1 10'                    0.844828
'url:jpg'                      0.844828   to here
'email'                        0.908163

I think I've established that 5+5 isn't enough for great results <snort>.
However, 80% of its decisions have been correct so far!


From tdickenson@devmail.geminidataloggers.co.uk  Fri Nov  8 10:52:32 2002
From: tdickenson@devmail.geminidataloggers.co.uk (Toby Dickenson)
Date: Fri, 8 Nov 2002 10:52:32 +0000
Subject: [Spambayes] Re: unsupervised training
In-Reply-To: <LNBBLJKPBEHFEDALKOLCMEAMCIAB.tim.one@comcast.net>
References: <LNBBLJKPBEHFEDALKOLCMEAMCIAB.tim.one@comcast.net>
Message-ID: <200211081052.32567.tdickenson@devmail.geminidataloggers.co.uk>

On Friday 08 November 2002 7:20 am, Tim Peters wrote:

> Provided the user has already done a decent amount of training, then as
> Paul Moore suggested it could even work to trust ham-vs-spam decisions
> immediately, and let user corrections undo those as needed.  A well-tra=
ined
> system should be pretty robust against a few misclassifications over th=
e
> short term.

For the last two weeks I have been using a setup that uses this type of=20
unsupervised training.

I have a procmail filter that sends a copy of all incoming ham and spam t=
o two=20
seperate mailboxes. These mailboxes are used for overnight batch training=
,=20
then deleted. Messages marked 'Unsure' do not take part in this automatic=
=20
training.

I perform seperate filtering for spam and 'unsure' in my mua. Fo far I am=
=20
manually inspecting the unsure folder, and manually adding them to the=20
appropriate training mailboxes. Initially about 3% of mails were 'unsure'=
,=20
but this has dropped to less than 1% after 2 weeks.

Starting next week I plan to change the mua filtering to treat 'unsure' t=
he=20
same as 'ham', and stop all manual training. It will be interesting to se=
e if=20
the training remains stable.


From bkc@murkworks.com  Fri Nov  8 14:51:15 2002
From: bkc@murkworks.com (Brad Clements)
Date: Fri, 08 Nov 2002 09:51:15 -0500
Subject: [Spambayes] SMTP proxy questions
In-Reply-To: <ievlsu08r2krkv5n6clac6c5p58uianqph@4ax.com>
Message-ID: <3DCB8912.18340.2FB5F81@localhost>

On 8 Nov 2002 at 0:17, Richie Hindle wrote:

> Or perhaps there's another way.  The only other option I'd thought of was
> to add two hyperlinks to the end of the message, "This is spam" and "This
> is ham" (in ways that would work for both HTML and plain-text messages, in
> both HTML and plain-text email clients).  They'd link to the HTML interface
> and tell it the cache ID of the message.  Adding content to emails is way
> more intrusive (and difficult) than adding headers.  But no more intrusive
> than the .sig that mailman adds.

If you do this, what's to keep spammers from also adding similar looking URLs?

A busy person might not notice any difference, could click through and confirm their 
email address...

Brad Clements,                bkc@murkworks.com   (315)268-1000
http://www.murkworks.com                          (315)268-9812 Fax
AOL-IM: BKClements


From barry@python.org  Fri Nov  8 15:04:56 2002
From: barry@python.org (Barry A. Warsaw)
Date: Fri, 8 Nov 2002 10:04:56 -0500
Subject: [Spambayes] SMTP proxy questions
References: <ievlsu08r2krkv5n6clac6c5p58uianqph@4ax.com>
	<XFMail.021107171529.jbublitz@nwinternet.com>
Message-ID: <15819.53912.407893.819241@gargle.gargle.HOWL>


>>>>> "JB" == Jim Bublitz <jbublitz@nwinternet.com> writes:

    JB> What about adding a MIME object to the msg with the Spambayes
    JB> info (text/spambayes?) - or will forwarding lose that info
    JB> too? The email module should be able to do this.

Of course that would have to be text/x-spambayes :)

-Barry


From randy.diffenderfer@eds.com  Fri Nov  8 17:21:25 2002
From: randy.diffenderfer@eds.com (Diffenderfer, Randy)
Date: Fri, 8 Nov 2002 12:21:25 -0500 
Subject: [Spambayes] SMTP proxy questions
Message-ID: <8AA870658244D4119AF600508BDF0A360C6BC295@usahm014.exmi01.exch.eds.com>

|>>>>> "JB" == Jim Bublitz <jbublitz@nwinternet.com> writes:
|
|    JB> What about adding a MIME object to the msg with the Spambayes
|    JB> info (text/spambayes?) - or will forwarding lose that info
|    JB> too? The email module should be able to do this.
|
|Of course that would have to be text/x-spambayes :)
|
|-Barry

While a fair portion of messages may very well be MIME compliant, this
wouldn't work without some serious munging around for non-MIME messages, as
well as being very problematic for the many deformed MIME (read very NON
compliant :-) ) messages floating around out there!

Just an observation...


From jbublitz@nwinternet.com  Fri Nov  8 17:10:33 2002
From: jbublitz@nwinternet.com (Jim Bublitz)
Date: Fri, 08 Nov 2002 09:10:33 -0800 (PST)
Subject: [Spambayes] SMTP proxy questions
In-Reply-To: <15819.53912.407893.819241@gargle.gargle.HOWL>
Message-ID: <XFMail.021108091033.jbublitz@nwinternet.com>

On 08-Nov-02 Barry A. Warsaw wrote:
> 
>>>>>> "JB" == Jim Bublitz <jbublitz@nwinternet.com> writes:
> 
>     JB> What about adding a MIME object to the msg with the
> Spambayes
>     JB> info (text/spambayes?) - or will forwarding lose that
> info
>     JB> too? The email module should be able to do this.
> 
> Of course that would have to be text/x-spambayes :)

<searching for an excuse for my embarrassing oversight>
Well - there's application/ms-excel or some such. Isn't spambayes
just as good? :)

Point taken.

Jim


From barry@python.org  Fri Nov  8 17:33:53 2002
From: barry@python.org (Barry A. Warsaw)
Date: Fri, 8 Nov 2002 12:33:53 -0500
Subject: [Spambayes] SMTP proxy questions
References: <15819.53912.407893.819241@gargle.gargle.HOWL>
	<XFMail.021108091033.jbublitz@nwinternet.com>
Message-ID: <15819.62849.101901.822699@gargle.gargle.HOWL>


>>>>> "JB" == Jim Bublitz <jbublitz@nwinternet.com> writes:

    JB> <searching for an excuse for my embarrassing oversight> Well -
    JB> there's application/ms-excel or some such. Isn't spambayes
    JB> just as good? :)

It depends on whether you hold the IETF and IANA in as high regard as
Microsoft does <wink>.

http://www.iana.org/assignments/media-types/

-Barry

From lists@morpheus.demon.co.uk  Fri Nov  8 21:07:45 2002
From: lists@morpheus.demon.co.uk (Paul Moore)
Date: Fri, 08 Nov 2002 21:07:45 +0000
Subject: [Spambayes] Outlook plugin - training
References: <n2m-g.u1iue32h.fsf@morpheus.demon.co.uk>
	<BIEJKCLHCIOIHAGOKOLHMEDGDOAA.tim@zope.com>
Message-ID: <n2m-g.wunn3h3y.fsf@morpheus.demon.co.uk>

"Tim Peters" <tim@zope.com> writes:

[About the plugin code...]
> I'm more lost than not in it myself!

That makes me feel better :-)

[About bothering with leaving list traffic out]
> Don't worry about it before you try it.  I suggest trying it because I'm not
> sure it's possible to *stop* the system now from scoring all incoming msgs
> (the "new msg in Inbox" filter appears to trigger for every one, regardless
> of whether the RW decides to move it; after that it may just be a race
> between the RW and the addin deciding where to move each).

OK, I've switched over. I now have one Spam folder, one Potential Spam
folder, and the rest are Ham (actually, some historic archive folders
I've left out, but that's just because I never use them any
more). We'll see how it goes.

>> Of course, I know that the classifier *really* works by magic, and
>> so my intuition is useless :-)
>
> It's more that unless you know exactly how the math works, your intuition is
> simply baseless here, carried over from some other experience.  Do *you*
> have trouble distinguishing personal and work email from spam?  There you
> go, and you can't even compute inverse chi-squared probabilities to 14
> significant digits on demand in your head <wink>.

How do *you* know I can't compute inverse chi-squared probabilities in
my head? Oh, hang on - you wanted me to get the right answer, didn't
you? :-)

> What's to manage?  I get about 600 emails per day, and about 1% end
> up in Unsure (about 6 -- actually less than that, lately; the system
> is learning).

My ratio is still a lot worse than that. But as I say, my training
corpus is still quite small. But you're right - managing a few mails
isn't hard. It's just that the overall results are *so* much better
than the old home-grown soution I used that I became instantly spoiled
:-)

Seriously, I've said this before, but what you guys have developed
here is *phenomenally* good. I've reached the point where I look
forward to getting spam, just because I enjoy so much seeing it
automatically appear in the spam folder :-)

>> My instinctive reaction is that I want "Spam" and "Not Spam" buttons,
>> and then I read or delete the message in situ.
>
> MarkH has since implemented this in the Unsure folder.

Time for a CVS update, I guess...

> I still think you're making life too complicated.  Is list traffic
> spam?  If so, call it spam.  If not, call it ham.

Sounds sensible. I think that all the troubles I've had in the past
trying to manage spam have left me with an instinctive feeling that
the problem is complicated. This leads to looking for complicated
solutions.

But you're right. The spam/ham distinction itself is a simple yes/no,
so the setup should be, too.

But permit me to drag my feet a little, as I throw away all my
cherished preconceptions :-)

More seriously, I'm putting this point into my spambayes notes
folder. I suspect it's something a lot of new users will have to get
used to.

Thanks for the comments,
Paul.

-- 
This signature intentionally left blank

From lists@morpheus.demon.co.uk  Fri Nov  8 21:12:17 2002
From: lists@morpheus.demon.co.uk (Paul Moore)
Date: Fri, 08 Nov 2002 21:12:17 +0000
Subject: [Spambayes] Outlook plugin plus Exchange
Message-ID: <n2m-g.u1ir3gwe.fsf@morpheus.demon.co.uk>

I've noticed a couple of strange effects with the Outlook plugin used
against an Exchange server. The main one is that when I start up the
client in the morning, there are a lot of overnight messages in my
inbox. They don't seem to get filtered. I suspect this is to do with
Outlook not firing the "new mail" event on stuff that's in the
Exchange store when the client starts up. But I'll need to test this.

Unfortunately, the Exchange server is at work, and I can only do any
serious hacking on this at home, so I'm running a batch cycle (code at
home, take into work, try out, take bugs home, and repeat). So it'll
take me a while to make any progress.

I'll report back when I get more details.

Paul (Off to look at Outlook events in MSDN, and to write a simple
"log the events and see what is going on" plugin to test with)

-- 
This signature intentionally left blank

From mhammond@skippinet.com.au  Fri Nov  8 21:52:20 2002
From: mhammond@skippinet.com.au (Mark Hammond)
Date: Sat, 9 Nov 2002 08:52:20 +1100
Subject: [Spambayes] Outlook plugin plus Exchange
In-Reply-To: <n2m-g.u1ir3gwe.fsf@morpheus.demon.co.uk>
Message-ID: <LCEPIIGDJPKCOIHOBJEPMENEHJAA.mhammond@skippinet.com.au>

> I've noticed a couple of strange effects with the Outlook plugin used
> against an Exchange server. The main one is that when I start up the
> client in the morning, there are a lot of overnight messages in my
> inbox. They don't seem to get filtered. I suspect this is to do with
> Outlook not firing the "new mail" event on stuff that's in the
> Exchange store when the client starts up. But I'll need to test this.

I am working on code that optionally processes "missed" messages at startup.
It looks like I can list all unread, unscored mail in my 1000+ item inbox
very quickly, so this should be feasible.

> Paul (Off to look at Outlook events in MSDN, and to write a simple
> "log the events and see what is going on" plugin to test with)

Check out the Outlook plugin in the win32com\demos directory - probably a
good place to start.

Or if anyone gets lots of KLEZ mail via Outlook, I have a plugin that does a
decent job at killing them.

Mark.


From francois.granger@free.fr  Fri Nov  8 23:25:51 2002
From: francois.granger@free.fr (=?iso-8859-1?Q?Fran=E7ois?= Granger)
Date: Sat, 9 Nov 2002 00:25:51 +0100
Subject: [Spambayes] pop3proxy
Message-ID: <a05100312b9f1f8357982@[192.168.1.11]>

Thanks to Richie Hindle, it now works on MacOS 9.

Excellent job !
-- 
Le courrier �lectronique est un moyen de communication. Les gens devraient
se poser des questions sur les implications politiques des choix (ou non
choix) de leurs outils et technologies.
Pour des courriers propres : http://minilien.com/?IXZneLoID0 - 
http://marc.herbert.free.fr/mail/ http://expita.com/nomime.html

From tim.one@comcast.net  Fri Nov  8 23:33:50 2002
From: tim.one@comcast.net (Tim Peters)
Date: Fri, 08 Nov 2002 18:33:50 -0500
Subject: [Spambayes] Outlook plugin - training
In-Reply-To: <LNBBLJKPBEHFEDALKOLCOEBFCIAB.tim.one@comcast.net>
Message-ID: <LNBBLJKPBEHFEDALKOLCCEHLCIAB.tim.one@comcast.net>

[Tim]
> ...
> I'm going to try an experiment:  I'm going to wipe my home database and
> start over from scratch, training first on one ham and one spam, then
> only on mistakes and unsures.  This should be fun <wink>.
> ...

After enduring the first round of gross mistakes, when I got up today I did
this:

while some ham in my inbox scores above 0.20 (my ham_cutoff):
    pick the highest-scoring ham in the inbox
    add it to the ham training set
    train on it
    rescore the inbox

These are false positives and unsures the classifier would have had if these
msgs had come in after I started the experiment.  There were about 700 msgs
in the inbox.

Other than that, I've left it mistake-driven and unsure-driven on live
incoming email.  Spam that's correctly classified simply gets deleted (no
training on it), ditto ham.

It's been a light spam day, but hundreds of msgs have come in since then and
I haven't seen a mistake or unsure in about 5 hours, although plenty of ham
gets near ham_cutoff and plenty of spam near spam_cutoff.  Total training
data now:  just 45 ham and 20 spam.

Scores remain grossly hapax-driven, but that's actually enough to classify
most of my email correctly:  a small number of subjects and senders and
mailing lists overwhelmingly dominate my ham mix, and one email account
accounts for the vast bulk of my spam.  Removing the hapaxes from the
database dropped the # of words from 5500 to about 1700.  Rescoring the
inbox with this reduced database then pushed about 5% of the msgs back into
Unsure.

So (no surprise here) hapaxes are vital with little training data.  That
also means that as soon as one of those words shows up in the other kind of
email, it changes from a strong clue to netural, *provided that* I actually
train on the new email.  I'm not training now unless there's a
mistake/unsure, so the hapaxes remain strong clues (even when they point in
the wrong direction).  BTW, when there are mistakes/unsures, I'm not
training on all of them:  as I did when I got up, I train the worst example
then rescore, one at a time, until no mistakes/unsures remain.

I'm never going to get sub-0.1% error rates this way, but if this is the
best it ever got, I'd be quite happy with it for my personal email.
Something to ponder?  If so, you can get away with a very small database,
and while hapaxes must not be removed blindly in this extreme scheme, using
the atime field could (I suspect) be very effective in slashing the
already-small database size (lots of hapaxes will never be seen again even
if you train on everything; the WordInfo atime field tells you when a word
was last used at all).


From rob@hooft.net  Fri Nov  8 23:49:59 2002
From: rob@hooft.net (Rob Hooft)
Date: Sat, 09 Nov 2002 00:49:59 +0100
Subject: [Spambayes] Outlook plugin - training
References: <LNBBLJKPBEHFEDALKOLCCEHLCIAB.tim.one@comcast.net>
Message-ID: <3DCC4DA7.80401@hooft.net>

Tim Peters wrote:
> I'm never going to get sub-0.1% error rates this way, but if this is the
> best it ever got, I'd be quite happy with it for my personal email.
> Something to ponder?  If so, you can get away with a very small database,
> and while hapaxes must not be removed blindly in this extreme scheme, using
> the atime field could (I suspect) be very effective in slashing the
> already-small database size (lots of hapaxes will never be seen again even
> if you train on everything; the WordInfo atime field tells you when a word
> was last used at all).

Tim,

This seems to imply that you're still playing with the idea that hapaxes 
could be "slashed" from the database when using the "old" train-on-all
procedure. I don't see how that can ever work, as all words pass through 
the hapax stage at some point. Or do you mean to slash "old" hapaxes 
only? And what is "old"?

Rob

-- 
Rob W.W. Hooft  ||  rob@hooft.net  ||  http://www.hooft.net/people/rob/


From tim@fourstonesExpressions.com  Sat Nov  9 00:55:07 2002
From: tim@fourstonesExpressions.com (Tim Stone - Four Stones Expressions)
Date: Fri, 08 Nov 2002 18:55:07 -0600
Subject: [Spambayes] Persisting a pickled bayes database
Message-ID: <MJDARN31HEKEJDK21NLB972UTMGVTFE.3dcc5ceb@riven>

I can see the nice createbayes function in hammie, but I don't see any 
persistence function anywhere.  I do see several places where code to write a 
pickled bayes database is hard coded, and I understand the PersistentBayes 
thing.  I might be missing something...

I've been using a simple class to handle creating and persisting my bayes 
databases.  I *think* this stuff should probably go somewhere, but beats me 
where... classifier?  doesn't make much sense there... hammie?  Any ideas 
anybody?

Here's the class...  (kinda a dumb name  ;))

class BayesHelper:
    '''helper class for bayes databases'''
    
    def __init__(self, db_name, useDB):
       ''' constructor '''
       
       self.db_name = db_name
       self.useDB = useDB
       
       self.bayes = hammie.createbayes(db_name, useDB)

    # no __del__() method, because we don't *necessarily* want to persist

    def persist(self):
        '''store the bayes database'''
        
        if not self.useDB:
           fp = open(self.db_name, 'wb')
           pickle.dump(self.bayes, fp, 1)
           fp.close() 


- TimS


From tim.one@comcast.net  Sat Nov  9 18:35:43 2002
From: tim.one@comcast.net (Tim Peters)
Date: Sat, 09 Nov 2002 13:35:43 -0500
Subject: [Spambayes] Persisting a pickled bayes database
In-Reply-To: <MJDARN31HEKEJDK21NLB972UTMGVTFE.3dcc5ceb@riven>
Message-ID: <LNBBLJKPBEHFEDALKOLCGENECIAB.tim.one@comcast.net>

[Tim Stone]
> I can see the nice createbayes function in hammie, but I don't see any
> persistence function anywhere.  I do see several places where code
> to write a pickled bayes database is hard coded, and I understand the
> PersistentBayes thing.  I might be missing something...

Just experience with idiomatic Python persistence.  The persistence was all
in DBDict.__init__'s:

        self.hash = anydbm.open(dbname, 'c')

The tradition in Python is that "a persistent database" supplies an
interface much like a Python dict, but persists almost purely by magic.

For example, here's a brief Python session:

C:\Code\python\PCbuild>python
Python 2.3a0 (#29, Nov  8 2002, 10:51:55) [MSC 32 bit (Intel)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import anydbb
>>> d = anydbm.open('example.dat', 'n')
>>> d['an'] = 'example'
>>> # and quit Python at this point

Then in another session:

>>> import anydbm
>>> d = anydbm.open('example.dat')
>>> print d
<bsddb.bsddb object at 0x0064E158>
>>> print d.keys()
['an']
>>> print d['an']
example
>>>

Note that anydbm used bsddb as the underlying database mechanism on my box.
It may use some other database mechanism on some other box (it depends on
what it finds available).  I could have used bsddb directly instead, of
course, but then my code would require that bsddb be available.  anydbm uses
whatever it can scrounge up.

Subclassing the builtin dict type can give a similar "by magic" facility;
e.g., here's temp.py:

"""
import cPickle as pickle
import os

class PDict(dict):
    def __init__(self, fname):
        self.fname = fname
        if os.path.exists(fname):
            f = file(fname, 'rb')
            guts = pickle.load(f)
            f.close()
            self.update(guts)
        self.is_open = True

    def close(self):
        if self.is_open:
            f = file(self.fname, 'wb')
            pickle.dump(self, f, 1)
            f.close()
            self.is_open = False

    def __del__(self):
        self.close()
"""

That just adds a few methods to a regular dict, arranging to dump its value
to a pickle when .close() is called or when it becomes unreachable.  It's
intended that .close() be called explicitly, though (by-magic shutdown
semantics are never something to bet your life on).

Then in one Python session:

>>> from temp import PDict
>>> d = PDict('example.pck')
>>> d['another'] = 'example'

and in another:

>>> from temp import PDict
>>> d = PDict('example.pck')
>>> d
{'another': 'example'}
>>>

In your example helper class, you decided you don't necessarily want to
persist.  That may or may not be a useful ability, but "the usual" simple
Python database facilities don't give you a choice about that:  they commit
changes to disk *as* mutations occur.  In DB terms, they view each mutation
as a transaction.  The ZODB-based stuff Jeremy is doing is different that
way:  changes to a ZODB db have to be explicitly committed.  That's what the

    get_transaction().commit()

lines in the pspam directory are doing.  ZODB is much more of "a real
database" than these other gimmicks, by which I mean it has an explicit and
pretty rich transactional model and API.


From tim.one@comcast.net  Sat Nov  9 20:00:42 2002
From: tim.one@comcast.net (Tim Peters)
Date: Sat, 09 Nov 2002 15:00:42 -0500
Subject: [Spambayes] Outlook plugin - training
In-Reply-To: <3DCC4DA7.80401@hooft.net>
Message-ID: <LNBBLJKPBEHFEDALKOLCMENJCIAB.tim.one@comcast.net>

[Tim]
> I'm never going to get sub-0.1% error rates this way, but if this is the
> best it ever got, I'd be quite happy with it for my personal email.
> Something to ponder?  If so, you can get away with a very small
> database, and while hapaxes must not be removed blindly in this extreme
> scheme, using the atime field could (I suspect) be very effective in
> slashing the already-small database size (lots of hapaxes will never be
> seen again even if you train on everything; the WordInfo atime field
> tells you when a word was last used at all).

BTW, I'm still doing this experiment, and my total training data is up to 45
ham and 38 spam, out of a total of about 1,700 msgs processed so far.  FP
are FN are both rare now, and the Unsure rate is about 5% overall and
visibly falling.  The Unsure spam are more surprising than the Unsure ham,
but that may be more psychological than real.  For example, it took about 24
hours before I got my first Nigerian spam, and it was shocking to see it
score at the low end of the Unsure range.

Looking at the internals is scary.  I have entire folders that are called
ham seemingly because the mailing list they come from has a few lexical
conventions unique to it, and the hapaxes from the single training msg from
that list save almost all of that list's msgs from Unsure status.

In the msg of Rob's I'm replying to, these are all ham hapaxes:

'database'                     0.155172
'database,'                    0.155172
'ever'                         0.155172
'idea'                         0.155172
'quite'                        0.155172
'scheme,'                      0.155172
'seen'                         0.155172
'subject:Outlook'              0.155172
'subject:Spambayes'            0.155172
'subject:plugin'               0.155172
'subject:training'             0.155172
'tells'                        0.155172
'words'                        0.155172

and they slug it out with these spam hapaxes:

'away'                         0.844828
'effective'                    0.844828
'field'                        0.844828
'mean'                         0.844828
'word'                         0.844828

That 'word' is a strong spam clue but 'words' a strong ham clue should tell
us something about how robust this is <wink>.

[Rob Hooft]
> This seems to imply that you're still playing with the idea that hapaxes
> could be "slashed" from the database when using the "old" train-on-all
> procedure. I don't see how that can ever work, as all words pass through
> the hapax stage at some point. Or do you mean to slash "old" hapaxes
> only?

Well, training has no effect on scoring until update_probabilities() is
called, and in a batch-training context I mean hapax from
update_probabilities's POV.  Of course hamcounts or spamcounts for new words
start life at 1, but when doing batch training I don't mean to look at the
counts until the probabilities are updated.  At that point, a hapax is a
word that was seen in only one msg from the entire batch of new msgs.

Here's a quick test, based on unpublished general python.org email (we can't
publish the ham because it includes some personal email; GregW was working
on making the spam collection available, but I haven't heard about that in a
week; ditto his very large python.org virus collection).

In each case, it trains on 2,741 ham and 948 spam, then predicts the same
numbers of each.  The "all" column includes hapaxes (wrt counts at the *end*
of training).  The gt1 column threw away words at the end of training where
spamcount+hamcount <= 1; i.e., it retained only words that appeared more
than once, the non-hapaxes.   The gt2 column retained only words that
appeared more than twice; and so on.  ham_cutoff was 0.20 here, and
spam_cutoff 0.90.

filename:      all     gt1     gt2     gt3     gt4     gt5     gt6
ham:spam:  2741:948        2741:948        2741:948        2741:948
                   2741:948        2741:948        2741:948
fp total:        1       0       1       0       0       0       0
fp %:         0.04    0.00    0.04    0.00    0.00    0.00    0.00
fn total:        2       2       2       1       2       3       4
fn %:         0.21    0.21    0.21    0.11    0.21    0.32    0.42
unsure t:       81      87      89      82      98      96     100
unsure %:     2.20    2.36    2.41    2.22    2.66    2.60    2.71
real cost:  $28.20  $19.40  $29.80  $17.40  $21.60  $22.20  $24.00
best cost:  $22.20  $17.60  $20.00  $15.40  $16.80  $17.40  $22.40
h mean:       0.81    0.86    0.87    0.72    0.67    0.64    0.65
h sdev:       6.05    6.18    6.17    5.42    5.13    4.94    5.11
s mean:      98.00   97.66   97.54   97.38   97.03   96.62   96.52
s sdev:       9.26   10.22   10.37   10.62   11.19   12.49   12.61
mean diff:   97.19   96.80   96.67   96.66   96.36   95.98   95.87
k:            6.35    5.90    5.84    6.03    5.90    5.51    5.41

# retained
     words:  74327   36437   23877   16143   12798   10719    9157

So while hapaxes are vital with very little training data, even with "just"
about 4K training msgs they didn't buy anything in this test, and neither
did words that appeared only two or three times, and it doesn't appear to be
touchy (all of these columns show excellent results!).

> And what is "old"?

That remains a good question, and a good answer may differ between personal
email and bulk email applications.  A problem I see coming up in my personal
email is that some correspondents only show up once a year, and the hapaxes
they generate remain valuable clues, but only once a year.  General
python.org email doesn't appear to suffer anything like that (so long as
personal email is kept out of the python.org mix).


From rob@hooft.net  Sat Nov  9 22:24:52 2002
From: rob@hooft.net (Rob Hooft)
Date: Sat, 09 Nov 2002 23:24:52 +0100
Subject: [Spambayes] Outlook plugin - training
References: <LNBBLJKPBEHFEDALKOLCMENJCIAB.tim.one@comcast.net>
Message-ID: <3DCD8B34.6040903@hooft.net>

This is a multi-part message in MIME format.
---------------------- multipart/mixed attachment
Tim Peters wrote:
> [Tim]

>>I'm never going to get sub-0.1% error rates this way, but if this is the
>>best it ever got, I'd be quite happy with it for my personal email. 

> BTW, I'm still doing this experiment, and my total training data is up to 45
> ham and 38 spam, out of a total of about 1,700 msgs processed so far.  FP
> are FN are both rare now, and the Unsure rate is about 5% overall and
> visibly falling. 

I just added a testdriver to CVS that simulates your behaviour as I 
understand it: It will train on the first 30 messages, plus on all 
misclassified and all unsure messages. It is called "weaktest.py", and 
uses the good-old-Data/{Sp|H}am hierarchy.

I think we should test its performance at different Options settings.

It may not even be very realistic to training on fp's, as I think in my 
private E-mail I won't even check the spam folder very thoroughly at all.

Anyway, a default run for me now gives:

   100 trained:31H+16S wrds:4203 fp:0 fn:0 unsure:47
   200 trained:35H+25S wrds:6997 fp:0 fn:0 unsure:60
   300 trained:38H+29S wrds:7503 fp:0 fn:0 unsure:67
   400 trained:41H+32S wrds:8503 fp:0 fn:0 unsure:73
   500 trained:45H+38S wrds:8887 fp:0 fn:0 unsure:83
   600 trained:48H+39S wrds:9010 fp:0 fn:0 unsure:87
   700 trained:57H+41S wrds:9484 fp:0 fn:0 unsure:98
   800 trained:63H+43S wrds:9837 fp:0 fn:0 unsure:106
   900 trained:63H+45S wrds:9936 fp:0 fn:0 unsure:108
  1000 trained:67H+45S wrds:10001 fp:0 fn:0 unsure:112
  1100 trained:72H+47S wrds:10268 fp:0 fn:0 unsure:119
  1200 trained:72H+53S wrds:10386 fp:0 fn:0 unsure:125
  1300 trained:77H+56S wrds:11178 fp:0 fn:0 unsure:133
  1400 trained:81H+58S wrds:11546 fp:0 fn:0 unsure:139
  1500 trained:85H+60S wrds:11734 fp:0 fn:0 unsure:145
  1600 trained:87H+62S wrds:12023 fp:0 fn:0 unsure:149
  1700 trained:89H+63S wrds:12161 fp:0 fn:0 unsure:152
  1800 trained:93H+65S wrds:12287 fp:0 fn:0 unsure:158
  1900 trained:93H+68S wrds:12449 fp:0 fn:0 unsure:161
  2000 trained:96H+70S wrds:12637 fp:0 fn:0 unsure:166
  2100 trained:100H+70S wrds:12742 fp:0 fn:0 unsure:170
  2200 trained:103H+72S wrds:12984 fp:0 fn:0 unsure:175
  2300 trained:105H+73S wrds:13047 fp:0 fn:0 unsure:178
  2400 trained:108H+74S wrds:13220 fp:0 fn:0 unsure:182
  2500 trained:111H+78S wrds:13407 fp:0 fn:0 unsure:189
  2600 trained:112H+79S wrds:13485 fp:0 fn:0 unsure:191
  2700 trained:115H+81S wrds:13647 fp:0 fn:0 unsure:196
  2800 trained:118H+84S wrds:13797 fp:0 fn:0 unsure:202
  2900 trained:120H+84S wrds:13845 fp:0 fn:0 unsure:204
  3000 trained:123H+86S wrds:14131 fp:0 fn:0 unsure:209
fp: Data/Ham/Set2/n05250.txt score:0.9312
  3100 trained:128H+87S wrds:14327 fp:1 fn:0 unsure:214
  3200 trained:129H+90S wrds:14430 fp:1 fn:0 unsure:218
  3300 trained:132H+91S wrds:14633 fp:1 fn:0 unsure:222
  3400 trained:133H+93S wrds:14923 fp:1 fn:1 unsure:224
  3500 trained:133H+94S wrds:14937 fp:1 fn:1 unsure:225
  3600 trained:133H+98S wrds:15023 fp:1 fn:1 unsure:229
  3700 trained:135H+102S wrds:15463 fp:1 fn:1 unsure:235
  3800 trained:135H+107S wrds:15627 fp:1 fn:1 unsure:240
  3900 trained:138H+107S wrds:15786 fp:1 fn:1 unsure:243
  4000 trained:140H+111S wrds:15951 fp:1 fn:1 unsure:249
  4100 trained:142H+116S wrds:16115 fp:1 fn:1 unsure:256
  4200 trained:142H+117S wrds:16124 fp:1 fn:1 unsure:257
  4300 trained:143H+122S wrds:16251 fp:1 fn:1 unsure:263
  4400 trained:143H+126S wrds:16366 fp:1 fn:1 unsure:267
  4500 trained:144H+130S wrds:16434 fp:1 fn:1 unsure:272
  4600 trained:144H+134S wrds:16599 fp:1 fn:1 unsure:276
  4700 trained:146H+135S wrds:16664 fp:1 fn:1 unsure:279
  4800 trained:147H+135S wrds:16682 fp:1 fn:1 unsure:280
  4900 trained:149H+138S wrds:16911 fp:1 fn:1 unsure:285
fp: Data/Ham/Set1/n01590.txt score:0.9092
  5000 trained:151H+140S wrds:17257 fp:2 fn:1 unsure:288
  5100 trained:153H+141S wrds:17390 fp:2 fn:1 unsure:291
  5200 trained:155H+142S wrds:17747 fp:2 fn:1 unsure:294
  5300 trained:156H+143S wrds:18095 fp:2 fn:1 unsure:296
  5400 trained:159H+147S wrds:18205 fp:2 fn:1 unsure:303
  5500 trained:160H+147S wrds:18230 fp:2 fn:1 unsure:304
  5600 trained:163H+147S wrds:18334 fp:2 fn:1 unsure:307
  5700 trained:163H+150S wrds:18410 fp:2 fn:1 unsure:310
  5800 trained:165H+150S wrds:18455 fp:2 fn:1 unsure:312
  5900 trained:168H+151S wrds:18671 fp:2 fn:1 unsure:316
  6000 trained:170H+154S wrds:18764 fp:2 fn:1 unsure:321
  6100 trained:170H+155S wrds:18787 fp:2 fn:1 unsure:322
  6200 trained:170H+156S wrds:18791 fp:2 fn:1 unsure:323
  6300 trained:174H+157S wrds:19095 fp:2 fn:1 unsure:328
  6400 trained:176H+161S wrds:19398 fp:2 fn:2 unsure:333
  6500 trained:178H+161S wrds:19444 fp:2 fn:2 unsure:335
Total messages 6540 (4800 ham and 1740 spam)
Total unsure (including 30 startup messages): 336 (5.1%)
Trained on 178 ham and 162 spam
fp: 2 fn: 2
Total cost: $89.20

(This is on 3 out of my 10 test directories).

Interesting to note so far:
  * The "Total cost" is much higher than for train-on-all schemes,
    but it is only due to Unsures; fp and fn are still small.
  * The database growth doesn't decay with time after a while;
    it can be described as:
       nwords = 9200 + 1.6 * nmessages
    or alternatively:
       nwords = 5700 + 40 * ntrained
    ..as can be seen in the attached png's
  * The training set is almost balanced, even though I scored
    many more ham than spam
  * The unsure rate drops over time:
         0- 1000: 11.2% (minus 3.0% to be fair)
      1000- 2000:  5.4%
      2000- 3000:  4.3%
      3000- 4000:  4.0%
      4000- 5000:  3.9%
      5000- 6000:  3.3%

Rob

--
Rob W.W. Hooft  ||  rob@hooft.net  ||  http://www.hooft.net/people/rob/

---------------------- multipart/mixed attachment
A non-text attachment was scrubbed...
Name: words1.png
Type: image/png
Size: 12191 bytes
Desc: not available
Url : http://mail.python.org/pipermail/spambayes/attachments/20021109/85c3f3b5/words1.png

---------------------- multipart/mixed attachment
A non-text attachment was scrubbed...
Name: words2.png
Type: image/png
Size: 12807 bytes
Desc: not available
Url : http://mail.python.org/pipermail/spambayes/attachments/20021109/85c3f3b5/words2.png

---------------------- multipart/mixed attachment--


From rob@hooft.net  Sat Nov  9 23:46:02 2002
From: rob@hooft.net (Rob Hooft)
Date: Sun, 10 Nov 2002 00:46:02 +0100
Subject: [Spambayes] More experiments with weaktest.py
Message-ID: <3DCD9E3A.4040809@hooft.net>

These were results of weaktest with default parameters:

   Total messages 6540 (4800 ham and 1740 spam)
   Total unsure (including 30 startup messages): 336 (5.1%)
   Trained on 178 ham and 162 spam
   fp: 2 fn: 2
   Total cost: $89.20

If I set the "ham_cutoff" to 10 from 20 to make things more symmetrical 
(spam_cutoff is 90 by default):

   Total messages 6540 (4800 ham and 1740 spam)
   Total unsure (including 30 startup messages): 442 (6.8%)
   Trained on 292 ham and 152 spam
   fp: 2 fn: 0
   Total cost: $108.40

So the database grows by 30% but it didn't help my cost. The training 
set is now unbalanced 2:1. Set spam_cutoff to 80 and ham_cutoff back to 
the default 20:

   Total messages 6540 (4800 ham and 1740 spam)
   Total unsure (including 30 startup messages): 304 (4.6%)
   Trained on 213 ham and 101 spam
   fp: 7 fn: 3
   Total cost: $133.80

This reduces the database by only 10%, but at very high fp cost. Same
2:1 unbalance in the training set.
Back to the default 20:90 then, and set the minimum_prob_strength to 0.0:

   Total messages 6540 (4800 ham and 1740 spam)
   Total unsure (including 30 startup messages): 933 (14.3%)
   Trained on 497 ham and 437 spam
   fp: 0 fn: 1
   Total cost: $187.60

OK, so that didn't work either. How about setting it to 0.2?

   Total messages 6540 (4800 ham and 1740 spam)
   Total unsure (including 30 startup messages): 304 (4.6%)
   Trained on 134 ham and 177 spam
   fp: 2 fn: 5
   Total cost: $85.80

Hm. That is slightly better. Funny, we are suddenly training on more 
spam than ham.... Back to 0.1 anyway ---the differences are too small--- 
and set robinson_probability_x = 0.3 (default is 0.5):

   Total messages 6540 (4800 ham and 1740 spam)
   Total unsure (including 30 startup messages): 602 (9.2%)
   Trained on 54 ham and 616 spam
   fp: 1 fn: 67
   Total cost: $197.40

Very interesting: this changes the training ratio to 1:12, at huge cost!
(less than one in three spams was recognized solidly as such).
Wonder what this could do if changed together with the cutoff....
Lets move it back to 0.5, and try "robinson_probability_s = 0.3":

   Total messages 6540 (4800 ham and 1740 spam)
   Total unsure (including 30 startup messages): 348 (5.3%)
   Trained on 237 ham and 120 spam
   fp: 7 fn: 2
   Total cost: $141.60

Ouf.

I am back with the defaults, but I'd still like to do an automated 
optimization of everything simultaneously. Might try that.

Rob

-- 
Rob W.W. Hooft  ||  rob@hooft.net  ||  http://www.hooft.net/people/rob/


From trebor@animeigo.com  Sun Nov 10 00:32:36 2002
From: trebor@animeigo.com (Robert Woodhead)
Date: Sat, 9 Nov 2002 19:32:36 -0500
Subject: [Spambayes] Introducing myself
In-Reply-To: <E18AYxq-0006sT-00@mail.python.org>
References: <E18AYxq-0006sT-00@mail.python.org>
Message-ID: <a05200f03b9f34909a00b@[192.168.1.103]>

Hello everyone,

Just a quick note to introduce myself; I ran the session at that 
Hacker's conference that Guido mentioned, and passed on the 
suggestion of checking out Bill Y's combinatorial approach.

I've been playing with rules-based techniques for almost a year (see 
http://www.madoverlord.com/projects/told.t for details) and toying 
with bayesian  systems for only the last couple of months, on and 
off.  So no expert in that regard; I have mostly replicated the early 
work you guys have done (skimmed the archive today).

I'm particularly impressed with the chi-square work, it looks very 
interesting (but more stats for my poor stats-challenged mind to work 
on; not to mention that now I'm going to have to get around to 
cramming python in there with all the other languages that have 
accumulated over the years...).  Also, it's nice the way you're 
testing a lot of variants, I've been crossing things off my "try 
this" list all afternoon.

Couple of comments (bear in mind, I haven't grabbed the source yet, 
and only skimmed the archive, so if this repeats things you've 
already tried, sorry).  This is just stuff that's been in my mind 
recently, plus stuff stimulated by my skim.

* The great headers debate; suggest you put both machine and human 
readable opinions in the header, eg:

	X-SpamBayes-Rating: 9 (Very Spammy)
	X-SpamBayes-Rating: 7 (Somewhat Spammy)
	X-SpamBayes-Rating: 5 (Unsure)
	X-SpamBayes-Rating: 3 (Probably Innocent)
	X-SpamBayes-Rating: 0 (The Finest Ham)

The reason being, many mailreaders can use a finer discriminant than 
(yes,no,beats me) in ranking spam.  A common strategy (which I like 
myself) is to start an email at neutral priority and bump it up and 
down based on various triggers, whitelists, whatever, then sort the 
inbox by the final priority.

A cute hack I used in TOLD was to output the result like this:

	X-SpamBayes-Rating: 0123456789 (Very Spammy)
	X-SpamBayes-Rating: 012345 (Unsure)

This permits a mailreader with limited filtering tools (like Eudora) 
to classify multiple results with a single rule (such as "if an 
X-SpamBayes-Rating header contains the string 12345678, set priority 
to double-low", which catches both 8 and 9 rated emails).

BTW, being pedantic, "rating" is a better word to use, it is more 
precisely what the discriminator is doing, is the same in all flavors 
of english, and is shorter.  "Score" might be even better.  ;^)

* Hashing to a 32-bit token is very fast, saves a ton of memory, and 
the number of collisions using the python hash (I appealed for hash 
functions on the hackers-l and Guido was kind enough to send me the 
source) is low.  About 1100 collisions out of 3.3 million unique 
tokens on a training set I was using.  CRC32, of all things, is 
actually slightly better, but only by a hair.  So this kind of 
hashing probably won't have much effect on the statistical results.

* Bill Y's byte bucket system has a lot of problems, but a there are 
probably some data reduction techniques that would work well.  One 
that occurred to me on the way back from Hackers would be simply to 
keep a 1-byte count of ham/spam hits for each token, and when the ham 
or spam count is about to wrap, cut each count in half, rounding up 
the other value; ie:

	// increment ham count for bucket i
	// apologies, my pseudocode is a bizarre mixture of
	// now-unknown languages

	if (ham[i]=255)
	   {
	      ham[i]=128;
	      spam[i]=(spam[i]/2)+(spam[i]%2)
    	   }
	else
	   ham[i]++;

The nice thing about this is that it would bias in favor of more 
recent email; things would "age".  But note this means when building 
the original database you have to feed it ham and spam in small 
chunks, or use a greater resolution before cramming it into 
individual bytes.

* I was playing a week or two back with 1 and 2 token groups, and 
found that a useful technique was, for each new token, to only 
consider the most deviant result.  So if the individual word was .99 
spam, and the two word phrase was .95, it would only consider the .99 
result.  This would probably help with Bill Y's combinatorial scheme. 
Dunno if you've tried this; it prevents a particularly spammy or 
hammy sequence from dominating the results (I was only considering 
the 16 or so most deviant results in my bayesian calc, at least on my 
corpus, more than that didn't really help).

* My personal bias (as I think Guido mentioned) is for a multifaceted 
approach, using Bayesian, rules-based (attacking things that bayesian 
isn't good at, like looking for obfuscated url structures), DNSBL, 
and whitelisting heuristics to generate an overall ranking.  So a 
hammy mail from a guy in your address book would bubble up to highest 
priority, whereas something spammy from him would stay neutral. 
There's lots of room for cooperation between the various approaches 
and multiple agents means its less likely that a spam will get by. 
In particular, whitelisting heuristics can almost eliminate false 
positives.

* Finally, if anyone needs more spam, I get over 300 a day (I've been 
around a while!) and have a cleaned corpus of over 130MB of spam and 
foreign email.  Also, given all the legit web-marketing email I get 
because of the url registration work I've done, I've got tons of the 
spammiest ham you could imagine.

Best
R

-- 
-----------------------------------------------------------------------
http://madoverlord.com/    World Domination - a fun family activity!

From tim.one@comcast.net  Sun Nov 10 06:55:59 2002
From: tim.one@comcast.net (Tim Peters)
Date: Sun, 10 Nov 2002 01:55:59 -0500
Subject: [Spambayes] Introducing myself
In-Reply-To: <a05200f03b9f34909a00b@[192.168.1.103]>
Message-ID: <LNBBLJKPBEHFEDALKOLCKEPMCIAB.tim.one@comcast.net>

[Robert Woodhead]
> Hello everyone,

Hi!

> Just a quick note to introduce myself; I ran the session at that
> Hacker's conference that Guido mentioned, and passed on the
> suggestion of checking out Bill Y's combinatorial approach.

You can find test results for that in the list archives.  Bottom line is
that it did worse than what we're doing now, to such an extent that I'm the
only one who appeared to try it (my reports weren't encouraging).  I may
have misimplemented the idea, but I don't think so.  The results were in
line with earlier experiments we tried on gimmicks that systematically
generate highly correlated "words".  Such things appear to learn a lot
faster than word unigrams, but we've always found (so far) that unigrams
soon enough overcome that, and then go on to win.  What we're missing is any
practical approach to a scheme that can suck out phrases without identifying
them by hand first, and without generating highly correlated phrases
(overlapping word n-grams are highly correlated, of course, and Bill carries
that to extremes).

Something I didn't report on is later experiments using chi-combining
instead of Bill's "add up the raw counts".  chi-combining worked better.  I
know Bill has gone on to do a more Bayesian-like combination method, but I
expect that to do worse than what we've got now for the same reasons we gave
up on Paul Graham's combining scheme, but more so:  the word independence
assumption is bogus, and feeding the Bayesian calcs highly correlated words
grossly over- or under- estimates the true probability as a result.  In the
end you get a scheme that claims certainty even when it's dead wrong, and
although it's not dead wrong often, it is dead wrong at a non-zero rate.

Revealing:  fiddle our chi2.py to use whatever combining scheme Bill is
using now, and feed it vectors of *random* probabilities.  Most of the code
needed for that, and to display a histogram of results, is already there.
Try it with Graham's combining scheme and you'll find that scores are almost
always very near 0 or very near 1 even when the inputs are random and
uniformly distributed.  I expect that can only get worse by doing "chain
rule" calcs on probs that are highly correlated to begin with.

The internal chi-combining S and H scores are uniformly distributed given
random inputs, so chi-combining doesn't infer certainty by chance any more
often than it "should" infer certainty by chance.  That appears to be what
makes it far more robust against embarrassing mistakes, and it reliably
pumps out a score near 0.5 given a highly ambiguous input msg (many strong
ham and many strong spam clues -- we call that "cancellation disease" here,
and chi-combining doesn't infer certainty when it happens; all other schemes
did, and didn't do better than chance when it happened).

> I've been playing with rules-based techniques for almost a year (see
> http://www.madoverlord.com/projects/told.t for details) and toying
> with bayesian  systems for only the last couple of months, on and
> off.  So no expert in that regard; I have mostly replicated the early
> work you guys have done (skimmed the archive today).
>
> I'm particularly impressed with the chi-square work, it looks very
> interesting (but more stats for my poor stats-challenged mind to work
> on;

So copy and paste <wink>.

> not to mention that now I'm going to have to get around to
> cramming python in there with all the other languages that have
> accumulated over the years...).

In return, you can throw twelve other languages out <0.7 wink>.

> Also, it's nice the way you're testing a lot of variants, I've been
> crossing things off my "try this" list all afternoon.

Testing has pretty much run out of steam here, though.  My error rates are
so low now I couldn't measure an improvement in a convincing way even if one
were to be made, and the same is true of a few others here too.  We appear
to be fresh out of big algorithmic wins, so are pushing on to wrestling with
deployment issues.

BTW, download the source code and read the comments in tokenizer.py:  the
results of many early experiments are given there in comment blocks.

> Couple of comments (bear in mind, I haven't grabbed the source yet,
> and only skimmed the archive, so if this repeats things you've
> already tried, sorry).  This is just stuff that's been in my mind
> recently, plus stuff stimulated by my skim.
>
> * The great headers debate; suggest you put both machine and human
> readable opinions in the header, eg:
>
> 	X-SpamBayes-Rating: 9 (Very Spammy)
> 	X-SpamBayes-Rating: 7 (Somewhat Spammy)
> 	X-SpamBayes-Rating: 5 (Unsure)
> 	X-SpamBayes-Rating: 3 (Probably Innocent)
> 	X-SpamBayes-Rating: 0 (The Finest Ham)
>
> The reason being, many mailreaders can use a finer discriminant than
> (yes,no,beats me) in ranking spam.  A common strategy (which I like
> myself) is to start an email at neutral priority and bump it up and
> down based on various triggers, whitelists, whatever, then sort the
> inbox by the final priority.

Spoken like someone who worked on a rule-based system <wink>.  We have three
categories:  Ham, Unsure, and Spam, and I haven't seen anything to make me
believe that a finer distinction than that can be quantitatively justified
(but my primary test data makes 2 mistakes out of 34,000 msgs now -- that's
what I mean by "can't measure an improvement anymore", and a finer-grained
scheme isn't going to touch those 2 mistakes; one of them is formally ham
because it was sent by a real person, but consists of a one-line comment
followed by a quote of an entire Nigerian scam spam -- nothing useful is
ever going to *call* that one ham, and it scores as spam *almost* as solidly
as an original Nigerian spam).

> A cute hack I used in TOLD was to output the result like this:
>
> 	X-SpamBayes-Rating: 0123456789 (Very Spammy)
> 	X-SpamBayes-Rating: 012345 (Unsure)
>
> This permits a mailreader with limited filtering tools (like Eudora)
> to classify multiple results with a single rule (such as "if an
> X-SpamBayes-Rating header contains the string 12345678, set priority
> to double-low", which catches both 8 and 9 rated emails).
>
> BTW, being pedantic, "rating" is a better word to use, it is more
> precisely what the discriminator is doing, is the same in all flavors
> of english, and is shorter.  "Score" might be even better.  ;^)

"Score" is my favorite, but isn't catching on.  I believe the word "ham" for
"not spam" was my invention, and since that one caught on big, I'm not
fighting to the death for any others <wink>.

> * Hashing to a 32-bit token is very fast, saves a ton of memory,
> and the number of collisions using the python hash (I appealed for hash
> functions on the hackers-l and Guido was kind enough to send me the
> source) is low.  About 1100 collisions out of 3.3 million unique
> tokens on a training set I was using.

That's significantly better than you could expect from a truly random hash
function, so is fishy.  Tossing 3.3M balls into 2**32 buckets at random
should leave 3298733 buckets occupied on average, with an sdev of 35.58
buckets.  Getting 1100 collisions is about 4.7 sdevs fewer than the random
mean.

> CRC32, of all things, is actually slightly better,

With sparse occupancy (3.3e6 out of 4.3e9 buckets is sparse) they may be
comparable.  PythonLabs ran large-scale statistical tests a few years ago on
this.  The Python string hash produced 32-bit numbers indistinguishable from
random (on the #-of-collision basis) as far as we pushed it; crc32 broken
down *very* badly as occupancy increased, with collision rates hundreds of
sdevs worse than random.  So I can't recommend crc32 for general string
hashing (and the Python docs indeed warn about this now), but can recommend
Python's string hash.  By coincidence, it turns out that Python's string
hash is very similar to what later became "the standard" Fowler-Noll-Vo
string hash, which may be the most widely tested "seems as good as random"
fast string hash now:

    http://www.isthe.com/chongo/tech/comp/fnv/

> but only by a hair.  So this kind of hashing probably won't have much
> effect on the statistical results.

Since we're sticking to unigrams, we don't have an insane database burden.
We also (by default) limit ourselves to looking at no more than 150 words
per msg.  So I'm not sure saving some bytes of string storage is "worth it"
for us, and it's very nice that we can get back the exact list of words that
went into computing a score later.  A pile of hash codes wouldn't give the
same loving effect <wink>.

> * Bill Y's byte bucket system has a lot of problems, but a there are
> probably some data reduction techniques that would work well.  One
> that occurred to me on the way back from Hackers would be simply to
> keep a 1-byte count of ham/spam hits for each token, and when the ham
> or spam count is about to wrap, cut each count in half, rounding up
> the other value; ie:
>
> 	// increment ham count for bucket i
> 	// apologies, my pseudocode is a bizarre mixture of
> 	// now-unknown languages
>
> 	if (ham[i]=255)
> 	   {
> 	      ham[i]=128;
> 	      spam[i]=(spam[i]/2)+(spam[i]%2)
>     	   }
> 	else
> 	   ham[i]++;
>
> The nice thing about this is that it would bias in favor of more
> recent email; things would "age".  But note this means when building
> the original database you have to feed it ham and spam in small
> chunks, or use a greater resolution before cramming it into
> individual bytes.

Except I didn't get good enough results from his approach to justify
pursuing it here, even leaving the hash codes at the full 32 bits.  When I
went on to squash them to fit in a million buckets, a few false positives
popped up that were just too bad to bear (two can be found in the list
archives):  ham that was so obviously ham that no system that called them
spam would be acceptable to most people.

> * I was playing a week or two back with 1 and 2 token groups, and
> found that a useful technique was, for each new token, to only
> consider the most deviant result.  So if the individual word was .99
> spam, and the two word phrase was .95, it would only consider the .99
> result.  This would probably help with Bill Y's combinatorial scheme.

It could be a viable approach to the problem mentioned above:  a scheme to
suck out more than one word that doesn't systematically generate mounds of
nearly redundant (highly correlated) clues.  We're clearly missing info by
never looking at bigrams (or beyond) now, and that continues to bother me
(even if it doesn't seem to be bothering the error rates <wink>).

> Dunno if you've tried this; it prevents a particularly spammy or
> hammy sequence from dominating the results (I was only considering
> the 16 or so most deviant results in my bayesian calc, at least on my
> corpus, more than that didn't really help).

There's too much I don't know about everything you're doing to say much
about that.  *All* the biases in Graham's original scheme eventually went
away in this project, and things like clamping the spamprobs into [.01,
0.99] turned out to make it systematically useless to try to use more than
16 words under Graham-combining (it just caused more "cancellation disease",
and so caused more wildly wrong mistakes).  We use 150 now, but IIRC we
generally stopped seeing strong benefits after hitting about 40.  That 40
was better than 16 very much relied on removing all the biases, though (no
"ham boosts", no prob clamping, no minimum word count, no giving unknown
words spamprobs above 0.5 to favor ham, no doubling the ham count when
computing a spam prob, etc).

> * My personal bias (as I think Guido mentioned) is for a multifaceted
> approach, using Bayesian, rules-based (attacking things that bayesian
> isn't good at, like looking for obfuscated url structures), DNSBL,
> and whitelisting heuristics to generate an overall ranking.  So a
> hammy mail from a guy in your address book would bubble up to highest
> priority, whereas something spammy from him would stay neutral.

I'm not sure we really need it.  For example, *lots* of spam has been
discussed on this mailing list, so much so that the python.org email admin
had to castrate SpamAssassin for msgs to this list address else it kept
blocking ordinary list traffic.  My personal email classifier never calls
anything here spam, though, nor does it call the originals of the spams
posted here ham.

I do worry a little about obsfuscated HTML.  We strip almost all HTML tags
by default for a reason I've harped on enough <wink>:  all HTML decorations
have very high spamprobs, and counting more than one of them as "a clue"
fools almost every combining scheme into believing the msg containing them
is spam (if you know a msg contains both <br> and <p>, it's not really more
likely to be spam than if you just know it contains <br>!).  So we blind the
classifier to HTML decorations now.

But a spam I forwarded here a week or so ago exploited that:  the spam was
interleaved with size=1 white-on-white news stories and tech mailing list
postings.  The classifier *did* see those, but didn't see the HTML
decorations hiding them.  This was a cancellation-disease-by-construction
kind of msg, and chi-combining scored it near 0.5 as a result (solidly
Unsure).  It's the only spam of that kind I've seen so far; if it becomes a
popular techinque, we'll have to take more HTML blinders off the classifier.

> There's lots of room for cooperation between the various approaches
> and multiple agents means its less likely that a spam will get by.
> In particular, whitelisting heuristics can almost eliminate false
> positives.

I'll let you know if I ever see one <wink>.  Seriously, one of the apps I've
especially got in mind is filtering the high-volume mailing lists on
python.org.  The only kind of FP I see there now in tests is adminstrative
requests to *-request addresses, which typically consist of a one word
"subscribe" or "unsubscribe" (themselves words with high spamprobs!),
followed by 6KB of employer-generated HTML disclaimers, and/or a forwarded
spam or conference announcement the sender didn't like.  There's still a
very low FP rate even on those, but text analysis simply can't be expected
to nail them every time.  Under SpamAssassin, those recipient addresses are
given strong ham boosts by the python.org email admin.

> * Finally, if anyone needs more spam, I get over 300 a day (I've been
> around a while!) and have a cleaned corpus of over 130MB of spam and
> foreign email.  Also, given all the legit web-marketing email I get
> because of the url registration work I've done, I've got tons of the
> spammiest ham you could imagine.

Wasn't Paul Graham collecting corpora?  Yup, still is:

    http://www.paulgraham.com/spamarchives.html

Getting vast quantities of spam isn't a problem anymore, but getting vast
quantities of ham is.  Since your spammy ham is presumably business-related,
I assume you can't share it.  Or can you?  Mixing spam and ham from
different sources also causes worlds of problems (indeed, we still (by
default) ignore most of the header lines partly for that reason, else the
system gets great results for bogus reasons).


From tim.one@comcast.net  Sun Nov 10 07:27:38 2002
From: tim.one@comcast.net (Tim Peters)
Date: Sun, 10 Nov 2002 02:27:38 -0500
Subject: [Spambayes] More experiments with weaktest.py
In-Reply-To: <3DCD9E3A.4040809@hooft.net>
Message-ID: <LNBBLJKPBEHFEDALKOLCIEPOCIAB.tim.one@comcast.net>

[Rob Hooft]
> These were results of weaktest with default parameters:

Very interesting!  I'll have to try that too.  Note that in my live email
experiment here, I'm (except for the very start) also scoring/training msgs
in (with small lapses) the order they arrive.  It's been reported before
that this helps; although I still haven't run a controlled experiment on
that, my *impression* is that it does help.

>    Total messages 6540 (4800 ham and 1740 spam)
>    Total unsure (including 30 startup messages): 336 (5.1%)
>    Trained on 178 ham and 162 spam
>    fp: 2 fn: 2
>    Total cost: $89.20
>
> If I set the "ham_cutoff" to 10 from 20 to make things more symmetrical
> (spam_cutoff is 90 by default):

The asymmetry is intentional:  most people hate FP more than FN, so by
default I made it harder for a thing to get called spam.  In test after test
we've also seen that spam has a tighter score distribution than ham, which
is a more formal justification for setting the spam cutoff closer to its
endpoint than the ham cutoff.  Setting ham_cutoff as low as 10 is for the
truly paranoid <0.9 wink>.

>    Total messages 6540 (4800 ham and 1740 spam)
>    Total unsure (including 30 startup messages): 442 (6.8%)
>    Trained on 292 ham and 152 spam
>    fp: 2 fn: 0
>    Total cost: $108.40
>
> So the database grows by 30% but it didn't help my cost. The training
> set is now unbalanced 2:1. Set spam_cutoff to 80 and ham_cutoff back to
> the default 20:
>
>    Total messages 6540 (4800 ham and 1740 spam)
>    Total unsure (including 30 startup messages): 304 (4.6%)
>    Trained on 213 ham and 101 spam
>    fp: 7 fn: 3
>    Total cost: $133.80
>
> This reduces the database by only 10%, but at very high fp cost. Same
> 2:1 unbalance in the training set.
> Back to the default 20:90 then, and set the minimum_prob_strength to 0.0:
>
>    Total messages 6540 (4800 ham and 1740 spam)
>    Total unsure (including 30 startup messages): 933 (14.3%)
>    Trained on 497 ham and 437 spam
>    fp: 0 fn: 1
>    Total cost: $187.60
>
> OK, so that didn't work either. How about setting it to 0.2?
>
>    Total messages 6540 (4800 ham and 1740 spam)
>    Total unsure (including 30 startup messages): 304 (4.6%)
>    Trained on 134 ham and 177 spam
>    fp: 2 fn: 5
>    Total cost: $85.80
>
> Hm. That is slightly better. Funny, we are suddenly training on more
> spam than ham.... Back to 0.1 anyway ---the differences are too small---
> and set robinson_probability_x = 0.3 (default is 0.5):
>
>    Total messages 6540 (4800 ham and 1740 spam)
>    Total unsure (including 30 startup messages): 602 (9.2%)
>    Trained on 54 ham and 616 spam
>    fp: 1 fn: 67
>    Total cost: $197.40
>
> Very interesting: this changes the training ratio to 1:12, at huge cost!
> (less than one in three spams was recognized solidly as such).

Note that in calculations I reported a day or two ago, the measured mean of
spamprobs across 3 different corpora was > 0.5, but not by a lot.  .3 moves
it outside the range minimum_prob_strength ignores, so now every "new word"
is instantly taken as a ham clue, where before all new words were ignored by
default.  So that it grossly inflated the FN rate isn't surprising;
everything that will *eventually* become a hapax is initially taken to be a
ham clue, even when it's never been seen before.

> Wonder what this could do if changed together with the cutoff....
> Lets move it back to 0.5, and try "robinson_probability_s = 0.3":
>
>    Total messages 6540 (4800 ham and 1740 spam)
>    Total unsure (including 30 startup messages): 348 (5.3%)
>    Trained on 237 ham and 120 spam
>    fp: 7 fn: 2
>    Total cost: $141.60
>
> Ouf.

I hope you're at least gaining some respect for how much work went into
picking the defaults <wink>.

> I am back with the defaults, but I'd still like to do an automated
> optimization of everything simultaneously. Might try that.

Now *that* could be a useful system regardless of scheme.  I've tended to do
hill-climbing across one dimension at a time, occasionally moving batches of
params random amounts at once (to see whether that kicks it out of a
stubborn local minimum).


From tim.one@comcast.net  Sun Nov 10 07:52:42 2002
From: tim.one@comcast.net (Tim Peters)
Date: Sun, 10 Nov 2002 02:52:42 -0500
Subject: [Spambayes] Outlook plugin - training
In-Reply-To: <3DCD8B34.6040903@hooft.net>
Message-ID: <LNBBLJKPBEHFEDALKOLCCEAACJAB.tim.one@comcast.net>

[Rob Hooft]
> I just added a testdriver to CVS that simulates your behaviour as I
> understand it: It will train on the first 30 messages,

I trained on 1 of each at the start.  If I were to do it over, I'd start
with an empty database <wink>.

> plus on all misclassified and all unsure messages.

Since I'm doing this real-time on my live email, I've been training "on the
worst" (farthest away from correct) msg that arrives in a batch, then
rescoring all the ones that arrived in the batch, then training the worst
remaining, ... until all new ham is below ham_cutoff and all new spam above
spam_cutoff.  I don't know that it matters, just being clear(er).  As things
turned out, this worst-at-a-time training never managed to push one of the
remaining mistakes/unsures into the correct category, *except* for cases
where I got more than one copy of a spam from different accounts at the same
time.  Then it always pushed the copies into scoring near 1.0, since the
hapaxes in the training copy are abundant.

> It is called "weaktest.py", and uses the good-old-Data/{Sp|H}am
> hierarchy.
>
> I think we should test its performance at different Options settings.
>
> It may not even be very realistic to training on fp's, as I think in my
> private E-mail I won't even check the spam folder very thoroughly at all.

But I will (and do), and my primary interest here is to see how bad things
can get if a user takes mistake-based training to an extreme.  Despite that
it's heavily hapax-driven, it appears to do very well when judged by error
rate.

I've been doing it long enough now, though, that it doesn't do so well
subjectively:  the Unsures are too often bizarre.  For example, I sent a
long reply here to Robert Woodland, and the copy I get bock showed up as
Unsure, with H=1 and S=0.66.  There were a lot of accidental spam hapaxes in
that msg!  Training on it as ham then eliminated about 30 spam hapaxes
(there're now netural, having been seen in one ham and one spam each).

So it's no different from my POV than the cases where people have sent me
"surprising msgs" in the past, and my carefully trained slice-of-life
classifier (regularly trained on a sampling of correctly classified msgs
too) at the time had no trouble nailing them as ham or spam, with lots of
non-hapax evidence to back it up.

IOW, I'm still sticking to what I guessed before I started this:
mistake-driven training will appear to work well over the short term, but
it's brittle, and is brittle because of its reliance on hapaxes.

> Anyway, a default run for me now gives:
>
>    100 trained:31H+16S wrds:4203 fp:0 fn:0 unsure:47
>    200 trained:35H+25S wrds:6997 fp:0 fn:0 unsure:60
>    300 trained:38H+29S wrds:7503 fp:0 fn:0 unsure:67
>    400 trained:41H+32S wrds:8503 fp:0 fn:0 unsure:73
>    500 trained:45H+38S wrds:8887 fp:0 fn:0 unsure:83
>    600 trained:48H+39S wrds:9010 fp:0 fn:0 unsure:87
>    700 trained:57H+41S wrds:9484 fp:0 fn:0 unsure:98
>    800 trained:63H+43S wrds:9837 fp:0 fn:0 unsure:106
>    900 trained:63H+45S wrds:9936 fp:0 fn:0 unsure:108
>   1000 trained:67H+45S wrds:10001 fp:0 fn:0 unsure:112
>   1100 trained:72H+47S wrds:10268 fp:0 fn:0 unsure:119
>   1200 trained:72H+53S wrds:10386 fp:0 fn:0 unsure:125
>   1300 trained:77H+56S wrds:11178 fp:0 fn:0 unsure:133
>   1400 trained:81H+58S wrds:11546 fp:0 fn:0 unsure:139
>   1500 trained:85H+60S wrds:11734 fp:0 fn:0 unsure:145
>   1600 trained:87H+62S wrds:12023 fp:0 fn:0 unsure:149
>   1700 trained:89H+63S wrds:12161 fp:0 fn:0 unsure:152
>   1800 trained:93H+65S wrds:12287 fp:0 fn:0 unsure:158
>   1900 trained:93H+68S wrds:12449 fp:0 fn:0 unsure:161
>   2000 trained:96H+70S wrds:12637 fp:0 fn:0 unsure:166
>   2100 trained:100H+70S wrds:12742 fp:0 fn:0 unsure:170
>   2200 trained:103H+72S wrds:12984 fp:0 fn:0 unsure:175
>   2300 trained:105H+73S wrds:13047 fp:0 fn:0 unsure:178
>   2400 trained:108H+74S wrds:13220 fp:0 fn:0 unsure:182
>   2500 trained:111H+78S wrds:13407 fp:0 fn:0 unsure:189
>   2600 trained:112H+79S wrds:13485 fp:0 fn:0 unsure:191
>   2700 trained:115H+81S wrds:13647 fp:0 fn:0 unsure:196
>   2800 trained:118H+84S wrds:13797 fp:0 fn:0 unsure:202
>   2900 trained:120H+84S wrds:13845 fp:0 fn:0 unsure:204
>   3000 trained:123H+86S wrds:14131 fp:0 fn:0 unsure:209
> fp: Data/Ham/Set2/n05250.txt score:0.9312
>   3100 trained:128H+87S wrds:14327 fp:1 fn:0 unsure:214
>   3200 trained:129H+90S wrds:14430 fp:1 fn:0 unsure:218
>   3300 trained:132H+91S wrds:14633 fp:1 fn:0 unsure:222
>   3400 trained:133H+93S wrds:14923 fp:1 fn:1 unsure:224
>   3500 trained:133H+94S wrds:14937 fp:1 fn:1 unsure:225
>   3600 trained:133H+98S wrds:15023 fp:1 fn:1 unsure:229
>   3700 trained:135H+102S wrds:15463 fp:1 fn:1 unsure:235
>   3800 trained:135H+107S wrds:15627 fp:1 fn:1 unsure:240
>   3900 trained:138H+107S wrds:15786 fp:1 fn:1 unsure:243
>   4000 trained:140H+111S wrds:15951 fp:1 fn:1 unsure:249
>   4100 trained:142H+116S wrds:16115 fp:1 fn:1 unsure:256
>   4200 trained:142H+117S wrds:16124 fp:1 fn:1 unsure:257
>   4300 trained:143H+122S wrds:16251 fp:1 fn:1 unsure:263
>   4400 trained:143H+126S wrds:16366 fp:1 fn:1 unsure:267
>   4500 trained:144H+130S wrds:16434 fp:1 fn:1 unsure:272
>   4600 trained:144H+134S wrds:16599 fp:1 fn:1 unsure:276
>   4700 trained:146H+135S wrds:16664 fp:1 fn:1 unsure:279
>   4800 trained:147H+135S wrds:16682 fp:1 fn:1 unsure:280
>   4900 trained:149H+138S wrds:16911 fp:1 fn:1 unsure:285
> fp: Data/Ham/Set1/n01590.txt score:0.9092
>   5000 trained:151H+140S wrds:17257 fp:2 fn:1 unsure:288
>   5100 trained:153H+141S wrds:17390 fp:2 fn:1 unsure:291
>   5200 trained:155H+142S wrds:17747 fp:2 fn:1 unsure:294
>   5300 trained:156H+143S wrds:18095 fp:2 fn:1 unsure:296
>   5400 trained:159H+147S wrds:18205 fp:2 fn:1 unsure:303
>   5500 trained:160H+147S wrds:18230 fp:2 fn:1 unsure:304
>   5600 trained:163H+147S wrds:18334 fp:2 fn:1 unsure:307
>   5700 trained:163H+150S wrds:18410 fp:2 fn:1 unsure:310
>   5800 trained:165H+150S wrds:18455 fp:2 fn:1 unsure:312
>   5900 trained:168H+151S wrds:18671 fp:2 fn:1 unsure:316
>   6000 trained:170H+154S wrds:18764 fp:2 fn:1 unsure:321
>   6100 trained:170H+155S wrds:18787 fp:2 fn:1 unsure:322
>   6200 trained:170H+156S wrds:18791 fp:2 fn:1 unsure:323
>   6300 trained:174H+157S wrds:19095 fp:2 fn:1 unsure:328
>   6400 trained:176H+161S wrds:19398 fp:2 fn:2 unsure:333
>   6500 trained:178H+161S wrds:19444 fp:2 fn:2 unsure:335
> Total messages 6540 (4800 ham and 1740 spam)
> Total unsure (including 30 startup messages): 336 (5.1%)
> Trained on 178 ham and 162 spam
> fp: 2 fn: 2
> Total cost: $89.20
>
> (This is on 3 out of my 10 test directories).
>
> Interesting to note so far:
>   * The "Total cost" is much higher than for train-on-all schemes,
>     but it is only due to Unsures; fp and fn are still small.

That matches my experience too, although I started with 1 ham and 1 spam and
had high FP and FN rates over the first few hours.

>   * The database growth doesn't decay with time after a while;
>     it can be described as:
>        nwords = 9200 + 1.6 * nmessages
>     or alternatively:
>        nwords = 5700 + 40 * ntrained
>     ..as can be seen in the attached png's

I expect that's mostly because there are still (relatively) few total msgs
trained on.

>   * The training set is almost balanced, even though I scored
>     many more ham than spam

Curiously, same here!  I get about 500 ham and 100 spam per day, but my
training database now has 47 ham and 41 spam.  It does well, except when it
sucks <wink>.

>   * The unsure rate drops over time:

I haven't measured that, but it's clearly been so here too (as I said
before).

>          0- 1000: 11.2% (minus 3.0% to be fair)
>       1000- 2000:  5.4%
>       2000- 3000:  4.3%
>       3000- 4000:  4.0%
>       4000- 5000:  3.9%
>       5000- 6000:  3.3%

Proving what I've always suspected:  over time, all msgs are repetitions of
ones you've seen before <0.9 wink>.


From tim.one@comcast.net  Sun Nov 10 08:36:10 2002
From: tim.one@comcast.net (Tim Peters)
Date: Sun, 10 Nov 2002 03:36:10 -0500
Subject: [Spambayes] My first non-personal personal false positive
In-Reply-To: <B9EFE93D.5C286%francois.granger@free.fr>
Message-ID: <LNBBLJKPBEHFEDALKOLCEEAFCJAB.tim.one@comcast.net>

[Tim, asks for help on a Spanish Unsure]

[Fran=E7ois Granger]
> Here are the most probable English equivalents of the Spanish words=
.
> 'using', 'page', 'have', 'click', 'much', 'but', 'know', 'with',
> 'good', 'this', 'Hi', 'that', 'here', 'the', 'for'
>
> This illustrate he need for properly balanced training sets and
> re raise the question of language discrimination.

It really doesn't raise it for me:  this was in my personal email, an=
d since
I couldn't read the msg anyway, it may as well have been spam.  I get=
 way
too much email to bother more than 2 seconds with something I can't r=
ead.  I
only looked at this one because I'm paying heavy attention to everyth=
ing the
automatic classifier calls spam.  If I weren't using this system, I w=
ould
have thrown out that msg at once.

If I were someone who got any quantity of Spanish ham, the system wou=
ld have
scored it as ham.  As is, the only Spanish I get is in Spanish spam, =
so the
system correctly judged it for my personal email mix.

> At least prior language discrimination would allow for a different
> database for each language

Whether that would improve results is a testable hypothesis; I've alr=
eady
said I doubt it would be helpful, and have no motivation to try such =
an
experiment myself.

> or for a systematic "unsure" flag for not trained languages.

But I *do* train on Spanish -- and Russian, and Turkish, and Chinese,=
 and
Japanese, and German, and French, and Polish (at least):  in my email=
 mix,
they're all used in spam, aren't used in my ham, and are spam to me b=
ecause
they're unreadable by me.

> If you put my messages in a Ham training set, you will flag French =
spams
> as ham because of my French sig ;-)

Nope, the system isn't that stupid (or, rather, it is <wink>).  What =
it will
do is knock down the spamprobs of those words.  Despite that I've got=
 French
spam in my training data, your msg here-- including the French sig --=
got a
solid ham score, with H=3D1 (to six significant digits) and S=3D1.1e-=
11.  The
strongest spam word in fact came from your sig, spamprob('est')=3D0.8=
4.  It
didn't matter, because I could actually read most of what you wrote, =
and it
wasn't trying to sell me Viagra <wink>.

> All these words should rate around 0.5 since they are among the
> most common ones in this language.

If I got any French ham, they would rate around 0.5, but for my perso=
nal
email it's Just Fine that they're considered spam words.  It wouldn't=
 be OK
for python.org use, but python.org gets a non-trivial amount of non-E=
nglish
ham, so it trains there accordingly.

> Le courrier est un moyen de communication. Les gens devraient
> se poser des questions sur les implications politiques des choix (o=
u non
> choix) de leurs outils et technologies. Pour des courriers propres =
:
> <http://marc.herbert.free.fr/mail/> -- <http://minilien.com/?IXZneL=
oID0>

Indeed <wink>.


From rob@hooft.net  Sun Nov 10 11:09:28 2002
From: rob@hooft.net (Rob Hooft)
Date: Sun, 10 Nov 2002 12:09:28 +0100
Subject: [Spambayes] Introducing myself
References: <E18AYxq-0006sT-00@mail.python.org>
	<a05200f03b9f34909a00b@[192.168.1.103]>
Message-ID: <3DCE3E68.2060101@hooft.net>

Robert Woodhead wrote:

> * My personal bias (as I think Guido mentioned) is for a multifaceted 
> approach, using Bayesian, rules-based (attacking things that bayesian 
> isn't good at, like looking for obfuscated url structures), DNSBL, and 
> whitelisting heuristics to generate an overall ranking.  So a hammy mail 
> from a guy in your address book would bubble up to highest priority, 
> whereas something spammy from him would stay neutral. There's lots of 
> room for cooperation between the various approaches and multiple agents 
> means its less likely that a spam will get by. In particular, 
> whitelisting heuristics can almost eliminate false positives.

I think our very good experience with the bayesian classifier would 
"forbid" to use whitelisting. Once a whitelisted feature "leaks" into 
the spam community, it will be useless.

But there is a bayesian solution to it: Make the tokenizer recognize the 
feature that you want to whitelist or blacklist, and emit a new token to 
that effect.

    From:<in-address-book>  --> Will have a low spamprob
    url:numeric-host        --> Will have a high spamprob

We're already doing something that for a number of the SpamAssassin 
tests (e.g. mime-type tokens). This approach still uses a purely 
bayesian classifier, and it will follow reality automatically.

I'd like to note that a lot of what you were saying and what was in
Tim's response (and mine here) is only valid in a train-on-all scheme. 
i.e. like we've been using until a week ago....

Rob

-- 
Rob W.W. Hooft  ||  rob@hooft.net  ||  http://www.hooft.net/people/rob/


From rob@hooft.net  Sun Nov 10 12:11:46 2002
From: rob@hooft.net (Rob Hooft)
Date: Sun, 10 Nov 2002 13:11:46 +0100
Subject: [Spambayes] More experiments with weaktest.py
References: <LNBBLJKPBEHFEDALKOLCIEPOCIAB.tim.one@comcast.net>
Message-ID: <3DCE4D02.6060907@hooft.net>

Tim Peters wrote:
> [Rob Hooft]
> 
>>These were results of weaktest with default parameters:
> 
> 
> Very interesting!  I'll have to try that too.  Note that in my live email
> experiment here, I'm (except for the very start) also scoring/training msgs
> in (with small lapses) the order they arrive.  It's been reported before
> that this helps; although I still haven't run a controlled experiment on
> that, my *impression* is that it does help.

I toyed with the idea, but that would involve parsing all messages once 
before starting, and sorting them on date. Putting them in a set to 
"randomize" the order is much easier, so I was lazy.

> Setting ham_cutoff as low as 10 is for the
> truly paranoid <0.9 wink>.

Very much so. For my "production" systems, I have ham_cutoff at 40...

> I hope you're at least gaining some respect for how much work went into
> picking the defaults <wink>.

I was just arriving when it happened. But that was on a completely 
different classifier, so I'm still convinced these need to be thoroughly 
tested.

>>I am back with the defaults, but I'd still like to do an automated
>>optimization of everything simultaneously. Might try that.

> Now *that* could be a useful system regardless of scheme.  I've tended to do
> hill-climbing across one dimension at a time, occasionally moving batches of
> params random amounts at once (to see whether that kicks it out of a
> stubborn local minimum).

Hm. That sounds so enthousiastic that I just might commit what I have 
gone through this night. Some more info:

  * No, I have not used a "Simulated Annealing" or "Threshold Accepting"
    yet. Please keep in mind that each step in the optimization takes
    between 3 minutes (1 set on my home PC) and 15 minutes (10 sets on my
    work PC). This would be way too costly. Just minimization it will be.
  * I tried to use "Simplex optimization" (let a multidimensional
    triangle walk through phase space) on the "Total cost" parameter.
    This was simply disastrous. Phase space consists of plateau regions
    that are exactly flat, joined by huge ridges. Think about that one
    spam that goes from a 0.11 to a 0.09 score: it will add $9.80 in one
    bang to the cost. This field is impossible to optimize.
  * I designed a new "Flex cost" field. That one does away with the
    "unsure cost". The cost of a message is 0.0 at its own cutoff, and
    increases linearly towards its "false" cost at the other cutoff,
    and increases further to the other end. Hm. Unreadable. A table:

           Score    Spam with this   Ham with this
                      score costs     score costs
            0.00         $ 1.29          $ 0.00
            0.20         $ 1.00          $ 0.00
            0.55         $ 0.50          $ 5.00
            0.90         $ 0.00          $10.00
            1.00         $ 0.00          $11.43

     This field is much more smooth than the total cost field, so I was
     hoping that pure minimization will do. Obviously, the flex cost is
     much, much higher than the total cost because unsures are so much
     more expensive. The flex cost field will also be less sensitive to
     the {sp|h}am_cutoff parameters than the total cost field, because
     there are no sudden cost jumps.
   * Results are not great I need to experiment more before reporting
     on them.
   * I just committed:
      weaktest.py: introduction of the flexcost measure
      optimize.py: simplex optimization (needs Numeric python; sorry)
      weakloop.py: run weaktest.py repeatedly under simplex optimization

Regards,

Rob Hooft
-- 
Rob W.W. Hooft  ||  rob@hooft.net  ||  http://www.hooft.net/people/rob/


From rob@hooft.net  Sun Nov 10 12:28:44 2002
From: rob@hooft.net (Rob Hooft)
Date: Sun, 10 Nov 2002 13:28:44 +0100
Subject: [Spambayes] Outlook plugin - training
References: <LNBBLJKPBEHFEDALKOLCCEAACJAB.tim.one@comcast.net>
Message-ID: <3DCE50FC.3050005@hooft.net>

This is a multi-part message in MIME format.
---------------------- multipart/mixed attachment
Tim Peters wrote:
> [Rob Hooft]
> 
>>I just added a testdriver to CVS that simulates your behaviour as I
>>understand it: It will train on the first 30 messages,
> 
> 
> I trained on 1 of each at the start.  If I were to do it over, I'd start
> with an empty database <wink>.

This is easy enough to change, but I left it at 30 for now.

> Since I'm doing this real-time on my live email, I've been training "on the
> worst" (farthest away from correct) msg that arrives in a batch, then
> rescoring all the ones that arrived in the batch, then training the worst
> remaining, ... until all new ham is below ham_cutoff and all new spam above
> spam_cutoff.  I don't know that it matters, just being clear(er).  As things
> turned out, this worst-at-a-time training never managed to push one of the
> remaining mistakes/unsures into the correct category, *except* for cases
> where I got more than one copy of a spam from different accounts at the same
> time.  Then it always pushed the copies into scoring near 1.0, since the
> hapaxes in the training copy are abundant.

But I'm doing exactly the same, except that my batch size is always 1 ;-)

>>It may not even be very realistic to training on fp's, as I think in my
>>private E-mail I won't even check the spam folder very thoroughly at all.

> But I will (and do), and my primary interest here is to see how bad things
> can get if a user takes mistake-based training to an extreme.  Despite that
> it's heavily hapax-driven, it appears to do very well when judged by error
> rate.

Hm. There are so little fp/fn's relative to unsures (at least after 30 
messages initial training), that it wouldn't matter much (I think).

>>  * The database growth doesn't decay with time after a while;
>>    it can be described as:
>>       nwords = 9200 + 1.6 * nmessages
>>    or alternatively:
>>       nwords = 5700 + 40 * ntrained
>>    ..as can be seen in the attached png's
> 
> 
> I expect that's mostly because there are still (relatively) few total msgs
> trained on.

Hm, it is more like a sqrt after more messages. See attached image which 
has a sqrt X axis. The fit fits the data even at the lowest end.

Regards,

Rob

-- 
Rob W.W. Hooft  ||  rob@hooft.net  ||  http://www.hooft.net/people/rob/

---------------------- multipart/mixed attachment
A non-text attachment was scrubbed...
Name: words3.png
Type: image/png
Size: 13675 bytes
Desc: not available
Url : http://mail.python.org/pipermail/spambayes/attachments/20021110/b5905d0f/words3.png

---------------------- multipart/mixed attachment--


From lists@morpheus.demon.co.uk  Sun Nov 10 14:31:30 2002
From: lists@morpheus.demon.co.uk (Paul Moore)
Date: Sun, 10 Nov 2002 14:31:30 +0000
Subject: [Spambayes] Outlook plugin plus Exchange
References: <n2m-g.u1ir3gwe.fsf@morpheus.demon.co.uk>
	<LCEPIIGDJPKCOIHOBJEPMENEHJAA.mhammond@skippinet.com.au>
Message-ID: <n2m-g.d6pdqywt.fsf@morpheus.demon.co.uk>

"Mark Hammond" <mhammond@skippinet.com.au> writes:

> I am working on code that optionally processes "missed" messages at startup.
> It looks like I can list all unread, unscored mail in my 1000+ item inbox
> very quickly, so this should be feasible.

That sounds like the best option. I haven't had a chance to check
Exchange yet, but with an IMAP store there are no "New mail" events
triggered when I start Outlook with new mail in the IMAP inbox. I'd
expect Exchange to be the same. (I didn't write a new addin, the
spambayes addin does log when it gets a NewMail event, which I can see
via win32traceutil...)

I'll be interested to see the code, in any case, as when I tried to
list unread mail for anotyher project, I couldn't get it to be fast
:-(

Paul.

-- 
This signature intentionally left blank

From trebor@animeigo.com  Sun Nov 10 21:59:28 2002
From: trebor@animeigo.com (Robert Woodhead)
Date: Sun, 10 Nov 2002 16:59:28 -0500
Subject: [Spambayes] Introducing myself
In-Reply-To: <LNBBLJKPBEHFEDALKOLCKEPMCIAB.tim.one@comcast.net>
References: <LNBBLJKPBEHFEDALKOLCKEPMCIAB.tim.one@comcast.net>
Message-ID: <a05200f19b9f46cfba659@[192.168.1.103]>

[my apologies if some of the suggestions/comments below have been previously
discussed, I'm still getting up to speed on the list]

>  > I'm particularly impressed with the chi-square work, it looks very
>>  interesting (but more stats for my poor stats-challenged mind to work
>>  on;
>
>So copy and paste <wink>.

Heh, call me old fashioned, but I actually like to know how things 
work, rather than relying on black magic.  ;^)

>  > not to mention that now I'm going to have to get around to
>>  cramming python in there with all the other languages that have
>>  accumulated over the years...).
>
>In return, you can throw twelve other languages out <0.7 wink>.

Why would I ever want to do that?  You never know when you'll need to 
be able to remember PL/C, JPL, APL, TUTOR, etc., etc., etc.  Though I 
pray I never have to remember NOVA MOBOL ("Language of Kings") ;^)

>Testing has pretty much run out of steam here, though.  My error rates are
>so low now I couldn't measure an improvement in a convincing way even if one
>were to be made, and the same is true of a few others here too.  We appear
>to be fresh out of big algorithmic wins, so are pushing on to wrestling with
>deployment issues.

Indeed.  And you also have to start worrying about the metagame; 
assuming your system goes into widespread deployment, what will the 
intelligent spammer (oxymoron) responses be?

>BTW, download the source code and read the comments in tokenizer.py:  the
>results of many early experiments are given there in comment blocks.

Will be doing this over the next day or so.

>Spoken like someone who worked on a rule-based system <wink>.  We have three
>categories:  Ham, Unsure, and Spam, and I haven't seen anything to make me
>believe that a finer distinction than that can be quantitatively justified
>(but my primary test data makes 2 mistakes out of 34,000 msgs now -- that's
>what I mean by "can't measure an improvement anymore", and a finer-grained
>scheme isn't going to touch those 2 mistakes; one of them is formally ham
>because it was sent by a real person, but consists of a one-line comment
>followed by a quote of an entire Nigerian scam spam -- nothing useful is
>ever going to *call* that one ham, and it scores as spam *almost* as solidly
>as an original Nigerian spam).

Ah, but there are more considerations.  First, many people's training 
sets may not be as distinct as yours, so the results might be more 
blurry.  Second, future versions of the software might end up 
including other recognizers in the mix (for example, DNSBL, url 
heuristics, whitelists, stamping systems, etc), so adding a bit of 
flexibility at the start doesn't cost you anything, but could end up 
saving everyone a lot of work down the road.  Since most existing 
mailreader filter schemes are relatively primitive, more than 10 
levels of discrimination isn't going to be all that useful.  But only 
3 would seem to be to be too few.  In a 1-9 scheme, the current 3 
levels would map to (say), 2,5,8.

It's just a syntactic difference, but it gives you precious wiggle room.

>"Score" is my favorite, but isn't catching on.  I believe the word "ham" for
>"not spam" was my invention, and since that one caught on big, I'm not
>fighting to the death for any others <wink>.

Hey, why quit when you're on a roll?

>
>>  * Hashing to a 32-bit token is very fast, saves a ton of memory,
>>  and the number of collisions using the python hash (I appealed for hash
>>  functions on the hackers-l and Guido was kind enough to send me the
>>  source) is low.  About 1100 collisions out of 3.3 million unique
>>  tokens on a training set I was using.
>
>That's significantly better than you could expect from a truly random hash
>function, so is fishy.  Tossing 3.3M balls into 2**32 buckets at random
>should leave 3298733 buckets occupied on average, with an sdev of 35.58
>buckets.  Getting 1100 collisions is about 4.7 sdevs fewer than the random
>mean.

I may have gotten the # of tokens wrong.  Currently my test runs are 
using 3.3M tokens but it may have been fewer when I was doing the 
hash tests.  Maybe 2.3-2.4M tokens at that time?  Anyway, thanks for 
the info about the relative merits of CRC32 and the Python hash; I'd 
been told CRC32 was bad and so was really surprised when it was 
marginally better.

>Since we're sticking to unigrams, we don't have an insane database burden.
>We also (by default) limit ourselves to looking at no more than 150 words
>per msg.  So I'm not sure saving some bytes of string storage is "worth it"
>for us, and it's very nice that we can get back the exact list of words that
>went into computing a score later.  A pile of hash codes wouldn't give the
>same loving effect <wink>.

Well, unless I'm missing something, you've got to keep track of every 
token you've ever seen, and you've got to look up every token you 
encounter to determine if it's significant enough to consider in the 
final calc.  If so, assuming the final calc isn't exponential, 
reducing the lookup time/resources can be a big win performance-wise.

Note that since you have the text of the token before you hash it, 
you can keep that around for significant tokens and display it later. 
The only reason to hash is for speed of access to the probability 
data.  The cost of the hashing is the inevitable collisions, which 
blur the probabilities for colliding tokens.

>Except I didn't get good enough results from his approach to justify
>pursuing it here, even leaving the hash codes at the full 32 bits.  When I
>went on to squash them to fit in a million buckets, a few false positives
>popped up that were just too bad to bear (two can be found in the list
>archives):  ham that was so obviously ham that no system that called them
>spam would be acceptable to most people.

I wasn't commenting on the phrase system, or even hashing, but rather 
on data reduction to reduce the memory footprint required of the 
statistical tables (ie: using 1 byte frequency counts vs. 4 byte 
ones).

Also, a cautionary note: just because the current system doesn't 
generate any horrible false positives on your corpii doesn't mean it 
won't do so on Joe Schmoe's.  Or my slightly smelly ham.

>  > * I was playing a week or two back with 1 and 2 token groups, and
>>  found that a useful technique was, for each new token, to only
>>  consider the most deviant result.  So if the individual word was .99
>>  spam, and the two word phrase was .95, it would only consider the .99
>>  result.  This would probably help with Bill Y's combinatorial scheme.
>
>It could be a viable approach to the problem mentioned above:  a scheme to
>suck out more than one word that doesn't systematically generate mounds of
>nearly redundant (highly correlated) clues.  We're clearly missing info by
>never looking at bigrams (or beyond) now, and that continues to bother me
>(even if it doesn't seem to be bothering the error rates <wink>).

Right; and, related to the metagame, you've got to consider responses 
by the spammers.  The initial attempt to defeat these kind of 
recognizers is going to try and exploit cancellation disease, 
probably by having a spammy preamble and a very hammy postscript.

So one possible approach would be to gradually degrade the 
significance of a token the further along in the email it is (both 
during training and recognition).  But of course, then you'll have to 
watch for html email that loads the front of the message with 
invisible ham.  So a parser that spits out only the tokens a human is 
going to see is indicated.

>  > * My personal bias (as I think Guido mentioned) is for a multifaceted
>>  approach, using Bayesian, rules-based (attacking things that bayesian
>>  isn't good at, like looking for obfuscated url structures), DNSBL,
>>  and whitelisting heuristics to generate an overall ranking.  So a
>>  hammy mail from a guy in your address book would bubble up to highest
>>  priority, whereas something spammy from him would stay neutral.
>
>I'm not sure we really need it.  For example, *lots* of spam has been
>discussed on this mailing list, so much so that the python.org email admin
>had to castrate SpamAssassin for msgs to this list address else it kept
>blocking ordinary list traffic.  My personal email classifier never calls
>anything here spam, though, nor does it call the originals of the spams
>posted here ham.

Beware the One True Path.  There is strength in diversity.

Or, as the noted philosopher D. Vader put it, "Don't be too proud of 
this technological terror you have created."  As you will recall, 
those rebel scum managed to craft a nasty false positive.

>
>I do worry a little about obsfuscated HTML.  We strip almost all HTML tags
>by default for a reason I've harped on enough <wink>:  all HTML decorations
>have very high spamprobs, and counting more than one of them as "a clue"
>fools almost every combining scheme into believing the msg containing them
>is spam (if you know a msg contains both <br> and <p>, it's not really more
>likely to be spam than if you just know it contains <br>!).  So we blind the
>classifier to HTML decorations now.
>
>But a spam I forwarded here a week or so ago exploited that:  the spam was
>interleaved with size=1 white-on-white news stories and tech mailing list
>postings.  The classifier *did* see those, but didn't see the HTML
>decorations hiding them.  This was a cancellation-disease-by-construction
>kind of msg, and chi-combining scored it near 0.5 as a result (solidly
>Unsure).  It's the only spam of that kind I've seen so far; if it becomes a
>popular techinque, we'll have to take more HTML blinders off the classifier.

That's a classic example of metagaming.  Seems to me, the strength of 
the spambayes recognizer is in recognizing the semantics (the spammy 
meaning of the message), not the syntactics.  So train it only on 
what a human would see reading the message.  Have another recognizer 
(either rules-based, bayesian, whatever works) that deals with the 
syntactics, and picks up on the html decoration tricks.  In other 
words, one that looks at what the message says, and another that 
looks at how it is presented.  This will prevent that particular kind 
of simple cancellation attacks.

And that wraps back to the "more responses" suggestion above.  How do 
you rate a hammy message with spammy html ornaments?  Might not "a 
little hammy" be a better response than "beat's me, boss!"?

>
>>  There's lots of room for cooperation between the various approaches
>>  and multiple agents means its less likely that a spam will get by.
>>  In particular, whitelisting heuristics can almost eliminate false
>>  positives.
>
>I'll let you know if I ever see one <wink>.

You will.  And it will be the one email that you really, really 
needed to read.  Murphy's Law guarantees that it will happen.  In 
fact, it typically happens (in my painful personal experience) soon 
after you make comments like the above.

>Getting vast quantities of spam isn't a problem anymore, but getting vast
>quantities of ham is.  Since your spammy ham is presumably business-related,
>I assume you can't share it.  Or can you?

Probably not.  Unless I could process them and just give you the 
tokens and frequencies in some useable format.  I'll see what I can 
do next week, gotta get python up and running along on my Mac.  Also 
gotta get the battlebot finished or my kids will hurt me.

>   Mixing spam and ham from
>different sources also causes worlds of problems (indeed, we still (by
>default) ignore most of the header lines partly for that reason, else the
>system gets great results for bogus reasons).

I do the same, I'm currently just looking at the subject line.

At 12:09 PM +0100 11/10/02, Rob Hooft wrote:
>I think our very good experience with the bayesian classifier would 
>"forbid" to use whitelisting. Once a whitelisted feature "leaks" 
>into the spam community, it will be useless.

Not if the whitelist heuristics are based on the individual user's 
environment, as opposed to global features.

>But there is a bayesian solution to it: Make the tokenizer recognize 
>the feature that you want to whitelist or blacklist, and emit a new 
>token to that effect.
>
>    From:<in-address-book>  --> Will have a low spamprob
>    url:numeric-host        --> Will have a high spamprob


While this is a useful approach, there is (IMHO) a need for users to 
be able to override, or at least modulate, the bayesian results in 
certain circumstances.  The classic example would be your boss 
forwarding a 419 scam to you with the comment "Looks good, I'm going 
to invest in this, what do you think?".  The spamminess might 
overwhelm the low spamprob From:<in-address-book>

A (paranoid) user needs to be able to tell the system "I don't care 
how spammy an email looks, if it's got this feature, I've got to at 
least glance at it with the Mk.1 Eyeball Recognition System".  Note 
that this doesn't mean that it should be declared "clean as the 
driven snow", just "might not be a pile of decomposing lunchmeat"

Yeah, this means that every spam going into Microsoft will eventually 
be from "billg@microsoft.com", but the consequences of this might be 
interesting.  Or at least, amusing.

best,R

-- 

Woodhead's Law: "The further you are from your server,  the more likely
it is to crash."

From tim.one@comcast.net  Mon Nov 11 00:59:05 2002
From: tim.one@comcast.net (Tim Peters)
Date: Sun, 10 Nov 2002 19:59:05 -0500
Subject: [Spambayes] More experiments with weaktest.py
In-Reply-To: <3DCE4D02.6060907@hooft.net>
Message-ID: <LNBBLJKPBEHFEDALKOLCMECDCJAB.tim.one@comcast.net>

[Tim, notes that his mistake-only training works in the order msgs
 come in]

[Rob Hooft]
> I toyed with the idea, but that would involve parsing all messages once
> before starting, and sorting them on date. Putting them in a set to
> "randomize" the order is much easier, so I was lazy.

That's fine.  For purposes of comparing this against previous tests, I
expect it's even good, since they were randomized too.

> ...
> Hm. That sounds so enthousiastic that I just might commit what I have
> gone through this night.

You did, and I thank you!  Note that there were already three Simplex pkgs
linked from

    http://www.python.org/topics/scicomp/numbercrunching.html

but I know how much fun it is write such stuff again <wink>.

> Some more info:
>
>   * No, I have not used a "Simulated Annealing" or "Threshold Accepting"
>     yet. Please keep in mind that each step in the optimization takes
>     between 3 minutes (1 set on my home PC) and 15 minutes (10 sets on my
>     work PC). This would be way too costly. Just minimization it will be.

Understood.

>   * I tried to use "Simplex optimization" (let a multidimensional
>     triangle walk through phase space) on the "Total cost" parameter.
>     This was simply disastrous. Phase space consists of plateau regions
>     that are exactly flat, joined by huge ridges. Think about that one
>     spam that goes from a 0.11 to a 0.09 score: it will add $9.80 in one
>     bang to the cost. This field is impossible to optimize.

Yes, it's a sum of step functions in the end, and at every point "the
derivative" is either 0 or infinite, depending on where you are and which
direction you look.  Making a new "smooth" cost measure was thoroughly
appropriate:

>   * I designed a new "Flex cost" field. That one does away with the
>     "unsure cost". The cost of a message is 0.0 at its own cutoff, and
>     increases linearly towards its "false" cost at the other cutoff,
>     and increases further to the other end. Hm. Unreadable.

The code is clear enough, though.  What I didn't understand is why each term
in the flexcost is divided by the difference between the (fixed per run)
cutoff levels:   / (SPC - HC).  That seems to systematically penalize, e.g.,
ham_cutoff=.4 and spam_cutoff=0.8 compared to ham_cutoff=0.1 and
spam_cutoff=0.9 (the former divides every term by 0.4, the latter by 0.8).
In the limit, if someone wanted a binary classifier (ham_cutoff ==
spam_cutoff), any mistake would be charged an infinite penalty.

> A table:
>
>            Score    Spam with this   Ham with this
>                       score costs     score costs
>             0.00         $ 1.29          $ 0.00

It's hard to see where that comes from.  Assuming ham_cutoff is 0.2 and
spam_cutoff 0.9, and so a spam scoring 0.0 works out to $1 *
(.9-0.0)/(.9-.2) ?

>             0.20         $ 1.00          $ 0.00
>             0.55         $ 0.50          $ 5.00
>             0.90         $ 0.00          $10.00
>             1.00         $ 0.00          $11.43
>
>      This field is much more smooth than the total cost field, so I was
>      hoping that pure minimization will do. Obviously, the flex cost is
>      much, much higher than the total cost because unsures are so much
>      more expensive. The flex cost field will also be less sensitive to
>      the {sp|h}am_cutoff parameters than the total cost field, because
>      there are no sudden cost jumps.

Well, if ham_cutoff==spam_cutoff, then (as above) any mistake will cause a
DivideByZero exception, so it's sure sensitive there <wink>.  I suspect it
might work better if the "/(SPC-HC)" business were simply removed?

>    * Results are not great I need to experiment more before reporting
>      on them.
>    * I just committed:
>       weaktest.py: introduction of the flexcost measure
>       optimize.py: simplex optimization (needs Numeric python; sorry)
>       weakloop.py: run weaktest.py repeatedly under simplex optimization

I've been running weakloop.py over two sets of my c.l.py data while typing
this.  That's 2*2000 = 4000 ham, and 2*1400 = 2800 spam, for 6800 total
msgs.  It's been thru the whole business about 25 times now.  At the start,

Trained on 88 ham and 66 spam
fp: 0 fn: 0
Total cost: $30.80
Flex cost: $212.3120
x=0.5000 p=0.1000 s=0.4500 sc=0.900 hc=0.200 212.31

It's having a hard time doing better than that.  The best so far seems to be

Trained on 82 ham and 66 spam
fp: 0 fn: 0
Total cost: $29.60
Flex cost: $200.0924
x=0.5011 p=0.1026 s=0.4515 sc=0.901 hc=0.205 200.09

which is so close to the starting point that it's hard to believe it's
finding something "real".  It *does* seem to be in a nasty local minimum,
though, as the next attempt was:

Trained on 118 ham and 69 spam
fp: 1 fn: 0
Total cost: $47.20
Flex cost: $344.7334
x=0.4989 p=0.1038 s=0.4531 sc=0.900 hc=0.209 344.73

I'm afraid it looks like it's eventually going to converge on the most
delicate possible settings that barely manage to avoid that 1 FP.


From tim.one@comcast.net  Mon Nov 11 01:17:46 2002
From: tim.one@comcast.net (Tim Peters)
Date: Sun, 10 Nov 2002 20:17:46 -0500
Subject: [Spambayes] Outlook plugin - training
In-Reply-To: <3DCE50FC.3050005@hooft.net>
Message-ID: <LNBBLJKPBEHFEDALKOLCCECGCJAB.tim.one@comcast.net>

[Tim]
>> ... my primary interest here is to see how bad things can get if
>> a user takes mistake-based training to an extreme.  Despite that
>> it's heavily hapax-driven, it appears to do very well when judged by
>> error rate.

[Rob Hooft]
> Hm. There are so little fp/fn's relative to unsures (at least after 30
> messages initial training), that it wouldn't matter much (I think).

As I tried to explain later, the psychological impact of the Unsures isn't
attractive, though -- they remain bizarre to human eyes.  When I got up
today, I got 6 new Unsure spam:  human growth hormone, gay porn, life
insurance, mortgage rates, a msg that made no sense (empty except for a
Yahoo auto-generated sig), and Genuine Leather Jackets.  It's not picking up
on general "this is advertising" clues, or even on general "this is gay
porn" clues.  Indeed, "XXX" is still a hapax!  This particular HGH spam will
never get through again, because training it found 80(!) hapaxes unique to
it.  It's not going to do much to stop other HGH spam, though -- this one
was especially chatty, and added words like 'forget', 'hair', 'lose', 'lost'
and 'anywhere' to the collection of (what are now, after training on it)
spam hapaxes -- just as previous HGH spam trained on didn't stop this one.
To my eyes, I had already told it about HGH spam, and I'm irked that it
showed me another one.  Ditto gay porn, ditto life insurance, etc.


[on database growth as a function of # of msgs]
> Hm, it is more like a sqrt after more messages. See attached image which
> has a sqrt X axis. The fit fits the data even at the lowest end.

Cool!  That was a dramatic graph indeed.  Soon there will be no mysteries
remaining <wink>.


From tim.one@comcast.net  Mon Nov 11 02:00:20 2002
From: tim.one@comcast.net (Tim Peters)
Date: Sun, 10 Nov 2002 21:00:20 -0500
Subject: [Spambayes] Proposing to rename some fundamental options
In-Reply-To: <LNBBLJKPBEHFEDALKOLCAEPOCHAB.tim.one@comcast.net>
Message-ID: <LNBBLJKPBEHFEDALKOLCCECJCJAB.tim.one@comcast.net>

[Tim]
> The original names made more sense when we had half a dozen competing
> schemes.
>
> Current                         Proposed
> -------                         --------
> robinson_probability_x          unknown_word_prob
> robinson_probability_s          unknown_word_strength
> robinson_minimum_prob_strength  minimum_prob_strength

This renaming has been done.  It should have no effect on pickles or
databases (i.e., no need to retrain).


From anthony@interlink.com.au  Mon Nov 11 02:22:26 2002
From: anthony@interlink.com.au (Anthony Baxter)
Date: Mon, 11 Nov 2002 13:22:26 +1100
Subject: [Spambayes] helping push the ham score for "nigeria" higher.
Message-ID: <200211110222.gAB2MQB11817@localhost.localdomain>


apologies for the marginal relevance, but it entertained me :)

http://news.bbc.co.uk/1/hi/world/africa/2423283.stm
"I am writing to you in the hope that you are under god and well. My naming
is Professor Isoun Turner, and I am having hope you can assist. We are 
having a communications sattelite worth $15 millon US dollars that needs
to be launched, but we need to find an international  launch pad"


From tim.one@comcast.net  Mon Nov 11 05:42:51 2002
From: tim.one@comcast.net (Tim Peters)
Date: Mon, 11 Nov 2002 00:42:51 -0500
Subject: [Spambayes] Introducing myself
In-Reply-To: <a05200f19b9f46cfba659@[192.168.1.103]>
Message-ID: <LNBBLJKPBEHFEDALKOLCMEDCCJAB.tim.one@comcast.net>

[Robert Woodhead]
> ...
> Heh, call me old fashioned, but I actually like to know how things
> work, rather than relying on black magic.  ;^)

You'll like this code, then!  We hate "mystery knobs", and everything has a
purpose.  A purpose may not make sense, but at least it has one.

> ...
> Indeed.  And you also have to start worrying about the metagame;
> assuming your system goes into widespread deployment, what will the
> intelligent spammer (oxymoron) responses be?

I expect to get rich by selling spammer software to defeat this latest round
of classifiers, so it's not that I can't tell you what their responses will
be, it's that I don't want to reveal trade secrets <wink>.  Indeed, if there
are technically savvy spammers, they're subscribed to this list (and others
like it).

> ...
> Ah, but there are more considerations.  First, many people's training
> sets may not be as distinct as yours, so the results might be more
> blurry.

Of all the things this project has done I find lacking in other projects,
this is the part I think gives this project its clearest advantage:  we have
a statistically sound testing framework, more than one person testing on
more than one corpus, people are beat up for running sloppy tests, and major
algorithm improvements have been vetted by many here on their own data, and
publicly reported results..  Winners survived and losers got purged from the
codebase, and no single test corpus ruled that.  Even for people with a
single test corpus, the testing framework slices-and-dices it into multiple
runs, so that results specific to a quirk of one subset can't be mistaken
for "the truth".  The project's TESTING.txt talks more about this.

My tech mailing-list data turned out to be easier than most peoples',
seemingly because almost all forms of advertising, and of HTML, are despised
on tech mailing lists.  But I've got other, harder test data too, and at
least one person here (hi, Anthony!) has a flatly horrid corpus.

> Second, future versions of the software might end up including other
> recognizers in the mix (for example, DNSBL, url heuristics, whitelists,
> stamping systems, etc), so adding a bit of flexibility at the start
> doesn't cost you anything, but could end up saving everyone a lot of
> work down the road.

We'll define a stable API for accessing this system.  If people want to
combine it with other systems, that's fine, and Python excels at playing
nice with other systems.  If someone wants to add, e.g., a DNSBL gimmick to
*this* codebase, they should write a new module to do so.  I don't want
fundamentally different approaches mixed into one module, let alone one
function.

> Since most existing mailreader filter schemes are relatively primitive,
> more than 10 levels of discrimination isn't going to be all that useful.
> But only 3 would seem to be to be too few.  In a 1-9 scheme, the
> current 3 levels would map to (say), 2,5,8.

Let me clarify:  I don't object to defining a billion levels, the problem is
that I've seen no evidence that the algorithm in use here *can* provide more
than 3 meaningful levels.  chi-combining usually gives extreme scores.  The
median spam score is (to 6 significant digits) 1.0; the median ham score is
on the order of 1e-10.  The difference between, e.g., 1e-20 and 1e-5 appears
meaningless, despite that it's 15 orders of magnitude.  When chi doesn't
give an extreme score, it tends to give one near 0.5, and which side of 0.5
it lies on doesn't appear to have strong correlation with whether a thing is
ham or spam.  The system is saying "I'm lost!" then, and it is.

In effect, it's a 1-bit classifier but with a very useful middle ground.
That it only gives about 1 bit of info follows from that the underlying math
is a statistical accept/reject test (a two-outcome decision).  Well, it's
actually two accept/reject tests under the covers (one for ham, one for
spam), and that's where the middle ground comes from (they both accept or
both reject).

If we were to call our middle ground 5, what good would that do anyone else?
It doesn't mean we judge the odds of a msg being spam at 1 in 2.  It means
we have no idea.  It certainly doesn't mean what, e.g., a 5 coming out of
SpamAssassin means.  "Unsure" means what it says.  If, in the future, a new
and better algorithm comes along with 6 meaningful digits, then I expect a
new X- header would be defined to report it.

> It's just a syntactic difference, but it gives you precious wiggle room.

I'll leave more on this to people adding headers (the client I'm using
doesn't use headers, but does attach integer score (in 0-100) metadata to
msgs).

[on hash collisions]
> ...
> I may have gotten the # of tokens wrong.  Currently my test runs are
> using 3.3M tokens but it may have been fewer when I was doing the
> hash tests.  Maybe 2.3-2.4M tokens at that time?  Anyway, thanks for
> the info about the relative merits of CRC32 and the Python hash; I'd
> been told CRC32 was bad and so was really surprised when it was
> marginally better.

Hard to say.  Neither CRC32 nor Python's string hash make any effort toward
being "crytographically secure", and Python's string hash is in fact and
deliberately "better than random" in some common cases:

>>> hash('x1')
739453787
>>> hash('x2')
739453784
>>> hash('x3')
739453785
>>> hash('x4')
739453790
>>>

That is, it's very regular in a way that most often yields fewer 32-bit
collisions than a truly random hash function would yield when fed input
strings with regularities.  That eventually breaks down if you throw enough
strings at it -- but it doesn't get "worse than random" then either, so far
as it's ever been pushed.

> ...
> Well, unless I'm missing something, you've got to keep track of every
> token you've ever seen,

So far we have, but there's slow-motion work in progress on database
pruning.

> and you've got to look up every token you encounter to determine if
> it's significant enough to consider in the final calc.

Yes, and that will probably always be true.

> If so, assuming the final calc isn't exponential, reducing the lookup
> time/resources can be a big win performance-wise.

I don't believe so.  When using a Python dict as "the database", the time
for scoring a msg is minor compared to the time taken by parsing and
tokenization, and especially compared to the time just to get the msg *into*
the system (whether that's file I/O, or socket I/O, or some email pkg's
programming API, or whatever -- that part is the bottleneck when using a
dict; when not using a dict, database access time may become a burden, and
most databases in use here require string keys even if you're working with
ints -- the database user has to convert the hash code to a string!  Other
databases (like ZODB) could use ints directly as keys, but they're rare.).

> Note that since you have the text of the token before you hash it,
> you can keep that around for significant tokens and display it later.

Good point!  I had overlooked that indeed.

> The only reason to hash is for speed of access to the probability
> data.

Feel free to experiment; as above, I don't have reason to suspect that
switching to hash codes would speed anything here, except for Jeremy's ZODB
database (which could switch to using an IOBTree, which is zippier than an
OOBTree).

> The cost of the hashing is the inevitable collisions, which
> blur the probabilities for colliding tokens.

Another cost is obscuring the code.

> ...
> I wasn't commenting on the phrase system, or even hashing, but rather
> on data reduction to reduce the memory footprint required of the
> statistical tables (ie: using 1 byte frequency counts vs. 4 byte
> ones).

Ours are actually unbounded, but I don't have any problem with the memory
footprint now.  Others do.  It seems more fruitful at this point to
concentrate on ways to reduce the # of tokens, rather than the size burden
per token.  BTW, see the neil*.py files for how one person here builds a
lean scoring-only CDB database -- you can store things any way you like,
provided that the database access function is fiddled to convert to what the
classifier expects to use.  I don't believe such conversion is a significant
time burden, but I haven't run the CDB variant and so haven't timed it
(Neil, do you have gripes about memory or time?  Spit 'em out.).

> Also, a cautionary note: just because the current system doesn't
> generate any horrible false positives on your corpii doesn't mean it
> won't do so on Joe Schmoe's.  Or my slightly smelly ham.

Sure, but I'm a realist:  any non-trivial scheme has a non-zero FP rate.
That's life.  What users choose to do about that isn't for this project to
dictate.  It is our responsibility to say up-front that there will be false
positives, and we do say so.

> ...
> Right; and, related to the metagame, you've got to consider responses
> by the spammers.  The initial attempt to defeat these kind of
> recognizers is going to try and exploit cancellation disease,
> probably by having a spammy preamble and a very hammy postscript.

They can't really defeat this scheme that way.  At best they can hope to
push msgs into Unsure territory.  What constitutes "very hammy" is a
function of each user's database here, and no generic blob of text is going
to score high for hamminess everywhere.  The spam in question happened to
include a news story about the DC-area snipers, and that was very hammy for
*me* because I live in that area and many friends and relatives had
corresponded about the snipers (including forwarding the text of that very
news story, as if we were suffering a news blackout here <wink>).  Even so,
the message ended up as Unsure for me, not as Ham.  That's to the credit of
chi-combining, which is very good about knowing when it's confused.

> So one possible approach would be to gradually degrade the
> significance of a token the further along in the email it is (both
> during training and recognition).

I think there is reason to believe that spammers have to get your attention
early.  OTOH, many pieces of incriminating evidence also live at the end of
spams ("this is not spam!" blurbs, the explanation that you got this because
you're on an opt-in list run by one of their "partners", references to
various state and federal bills, the "unsubscribe me" URL slash address
harverster, etc).

The white-on-white spam I mentioned before had hammish stuff at the start,
and at the end, and between each pair of paragraphs.

> But of course, then you'll have to watch for html email that loads the
> front of the message with invisible ham.  So a parser that spits out
> only the tokens a human is going to see is indicated.

Yup.  Guido suggested that at the start, but that level of HTML analysis
gets a lot more expensive too.  We'll see.

BTW, on large tests this system scores about 80 msgs/second on my box,
including everything (system time, training, I/O, parsing, tokenizing,
scoring, reporting, recording, and analyzing results -- this is # of msgs
divided by elapsed wall-clock time).  We could afford to get slower, if
necessary.

> ...
> Beware the One True Path.  There is strength in diversity.

Let a thousand classifiers bloom.  If someone here wants to volunteer the
effort to try a different approach, that's always been welcome.  But the
results have been so good sticking to one basic approach that I don't see
that happening.  We ended up doing one thing exceedingly well, and that's a
contribution to diversity too, of a kind you may be undervaluing <wink>.

> Or, as the noted philosopher D. Vader put it, "Don't be too proud of
> this technological terror you have created."  As you will recall,
> those rebel scum managed to craft a nasty false positive.

I don't view an FP as being as costly as needing to build a new Death Star.
For goodness sake, this is email we're talking about -- anyone trusting a
truly critical msg to email is dreaming to begin with.

> ...
> That's a classic example of metagaming.  Seems to me, the strength of
> the spambayes recognizer is in recognizing the semantics (the spammy
> meaning of the message), not the syntactics.

Well, it's got no semantic knowledge at all.  It doesn't even know which
language a msg is written in, let alone what it means, and has no concept of
"word" beyond "stuff that appears between whitespace".  It's very much
focused on purely local lexical structure.

> So train it only on what a human would see reading the message.

We get a lot of value out of mining a handful of header lines.  We also get
a lot of value out of tokenizing embedded "invisible" URLs.  The theme here
is that we tokenize "what works", and that's driven by measured error rates;
philosophy doesn't enter into that part.

> Have another recognizer (either rules-based, bayesian, whatever works)
> that deals with the syntactics, and picks up on the html decoration
> tricks.  In other words, one that looks at what the message says, and
> another that looks at how it is presented.  This will prevent that
> particular kind of simple cancellation attacks.

A rule-based system seems more effective to me too against that particular
gimmick.  Also against viruses.

> And that wraps back to the "more responses" suggestion above.  How do
> you rate a hammy message with spammy html ornaments?  Might not "a
> little hammy" be a better response than "beat's me, boss!"?

I have no real idea, but fear that presuming "yes" is presuming a lot of
intelligence that systems parsing this header won't actually have.  The
fancier the rating scheme the fancier they have to be too.  In the end, the
user has to decide what to do about everything that's not called ham, no
matter how many or few the non-ham categories.  As a user myself, I've got
no use at all for distinctions beyond "I'm pretty sure it's spam" and "beats
me".  That already gives two categories I have to check, and that's enough.
I do find it useful that my client can sort on the score metadata, and there
are proposals here too to add fancier header lines beyond the basic
spam/ham/unsure one.

[on FPs]
> You will.

Of course I will.

> And it will be the one email that you really, really needed to read.

It doesn't matter -- I review all my spam.  Other people won't, and so it
goes.

> Murphy's Law guarantees that it will happen.  In fact, it typically
> happens (in my painful personal experience) soon  after you make
> comments like the above.

You realize you're overselling badly here, right <wink>?

> ...
> I do the same, I'm currently just looking at the subject line.

Look at tokenize_headers() in tokenizer.py for a number of other
corpus-independent header lines that proved useful to tokenize.  Surprising
but true:  we can get a very good classifier by looking at this handful of
header lines alone.  Or by looking at the body alone.  Looking at both takes
longer <wink>.

> ...
> While this is a useful approach, there is (IMHO) a need for users to
> be able to override, or at least modulate, the bayesian results in
> certain circumstances.  The classic example would be your boss
> forwarding a 419 scam to you with the comment "Looks good, I'm going
> to invest in this, what do you think?".  The spamminess might
> overwhelm the low spamprob From:<in-address-book>

This is akin to my "entire Nigerian scam quote" FP, and it's all but certain
that the spam content would overwhelm the brief "from the boss" clues.
OTOH, if my boss didn't wait for my reply and went ahead and invested
anyway, the subsequent financial disgrace would open the door for me to take
his job.  After all, he relied on me for advice, so who more logical to
succeed him?

two-winners-and-only-one-loser-ly y'rs  - tim


From popiel@wolfskeep.com  Mon Nov 11 06:11:25 2002
From: popiel@wolfskeep.com (T. Alexander Popiel)
Date: Sun, 10 Nov 2002 22:11:25 -0800
Subject: [Spambayes] More experiments with weaktest.py 
In-Reply-To: Message from Tim Peters <tim.one@comcast.net> 
	<LNBBLJKPBEHFEDALKOLCMECDCJAB.tim.one@comcast.net> 
References: <LNBBLJKPBEHFEDALKOLCMECDCJAB.tim.one@comcast.net> 
Message-ID: <20021111061126.211B5F4CD@cashew.wolfskeep.com>

In message:  <LNBBLJKPBEHFEDALKOLCMECDCJAB.tim.one@comcast.net>
             Tim Peters <tim.one@comcast.net> writes:
>
>I've been running weakloop.py over two sets of my c.l.py data while typing

I've now run weakloop.py over three sets of my private data;
that's 3*200 ham and 3*200 spam, for a total of 1200 messages.

The best few it came up with were:

Trained on 39 ham and 61 spam
fp: 4 fn: 3
Total cost: $61.60
Flex cost: $189.7713
x=0.5040 p=0.1040 s=0.4400 sc=0.902 hc=0.204 189.77

Trained on 38 ham and 61 spam
fp: 4 fn: 2
Total cost: $60.60
Flex cost: $189.9767
x=0.5060 p=0.1060 s=0.4300 sc=0.903 hc=0.206 189.98

Trained on 37 ham and 61 spam
fp: 4 fn: 2
Total cost: $60.40
Flex cost: $189.2842
x=0.5054 p=0.0980 s=0.4436 sc=0.905 hc=0.209 189.28

Trained on 37 ham and 61 spam
fp: 4 fn: 2
Total cost: $60.40
Flex cost: $189.8255
x=0.5033 p=0.0981 s=0.4456 sc=0.903 hc=0.206 189.83

Trained on 37 ham and 61 spam
fp: 4 fn: 2
Total cost: $60.40
Flex cost: $189.8260
x=0.5026 p=0.1000 s=0.4458 sc=0.902 hc=0.207 189.83

There were a few where it trained on a couple more or less ham and
spam... but I had to go hunting for them.  I find it quite interesting
that my ham:spam training ratio here (about 2:3, about where all my
ratio tests have been pointing as a sweet spot) is significantly
different than that reported by others (which has been much closer
to 1:1 or favoring more ham than spam).  I guess my corpus really
is unusual.

FWIW, I'm running it again with all 10 of my sets (4000 messages
total) overnight.

- Alex

From popiel@wolfskeep.com  Fri Nov  8 00:06:27 2002
From: popiel@wolfskeep.com (T. Alexander Popiel)
Date: Thu, 07 Nov 2002 16:06:27 -0800
Subject: [Spambayes] Outlook plugin - training 
In-Reply-To: Message from "Tim Peters" <tim@zope.com> 
	<BIEJKCLHCIOIHAGOKOLHMEDFDOAA.tim@zope.com> 
References: <BIEJKCLHCIOIHAGOKOLHMEDFDOAA.tim@zope.com> 
Message-ID: <20021108000627.2B918F5CC@cashew.wolfskeep.com>

In message:  <BIEJKCLHCIOIHAGOKOLHMEDFDOAA.tim@zope.com>
             "Tim Peters" <tim@zope.com> writes:
>[Anthony Baxter]
>> Note that "random sample" is not as trivial as all that, either - if
>> you have a very high ham:spam ratio in your training DB, your accuracy
>> will suffer (see the tests from Alex, myself and others).
>
>I still need to try to make sense of those tests.  A real complication is
>that more than one thing changes when trying to test ratios:  it's not just
>the ratio that changes, it's the absolute number of each trained on too.

True.

>For example, (a) train on 5000 ham and 1000 spam; or, (b) train on 50000 ham
>and 10000 spam.  The ratios are identical.  Do we expect the error rates to
>be identical too?  I don't, but haven't tried it.

I have tried this, and the effects of ratio were diminished
as the training set size increased.  For details, see
http://www.wolfskeep.com/~popiel/spambayes/ratio2 .  The
tests were done with gary-combining, not chi-square, so I
really ought to rerun them.

>I expect the latter would do better than the former, despite the identical
>ratios, simply because more msgs allow better spamprob estimates.

It depended on what the ratio in question was... for 1:4
ham:spam, increased training set size hurt instead of helped,
in the ranges that I was able to test.  For 1:1, increased
training helped instead of hurt.

>Something missing in "the ratio tests" is a rationale (even an
>after-the-fact one) for believing there's some aspect of the system that's
>sensitive to the ratio.  The combining method certainly is not, and the
>spamprob estimation (update_probabilities()) deliberately works with
>percentages instead of raw counts so that the ham::spam training ratio
>has no direct effect on the spamprobs calculated.

Eh, I have a perfectly good rationale for believing that
something is sensitive the the ratio: the tests I've run
show such a sensitivity.  What's missing is a theory on
_why_ there's a sensitivity. ;-)

I don't think the following theory is perfectly phrased, but
it seems plausible to me:

Perhaps the number of topics discussed in ham is greater
than that in spam.  Thus, the average percentage of ham
messages containing a particular significant ham word is
systematically lower than the average probability of a
particular significant spam word appearing in spam messages.
As the training set size increases, the percentage difference
becomes more consistent and pronounced.  Since we're then
combining the percentages, we systematically skew slightly
due to the differing averages.

Changing the ratio of ham to spam has the effect of changing
the number of topics discussed, particularly when the training
set size is small and random chance can exclude all instances
of a given topic.  Balancing the number of topics removes the
skew in the probabilities.  As training set size increases,
adjusting the ratio has less effect, because it has less
likelyhood of eliminating topics of discussion.

I think that would account for my data.

>The total # of spam training msgs does limit how high a spamprob can get,
>and the total # of ham training msgs limits how low.  The *suspicion* I had
>running my large c.l.py test is that it wasn't the ratio that mattered so
>much as the absolute number, and that the error rates didn't "settle down"
>to the 4th digit until I got near 10,000 spam total.

I suspect that by the time the corpora got that large, adjusting
the training ratio wouldn't make a lick of difference if the
corpora were sampled randomly to achieve the given ratio.  There
would just be too little chance of excluding a topic from the
samples.  Systematically excluding a topic might produce equivalent
results to my ratio tests.

- Alex

From richie@entrian.com  Fri Nov  8 00:17:25 2002
From: richie@entrian.com (Richie Hindle)
Date: Fri, 08 Nov 2002 00:17:25 +0000
Subject: [Spambayes] SMTP proxy questions
Message-ID: <ievlsu08r2krkv5n6clac6c5p58uianqph@4ax.com>


[Me]
> Also on my list is to commit Tim Stone's SMTP proxy code, possibly after
> integrating it with the pop3proxy (but I need to discuss that with you,
> Tim, after looking in more detail at the code, hopefully tonight).

I've discussed this with Tim S, and he's going off the SMTP proxy idea
while I'm still broadly in favour of it.  What do people think - do
non-Outlook users want to forward messages to 'spam' and 'ham' to train the
system, or use an HTML UI?

The most difficult problem for retraining-by-forwarding is matching the
forwarded message to one from the cache, after Outlook Express has stripped
the headers, top-quoted the users .sig, converted it to HTML and added
fifteen macro viruses.  Any ideas?  Can the tokeniser help?

Or perhaps there's another way.  The only other option I'd thought of was
to add two hyperlinks to the end of the message, "This is spam" and "This
is ham" (in ways that would work for both HTML and plain-text messages, in
both HTML and plain-text email clients).  They'd link to the HTML interface
and tell it the cache ID of the message.  Adding content to emails is way
more intrusive (and difficult) than adding headers.  But no more intrusive
than the .sig that mailman adds.

-- 
Richie Hindle
richie@entrian.com


From anthony@interlink.com.au  Fri Nov  8 00:30:09 2002
From: anthony@interlink.com.au (Anthony Baxter)
Date: Fri, 08 Nov 2002 11:30:09 +1100
Subject: [Spambayes] SMTP proxy questions 
In-Reply-To: <ievlsu08r2krkv5n6clac6c5p58uianqph@4ax.com> 
Message-ID: <200211080030.gA80UAf11390@localhost.localdomain>


> I've discussed this with Tim S, and he's going off the SMTP proxy idea
> while I'm still broadly in favour of it.  What do people think - do
> non-Outlook users want to forward messages to 'spam' and 'ham' to train the
> system, or use an HTML UI?

I'd have to say I don't like the idea. There's too many potential places
where it can all go horribly horribly pear-shaped, and too many rat-holes
that the various email clients can screw up with.

Anthony

-- 
Anthony Baxter     <anthony@interlink.com.au>   
It's never too late to have a happy childhood.


From jbublitz@nwinternet.com  Fri Nov  8 01:15:29 2002
From: jbublitz@nwinternet.com (Jim Bublitz)
Date: Thu, 07 Nov 2002 17:15:29 -0800 (PST)
Subject: [Spambayes] SMTP proxy questions
In-Reply-To: <ievlsu08r2krkv5n6clac6c5p58uianqph@4ax.com>
Message-ID: <XFMail.021107171529.jbublitz@nwinternet.com>

On 08-Nov-02 Richie Hindle wrote:
> Or perhaps there's another way.  The only other option I'd
> thought of was to add two hyperlinks to the end of the message,
> "This is spam" and "This is ham" (in ways that would work for
> both HTML and plain-text messages, in both HTML and plain-text
> email clients).  They'd link to the HTML interface and tell it
> the cache ID of the message.  Adding content to emails is way
> more intrusive (and difficult) than adding headers.  But no more
> intrusive than the .sig that mailman adds.

What about adding a MIME object to the msg with the Spambayes info
(text/spambayes?) - or will forwarding lose that info too? The
email module should be able to do this.

Jim


From tim.one@comcast.net  Fri Nov  8 04:07:18 2002
From: tim.one@comcast.net (Tim Peters)
Date: Thu, 07 Nov 2002 23:07:18 -0500
Subject: [Spambayes] Proposing to drop retain_pure_html_tags
In-Reply-To: <LNBBLJKPBEHFEDALKOLCOEFOCGAB.tim.one@comcast.net>
Message-ID: <LNBBLJKPBEHFEDALKOLCEEPMCHAB.tim.one@comcast.net>

FYI, that option is gone now.

From tim.one@comcast.net  Fri Nov  8 04:29:17 2002
From: tim.one@comcast.net (Tim Peters)
Date: Thu, 07 Nov 2002 23:29:17 -0500
Subject: [Spambayes] Proposing to rename some fundamental options
In-Reply-To: <LNBBLJKPBEHFEDALKOLCOEFOCGAB.tim.one@comcast.net>
Message-ID: <LNBBLJKPBEHFEDALKOLCAEPOCHAB.tim.one@comcast.net>

The original names made more sense when we had half a dozen competing
schemes.

Current                         Proposed
-------                         --------
robinson_probability_x          unknown_word_prob
robinson_probability_s          unknown_word_strength
robinson_minimum_prob_strength  minimum_prob_strength


Note:  unknown_word_prob is what the Baysian prob adjustment moves toward,
more strongly the less evidence backs up a counting spamprob estimate (the
fewer the msgs a word has been seen in, the more the adjustment pushes the
spamprob toward unknown_word_prob; for a word that's never been seen before,
this reduces to unknown_word_prob exactly).

We've always set it to 0.5 by default, and previous tests never showed
benefit from changing that.

We've gotten better since then, though, and it's possible to deduce "a more
correct" value.  For example, take the mean of all the by-counting spamprobs
in your database, across words that have appeared in at least 10 msgs (so
that there's reason to have *some* confidence in the by-counting guess).
That's then an estimate of the spamprob a new word will eventually get over
time.

Across 3 databases I tried this on, it turned out to be a little over 0.5,
from 0.513 (my home personal classifier) to 0.540 (fat c.l.py test).

If someone has time for a controlled experiment, run the attached code to
find this guess for one of your databases; then if it differs from 0.5, try
a before-and-after test just changing that much.  If there's any promise
here, update_probabilities() could easily be changed to compute and use this
automatically.

"""
import cPickle as pickle
f = file('fat.pik', 'rb')  # your database pickle goes here

c = pickle.load(f)
f.close()
w = c.wordinfo

def guessx():
    nham = float(c.nham or 1.0)
    nspam = float(c.nspam or 1.0)
    n = 0
    probsum = 0.0
    for rec in w.itervalues():
        if rec.hamcount + rec.spamcount >= 10:
            hamratio = rec.hamcount / nham
            spamratio = rec.spamcount / nspam
            prob = spamratio / (spamratio + hamratio)
            probsum += prob
            n += 1
    print n, probsum / n

guessx()
"""


From mhammond@skippinet.com.au  Fri Nov  8 04:48:54 2002
From: mhammond@skippinet.com.au (Mark Hammond)
Date: Fri, 8 Nov 2002 15:48:54 +1100
Subject: [Spambayes] Corpus module (was: Upgrade problem)
In-Reply-To: <fljlsusfv2tcnmiv8a0jurqnc9fn8mn7q7@4ax.com>
Message-ID: <LCEPIIGDJPKCOIHOBJEPAEKIHJAA.mhammond@skippinet.com.au>

> Laughing and pointing should be directed towards me rather than Tim.

None of that, but some thoughts <wink>.

I think that the classes I posted a while ago suffer from the exact reverse
problem as your idea.  My idea was to make a "message store" that is largely
independent of training.  I believe the problem with your design is that it
deals with the training at the expense of the message store.

Obviously, but worth mentioning, is that there are competing interests here.
My focus is towards clients, and specifically the outlook one (if there were
more clients I would be happy to think of them too <wink>).  Alot of the
focus of this group is towards admins rather than individuals (which is just
fine!)  But it seems the current thinking is of a corpus as being a fairly
static, well-controlled set of messages used almost purely for training
purposes.

For client programs, this may not be practical.  The corpus is a more
dynamic set of messages - and worse, actually *is* the user's set of
messages rather than a collection of message copies.

For example, "moving" a message in a corpus may actually mean moving the
message in the user's real inbox.  This may or may not be what is intended -
a corpus "move" operation is more about changing a message's classification
than it is about physically moving pieces of mail around.

> A Corpus wouldn't know how to create Message objects, nor would a Message
> object know how to create itself - classes *derived from* them would know
> how to do that.  For instance (totally untested code, probably full of
> typos) -
>
> class Message:

Jeremy and I both posted real code, so starting with something that takes
that into consideration would be good.

> I may be putting too much
> into the base class by demanding that the text of the message be given to
> the constructor - that precludes making FileMessage lazy, and
> only read the
> file when it needs to.]

It also defeats the abstract nature of the class.

> 'Corpus' works the same way; again, the details may be naive, but this is
> the general idea:

I'm hoping I don't sound grumpy, but again, the few systems that already
exist for this engine are the best ones to use to discover the naivety early
<wink>

> You can then envisage a MailboxCorpus, and OutlookFolderCorpus, an
> IMAPFolderCorpus, a POP3AccountCorpus, a PigeonMessagingCorpus and so on.

I can't quite imagine that at the moment, as per my comments at the top.

Off the top of my head, I believe we need:
* An abstract "message id"
* A message classification database, as discussed before - basically just a
dictionary, keyed by ID, holding either "spam" or "ham".
* A "corpus" becomes just an enumerator of message IDs for bulk/batch
training.  It has no move etc operations.
* A "message store" is capable of returning a message object given its ID.
* The training API simply takes message objects and updates the probability
and message databases.

At that level, we really don't need much else - no folders or any other
grouping of messages.  I'm really not too sure there is much value in adding
higher-level concepts such as folders or message store "move" operations -
certainly not at the outset, where there are too many competing
requirements.

> Yes - this could work using observer objects registered with Corpus
> objects:

This could work, but may be too simple to be necessary.  If the process of
re-training a message in the Outlook GUI becomes:

def RetrainMessageAsSpam():
	# Outlook specific code to get an ID.
	message = message_store.GetMessage(id)
	if not classifier.IsSpam(message):
		classifier.train(message, is_spam=True)

And not a whole lot else, it doesn't seem worth it.  Unfortunately, the
decision to perform the retrain is the complex, but client specific part.
Is this a newly delivered message?  Did the user manually move the message
somewhere?  Did the user click one of our buttons?  Is the user deleting old
ham that we want to train on before it dies forever?

Outlook does this via examining what Outlook event we are seeing, and
looking at meta-data we possibly previously attached to the message.  I'm
not sure this can be encapsulated well at the moment without adding all our
meta-data etc baggage to the base classes.

> Most of the *new* code that's needed is defining the abstract concepts and
> their interfaces, rather than writing code that actually *does* anything -
> it's building a framework.

*cough* ummm...  This is doomed to failure.  Code *must* do something to be
taken seriously.  At the very least, I would expect to see the existing test
driver framework running against these "abstract concepts" <wink>

> Once the framework is there, most of the code needed to implement the
> functionality should already be in the project - code to hook
> into Outlook,
> to train on a message, to parse mbox files, and so on.  It just needs
> hooking into the framework.

See above <wink>.

Mark.


From tim.one@comcast.net  Fri Nov  8 04:50:42 2002
From: tim.one@comcast.net (Tim Peters)
Date: Thu, 07 Nov 2002 23:50:42 -0500
Subject: [Spambayes] SMTP proxy questions
In-Reply-To: <ievlsu08r2krkv5n6clac6c5p58uianqph@4ax.com>
Message-ID: <LNBBLJKPBEHFEDALKOLCGEAACIAB.tim.one@comcast.net>

[Richie Hindle]
> ...
> The most difficult problem for retraining-by-forwarding is matching the
> forwarded message to one from the cache, after Outlook Express
> has stripped the headers, top-quoted the users .sig, converted it
> to HTML and added fifteen macro viruses.  Any ideas?

If user can be convinced to forward as an *attachment*, those problems go
away, at least in OE.  You can create a new msg there, select any number of
msgs, drag them to the msg as a group, and OE will create an attachment for
each one.  Unlike Outlook, OE appears to save the original stuff that came
in over the wire (we're finding it's a real hoot in the OL client to try to
guess what the original MIME structure may have been).

> Can the tokeniser help?

If you put in a token unique to each msg, sure <wink>.  Perhaps the "loose
checksum" program Skip checked in could be useful for this.


From tim.one@comcast.net  Fri Nov  8 05:06:43 2002
From: tim.one@comcast.net (Tim Peters)
Date: Fri, 08 Nov 2002 00:06:43 -0500
Subject: [Spambayes] Upgrade problem
In-Reply-To: <5tjlsu8ak2a734sjb4hosp28qrvp6fdm13@4ax.com>
Message-ID: <LNBBLJKPBEHFEDALKOLCGEACCIAB.tim.one@comcast.net>

[Richie Hindle]
> A quick note in case someone decides to remove the counts from the
> database:

Neil Schemenauer already does, in his CDB code (neil*.py).  It's a lean
scoring-only database, mapping tokens to *just* spamprobs.  If he went on to
store them as scaled ints, he could almost certainly reduce this to 2 bytes
of prob info per token, and possibly even just 1.

> the HTML front end has a "Word query" feature which will tell you the
> information in the database for a given word - it's interesting to see
> how many more times the word 'Viagra' appears in ham than in spam.  I
> mean the other way round.

What a geek <wink>.


From tim.one@comcast.net  Fri Nov  8 05:48:25 2002
From: tim.one@comcast.net (Tim Peters)
Date: Fri, 08 Nov 2002 00:48:25 -0500
Subject: [Spambayes] Upgrade problem
In-Reply-To: <r01050400-1021-6EAB9700F2A611D68CC8003065D5E7E4@[10.0.0.23]>
Message-ID: <LNBBLJKPBEHFEDALKOLCIEAFCIAB.tim.one@comcast.net>

[Just van Rossum]
> I think it can be done with almost no extra overhead with a
> caching scheme.  This assumes (probably wrongly <wink>) that
> the cache stays in memory between runs.
> Something like this perhaps:
>
> *** classifier.py   Thu Nov  7 23:03:07 2002
> --- classifier.py.hack  Fri Nov  8 00:04:05 2002
> ***************
> *** 456,459 ****
> --- 456,460 ----
>
>           wordinfoget = self.wordinfo.get
> +         spamprobget = self.spamprobcache.get
>           now = time.time()
>           for word in Set(wordstream):
> ***************
> *** 463,467 ****
>               else:
>                   record.atime = now
> !                 prob = record.spamprob
>               distance = abs(prob - 0.5)
>               if distance >= mindist:
> --- 464,470 ----
>               else:
>                   record.atime = now
> !                 prob = spamprobget(word)
> !                 if prob is None:
> !                     prob = self.calcspamprob(word, record)
>               distance = abs(prob - 0.5)
>               if distance >= mindist:

Sorry, I don't know what this is trying to accomplish.  Like, what is
self.spamprobcache?  There's no such thing now, and the patch doesn't appear
to create one (i.e., this code doesn't run).  Whatever it's supposed to be,
why isn't spamprobcache.get *itself* responsible for returning a spamprob,
instead of making its caller deal with two cases?  If the answer is "it's
supposed to be a dict, so .get ain't that smart", then the memory burden for
a long-running scorer process will zoom, negating one of the benefits people
attached to "real databases" thought they were buying in return for giant
files and slothful performance <wink>.

Life would be easier if databaseheads trained all they liked as often as
they liked, but refrained from calling update_probabilities() until the end
of the day (or other "quiet time").  The idea that the model should be
updated after every msg trained on is an extreme.


From tim.one@comcast.net  Fri Nov  8 06:23:13 2002
From: tim.one@comcast.net (Tim Peters)
Date: Fri, 08 Nov 2002 01:23:13 -0500
Subject: [Spambayes] Corpus module (was: Upgrade problem)
In-Reply-To: <fljlsusfv2tcnmiv8a0jurqnc9fn8mn7q7@4ax.com>
Message-ID: <LNBBLJKPBEHFEDALKOLCCEAICIAB.tim.one@comcast.net>

[Richie Hindle, cogitates about Messages and their Corpus(ora)]

That's the ticket!  Backing off to a more fundamental level looks useful to
me too.  We never even straightened that much out for testing purposes
(msgs.py isn't general enough; for some custom test drivers (never checked
in), I couldn't even reuse the MsgStream class for my *own* directory
structures).

I disagree with Mark's

> If the process of re-training a message in the Outlook GUI becomes:
>
> def RetrainMessageAsSpam():
>     # Outlook specific code to get an ID.
>     message = message_store.GetMessage(id)
>     if not classifier.IsSpam(message):
>         classifier.train(message, is_spam=True)
>
> And not a whole lot else, it doesn't seem worth it.

because it illustrates the point <wink>:  it doesn't look like a correct
re-training method (although it may be, depending on assumptions about where
"id" comes from, and what assorted classifier methods do), and while a
correct method shouldn't be hard, in the absence of a class dedicated to
doing the simple common things that *can* be done in a common way, everyone
will keep screwing it up in their own client code.

> ...
> You might want to run it past Tim Peters, 'cos he's *far* better at this
> kind of thing than I am (though he's also busy).

I have to do more Python and Zope work now, so have to guard my time on
*this* project more jealously than I have.  MarkH and SeanT and JeremyH all
have ideas here too, and I trust you'll sort them out as a harmonious family
bent on world domination.  As a general strategy, the first person to check
code in usually wins <wink -- but take a clue from Mark, and from the
earlier days of this project, and from the pop3 proxy, and sling code more
than talk about it -- refactoring in Python is easy when the need becomes
apparent from real life>.

> ...
> The mark of a good framework is when you write a tiny little class (like
> AutoTrainer above for instance) that contains hardly any code but adds a
> major new feature (in this case, automatic training when moving messages
> around in Outlook).

The client-specific code to hook and track msg movement in Outlook is
relatively massive, so everything else appears a drop in the bucket to Mark.
Nevertheless, if a usable framework for capturing the *common* part of this
stuff were available, removing the 5 lines of code quoted above would help
(the Outlook client, and all others).


From B-Morgan@concentric.net  Fri Nov  8 06:25:30 2002
From: B-Morgan@concentric.net (Brad Morgan)
Date: Thu, 7 Nov 2002 23:25:30 -0700
Subject: [Spambayes] SMTP proxy questions
In-Reply-To: <ievlsu08r2krkv5n6clac6c5p58uianqph@4ax.com>
Message-ID: <NABBJOOEOFODEALNMJAJMEOGHBAA.B-Morgan@concentric.net>

As I see it, having pop3proxy keep copies of the messages and using an HTML
UI for training has the least amount of dependancy on the email client's
forwarding capabilities (or lack thereof).

I have a severe aversion to opening spam that will probably carry over to
unsure messages, so having a link added to the message body may not do me
much good.

I will, however, go to an HTML UI and examine a message if that UI doesn't
"execute" the HTML.  I don't want to see pretty, raw data is good enough for
me to decide.

I hate to keep mentioning a "rival" project <G>, but popfile's UI seems
pretty close to what I think would work best here.

Regards,

Brad

-----Original Message-----
From: spambayes-bounces@python.org
[mailto:spambayes-bounces@python.org]On Behalf Of Richie Hindle
Sent: Thursday, November 07, 2002 5:17 PM
To: spambayes@python.org
Subject: [Spambayes] SMTP proxy questions


[Me]
> Also on my list is to commit Tim Stone's SMTP proxy code, possibly after
> integrating it with the pop3proxy (but I need to discuss that with you,
> Tim, after looking in more detail at the code, hopefully tonight).

I've discussed this with Tim S, and he's going off the SMTP proxy idea
while I'm still broadly in favour of it.  What do people think - do
non-Outlook users want to forward messages to 'spam' and 'ham' to train the
system, or use an HTML UI?

The most difficult problem for retraining-by-forwarding is matching the
forwarded message to one from the cache, after Outlook Express has stripped
the headers, top-quoted the users .sig, converted it to HTML and added
fifteen macro viruses.  Any ideas?  Can the tokeniser help?

Or perhaps there's another way.  The only other option I'd thought of was
to add two hyperlinks to the end of the message, "This is spam" and "This
is ham" (in ways that would work for both HTML and plain-text messages, in
both HTML and plain-text email clients).  They'd link to the HTML interface
and tell it the cache ID of the message.  Adding content to emails is way
more intrusive (and difficult) than adding headers.  But no more intrusive
than the .sig that mailman adds.

--
Richie Hindle
richie@entrian.com


_______________________________________________
Spambayes mailing list
Spambayes@python.org
http://mail.python.org/mailman/listinfo/spambayes


From tim.one@comcast.net  Fri Nov  8 06:46:14 2002
From: tim.one@comcast.net (Tim Peters)
Date: Fri, 08 Nov 2002 01:46:14 -0500
Subject: [Spambayes] Outlook plugin - training
In-Reply-To: 
 <16E1010E4581B049ABC51D4975CEDB8861992D@UKDCX001.uk.int.atosorigin.com>
Message-ID: <LNBBLJKPBEHFEDALKOLCGEAKCIAB.tim.one@comcast.net>

[Moore, Paul]
> ...
> I'm assuming (based on a message I recall seeing recently) that it's
> possible to "correct" training - ie, if I train the classifier that a
> specific message is spam, I can later say "no it isn't, it's ham".

That's right, and at the level of classifier.py it's a two-step process:
unlearn() as spam, then learn() as ham.  It actually doesn't matter which
order those are done in, but I won't admit to that <wink>.

> Assuming that this is so, is it not reasonable to train dynamically
> on an "assume I got it right" basis?

Depending on context, it *may* be.

> In other words, whenever the addin filters a message as ham or spam,
> automatically train on that basis as well. Then, if the user sees a
> mistake, he corrects it, which automatically retrains the classifier
> (manually deleting as spam or moving a message already does this).

Assuming a conscientious user, and a client that knows enough about what the
user is doing, that should work fine.

> This will keep the database right up to date, and all the user has to
> do is correct any bad decisions the classifier makes (which he should
> be doing anyway).
>
> I've ignored database growth issues, but other than that, is there any
> other problem with this approach?

Doubtless hundreds, but why quibble <wink>.  A misclassified msg will have
bad effects at once if the training gets reflected into the probabilities at
once, so it gets less appealing the less zealous the user is about
correcting mistakes right away.  That can be mitigated by doing the day's
training into a distinct dict, or not calling update_probabilities() in a
single dict, until "the end of the day", when the user has (presumably)
corrected all the day's mistakes they're going to correct.  But if the model
updating is going to be delayed anyway, then it makes as much sense to delay
doing any training on "the day's" msgs until the end of the day.
Determining what "the end of the day" means is a puzzle then too.  For
example, maybe I left my email client running and went on a week-long
vacation.  I'm not going to look over 700 presumed spam when I get back,
I'll just delete it.  But if ham was in there, I've now let it train in the
wrong direction, and that will hurt.

In other contexts, the scheme doesn't get off the ground.  For example, for
python.org use, nobody is going to review msgs claimed to be spam.  A system
feeding on its own judgments is going to reinforce its own mistakes too, so
the "conscientious, timely, reviewing human" bit is important.


From tim.one@comcast.net  Fri Nov  8 07:20:18 2002
From: tim.one@comcast.net (Tim Peters)
Date: Fri, 08 Nov 2002 02:20:18 -0500
Subject: [Spambayes] Outlook plugin - training
In-Reply-To: <LCEPIIGDJPKCOIHOBJEPMEFCHJAA.mhammond@skippinet.com.au>
Message-ID: <LNBBLJKPBEHFEDALKOLCMEAMCIAB.tim.one@comcast.net>

[Mark Hammond]
> ...
> The key limitation of this scheme, as Tim also alludes to, is that this
> never correctly classifies ham.  However, I actually see this
> incremental training more as a "get smarter now" than a "just get
> smarter" technique - ie, a user sees a mis-classified Spam, by re-
> training they are increasing the chances that the next similar mail
> will be handled correctly.  Instant feedback, especially while the user
> is getting started.
>
> ie, it is indeed "mistake based training", but that may still prove
> useful in addition to ongoing training.

I sure agree it's *very* useful at the start, and expect it will continue to
be useful over time.

> I can't help thinking that we are somehow underestimating our own
> tool here.

I'm going to try an experiment:  I'm going to wipe my home database and
start over from scratch, training first on one ham and one spam, then only
on mistakes and unsures.  This should be fun <wink>.

> As is common when people first use this tool, spam is generally
> found in the ham set and vice-versa.  Because of this, I know that my
> Inbox is spam free (but less sure about my other "ham" folders).  I'm
> also sure that my Spam folder has no ham.  This should remain true
> while continue to use the tool.

How do you know your Spam folder has no ham?  I know mine doesn't because I
routinely score it, sort on the score, and stare at "the wrong end".  I find
ham there as often as not, *usually* apparently due to mousing error when
dragging a training ham into the Ham folder and overshooting the mark.

> So surely we can exploit this somehow.  Off the top of my head:
> * Assume we don't trust the last 2 days of mail (as the user may not
> yet have sorted them).  Anything in the "good" and "spam" folders older
> than this can be assumed correctly classified, and able to be trained
> on.

Provided the user has already done a decent amount of training, then as Paul
Moore suggested it could even work to trust ham-vs-spam decisions
immediately, and let user corrections undo those as needed.  A well-trained
system should be pretty robust against a few misclassifications over the
short term.

> * A process could go through all ham and spam trained on, and score each
> message.  Any "suspect" messages are presented in a list (much like the
> Outlook "Find Message" result list).  The user can indicate that the
> message is correct (and the system will remember, never asking about
> this message again) or is indeed incorrectly classified.  If incorrect,
> it will be moved, and incrementally trained as per now.  (I can also
> picture a whitelist kicking in here; if incorrect, offer to add user to
> whitelist.  If user in the whitelist, assume ham thereby meaning mail
> from this person can never again be spam)

Tell us about the mistakes *you* see.  I feel like we're designing a
solution to a hypothetical problem otherwise.  The only "mistake" I
routinely see is that my cigarettes-via-web advertising keeps getting
knocked back into Unsure territory.  That doesn't bother me enough to do
anything about it, but if it bothers you enough <wink> then, yes, a
whitelist would solve that one.

> I can picture this working in the background, and simply indicating to
> the user that there are "conflicts" to be resolved at their leisure.

Or maybe we could just move those back to the Unsure folder.  The user
should already know what to do about things in Unsure, so it's nothing new
to them.  Moving a msg out of Unsure could be taken as a positive sign that
the user has classified such a msg once and for all (well, until they move
it again, anyway).

> Further, I imagine that as we build better training data for each
> message store, the number of "conflicts" actually found would
> generally be zero - ie, the system would find that all 2 day and
> older mail correctly classifies.

I expect that's true.


From just@letterror.com  Fri Nov  8 07:54:04 2002
From: just@letterror.com (Just van Rossum)
Date: Fri,  8 Nov 2002 08:54:04 +0100
Subject: [Spambayes] Upgrade problem
In-Reply-To: <LNBBLJKPBEHFEDALKOLCIEAFCIAB.tim.one@comcast.net>
Message-ID: <r01050400-1021-4B54CB90F2EF11D68CC8003065D5E7E4@[10.0.0.23]>

Tim Peters wrote:

> [Just van Rossum]
> > I think it can be done with almost no extra overhead with a
> > caching scheme.  This assumes (probably wrongly <wink>) that
> > the cache stays in memory between runs.
> > Something like this perhaps:
> >
> > *** classifier.py   Thu Nov  7 23:03:07 2002
> > --- classifier.py.hack  Fri Nov  8 00:04:05 2002
> > ***************
> > *** 456,459 ****
> > --- 456,460 ----
> >
> >           wordinfoget = self.wordinfo.get
> > +         spamprobget = self.spamprobcache.get
> >           now = time.time()
> >           for word in Set(wordstream):
> > ***************
> > *** 463,467 ****
> >               else:
> >                   record.atime = now
> > !                 prob = record.spamprob
> >               distance = abs(prob - 0.5)
> >               if distance >= mindist:
> > --- 464,470 ----
> >               else:
> >                   record.atime = now
> > !                 prob = spamprobget(word)
> > !                 if prob is None:
> > !                     prob = self.calcspamprob(word, record)
> >               distance = abs(prob - 0.5)
> >               if distance >= mindist:
> 
> Sorry, I don't know what this is trying to accomplish.  Like, what is
> self.spamprobcache?  There's no such thing now, and the patch doesn't appear
> to create one (i.e., this code doesn't run). 

Tim, don't be such a programmer <wink>. But ok, I promise I'll never post
pseudocode as a patch again...

> Whatever it's supposed to be,
> why isn't spamprobcache.get *itself* responsible for returning a spamprob,
> instead of making its caller deal with two cases? 

I thought I was doing your performance needs a favor <wink>.

> If the answer is "it's
> supposed to be a dict, so .get ain't that smart",

That's the answer.

> then the memory burden for
> a long-running scorer process will zoom, negating one of the benefits people
> attached to "real databases" thought they were buying in return for giant
> files and slothful performance <wink>.

Right. If a float takes up 20 bytes in memory (just a guess, no time to look),
then for a database of 100000 words (that's roughly the size of my personal db)
the memory burden is 100000 * (8 + 20), almost three megs.

Just in case the higher memory usage is not an issue, there's a simpler
approach: don't store spamprob in the db, but call bayes.update_probabilities()
on startup. update_probabilities() takes about 2 seconds on my lowly 400Mhz PPC
on my db (hm, that's using pickle, so will be a lot more when using a db :-( ).
You can tell I'm thinking mostly about long running processes...

I guess you're right, one size doesn't fit all. One last idea for this morning:
how about splitting the db in a training db (storing hamcount and spamcount) and
a classifying db (storing only spamprob)?

> Life would be easier if databaseheads trained all they liked as often as
> they liked, but refrained from calling update_probabilities() until the end
> of the day (or other "quiet time").  The idea that the model should be
> updated after every msg trained on is an extreme.

Good points.

Just

From richie@entrian.com  Fri Nov  8 08:06:33 2002
From: richie@entrian.com (Richie Hindle)
Date: Fri, 08 Nov 2002 08:06:33 +0000
Subject: [Spambayes] Upgrade problem
In-Reply-To: <r01050400-1021-281375DCF23011D68CC8003065D5E7E4@[10.0.0.23]>
References: <B9EFE999.5C289%francois.granger@free.fr>
	<r01050400-1021-281375DCF23011D68CC8003065D5E7E4@[10.0.0.23]>
Message-ID: <rirmsuse1blns4r3h9apiibvcluabnd9g7@4ax.com>


[Just]
> the web interface of pop3proxy.py is pretty good and useful, the only
> downside is that it saves the database after each training

That's now fixed (at least partly) along with some other bits:

 o The database is now saved (optionally) on exit, rather than after each
   message you train with.  There should be explicit save/reload commands,
   but they can come later.
 o It now keeps two mbox files of all the messages that have been used to
   train via the web interface - thanks to Just for the patch.
 o All the sockets now use async - the web interface used to freeze
   whenever the proxy was awaiting a response from the POP3 server.  That's
   now fixed.
 o It now copes with POP3 servers that don't issue a welcome command.
 o The training form now appears in the training results, so you can train
   on another message without having to go back to the Home page.

-- 
Richie Hindle
richie@entrian.com


From tim.one@comcast.net  Fri Nov  8 09:15:24 2002
From: tim.one@comcast.net (Tim Peters)
Date: Fri, 08 Nov 2002 04:15:24 -0500
Subject: [Spambayes] Outlook plugin - training
In-Reply-To: <LNBBLJKPBEHFEDALKOLCMEAMCIAB.tim.one@comcast.net>
Message-ID: <LNBBLJKPBEHFEDALKOLCOEBFCIAB.tim.one@comcast.net>

[Tim]
> ...
> I'm going to try an experiment:  I'm going to wipe my home database and
> start over from scratch, training first on one ham and one spam, then
> only on mistakes and unsures.  This should be fun <wink>.

It is!  The msg from me I'm replying to here scored 94 (solid spam).  I've
now got 5 ham and 5 spam in my training set, most of the new ones from
Unsures.  The latest spam was a blatant false negative, from Hapax City:

'*H*'                          0.998601
'*S*'                          8.60833e-005
'can'                          0.0652174
'have'                         0.0652174
"don't"                        0.0918367
'never'                        0.0918367
'number'                       0.0918367
'one'                          0.0918367
'what'                         0.0918367
'"the'                         0.155172   ham hapaxes from here
'able'                         0.155172
'about'                        0.155172
'against'                      0.155172
'also'                         0.155172
'any'                          0.155172
'anything'                     0.155172
'back'                         0.155172
'because'                      0.155172
'been'                         0.155172
'check'                        0.155172
'even'                         0.155172
'find'                         0.155172
'found'                        0.155172
'heard'                        0.155172
'how'                          0.155172
'into'                         0.155172
"it's"                         0.155172
'more'                         0.155172
'needed'                       0.155172
'other'                        0.155172
'out'                          0.155172
'own'                          0.155172
'people'                       0.155172
'skip:a 10'                    0.155172
'skip:i 10'                    0.155172
'special'                      0.155172
'subject:.'                    0.155172
'subject:: '                   0.155172
'their'                        0.155172
'them.'                        0.155172
'they'                         0.155172
'those'                        0.155172
'time'                         0.155172
'time.'                        0.155172
'unsubscribe'                  0.155172
'until'                        0.155172
'useful'                       0.155172
'using'                        0.155172   to here
'and'                          0.275281
'for'                          0.275281
'subject: '                    0.275281
'you'                          0.275281
'from'                         0.355072
'not'                          0.355072
'off'                          0.355072
'our'                          0.355072
'when'                         0.355072
'new'                          0.644928
'see'                          0.644928
'url:gif'                      0.724719
'url:www'                      0.724719
'call'                         0.844828   spam hapaxes from here
'contact'                      0.844828
'credit'                       0.844828
'email.'                       0.844828
'every'                        0.844828
'further'                      0.844828
'header:Received:2'            0.844828
'made'                         0.844828
'more!'                        0.844828
'most'                         0.844828
'now'                          0.844828
'plus,'                        0.844828
'receive'                      0.844828
'search'                       0.844828
'skip:1 10'                    0.844828
'url:jpg'                      0.844828   to here
'email'                        0.908163

I think I've established that 5+5 isn't enough for great results <snort>.
However, 80% of its decisions have been correct so far!


From tdickenson@devmail.geminidataloggers.co.uk  Fri Nov  8 10:52:32 2002
From: tdickenson@devmail.geminidataloggers.co.uk (Toby Dickenson)
Date: Fri, 8 Nov 2002 10:52:32 +0000
Subject: [Spambayes] Re: unsupervised training
In-Reply-To: <LNBBLJKPBEHFEDALKOLCMEAMCIAB.tim.one@comcast.net>
References: <LNBBLJKPBEHFEDALKOLCMEAMCIAB.tim.one@comcast.net>
Message-ID: <200211081052.32567.tdickenson@devmail.geminidataloggers.co.uk>

On Friday 08 November 2002 7:20 am, Tim Peters wrote:

> Provided the user has already done a decent amount of training, then as
> Paul Moore suggested it could even work to trust ham-vs-spam decisions
> immediately, and let user corrections undo those as needed.  A well-tra=
ined
> system should be pretty robust against a few misclassifications over th=
e
> short term.

For the last two weeks I have been using a setup that uses this type of=20
unsupervised training.

I have a procmail filter that sends a copy of all incoming ham and spam t=
o two=20
seperate mailboxes. These mailboxes are used for overnight batch training=
,=20
then deleted. Messages marked 'Unsure' do not take part in this automatic=
=20
training.

I perform seperate filtering for spam and 'unsure' in my mua. Fo far I am=
=20
manually inspecting the unsure folder, and manually adding them to the=20
appropriate training mailboxes. Initially about 3% of mails were 'unsure'=
,=20
but this has dropped to less than 1% after 2 weeks.

Starting next week I plan to change the mua filtering to treat 'unsure' t=
he=20
same as 'ham', and stop all manual training. It will be interesting to se=
e if=20
the training remains stable.


From bkc@murkworks.com  Fri Nov  8 14:51:15 2002
From: bkc@murkworks.com (Brad Clements)
Date: Fri, 08 Nov 2002 09:51:15 -0500
Subject: [Spambayes] SMTP proxy questions
In-Reply-To: <ievlsu08r2krkv5n6clac6c5p58uianqph@4ax.com>
Message-ID: <3DCB8912.18340.2FB5F81@localhost>

On 8 Nov 2002 at 0:17, Richie Hindle wrote:

> Or perhaps there's another way.  The only other option I'd thought of was
> to add two hyperlinks to the end of the message, "This is spam" and "This
> is ham" (in ways that would work for both HTML and plain-text messages, in
> both HTML and plain-text email clients).  They'd link to the HTML interface
> and tell it the cache ID of the message.  Adding content to emails is way
> more intrusive (and difficult) than adding headers.  But no more intrusive
> than the .sig that mailman adds.

If you do this, what's to keep spammers from also adding similar looking URLs?

A busy person might not notice any difference, could click through and confirm their 
email address...

Brad Clements,                bkc@murkworks.com   (315)268-1000
http://www.murkworks.com                          (315)268-9812 Fax
AOL-IM: BKClements


From barry@python.org  Fri Nov  8 15:04:56 2002
From: barry@python.org (Barry A. Warsaw)
Date: Fri, 8 Nov 2002 10:04:56 -0500
Subject: [Spambayes] SMTP proxy questions
References: <ievlsu08r2krkv5n6clac6c5p58uianqph@4ax.com>
	<XFMail.021107171529.jbublitz@nwinternet.com>
Message-ID: <15819.53912.407893.819241@gargle.gargle.HOWL>


>>>>> "JB" == Jim Bublitz <jbublitz@nwinternet.com> writes:

    JB> What about adding a MIME object to the msg with the Spambayes
    JB> info (text/spambayes?) - or will forwarding lose that info
    JB> too? The email module should be able to do this.

Of course that would have to be text/x-spambayes :)

-Barry


From randy.diffenderfer@eds.com  Fri Nov  8 17:21:25 2002
From: randy.diffenderfer@eds.com (Diffenderfer, Randy)
Date: Fri, 8 Nov 2002 12:21:25 -0500 
Subject: [Spambayes] SMTP proxy questions
Message-ID: <8AA870658244D4119AF600508BDF0A360C6BC295@usahm014.exmi01.exch.eds.com>

|>>>>> "JB" == Jim Bublitz <jbublitz@nwinternet.com> writes:
|
|    JB> What about adding a MIME object to the msg with the Spambayes
|    JB> info (text/spambayes?) - or will forwarding lose that info
|    JB> too? The email module should be able to do this.
|
|Of course that would have to be text/x-spambayes :)
|
|-Barry

While a fair portion of messages may very well be MIME compliant, this
wouldn't work without some serious munging around for non-MIME messages, as
well as being very problematic for the many deformed MIME (read very NON
compliant :-) ) messages floating around out there!

Just an observation...


From jbublitz@nwinternet.com  Fri Nov  8 17:10:33 2002
From: jbublitz@nwinternet.com (Jim Bublitz)
Date: Fri, 08 Nov 2002 09:10:33 -0800 (PST)
Subject: [Spambayes] SMTP proxy questions
In-Reply-To: <15819.53912.407893.819241@gargle.gargle.HOWL>
Message-ID: <XFMail.021108091033.jbublitz@nwinternet.com>

On 08-Nov-02 Barry A. Warsaw wrote:
> 
>>>>>> "JB" == Jim Bublitz <jbublitz@nwinternet.com> writes:
> 
>     JB> What about adding a MIME object to the msg with the
> Spambayes
>     JB> info (text/spambayes?) - or will forwarding lose that
> info
>     JB> too? The email module should be able to do this.
> 
> Of course that would have to be text/x-spambayes :)

<searching for an excuse for my embarrassing oversight>
Well - there's application/ms-excel or some such. Isn't spambayes
just as good? :)

Point taken.

Jim


From barry@python.org  Fri Nov  8 17:33:53 2002
From: barry@python.org (Barry A. Warsaw)
Date: Fri, 8 Nov 2002 12:33:53 -0500
Subject: [Spambayes] SMTP proxy questions
References: <15819.53912.407893.819241@gargle.gargle.HOWL>
	<XFMail.021108091033.jbublitz@nwinternet.com>
Message-ID: <15819.62849.101901.822699@gargle.gargle.HOWL>


>>>>> "JB" == Jim Bublitz <jbublitz@nwinternet.com> writes:

    JB> <searching for an excuse for my embarrassing oversight> Well -
    JB> there's application/ms-excel or some such. Isn't spambayes
    JB> just as good? :)

It depends on whether you hold the IETF and IANA in as high regard as
Microsoft does <wink>.

http://www.iana.org/assignments/media-types/

-Barry

From lists@morpheus.demon.co.uk  Fri Nov  8 21:07:45 2002
From: lists@morpheus.demon.co.uk (Paul Moore)
Date: Fri, 08 Nov 2002 21:07:45 +0000
Subject: [Spambayes] Outlook plugin - training
References: <n2m-g.u1iue32h.fsf@morpheus.demon.co.uk>
	<BIEJKCLHCIOIHAGOKOLHMEDGDOAA.tim@zope.com>
Message-ID: <n2m-g.wunn3h3y.fsf@morpheus.demon.co.uk>

"Tim Peters" <tim@zope.com> writes:

[About the plugin code...]
> I'm more lost than not in it myself!

That makes me feel better :-)

[About bothering with leaving list traffic out]
> Don't worry about it before you try it.  I suggest trying it because I'm not
> sure it's possible to *stop* the system now from scoring all incoming msgs
> (the "new msg in Inbox" filter appears to trigger for every one, regardless
> of whether the RW decides to move it; after that it may just be a race
> between the RW and the addin deciding where to move each).

OK, I've switched over. I now have one Spam folder, one Potential Spam
folder, and the rest are Ham (actually, some historic archive folders
I've left out, but that's just because I never use them any
more). We'll see how it goes.

>> Of course, I know that the classifier *really* works by magic, and
>> so my intuition is useless :-)
>
> It's more that unless you know exactly how the math works, your intuition is
> simply baseless here, carried over from some other experience.  Do *you*
> have trouble distinguishing personal and work email from spam?  There you
> go, and you can't even compute inverse chi-squared probabilities to 14
> significant digits on demand in your head <wink>.

How do *you* know I can't compute inverse chi-squared probabilities in
my head? Oh, hang on - you wanted me to get the right answer, didn't
you? :-)

> What's to manage?  I get about 600 emails per day, and about 1% end
> up in Unsure (about 6 -- actually less than that, lately; the system
> is learning).

My ratio is still a lot worse than that. But as I say, my training
corpus is still quite small. But you're right - managing a few mails
isn't hard. It's just that the overall results are *so* much better
than the old home-grown soution I used that I became instantly spoiled
:-)

Seriously, I've said this before, but what you guys have developed
here is *phenomenally* good. I've reached the point where I look
forward to getting spam, just because I enjoy so much seeing it
automatically appear in the spam folder :-)

>> My instinctive reaction is that I want "Spam" and "Not Spam" buttons,
>> and then I read or delete the message in situ.
>
> MarkH has since implemented this in the Unsure folder.

Time for a CVS update, I guess...

> I still think you're making life too complicated.  Is list traffic
> spam?  If so, call it spam.  If not, call it ham.

Sounds sensible. I think that all the troubles I've had in the past
trying to manage spam have left me with an instinctive feeling that
the problem is complicated. This leads to looking for complicated
solutions.

But you're right. The spam/ham distinction itself is a simple yes/no,
so the setup should be, too.

But permit me to drag my feet a little, as I throw away all my
cherished preconceptions :-)

More seriously, I'm putting this point into my spambayes notes
folder. I suspect it's something a lot of new users will have to get
used to.

Thanks for the comments,
Paul.

-- 
This signature intentionally left blank

From lists@morpheus.demon.co.uk  Fri Nov  8 21:12:17 2002
From: lists@morpheus.demon.co.uk (Paul Moore)
Date: Fri, 08 Nov 2002 21:12:17 +0000
Subject: [Spambayes] Outlook plugin plus Exchange
Message-ID: <n2m-g.u1ir3gwe.fsf@morpheus.demon.co.uk>

I've noticed a couple of strange effects with the Outlook plugin used
against an Exchange server. The main one is that when I start up the
client in the morning, there are a lot of overnight messages in my
inbox. They don't seem to get filtered. I suspect this is to do with
Outlook not firing the "new mail" event on stuff that's in the
Exchange store when the client starts up. But I'll need to test this.

Unfortunately, the Exchange server is at work, and I can only do any
serious hacking on this at home, so I'm running a batch cycle (code at
home, take into work, try out, take bugs home, and repeat). So it'll
take me a while to make any progress.

I'll report back when I get more details.

Paul (Off to look at Outlook events in MSDN, and to write a simple
"log the events and see what is going on" plugin to test with)

-- 
This signature intentionally left blank

From mhammond@skippinet.com.au  Fri Nov  8 21:52:20 2002
From: mhammond@skippinet.com.au (Mark Hammond)
Date: Sat, 9 Nov 2002 08:52:20 +1100
Subject: [Spambayes] Outlook plugin plus Exchange
In-Reply-To: <n2m-g.u1ir3gwe.fsf@morpheus.demon.co.uk>
Message-ID: <LCEPIIGDJPKCOIHOBJEPMENEHJAA.mhammond@skippinet.com.au>

> I've noticed a couple of strange effects with the Outlook plugin used
> against an Exchange server. The main one is that when I start up the
> client in the morning, there are a lot of overnight messages in my
> inbox. They don't seem to get filtered. I suspect this is to do with
> Outlook not firing the "new mail" event on stuff that's in the
> Exchange store when the client starts up. But I'll need to test this.

I am working on code that optionally processes "missed" messages at startup.
It looks like I can list all unread, unscored mail in my 1000+ item inbox
very quickly, so this should be feasible.

> Paul (Off to look at Outlook events in MSDN, and to write a simple
> "log the events and see what is going on" plugin to test with)

Check out the Outlook plugin in the win32com\demos directory - probably a
good place to start.

Or if anyone gets lots of KLEZ mail via Outlook, I have a plugin that does a
decent job at killing them.

Mark.


From francois.granger@free.fr  Fri Nov  8 23:25:51 2002
From: francois.granger@free.fr (=?iso-8859-1?Q?Fran=E7ois?= Granger)
Date: Sat, 9 Nov 2002 00:25:51 +0100
Subject: [Spambayes] pop3proxy
Message-ID: <a05100312b9f1f8357982@[192.168.1.11]>

Thanks to Richie Hindle, it now works on MacOS 9.

Excellent job !
-- 
Le courrier �lectronique est un moyen de communication. Les gens devraient
se poser des questions sur les implications politiques des choix (ou non
choix) de leurs outils et technologies.
Pour des courriers propres : http://minilien.com/?IXZneLoID0 - 
http://marc.herbert.free.fr/mail/ http://expita.com/nomime.html

From tim.one@comcast.net  Fri Nov  8 23:33:50 2002
From: tim.one@comcast.net (Tim Peters)
Date: Fri, 08 Nov 2002 18:33:50 -0500
Subject: [Spambayes] Outlook plugin - training
In-Reply-To: <LNBBLJKPBEHFEDALKOLCOEBFCIAB.tim.one@comcast.net>
Message-ID: <LNBBLJKPBEHFEDALKOLCCEHLCIAB.tim.one@comcast.net>

[Tim]
> ...
> I'm going to try an experiment:  I'm going to wipe my home database and
> start over from scratch, training first on one ham and one spam, then
> only on mistakes and unsures.  This should be fun <wink>.
> ...

After enduring the first round of gross mistakes, when I got up today I did
this:

while some ham in my inbox scores above 0.20 (my ham_cutoff):
    pick the highest-scoring ham in the inbox
    add it to the ham training set
    train on it
    rescore the inbox

These are false positives and unsures the classifier would have had if these
msgs had come in after I started the experiment.  There were about 700 msgs
in the inbox.

Other than that, I've left it mistake-driven and unsure-driven on live
incoming email.  Spam that's correctly classified simply gets deleted (no
training on it), ditto ham.

It's been a light spam day, but hundreds of msgs have come in since then and
I haven't seen a mistake or unsure in about 5 hours, although plenty of ham
gets near ham_cutoff and plenty of spam near spam_cutoff.  Total training
data now:  just 45 ham and 20 spam.

Scores remain grossly hapax-driven, but that's actually enough to classify
most of my email correctly:  a small number of subjects and senders and
mailing lists overwhelmingly dominate my ham mix, and one email account
accounts for the vast bulk of my spam.  Removing the hapaxes from the
database dropped the # of words from 5500 to about 1700.  Rescoring the
inbox with this reduced database then pushed about 5% of the msgs back into
Unsure.

So (no surprise here) hapaxes are vital with little training data.  That
also means that as soon as one of those words shows up in the other kind of
email, it changes from a strong clue to netural, *provided that* I actually
train on the new email.  I'm not training now unless there's a
mistake/unsure, so the hapaxes remain strong clues (even when they point in
the wrong direction).  BTW, when there are mistakes/unsures, I'm not
training on all of them:  as I did when I got up, I train the worst example
then rescore, one at a time, until no mistakes/unsures remain.

I'm never going to get sub-0.1% error rates this way, but if this is the
best it ever got, I'd be quite happy with it for my personal email.
Something to ponder?  If so, you can get away with a very small database,
and while hapaxes must not be removed blindly in this extreme scheme, using
the atime field could (I suspect) be very effective in slashing the
already-small database size (lots of hapaxes will never be seen again even
if you train on everything; the WordInfo atime field tells you when a word
was last used at all).


From rob@hooft.net  Fri Nov  8 23:49:59 2002
From: rob@hooft.net (Rob Hooft)
Date: Sat, 09 Nov 2002 00:49:59 +0100
Subject: [Spambayes] Outlook plugin - training
References: <LNBBLJKPBEHFEDALKOLCCEHLCIAB.tim.one@comcast.net>
Message-ID: <3DCC4DA7.80401@hooft.net>

Tim Peters wrote:
> I'm never going to get sub-0.1% error rates this way, but if this is the
> best it ever got, I'd be quite happy with it for my personal email.
> Something to ponder?  If so, you can get away with a very small database,
> and while hapaxes must not be removed blindly in this extreme scheme, using
> the atime field could (I suspect) be very effective in slashing the
> already-small database size (lots of hapaxes will never be seen again even
> if you train on everything; the WordInfo atime field tells you when a word
> was last used at all).

Tim,

This seems to imply that you're still playing with the idea that hapaxes 
could be "slashed" from the database when using the "old" train-on-all
procedure. I don't see how that can ever work, as all words pass through 
the hapax stage at some point. Or do you mean to slash "old" hapaxes 
only? And what is "old"?

Rob

-- 
Rob W.W. Hooft  ||  rob@hooft.net  ||  http://www.hooft.net/people/rob/


From tim@fourstonesExpressions.com  Sat Nov  9 00:55:07 2002
From: tim@fourstonesExpressions.com (Tim Stone - Four Stones Expressions)
Date: Fri, 08 Nov 2002 18:55:07 -0600
Subject: [Spambayes] Persisting a pickled bayes database
Message-ID: <MJDARN31HEKEJDK21NLB972UTMGVTFE.3dcc5ceb@riven>

I can see the nice createbayes function in hammie, but I don't see any 
persistence function anywhere.  I do see several places where code to write a 
pickled bayes database is hard coded, and I understand the PersistentBayes 
thing.  I might be missing something...

I've been using a simple class to handle creating and persisting my bayes 
databases.  I *think* this stuff should probably go somewhere, but beats me 
where... classifier?  doesn't make much sense there... hammie?  Any ideas 
anybody?

Here's the class...  (kinda a dumb name  ;))

class BayesHelper:
    '''helper class for bayes databases'''
    
    def __init__(self, db_name, useDB):
       ''' constructor '''
       
       self.db_name = db_name
       self.useDB = useDB
       
       self.bayes = hammie.createbayes(db_name, useDB)

    # no __del__() method, because we don't *necessarily* want to persist

    def persist(self):
        '''store the bayes database'''
        
        if not self.useDB:
           fp = open(self.db_name, 'wb')
           pickle.dump(self.bayes, fp, 1)
           fp.close() 


- TimS


From popiel@wolfskeep.com  Fri Nov  8 00:06:27 2002
From: popiel@wolfskeep.com (T. Alexander Popiel)
Date: Thu, 07 Nov 2002 16:06:27 -0800
Subject: [Spambayes] Outlook plugin - training 
In-Reply-To: Message from "Tim Peters" <tim@zope.com> 
	<BIEJKCLHCIOIHAGOKOLHMEDFDOAA.tim@zope.com> 
References: <BIEJKCLHCIOIHAGOKOLHMEDFDOAA.tim@zope.com> 
Message-ID: <20021108000627.2B918F5CC@cashew.wolfskeep.com>

In message:  <BIEJKCLHCIOIHAGOKOLHMEDFDOAA.tim@zope.com>
             "Tim Peters" <tim@zope.com> writes:
>[Anthony Baxter]
>> Note that "random sample" is not as trivial as all that, either - if
>> you have a very high ham:spam ratio in your training DB, your accuracy
>> will suffer (see the tests from Alex, myself and others).
>
>I still need to try to make sense of those tests.  A real complication is
>that more than one thing changes when trying to test ratios:  it's not just
>the ratio that changes, it's the absolute number of each trained on too.

True.

>For example, (a) train on 5000 ham and 1000 spam; or, (b) train on 50000 ham
>and 10000 spam.  The ratios are identical.  Do we expect the error rates to
>be identical too?  I don't, but haven't tried it.

I have tried this, and the effects of ratio were diminished
as the training set size increased.  For details, see
http://www.wolfskeep.com/~popiel/spambayes/ratio2 .  The
tests were done with gary-combining, not chi-square, so I
really ought to rerun them.

>I expect the latter would do better than the former, despite the identical
>ratios, simply because more msgs allow better spamprob estimates.

It depended on what the ratio in question was... for 1:4
ham:spam, increased training set size hurt instead of helped,
in the ranges that I was able to test.  For 1:1, increased
training helped instead of hurt.

>Something missing in "the ratio tests" is a rationale (even an
>after-the-fact one) for believing there's some aspect of the system that's
>sensitive to the ratio.  The combining method certainly is not, and the
>spamprob estimation (update_probabilities()) deliberately works with
>percentages instead of raw counts so that the ham::spam training ratio
>has no direct effect on the spamprobs calculated.

Eh, I have a perfectly good rationale for believing that
something is sensitive the the ratio: the tests I've run
show such a sensitivity.  What's missing is a theory on
_why_ there's a sensitivity. ;-)

I don't think the following theory is perfectly phrased, but
it seems plausible to me:

Perhaps the number of topics discussed in ham is greater
than that in spam.  Thus, the average percentage of ham
messages containing a particular significant ham word is
systematically lower than the average probability of a
particular significant spam word appearing in spam messages.
As the training set size increases, the percentage difference
becomes more consistent and pronounced.  Since we're then
combining the percentages, we systematically skew slightly
due to the differing averages.

Changing the ratio of ham to spam has the effect of changing
the number of topics discussed, particularly when the training
set size is small and random chance can exclude all instances
of a given topic.  Balancing the number of topics removes the
skew in the probabilities.  As training set size increases,
adjusting the ratio has less effect, because it has less
likelyhood of eliminating topics of discussion.

I think that would account for my data.

>The total # of spam training msgs does limit how high a spamprob can get,
>and the total # of ham training msgs limits how low.  The *suspicion* I had
>running my large c.l.py test is that it wasn't the ratio that mattered so
>much as the absolute number, and that the error rates didn't "settle down"
>to the 4th digit until I got near 10,000 spam total.

I suspect that by the time the corpora got that large, adjusting
the training ratio wouldn't make a lick of difference if the
corpora were sampled randomly to achieve the given ratio.  There
would just be too little chance of excluding a topic from the
samples.  Systematically excluding a topic might produce equivalent
results to my ratio tests.

- Alex

From richie@entrian.com  Fri Nov  8 00:17:25 2002
From: richie@entrian.com (Richie Hindle)
Date: Fri, 08 Nov 2002 00:17:25 +0000
Subject: [Spambayes] SMTP proxy questions
Message-ID: <ievlsu08r2krkv5n6clac6c5p58uianqph@4ax.com>


[Me]
> Also on my list is to commit Tim Stone's SMTP proxy code, possibly after
> integrating it with the pop3proxy (but I need to discuss that with you,
> Tim, after looking in more detail at the code, hopefully tonight).

I've discussed this with Tim S, and he's going off the SMTP proxy idea
while I'm still broadly in favour of it.  What do people think - do
non-Outlook users want to forward messages to 'spam' and 'ham' to train the
system, or use an HTML UI?

The most difficult problem for retraining-by-forwarding is matching the
forwarded message to one from the cache, after Outlook Express has stripped
the headers, top-quoted the users .sig, converted it to HTML and added
fifteen macro viruses.  Any ideas?  Can the tokeniser help?

Or perhaps there's another way.  The only other option I'd thought of was
to add two hyperlinks to the end of the message, "This is spam" and "This
is ham" (in ways that would work for both HTML and plain-text messages, in
both HTML and plain-text email clients).  They'd link to the HTML interface
and tell it the cache ID of the message.  Adding content to emails is way
more intrusive (and difficult) than adding headers.  But no more intrusive
than the .sig that mailman adds.

-- 
Richie Hindle
richie@entrian.com


From anthony@interlink.com.au  Fri Nov  8 00:30:09 2002
From: anthony@interlink.com.au (Anthony Baxter)
Date: Fri, 08 Nov 2002 11:30:09 +1100
Subject: [Spambayes] SMTP proxy questions 
In-Reply-To: <ievlsu08r2krkv5n6clac6c5p58uianqph@4ax.com> 
Message-ID: <200211080030.gA80UAf11390@localhost.localdomain>


> I've discussed this with Tim S, and he's going off the SMTP proxy idea
> while I'm still broadly in favour of it.  What do people think - do
> non-Outlook users want to forward messages to 'spam' and 'ham' to train the
> system, or use an HTML UI?

I'd have to say I don't like the idea. There's too many potential places
where it can all go horribly horribly pear-shaped, and too many rat-holes
that the various email clients can screw up with.

Anthony

-- 
Anthony Baxter     <anthony@interlink.com.au>   
It's never too late to have a happy childhood.


From jbublitz@nwinternet.com  Fri Nov  8 01:15:29 2002
From: jbublitz@nwinternet.com (Jim Bublitz)
Date: Thu, 07 Nov 2002 17:15:29 -0800 (PST)
Subject: [Spambayes] SMTP proxy questions
In-Reply-To: <ievlsu08r2krkv5n6clac6c5p58uianqph@4ax.com>
Message-ID: <XFMail.021107171529.jbublitz@nwinternet.com>

On 08-Nov-02 Richie Hindle wrote:
> Or perhaps there's another way.  The only other option I'd
> thought of was to add two hyperlinks to the end of the message,
> "This is spam" and "This is ham" (in ways that would work for
> both HTML and plain-text messages, in both HTML and plain-text
> email clients).  They'd link to the HTML interface and tell it
> the cache ID of the message.  Adding content to emails is way
> more intrusive (and difficult) than adding headers.  But no more
> intrusive than the .sig that mailman adds.

What about adding a MIME object to the msg with the Spambayes info
(text/spambayes?) - or will forwarding lose that info too? The
email module should be able to do this.

Jim


From tim.one@comcast.net  Fri Nov  8 04:07:18 2002
From: tim.one@comcast.net (Tim Peters)
Date: Thu, 07 Nov 2002 23:07:18 -0500
Subject: [Spambayes] Proposing to drop retain_pure_html_tags
In-Reply-To: <LNBBLJKPBEHFEDALKOLCOEFOCGAB.tim.one@comcast.net>
Message-ID: <LNBBLJKPBEHFEDALKOLCEEPMCHAB.tim.one@comcast.net>

FYI, that option is gone now.

From tim.one@comcast.net  Fri Nov  8 04:29:17 2002
From: tim.one@comcast.net (Tim Peters)
Date: Thu, 07 Nov 2002 23:29:17 -0500
Subject: [Spambayes] Proposing to rename some fundamental options
In-Reply-To: <LNBBLJKPBEHFEDALKOLCOEFOCGAB.tim.one@comcast.net>
Message-ID: <LNBBLJKPBEHFEDALKOLCAEPOCHAB.tim.one@comcast.net>

The original names made more sense when we had half a dozen competing
schemes.

Current                         Proposed
-------                         --------
robinson_probability_x          unknown_word_prob
robinson_probability_s          unknown_word_strength
robinson_minimum_prob_strength  minimum_prob_strength


Note:  unknown_word_prob is what the Baysian prob adjustment moves toward,
more strongly the less evidence backs up a counting spamprob estimate (the
fewer the msgs a word has been seen in, the more the adjustment pushes the
spamprob toward unknown_word_prob; for a word that's never been seen before,
this reduces to unknown_word_prob exactly).

We've always set it to 0.5 by default, and previous tests never showed
benefit from changing that.

We've gotten better since then, though, and it's possible to deduce "a more
correct" value.  For example, take the mean of all the by-counting spamprobs
in your database, across words that have appeared in at least 10 msgs (so
that there's reason to have *some* confidence in the by-counting guess).
That's then an estimate of the spamprob a new word will eventually get over
time.

Across 3 databases I tried this on, it turned out to be a little over 0.5,
from 0.513 (my home personal classifier) to 0.540 (fat c.l.py test).

If someone has time for a controlled experiment, run the attached code to
find this guess for one of your databases; then if it differs from 0.5, try
a before-and-after test just changing that much.  If there's any promise
here, update_probabilities() could easily be changed to compute and use this
automatically.

"""
import cPickle as pickle
f = file('fat.pik', 'rb')  # your database pickle goes here

c = pickle.load(f)
f.close()
w = c.wordinfo

def guessx():
    nham = float(c.nham or 1.0)
    nspam = float(c.nspam or 1.0)
    n = 0
    probsum = 0.0
    for rec in w.itervalues():
        if rec.hamcount + rec.spamcount >= 10:
            hamratio = rec.hamcount / nham
            spamratio = rec.spamcount / nspam
            prob = spamratio / (spamratio + hamratio)
            probsum += prob
            n += 1
    print n, probsum / n

guessx()
"""


From mhammond@skippinet.com.au  Fri Nov  8 04:48:54 2002
From: mhammond@skippinet.com.au (Mark Hammond)
Date: Fri, 8 Nov 2002 15:48:54 +1100
Subject: [Spambayes] Corpus module (was: Upgrade problem)
In-Reply-To: <fljlsusfv2tcnmiv8a0jurqnc9fn8mn7q7@4ax.com>
Message-ID: <LCEPIIGDJPKCOIHOBJEPAEKIHJAA.mhammond@skippinet.com.au>

> Laughing and pointing should be directed towards me rather than Tim.

None of that, but some thoughts <wink>.

I think that the classes I posted a while ago suffer from the exact reverse
problem as your idea.  My idea was to make a "message store" that is largely
independent of training.  I believe the problem with your design is that it
deals with the training at the expense of the message store.

Obviously, but worth mentioning, is that there are competing interests here.
My focus is towards clients, and specifically the outlook one (if there were
more clients I would be happy to think of them too <wink>).  Alot of the
focus of this group is towards admins rather than individuals (which is just
fine!)  But it seems the current thinking is of a corpus as being a fairly
static, well-controlled set of messages used almost purely for training
purposes.

For client programs, this may not be practical.  The corpus is a more
dynamic set of messages - and worse, actually *is* the user's set of
messages rather than a collection of message copies.

For example, "moving" a message in a corpus may actually mean moving the
message in the user's real inbox.  This may or may not be what is intended -
a corpus "move" operation is more about changing a message's classification
than it is about physically moving pieces of mail around.

> A Corpus wouldn't know how to create Message objects, nor would a Message
> object know how to create itself - classes *derived from* them would know
> how to do that.  For instance (totally untested code, probably full of
> typos) -
>
> class Message:

Jeremy and I both posted real code, so starting with something that takes
that into consideration would be good.

> I may be putting too much
> into the base class by demanding that the text of the message be given to
> the constructor - that precludes making FileMessage lazy, and
> only read the
> file when it needs to.]

It also defeats the abstract nature of the class.

> 'Corpus' works the same way; again, the details may be naive, but this is
> the general idea:

I'm hoping I don't sound grumpy, but again, the few systems that already
exist for this engine are the best ones to use to discover the naivety early
<wink>

> You can then envisage a MailboxCorpus, and OutlookFolderCorpus, an
> IMAPFolderCorpus, a POP3AccountCorpus, a PigeonMessagingCorpus and so on.

I can't quite imagine that at the moment, as per my comments at the top.

Off the top of my head, I believe we need:
* An abstract "message id"
* A message classification database, as discussed before - basically just a
dictionary, keyed by ID, holding either "spam" or "ham".
* A "corpus" becomes just an enumerator of message IDs for bulk/batch
training.  It has no move etc operations.
* A "message store" is capable of returning a message object given its ID.
* The training API simply takes message objects and updates the probability
and message databases.

At that level, we really don't need much else - no folders or any other
grouping of messages.  I'm really not too sure there is much value in adding
higher-level concepts such as folders or message store "move" operations -
certainly not at the outset, where there are too many competing
requirements.

> Yes - this could work using observer objects registered with Corpus
> objects:

This could work, but may be too simple to be necessary.  If the process of
re-training a message in the Outlook GUI becomes:

def RetrainMessageAsSpam():
	# Outlook specific code to get an ID.
	message = message_store.GetMessage(id)
	if not classifier.IsSpam(message):
		classifier.train(message, is_spam=True)

And not a whole lot else, it doesn't seem worth it.  Unfortunately, the
decision to perform the retrain is the complex, but client specific part.
Is this a newly delivered message?  Did the user manually move the message
somewhere?  Did the user click one of our buttons?  Is the user deleting old
ham that we want to train on before it dies forever?

Outlook does this via examining what Outlook event we are seeing, and
looking at meta-data we possibly previously attached to the message.  I'm
not sure this can be encapsulated well at the moment without adding all our
meta-data etc baggage to the base classes.

> Most of the *new* code that's needed is defining the abstract concepts and
> their interfaces, rather than writing code that actually *does* anything -
> it's building a framework.

*cough* ummm...  This is doomed to failure.  Code *must* do something to be
taken seriously.  At the very least, I would expect to see the existing test
driver framework running against these "abstract concepts" <wink>

> Once the framework is there, most of the code needed to implement the
> functionality should already be in the project - code to hook
> into Outlook,
> to train on a message, to parse mbox files, and so on.  It just needs
> hooking into the framework.

See above <wink>.

Mark.


From tim.one@comcast.net  Fri Nov  8 04:50:42 2002
From: tim.one@comcast.net (Tim Peters)
Date: Thu, 07 Nov 2002 23:50:42 -0500
Subject: [Spambayes] SMTP proxy questions
In-Reply-To: <ievlsu08r2krkv5n6clac6c5p58uianqph@4ax.com>
Message-ID: <LNBBLJKPBEHFEDALKOLCGEAACIAB.tim.one@comcast.net>

[Richie Hindle]
> ...
> The most difficult problem for retraining-by-forwarding is matching the
> forwarded message to one from the cache, after Outlook Express
> has stripped the headers, top-quoted the users .sig, converted it
> to HTML and added fifteen macro viruses.  Any ideas?

If user can be convinced to forward as an *attachment*, those problems go
away, at least in OE.  You can create a new msg there, select any number of
msgs, drag them to the msg as a group, and OE will create an attachment for
each one.  Unlike Outlook, OE appears to save the original stuff that came
in over the wire (we're finding it's a real hoot in the OL client to try to
guess what the original MIME structure may have been).

> Can the tokeniser help?

If you put in a token unique to each msg, sure <wink>.  Perhaps the "loose
checksum" program Skip checked in could be useful for this.


From tim.one@comcast.net  Fri Nov  8 05:06:43 2002
From: tim.one@comcast.net (Tim Peters)
Date: Fri, 08 Nov 2002 00:06:43 -0500
Subject: [Spambayes] Upgrade problem
In-Reply-To: <5tjlsu8ak2a734sjb4hosp28qrvp6fdm13@4ax.com>
Message-ID: <LNBBLJKPBEHFEDALKOLCGEACCIAB.tim.one@comcast.net>

[Richie Hindle]
> A quick note in case someone decides to remove the counts from the
> database:

Neil Schemenauer already does, in his CDB code (neil*.py).  It's a lean
scoring-only database, mapping tokens to *just* spamprobs.  If he went on to
store them as scaled ints, he could almost certainly reduce this to 2 bytes
of prob info per token, and possibly even just 1.

> the HTML front end has a "Word query" feature which will tell you the
> information in the database for a given word - it's interesting to see
> how many more times the word 'Viagra' appears in ham than in spam.  I
> mean the other way round.

What a geek <wink>.


From tim.one@comcast.net  Fri Nov  8 05:48:25 2002
From: tim.one@comcast.net (Tim Peters)
Date: Fri, 08 Nov 2002 00:48:25 -0500
Subject: [Spambayes] Upgrade problem
In-Reply-To: <r01050400-1021-6EAB9700F2A611D68CC8003065D5E7E4@[10.0.0.23]>
Message-ID: <LNBBLJKPBEHFEDALKOLCIEAFCIAB.tim.one@comcast.net>

[Just van Rossum]
> I think it can be done with almost no extra overhead with a
> caching scheme.  This assumes (probably wrongly <wink>) that
> the cache stays in memory between runs.
> Something like this perhaps:
>
> *** classifier.py   Thu Nov  7 23:03:07 2002
> --- classifier.py.hack  Fri Nov  8 00:04:05 2002
> ***************
> *** 456,459 ****
> --- 456,460 ----
>
>           wordinfoget = self.wordinfo.get
> +         spamprobget = self.spamprobcache.get
>           now = time.time()
>           for word in Set(wordstream):
> ***************
> *** 463,467 ****
>               else:
>                   record.atime = now
> !                 prob = record.spamprob
>               distance = abs(prob - 0.5)
>               if distance >= mindist:
> --- 464,470 ----
>               else:
>                   record.atime = now
> !                 prob = spamprobget(word)
> !                 if prob is None:
> !                     prob = self.calcspamprob(word, record)
>               distance = abs(prob - 0.5)
>               if distance >= mindist:

Sorry, I don't know what this is trying to accomplish.  Like, what is
self.spamprobcache?  There's no such thing now, and the patch doesn't appear
to create one (i.e., this code doesn't run).  Whatever it's supposed to be,
why isn't spamprobcache.get *itself* responsible for returning a spamprob,
instead of making its caller deal with two cases?  If the answer is "it's
supposed to be a dict, so .get ain't that smart", then the memory burden for
a long-running scorer process will zoom, negating one of the benefits people
attached to "real databases" thought they were buying in return for giant
files and slothful performance <wink>.

Life would be easier if databaseheads trained all they liked as often as
they liked, but refrained from calling update_probabilities() until the end
of the day (or other "quiet time").  The idea that the model should be
updated after every msg trained on is an extreme.


From tim.one@comcast.net  Fri Nov  8 06:23:13 2002
From: tim.one@comcast.net (Tim Peters)
Date: Fri, 08 Nov 2002 01:23:13 -0500
Subject: [Spambayes] Corpus module (was: Upgrade problem)
In-Reply-To: <fljlsusfv2tcnmiv8a0jurqnc9fn8mn7q7@4ax.com>
Message-ID: <LNBBLJKPBEHFEDALKOLCCEAICIAB.tim.one@comcast.net>

[Richie Hindle, cogitates about Messages and their Corpus(ora)]

That's the ticket!  Backing off to a more fundamental level looks useful to
me too.  We never even straightened that much out for testing purposes
(msgs.py isn't general enough; for some custom test drivers (never checked
in), I couldn't even reuse the MsgStream class for my *own* directory
structures).

I disagree with Mark's

> If the process of re-training a message in the Outlook GUI becomes:
>
> def RetrainMessageAsSpam():
>     # Outlook specific code to get an ID.
>     message = message_store.GetMessage(id)
>     if not classifier.IsSpam(message):
>         classifier.train(message, is_spam=True)
>
> And not a whole lot else, it doesn't seem worth it.

because it illustrates the point <wink>:  it doesn't look like a correct
re-training method (although it may be, depending on assumptions about where
"id" comes from, and what assorted classifier methods do), and while a
correct method shouldn't be hard, in the absence of a class dedicated to
doing the simple common things that *can* be done in a common way, everyone
will keep screwing it up in their own client code.

> ...
> You might want to run it past Tim Peters, 'cos he's *far* better at this
> kind of thing than I am (though he's also busy).

I have to do more Python and Zope work now, so have to guard my time on
*this* project more jealously than I have.  MarkH and SeanT and JeremyH all
have ideas here too, and I trust you'll sort them out as a harmonious family
bent on world domination.  As a general strategy, the first person to check
code in usually wins <wink -- but take a clue from Mark, and from the
earlier days of this project, and from the pop3 proxy, and sling code more
than talk about it -- refactoring in Python is easy when the need becomes
apparent from real life>.

> ...
> The mark of a good framework is when you write a tiny little class (like
> AutoTrainer above for instance) that contains hardly any code but adds a
> major new feature (in this case, automatic training when moving messages
> around in Outlook).

The client-specific code to hook and track msg movement in Outlook is
relatively massive, so everything else appears a drop in the bucket to Mark.
Nevertheless, if a usable framework for capturing the *common* part of this
stuff were available, removing the 5 lines of code quoted above would help
(the Outlook client, and all others).


From B-Morgan@concentric.net  Fri Nov  8 06:25:30 2002
From: B-Morgan@concentric.net (Brad Morgan)
Date: Thu, 7 Nov 2002 23:25:30 -0700
Subject: [Spambayes] SMTP proxy questions
In-Reply-To: <ievlsu08r2krkv5n6clac6c5p58uianqph@4ax.com>
Message-ID: <NABBJOOEOFODEALNMJAJMEOGHBAA.B-Morgan@concentric.net>

As I see it, having pop3proxy keep copies of the messages and using an HTML
UI for training has the least amount of dependancy on the email client's
forwarding capabilities (or lack thereof).

I have a severe aversion to opening spam that will probably carry over to
unsure messages, so having a link added to the message body may not do me
much good.

I will, however, go to an HTML UI and examine a message if that UI doesn't
"execute" the HTML.  I don't want to see pretty, raw data is good enough for
me to decide.

I hate to keep mentioning a "rival" project <G>, but popfile's UI seems
pretty close to what I think would work best here.

Regards,

Brad

-----Original Message-----
From: spambayes-bounces@python.org
[mailto:spambayes-bounces@python.org]On Behalf Of Richie Hindle
Sent: Thursday, November 07, 2002 5:17 PM
To: spambayes@python.org
Subject: [Spambayes] SMTP proxy questions


[Me]
> Also on my list is to commit Tim Stone's SMTP proxy code, possibly after
> integrating it with the pop3proxy (but I need to discuss that with you,
> Tim, after looking in more detail at the code, hopefully tonight).

I've discussed this with Tim S, and he's going off the SMTP proxy idea
while I'm still broadly in favour of it.  What do people think - do
non-Outlook users want to forward messages to 'spam' and 'ham' to train the
system, or use an HTML UI?

The most difficult problem for retraining-by-forwarding is matching the
forwarded message to one from the cache, after Outlook Express has stripped
the headers, top-quoted the users .sig, converted it to HTML and added
fifteen macro viruses.  Any ideas?  Can the tokeniser help?

Or perhaps there's another way.  The only other option I'd thought of was
to add two hyperlinks to the end of the message, "This is spam" and "This
is ham" (in ways that would work for both HTML and plain-text messages, in
both HTML and plain-text email clients).  They'd link to the HTML interface
and tell it the cache ID of the message.  Adding content to emails is way
more intrusive (and difficult) than adding headers.  But no more intrusive
than the .sig that mailman adds.

--
Richie Hindle
richie@entrian.com


_______________________________________________
Spambayes mailing list
Spambayes@python.org
http://mail.python.org/mailman/listinfo/spambayes


From tim.one@comcast.net  Fri Nov  8 06:46:14 2002
From: tim.one@comcast.net (Tim Peters)
Date: Fri, 08 Nov 2002 01:46:14 -0500
Subject: [Spambayes] Outlook plugin - training
In-Reply-To: 
 <16E1010E4581B049ABC51D4975CEDB8861992D@UKDCX001.uk.int.atosorigin.com>
Message-ID: <LNBBLJKPBEHFEDALKOLCGEAKCIAB.tim.one@comcast.net>

[Moore, Paul]
> ...
> I'm assuming (based on a message I recall seeing recently) that it's
> possible to "correct" training - ie, if I train the classifier that a
> specific message is spam, I can later say "no it isn't, it's ham".

That's right, and at the level of classifier.py it's a two-step process:
unlearn() as spam, then learn() as ham.  It actually doesn't matter which
order those are done in, but I won't admit to that <wink>.

> Assuming that this is so, is it not reasonable to train dynamically
> on an "assume I got it right" basis?

Depending on context, it *may* be.

> In other words, whenever the addin filters a message as ham or spam,
> automatically train on that basis as well. Then, if the user sees a
> mistake, he corrects it, which automatically retrains the classifier
> (manually deleting as spam or moving a message already does this).

Assuming a conscientious user, and a client that knows enough about what the
user is doing, that should work fine.

> This will keep the database right up to date, and all the user has to
> do is correct any bad decisions the classifier makes (which he should
> be doing anyway).
>
> I've ignored database growth issues, but other than that, is there any
> other problem with this approach?

Doubtless hundreds, but why quibble <wink>.  A misclassified msg will have
bad effects at once if the training gets reflected into the probabilities at
once, so it gets less appealing the less zealous the user is about
correcting mistakes right away.  That can be mitigated by doing the day's
training into a distinct dict, or not calling update_probabilities() in a
single dict, until "the end of the day", when the user has (presumably)
corrected all the day's mistakes they're going to correct.  But if the model
updating is going to be delayed anyway, then it makes as much sense to delay
doing any training on "the day's" msgs until the end of the day.
Determining what "the end of the day" means is a puzzle then too.  For
example, maybe I left my email client running and went on a week-long
vacation.  I'm not going to look over 700 presumed spam when I get back,
I'll just delete it.  But if ham was in there, I've now let it train in the
wrong direction, and that will hurt.

In other contexts, the scheme doesn't get off the ground.  For example, for
python.org use, nobody is going to review msgs claimed to be spam.  A system
feeding on its own judgments is going to reinforce its own mistakes too, so
the "conscientious, timely, reviewing human" bit is important.


From tim.one@comcast.net  Fri Nov  8 07:20:18 2002
From: tim.one@comcast.net (Tim Peters)
Date: Fri, 08 Nov 2002 02:20:18 -0500
Subject: [Spambayes] Outlook plugin - training
In-Reply-To: <LCEPIIGDJPKCOIHOBJEPMEFCHJAA.mhammond@skippinet.com.au>
Message-ID: <LNBBLJKPBEHFEDALKOLCMEAMCIAB.tim.one@comcast.net>

[Mark Hammond]
> ...
> The key limitation of this scheme, as Tim also alludes to, is that this
> never correctly classifies ham.  However, I actually see this
> incremental training more as a "get smarter now" than a "just get
> smarter" technique - ie, a user sees a mis-classified Spam, by re-
> training they are increasing the chances that the next similar mail
> will be handled correctly.  Instant feedback, especially while the user
> is getting started.
>
> ie, it is indeed "mistake based training", but that may still prove
> useful in addition to ongoing training.

I sure agree it's *very* useful at the start, and expect it will continue to
be useful over time.

> I can't help thinking that we are somehow underestimating our own
> tool here.

I'm going to try an experiment:  I'm going to wipe my home database and
start over from scratch, training first on one ham and one spam, then only
on mistakes and unsures.  This should be fun <wink>.

> As is common when people first use this tool, spam is generally
> found in the ham set and vice-versa.  Because of this, I know that my
> Inbox is spam free (but less sure about my other "ham" folders).  I'm
> also sure that my Spam folder has no ham.  This should remain true
> while continue to use the tool.

How do you know your Spam folder has no ham?  I know mine doesn't because I
routinely score it, sort on the score, and stare at "the wrong end".  I find
ham there as often as not, *usually* apparently due to mousing error when
dragging a training ham into the Ham folder and overshooting the mark.

> So surely we can exploit this somehow.  Off the top of my head:
> * Assume we don't trust the last 2 days of mail (as the user may not
> yet have sorted them).  Anything in the "good" and "spam" folders older
> than this can be assumed correctly classified, and able to be trained
> on.

Provided the user has already done a decent amount of training, then as Paul
Moore suggested it could even work to trust ham-vs-spam decisions
immediately, and let user corrections undo those as needed.  A well-trained
system should be pretty robust against a few misclassifications over the
short term.

> * A process could go through all ham and spam trained on, and score each
> message.  Any "suspect" messages are presented in a list (much like the
> Outlook "Find Message" result list).  The user can indicate that the
> message is correct (and the system will remember, never asking about
> this message again) or is indeed incorrectly classified.  If incorrect,
> it will be moved, and incrementally trained as per now.  (I can also
> picture a whitelist kicking in here; if incorrect, offer to add user to
> whitelist.  If user in the whitelist, assume ham thereby meaning mail
> from this person can never again be spam)

Tell us about the mistakes *you* see.  I feel like we're designing a
solution to a hypothetical problem otherwise.  The only "mistake" I
routinely see is that my cigarettes-via-web advertising keeps getting
knocked back into Unsure territory.  That doesn't bother me enough to do
anything about it, but if it bothers you enough <wink> then, yes, a
whitelist would solve that one.

> I can picture this working in the background, and simply indicating to
> the user that there are "conflicts" to be resolved at their leisure.

Or maybe we could just move those back to the Unsure folder.  The user
should already know what to do about things in Unsure, so it's nothing new
to them.  Moving a msg out of Unsure could be taken as a positive sign that
the user has classified such a msg once and for all (well, until they move
it again, anyway).

> Further, I imagine that as we build better training data for each
> message store, the number of "conflicts" actually found would
> generally be zero - ie, the system would find that all 2 day and
> older mail correctly classifies.

I expect that's true.


From just@letterror.com  Fri Nov  8 07:54:04 2002
From: just@letterror.com (Just van Rossum)
Date: Fri,  8 Nov 2002 08:54:04 +0100
Subject: [Spambayes] Upgrade problem
In-Reply-To: <LNBBLJKPBEHFEDALKOLCIEAFCIAB.tim.one@comcast.net>
Message-ID: <r01050400-1021-4B54CB90F2EF11D68CC8003065D5E7E4@[10.0.0.23]>

Tim Peters wrote:

> [Just van Rossum]
> > I think it can be done with almost no extra overhead with a
> > caching scheme.  This assumes (probably wrongly <wink>) that
> > the cache stays in memory between runs.
> > Something like this perhaps:
> >
> > *** classifier.py   Thu Nov  7 23:03:07 2002
> > --- classifier.py.hack  Fri Nov  8 00:04:05 2002
> > ***************
> > *** 456,459 ****
> > --- 456,460 ----
> >
> >           wordinfoget = self.wordinfo.get
> > +         spamprobget = self.spamprobcache.get
> >           now = time.time()
> >           for word in Set(wordstream):
> > ***************
> > *** 463,467 ****
> >               else:
> >                   record.atime = now
> > !                 prob = record.spamprob
> >               distance = abs(prob - 0.5)
> >               if distance >= mindist:
> > --- 464,470 ----
> >               else:
> >                   record.atime = now
> > !                 prob = spamprobget(word)
> > !                 if prob is None:
> > !                     prob = self.calcspamprob(word, record)
> >               distance = abs(prob - 0.5)
> >               if distance >= mindist:
> 
> Sorry, I don't know what this is trying to accomplish.  Like, what is
> self.spamprobcache?  There's no such thing now, and the patch doesn't appear
> to create one (i.e., this code doesn't run). 

Tim, don't be such a programmer <wink>. But ok, I promise I'll never post
pseudocode as a patch again...

> Whatever it's supposed to be,
> why isn't spamprobcache.get *itself* responsible for returning a spamprob,
> instead of making its caller deal with two cases? 

I thought I was doing your performance needs a favor <wink>.

> If the answer is "it's
> supposed to be a dict, so .get ain't that smart",

That's the answer.

> then the memory burden for
> a long-running scorer process will zoom, negating one of the benefits people
> attached to "real databases" thought they were buying in return for giant
> files and slothful performance <wink>.

Right. If a float takes up 20 bytes in memory (just a guess, no time to look),
then for a database of 100000 words (that's roughly the size of my personal db)
the memory burden is 100000 * (8 + 20), almost three megs.

Just in case the higher memory usage is not an issue, there's a simpler
approach: don't store spamprob in the db, but call bayes.update_probabilities()
on startup. update_probabilities() takes about 2 seconds on my lowly 400Mhz PPC
on my db (hm, that's using pickle, so will be a lot more when using a db :-( ).
You can tell I'm thinking mostly about long running processes...

I guess you're right, one size doesn't fit all. One last idea for this morning:
how about splitting the db in a training db (storing hamcount and spamcount) and
a classifying db (storing only spamprob)?

> Life would be easier if databaseheads trained all they liked as often as
> they liked, but refrained from calling update_probabilities() until the end
> of the day (or other "quiet time").  The idea that the model should be
> updated after every msg trained on is an extreme.

Good points.

Just

From richie@entrian.com  Fri Nov  8 08:06:33 2002
From: richie@entrian.com (Richie Hindle)
Date: Fri, 08 Nov 2002 08:06:33 +0000
Subject: [Spambayes] Upgrade problem
In-Reply-To: <r01050400-1021-281375DCF23011D68CC8003065D5E7E4@[10.0.0.23]>
References: <B9EFE999.5C289%francois.granger@free.fr>
	<r01050400-1021-281375DCF23011D68CC8003065D5E7E4@[10.0.0.23]>
Message-ID: <rirmsuse1blns4r3h9apiibvcluabnd9g7@4ax.com>


[Just]
> the web interface of pop3proxy.py is pretty good and useful, the only
> downside is that it saves the database after each training

That's now fixed (at least partly) along with some other bits:

 o The database is now saved (optionally) on exit, rather than after each
   message you train with.  There should be explicit save/reload commands,
   but they can come later.
 o It now keeps two mbox files of all the messages that have been used to
   train via the web interface - thanks to Just for the patch.
 o All the sockets now use async - the web interface used to freeze
   whenever the proxy was awaiting a response from the POP3 server.  That's
   now fixed.
 o It now copes with POP3 servers that don't issue a welcome command.
 o The training form now appears in the training results, so you can train
   on another message without having to go back to the Home page.

-- 
Richie Hindle
richie@entrian.com


From tim.one@comcast.net  Fri Nov  8 09:15:24 2002
From: tim.one@comcast.net (Tim Peters)
Date: Fri, 08 Nov 2002 04:15:24 -0500
Subject: [Spambayes] Outlook plugin - training
In-Reply-To: <LNBBLJKPBEHFEDALKOLCMEAMCIAB.tim.one@comcast.net>
Message-ID: <LNBBLJKPBEHFEDALKOLCOEBFCIAB.tim.one@comcast.net>

[Tim]
> ...
> I'm going to try an experiment:  I'm going to wipe my home database and
> start over from scratch, training first on one ham and one spam, then
> only on mistakes and unsures.  This should be fun <wink>.

It is!  The msg from me I'm replying to here scored 94 (solid spam).  I've
now got 5 ham and 5 spam in my training set, most of the new ones from
Unsures.  The latest spam was a blatant false negative, from Hapax City:

'*H*'                          0.998601
'*S*'                          8.60833e-005
'can'                          0.0652174
'have'                         0.0652174
"don't"                        0.0918367
'never'                        0.0918367
'number'                       0.0918367
'one'                          0.0918367
'what'                         0.0918367
'"the'                         0.155172   ham hapaxes from here
'able'                         0.155172
'about'                        0.155172
'against'                      0.155172
'also'                         0.155172
'any'                          0.155172
'anything'                     0.155172
'back'                         0.155172
'because'                      0.155172
'been'                         0.155172
'check'                        0.155172
'even'                         0.155172
'find'                         0.155172
'found'                        0.155172
'heard'                        0.155172
'how'                          0.155172
'into'                         0.155172
"it's"                         0.155172
'more'                         0.155172
'needed'                       0.155172
'other'                        0.155172
'out'                          0.155172
'own'                          0.155172
'people'                       0.155172
'skip:a 10'                    0.155172
'skip:i 10'                    0.155172
'special'                      0.155172
'subject:.'                    0.155172
'subject:: '                   0.155172
'their'                        0.155172
'them.'                        0.155172
'they'                         0.155172
'those'                        0.155172
'time'                         0.155172
'time.'                        0.155172
'unsubscribe'                  0.155172
'until'                        0.155172
'useful'                       0.155172
'using'                        0.155172   to here
'and'                          0.275281
'for'                          0.275281
'subject: '                    0.275281
'you'                          0.275281
'from'                         0.355072
'not'                          0.355072
'off'                          0.355072
'our'                          0.355072
'when'                         0.355072
'new'                          0.644928
'see'                          0.644928
'url:gif'                      0.724719
'url:www'                      0.724719
'call'                         0.844828   spam hapaxes from here
'contact'                      0.844828
'credit'                       0.844828
'email.'                       0.844828
'every'                        0.844828
'further'                      0.844828
'header:Received:2'            0.844828
'made'                         0.844828
'more!'                        0.844828
'most'                         0.844828
'now'                          0.844828
'plus,'                        0.844828
'receive'                      0.844828
'search'                       0.844828
'skip:1 10'                    0.844828
'url:jpg'                      0.844828   to here
'email'                        0.908163

I think I've established that 5+5 isn't enough for great results <snort>.
However, 80% of its decisions have been correct so far!


From tdickenson@devmail.geminidataloggers.co.uk  Fri Nov  8 10:52:32 2002
From: tdickenson@devmail.geminidataloggers.co.uk (Toby Dickenson)
Date: Fri, 8 Nov 2002 10:52:32 +0000
Subject: [Spambayes] Re: unsupervised training
In-Reply-To: <LNBBLJKPBEHFEDALKOLCMEAMCIAB.tim.one@comcast.net>
References: <LNBBLJKPBEHFEDALKOLCMEAMCIAB.tim.one@comcast.net>
Message-ID: <200211081052.32567.tdickenson@devmail.geminidataloggers.co.uk>

On Friday 08 November 2002 7:20 am, Tim Peters wrote:

> Provided the user has already done a decent amount of training, then as
> Paul Moore suggested it could even work to trust ham-vs-spam decisions
> immediately, and let user corrections undo those as needed.  A well-tra=
ined
> system should be pretty robust against a few misclassifications over th=
e
> short term.

For the last two weeks I have been using a setup that uses this type of=20
unsupervised training.

I have a procmail filter that sends a copy of all incoming ham and spam t=
o two=20
seperate mailboxes. These mailboxes are used for overnight batch training=
,=20
then deleted. Messages marked 'Unsure' do not take part in this automatic=
=20
training.

I perform seperate filtering for spam and 'unsure' in my mua. Fo far I am=
=20
manually inspecting the unsure folder, and manually adding them to the=20
appropriate training mailboxes. Initially about 3% of mails were 'unsure'=
,=20
but this has dropped to less than 1% after 2 weeks.

Starting next week I plan to change the mua filtering to treat 'unsure' t=
he=20
same as 'ham', and stop all manual training. It will be interesting to se=
e if=20
the training remains stable.


From bkc@murkworks.com  Fri Nov  8 14:51:15 2002
From: bkc@murkworks.com (Brad Clements)
Date: Fri, 08 Nov 2002 09:51:15 -0500
Subject: [Spambayes] SMTP proxy questions
In-Reply-To: <ievlsu08r2krkv5n6clac6c5p58uianqph@4ax.com>
Message-ID: <3DCB8912.18340.2FB5F81@localhost>

On 8 Nov 2002 at 0:17, Richie Hindle wrote:

> Or perhaps there's another way.  The only other option I'd thought of was
> to add two hyperlinks to the end of the message, "This is spam" and "This
> is ham" (in ways that would work for both HTML and plain-text messages, in
> both HTML and plain-text email clients).  They'd link to the HTML interface
> and tell it the cache ID of the message.  Adding content to emails is way
> more intrusive (and difficult) than adding headers.  But no more intrusive
> than the .sig that mailman adds.

If you do this, what's to keep spammers from also adding similar looking URLs?

A busy person might not notice any difference, could click through and confirm their 
email address...

Brad Clements,                bkc@murkworks.com   (315)268-1000
http://www.murkworks.com                          (315)268-9812 Fax
AOL-IM: BKClements


From barry@python.org  Fri Nov  8 15:04:56 2002
From: barry@python.org (Barry A. Warsaw)
Date: Fri, 8 Nov 2002 10:04:56 -0500
Subject: [Spambayes] SMTP proxy questions
References: <ievlsu08r2krkv5n6clac6c5p58uianqph@4ax.com>
	<XFMail.021107171529.jbublitz@nwinternet.com>
Message-ID: <15819.53912.407893.819241@gargle.gargle.HOWL>


>>>>> "JB" == Jim Bublitz <jbublitz@nwinternet.com> writes:

    JB> What about adding a MIME object to the msg with the Spambayes
    JB> info (text/spambayes?) - or will forwarding lose that info
    JB> too? The email module should be able to do this.

Of course that would have to be text/x-spambayes :)

-Barry


From randy.diffenderfer@eds.com  Fri Nov  8 17:21:25 2002
From: randy.diffenderfer@eds.com (Diffenderfer, Randy)
Date: Fri, 8 Nov 2002 12:21:25 -0500 
Subject: [Spambayes] SMTP proxy questions
Message-ID: <8AA870658244D4119AF600508BDF0A360C6BC295@usahm014.exmi01.exch.eds.com>

|>>>>> "JB" == Jim Bublitz <jbublitz@nwinternet.com> writes:
|
|    JB> What about adding a MIME object to the msg with the Spambayes
|    JB> info (text/spambayes?) - or will forwarding lose that info
|    JB> too? The email module should be able to do this.
|
|Of course that would have to be text/x-spambayes :)
|
|-Barry

While a fair portion of messages may very well be MIME compliant, this
wouldn't work without some serious munging around for non-MIME messages, as
well as being very problematic for the many deformed MIME (read very NON
compliant :-) ) messages floating around out there!

Just an observation...


From jbublitz@nwinternet.com  Fri Nov  8 17:10:33 2002
From: jbublitz@nwinternet.com (Jim Bublitz)
Date: Fri, 08 Nov 2002 09:10:33 -0800 (PST)
Subject: [Spambayes] SMTP proxy questions
In-Reply-To: <15819.53912.407893.819241@gargle.gargle.HOWL>
Message-ID: <XFMail.021108091033.jbublitz@nwinternet.com>

On 08-Nov-02 Barry A. Warsaw wrote:
> 
>>>>>> "JB" == Jim Bublitz <jbublitz@nwinternet.com> writes:
> 
>     JB> What about adding a MIME object to the msg with the
> Spambayes
>     JB> info (text/spambayes?) - or will forwarding lose that
> info
>     JB> too? The email module should be able to do this.
> 
> Of course that would have to be text/x-spambayes :)

<searching for an excuse for my embarrassing oversight>
Well - there's application/ms-excel or some such. Isn't spambayes
just as good? :)

Point taken.

Jim


From barry@python.org  Fri Nov  8 17:33:53 2002
From: barry@python.org (Barry A. Warsaw)
Date: Fri, 8 Nov 2002 12:33:53 -0500
Subject: [Spambayes] SMTP proxy questions
References: <15819.53912.407893.819241@gargle.gargle.HOWL>
	<XFMail.021108091033.jbublitz@nwinternet.com>
Message-ID: <15819.62849.101901.822699@gargle.gargle.HOWL>


>>>>> "JB" == Jim Bublitz <jbublitz@nwinternet.com> writes:

    JB> <searching for an excuse for my embarrassing oversight> Well -
    JB> there's application/ms-excel or some such. Isn't spambayes
    JB> just as good? :)

It depends on whether you hold the IETF and IANA in as high regard as
Microsoft does <wink>.

http://www.iana.org/assignments/media-types/

-Barry

From lists@morpheus.demon.co.uk  Fri Nov  8 21:07:45 2002
From: lists@morpheus.demon.co.uk (Paul Moore)
Date: Fri, 08 Nov 2002 21:07:45 +0000
Subject: [Spambayes] Outlook plugin - training
References: <n2m-g.u1iue32h.fsf@morpheus.demon.co.uk>
	<BIEJKCLHCIOIHAGOKOLHMEDGDOAA.tim@zope.com>
Message-ID: <n2m-g.wunn3h3y.fsf@morpheus.demon.co.uk>

"Tim Peters" <tim@zope.com> writes:

[About the plugin code...]
> I'm more lost than not in it myself!

That makes me feel better :-)

[About bothering with leaving list traffic out]
> Don't worry about it before you try it.  I suggest trying it because I'm not
> sure it's possible to *stop* the system now from scoring all incoming msgs
> (the "new msg in Inbox" filter appears to trigger for every one, regardless
> of whether the RW decides to move it; after that it may just be a race
> between the RW and the addin deciding where to move each).

OK, I've switched over. I now have one Spam folder, one Potential Spam
folder, and the rest are Ham (actually, some historic archive folders
I've left out, but that's just because I never use them any
more). We'll see how it goes.

>> Of course, I know that the classifier *really* works by magic, and
>> so my intuition is useless :-)
>
> It's more that unless you know exactly how the math works, your intuition is
> simply baseless here, carried over from some other experience.  Do *you*
> have trouble distinguishing personal and work email from spam?  There you
> go, and you can't even compute inverse chi-squared probabilities to 14
> significant digits on demand in your head <wink>.

How do *you* know I can't compute inverse chi-squared probabilities in
my head? Oh, hang on - you wanted me to get the right answer, didn't
you? :-)

> What's to manage?  I get about 600 emails per day, and about 1% end
> up in Unsure (about 6 -- actually less than that, lately; the system
> is learning).

My ratio is still a lot worse than that. But as I say, my training
corpus is still quite small. But you're right - managing a few mails
isn't hard. It's just that the overall results are *so* much better
than the old home-grown soution I used that I became instantly spoiled
:-)

Seriously, I've said this before, but what you guys have developed
here is *phenomenally* good. I've reached the point where I look
forward to getting spam, just because I enjoy so much seeing it
automatically appear in the spam folder :-)

>> My instinctive reaction is that I want "Spam" and "Not Spam" buttons,
>> and then I read or delete the message in situ.
>
> MarkH has since implemented this in the Unsure folder.

Time for a CVS update, I guess...

> I still think you're making life too complicated.  Is list traffic
> spam?  If so, call it spam.  If not, call it ham.

Sounds sensible. I think that all the troubles I've had in the past
trying to manage spam have left me with an instinctive feeling that
the problem is complicated. This leads to looking for complicated
solutions.

But you're right. The spam/ham distinction itself is a simple yes/no,
so the setup should be, too.

But permit me to drag my feet a little, as I throw away all my
cherished preconceptions :-)

More seriously, I'm putting this point into my spambayes notes
folder. I suspect it's something a lot of new users will have to get
used to.

Thanks for the comments,
Paul.

-- 
This signature intentionally left blank

From lists@morpheus.demon.co.uk  Fri Nov  8 21:12:17 2002
From: lists@morpheus.demon.co.uk (Paul Moore)
Date: Fri, 08 Nov 2002 21:12:17 +0000
Subject: [Spambayes] Outlook plugin plus Exchange
Message-ID: <n2m-g.u1ir3gwe.fsf@morpheus.demon.co.uk>

I've noticed a couple of strange effects with the Outlook plugin used
against an Exchange server. The main one is that when I start up the
client in the morning, there are a lot of overnight messages in my
inbox. They don't seem to get filtered. I suspect this is to do with
Outlook not firing the "new mail" event on stuff that's in the
Exchange store when the client starts up. But I'll need to test this.

Unfortunately, the Exchange server is at work, and I can only do any
serious hacking on this at home, so I'm running a batch cycle (code at
home, take into work, try out, take bugs home, and repeat). So it'll
take me a while to make any progress.

I'll report back when I get more details.

Paul (Off to look at Outlook events in MSDN, and to write a simple
"log the events and see what is going on" plugin to test with)

-- 
This signature intentionally left blank

From mhammond@skippinet.com.au  Fri Nov  8 21:52:20 2002
From: mhammond@skippinet.com.au (Mark Hammond)
Date: Sat, 9 Nov 2002 08:52:20 +1100
Subject: [Spambayes] Outlook plugin plus Exchange
In-Reply-To: <n2m-g.u1ir3gwe.fsf@morpheus.demon.co.uk>
Message-ID: <LCEPIIGDJPKCOIHOBJEPMENEHJAA.mhammond@skippinet.com.au>

> I've noticed a couple of strange effects with the Outlook plugin used
> against an Exchange server. The main one is that when I start up the
> client in the morning, there are a lot of overnight messages in my
> inbox. They don't seem to get filtered. I suspect this is to do with
> Outlook not firing the "new mail" event on stuff that's in the
> Exchange store when the client starts up. But I'll need to test this.

I am working on code that optionally processes "missed" messages at startup.
It looks like I can list all unread, unscored mail in my 1000+ item inbox
very quickly, so this should be feasible.

> Paul (Off to look at Outlook events in MSDN, and to write a simple
> "log the events and see what is going on" plugin to test with)

Check out the Outlook plugin in the win32com\demos directory - probably a
good place to start.

Or if anyone gets lots of KLEZ mail via Outlook, I have a plugin that does a
decent job at killing them.

Mark.


From tim@fourstonesExpressions.com  Sat Nov  9 00:55:07 2002
From: tim@fourstonesExpressions.com (Tim Stone - Four Stones Expressions)
Date: Fri, 08 Nov 2002 18:55:07 -0600
Subject: [Spambayes] Persisting a pickled bayes database
Message-ID: <MJDARN31HEKEJDK21NLB972UTMGVTFE.3dcc5ceb@riven>

I can see the nice createbayes function in hammie, but I don't see any 
persistence function anywhere.  I do see several places where code to write a 
pickled bayes database is hard coded, and I understand the PersistentBayes 
thing.  I might be missing something...

I've been using a simple class to handle creating and persisting my bayes 
databases.  I *think* this stuff should probably go somewhere, but beats me 
where... classifier?  doesn't make much sense there... hammie?  Any ideas 
anybody?

Here's the class...  (kinda a dumb name  ;))

class BayesHelper:
    '''helper class for bayes databases'''
    
    def __init__(self, db_name, useDB):
       ''' constructor '''
       
       self.db_name = db_name
       self.useDB = useDB
       
       self.bayes = hammie.createbayes(db_name, useDB)

    # no __del__() method, because we don't *necessarily* want to persist

    def persist(self):
        '''store the bayes database'''
        
        if not self.useDB:
           fp = open(self.db_name, 'wb')
           pickle.dump(self.bayes, fp, 1)
           fp.close() 


- TimS


From tim.one@comcast.net  Sat Nov  9 18:35:43 2002
From: tim.one@comcast.net (Tim Peters)
Date: Sat, 09 Nov 2002 13:35:43 -0500
Subject: [Spambayes] Persisting a pickled bayes database
In-Reply-To: <MJDARN31HEKEJDK21NLB972UTMGVTFE.3dcc5ceb@riven>
Message-ID: <LNBBLJKPBEHFEDALKOLCGENECIAB.tim.one@comcast.net>

[Tim Stone]
> I can see the nice createbayes function in hammie, but I don't see any
> persistence function anywhere.  I do see several places where code
> to write a pickled bayes database is hard coded, and I understand the
> PersistentBayes thing.  I might be missing something...

Just experience with idiomatic Python persistence.  The persistence was all
in DBDict.__init__'s:

        self.hash = anydbm.open(dbname, 'c')

The tradition in Python is that "a persistent database" supplies an
interface much like a Python dict, but persists almost purely by magic.

For example, here's a brief Python session:

C:\Code\python\PCbuild>python
Python 2.3a0 (#29, Nov  8 2002, 10:51:55) [MSC 32 bit (Intel)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import anydbb
>>> d = anydbm.open('example.dat', 'n')
>>> d['an'] = 'example'
>>> # and quit Python at this point

Then in another session:

>>> import anydbm
>>> d = anydbm.open('example.dat')
>>> print d
<bsddb.bsddb object at 0x0064E158>
>>> print d.keys()
['an']
>>> print d['an']
example
>>>

Note that anydbm used bsddb as the underlying database mechanism on my box.
It may use some other database mechanism on some other box (it depends on
what it finds available).  I could have used bsddb directly instead, of
course, but then my code would require that bsddb be available.  anydbm uses
whatever it can scrounge up.

Subclassing the builtin dict type can give a similar "by magic" facility;
e.g., here's temp.py:

"""
import cPickle as pickle
import os

class PDict(dict):
    def __init__(self, fname):
        self.fname = fname
        if os.path.exists(fname):
            f = file(fname, 'rb')
            guts = pickle.load(f)
            f.close()
            self.update(guts)
        self.is_open = True

    def close(self):
        if self.is_open:
            f = file(self.fname, 'wb')
            pickle.dump(self, f, 1)
            f.close()
            self.is_open = False

    def __del__(self):
        self.close()
"""

That just adds a few methods to a regular dict, arranging to dump its value
to a pickle when .close() is called or when it becomes unreachable.  It's
intended that .close() be called explicitly, though (by-magic shutdown
semantics are never something to bet your life on).

Then in one Python session:

>>> from temp import PDict
>>> d = PDict('example.pck')
>>> d['another'] = 'example'

and in another:

>>> from temp import PDict
>>> d = PDict('example.pck')
>>> d
{'another': 'example'}
>>>

In your example helper class, you decided you don't necessarily want to
persist.  That may or may not be a useful ability, but "the usual" simple
Python database facilities don't give you a choice about that:  they commit
changes to disk *as* mutations occur.  In DB terms, they view each mutation
as a transaction.  The ZODB-based stuff Jeremy is doing is different that
way:  changes to a ZODB db have to be explicitly committed.  That's what the

    get_transaction().commit()

lines in the pspam directory are doing.  ZODB is much more of "a real
database" than these other gimmicks, by which I mean it has an explicit and
pretty rich transactional model and API.


From tim.one@comcast.net  Sat Nov  9 20:00:42 2002
From: tim.one@comcast.net (Tim Peters)
Date: Sat, 09 Nov 2002 15:00:42 -0500
Subject: [Spambayes] Outlook plugin - training
In-Reply-To: <3DCC4DA7.80401@hooft.net>
Message-ID: <LNBBLJKPBEHFEDALKOLCMENJCIAB.tim.one@comcast.net>

[Tim]
> I'm never going to get sub-0.1% error rates this way, but if this is the
> best it ever got, I'd be quite happy with it for my personal email.
> Something to ponder?  If so, you can get away with a very small
> database, and while hapaxes must not be removed blindly in this extreme
> scheme, using the atime field could (I suspect) be very effective in
> slashing the already-small database size (lots of hapaxes will never be
> seen again even if you train on everything; the WordInfo atime field
> tells you when a word was last used at all).

BTW, I'm still doing this experiment, and my total training data is up to 45
ham and 38 spam, out of a total of about 1,700 msgs processed so far.  FP
are FN are both rare now, and the Unsure rate is about 5% overall and
visibly falling.  The Unsure spam are more surprising than the Unsure ham,
but that may be more psychological than real.  For example, it took about 24
hours before I got my first Nigerian spam, and it was shocking to see it
score at the low end of the Unsure range.

Looking at the internals is scary.  I have entire folders that are called
ham seemingly because the mailing list they come from has a few lexical
conventions unique to it, and the hapaxes from the single training msg from
that list save almost all of that list's msgs from Unsure status.

In the msg of Rob's I'm replying to, these are all ham hapaxes:

'database'                     0.155172
'database,'                    0.155172
'ever'                         0.155172
'idea'                         0.155172
'quite'                        0.155172
'scheme,'                      0.155172
'seen'                         0.155172
'subject:Outlook'              0.155172
'subject:Spambayes'            0.155172
'subject:plugin'               0.155172
'subject:training'             0.155172
'tells'                        0.155172
'words'                        0.155172

and they slug it out with these spam hapaxes:

'away'                         0.844828
'effective'                    0.844828
'field'                        0.844828
'mean'                         0.844828
'word'                         0.844828

That 'word' is a strong spam clue but 'words' a strong ham clue should tell
us something about how robust this is <wink>.

[Rob Hooft]
> This seems to imply that you're still playing with the idea that hapaxes
> could be "slashed" from the database when using the "old" train-on-all
> procedure. I don't see how that can ever work, as all words pass through
> the hapax stage at some point. Or do you mean to slash "old" hapaxes
> only?

Well, training has no effect on scoring until update_probabilities() is
called, and in a batch-training context I mean hapax from
update_probabilities's POV.  Of course hamcounts or spamcounts for new words
start life at 1, but when doing batch training I don't mean to look at the
counts until the probabilities are updated.  At that point, a hapax is a
word that was seen in only one msg from the entire batch of new msgs.

Here's a quick test, based on unpublished general python.org email (we can't
publish the ham because it includes some personal email; GregW was working
on making the spam collection available, but I haven't heard about that in a
week; ditto his very large python.org virus collection).

In each case, it trains on 2,741 ham and 948 spam, then predicts the same
numbers of each.  The "all" column includes hapaxes (wrt counts at the *end*
of training).  The gt1 column threw away words at the end of training where
spamcount+hamcount <= 1; i.e., it retained only words that appeared more
than once, the non-hapaxes.   The gt2 column retained only words that
appeared more than twice; and so on.  ham_cutoff was 0.20 here, and
spam_cutoff 0.90.

filename:      all     gt1     gt2     gt3     gt4     gt5     gt6
ham:spam:  2741:948        2741:948        2741:948        2741:948
                   2741:948        2741:948        2741:948
fp total:        1       0       1       0       0       0       0
fp %:         0.04    0.00    0.04    0.00    0.00    0.00    0.00
fn total:        2       2       2       1       2       3       4
fn %:         0.21    0.21    0.21    0.11    0.21    0.32    0.42
unsure t:       81      87      89      82      98      96     100
unsure %:     2.20    2.36    2.41    2.22    2.66    2.60    2.71
real cost:  $28.20  $19.40  $29.80  $17.40  $21.60  $22.20  $24.00
best cost:  $22.20  $17.60  $20.00  $15.40  $16.80  $17.40  $22.40
h mean:       0.81    0.86    0.87    0.72    0.67    0.64    0.65
h sdev:       6.05    6.18    6.17    5.42    5.13    4.94    5.11
s mean:      98.00   97.66   97.54   97.38   97.03   96.62   96.52
s sdev:       9.26   10.22   10.37   10.62   11.19   12.49   12.61
mean diff:   97.19   96.80   96.67   96.66   96.36   95.98   95.87
k:            6.35    5.90    5.84    6.03    5.90    5.51    5.41

# retained
     words:  74327   36437   23877   16143   12798   10719    9157

So while hapaxes are vital with very little training data, even with "just"
about 4K training msgs they didn't buy anything in this test, and neither
did words that appeared only two or three times, and it doesn't appear to be
touchy (all of these columns show excellent results!).

> And what is "old"?

That remains a good question, and a good answer may differ between personal
email and bulk email applications.  A problem I see coming up in my personal
email is that some correspondents only show up once a year, and the hapaxes
they generate remain valuable clues, but only once a year.  General
python.org email doesn't appear to suffer anything like that (so long as
personal email is kept out of the python.org mix).


From rob@hooft.net  Sat Nov  9 22:24:52 2002
From: rob@hooft.net (Rob Hooft)
Date: Sat, 09 Nov 2002 23:24:52 +0100
Subject: [Spambayes] Outlook plugin - training
References: <LNBBLJKPBEHFEDALKOLCMENJCIAB.tim.one@comcast.net>
Message-ID: <3DCD8B34.6040903@hooft.net>

This is a multi-part message in MIME format.
---------------------- multipart/mixed attachment
Tim Peters wrote:
> [Tim]

>>I'm never going to get sub-0.1% error rates this way, but if this is the
>>best it ever got, I'd be quite happy with it for my personal email. 

> BTW, I'm still doing this experiment, and my total training data is up to 45
> ham and 38 spam, out of a total of about 1,700 msgs processed so far.  FP
> are FN are both rare now, and the Unsure rate is about 5% overall and
> visibly falling. 

I just added a testdriver to CVS that simulates your behaviour as I 
understand it: It will train on the first 30 messages, plus on all 
misclassified and all unsure messages. It is called "weaktest.py", and 
uses the good-old-Data/{Sp|H}am hierarchy.

I think we should test its performance at different Options settings.

It may not even be very realistic to training on fp's, as I think in my 
private E-mail I won't even check the spam folder very thoroughly at all.

Anyway, a default run for me now gives:

   100 trained:31H+16S wrds:4203 fp:0 fn:0 unsure:47
   200 trained:35H+25S wrds:6997 fp:0 fn:0 unsure:60
   300 trained:38H+29S wrds:7503 fp:0 fn:0 unsure:67
   400 trained:41H+32S wrds:8503 fp:0 fn:0 unsure:73
   500 trained:45H+38S wrds:8887 fp:0 fn:0 unsure:83
   600 trained:48H+39S wrds:9010 fp:0 fn:0 unsure:87
   700 trained:57H+41S wrds:9484 fp:0 fn:0 unsure:98
   800 trained:63H+43S wrds:9837 fp:0 fn:0 unsure:106
   900 trained:63H+45S wrds:9936 fp:0 fn:0 unsure:108
  1000 trained:67H+45S wrds:10001 fp:0 fn:0 unsure:112
  1100 trained:72H+47S wrds:10268 fp:0 fn:0 unsure:119
  1200 trained:72H+53S wrds:10386 fp:0 fn:0 unsure:125
  1300 trained:77H+56S wrds:11178 fp:0 fn:0 unsure:133
  1400 trained:81H+58S wrds:11546 fp:0 fn:0 unsure:139
  1500 trained:85H+60S wrds:11734 fp:0 fn:0 unsure:145
  1600 trained:87H+62S wrds:12023 fp:0 fn:0 unsure:149
  1700 trained:89H+63S wrds:12161 fp:0 fn:0 unsure:152
  1800 trained:93H+65S wrds:12287 fp:0 fn:0 unsure:158
  1900 trained:93H+68S wrds:12449 fp:0 fn:0 unsure:161
  2000 trained:96H+70S wrds:12637 fp:0 fn:0 unsure:166
  2100 trained:100H+70S wrds:12742 fp:0 fn:0 unsure:170
  2200 trained:103H+72S wrds:12984 fp:0 fn:0 unsure:175
  2300 trained:105H+73S wrds:13047 fp:0 fn:0 unsure:178
  2400 trained:108H+74S wrds:13220 fp:0 fn:0 unsure:182
  2500 trained:111H+78S wrds:13407 fp:0 fn:0 unsure:189
  2600 trained:112H+79S wrds:13485 fp:0 fn:0 unsure:191
  2700 trained:115H+81S wrds:13647 fp:0 fn:0 unsure:196
  2800 trained:118H+84S wrds:13797 fp:0 fn:0 unsure:202
  2900 trained:120H+84S wrds:13845 fp:0 fn:0 unsure:204
  3000 trained:123H+86S wrds:14131 fp:0 fn:0 unsure:209
fp: Data/Ham/Set2/n05250.txt score:0.9312
  3100 trained:128H+87S wrds:14327 fp:1 fn:0 unsure:214
  3200 trained:129H+90S wrds:14430 fp:1 fn:0 unsure:218
  3300 trained:132H+91S wrds:14633 fp:1 fn:0 unsure:222
  3400 trained:133H+93S wrds:14923 fp:1 fn:1 unsure:224
  3500 trained:133H+94S wrds:14937 fp:1 fn:1 unsure:225
  3600 trained:133H+98S wrds:15023 fp:1 fn:1 unsure:229
  3700 trained:135H+102S wrds:15463 fp:1 fn:1 unsure:235
  3800 trained:135H+107S wrds:15627 fp:1 fn:1 unsure:240
  3900 trained:138H+107S wrds:15786 fp:1 fn:1 unsure:243
  4000 trained:140H+111S wrds:15951 fp:1 fn:1 unsure:249
  4100 trained:142H+116S wrds:16115 fp:1 fn:1 unsure:256
  4200 trained:142H+117S wrds:16124 fp:1 fn:1 unsure:257
  4300 trained:143H+122S wrds:16251 fp:1 fn:1 unsure:263
  4400 trained:143H+126S wrds:16366 fp:1 fn:1 unsure:267
  4500 trained:144H+130S wrds:16434 fp:1 fn:1 unsure:272
  4600 trained:144H+134S wrds:16599 fp:1 fn:1 unsure:276
  4700 trained:146H+135S wrds:16664 fp:1 fn:1 unsure:279
  4800 trained:147H+135S wrds:16682 fp:1 fn:1 unsure:280
  4900 trained:149H+138S wrds:16911 fp:1 fn:1 unsure:285
fp: Data/Ham/Set1/n01590.txt score:0.9092
  5000 trained:151H+140S wrds:17257 fp:2 fn:1 unsure:288
  5100 trained:153H+141S wrds:17390 fp:2 fn:1 unsure:291
  5200 trained:155H+142S wrds:17747 fp:2 fn:1 unsure:294
  5300 trained:156H+143S wrds:18095 fp:2 fn:1 unsure:296
  5400 trained:159H+147S wrds:18205 fp:2 fn:1 unsure:303
  5500 trained:160H+147S wrds:18230 fp:2 fn:1 unsure:304
  5600 trained:163H+147S wrds:18334 fp:2 fn:1 unsure:307
  5700 trained:163H+150S wrds:18410 fp:2 fn:1 unsure:310
  5800 trained:165H+150S wrds:18455 fp:2 fn:1 unsure:312
  5900 trained:168H+151S wrds:18671 fp:2 fn:1 unsure:316
  6000 trained:170H+154S wrds:18764 fp:2 fn:1 unsure:321
  6100 trained:170H+155S wrds:18787 fp:2 fn:1 unsure:322
  6200 trained:170H+156S wrds:18791 fp:2 fn:1 unsure:323
  6300 trained:174H+157S wrds:19095 fp:2 fn:1 unsure:328
  6400 trained:176H+161S wrds:19398 fp:2 fn:2 unsure:333
  6500 trained:178H+161S wrds:19444 fp:2 fn:2 unsure:335
Total messages 6540 (4800 ham and 1740 spam)
Total unsure (including 30 startup messages): 336 (5.1%)
Trained on 178 ham and 162 spam
fp: 2 fn: 2
Total cost: $89.20

(This is on 3 out of my 10 test directories).

Interesting to note so far:
  * The "Total cost" is much higher than for train-on-all schemes,
    but it is only due to Unsures; fp and fn are still small.
  * The database growth doesn't decay with time after a while;
    it can be described as:
       nwords = 9200 + 1.6 * nmessages
    or alternatively:
       nwords = 5700 + 40 * ntrained
    ..as can be seen in the attached png's
  * The training set is almost balanced, even though I scored
    many more ham than spam
  * The unsure rate drops over time:
         0- 1000: 11.2% (minus 3.0% to be fair)
      1000- 2000:  5.4%
      2000- 3000:  4.3%
      3000- 4000:  4.0%
      4000- 5000:  3.9%
      5000- 6000:  3.3%

Rob

--
Rob W.W. Hooft  ||  rob@hooft.net  ||  http://www.hooft.net/people/rob/

---------------------- multipart/mixed attachment
A non-text attachment was scrubbed...
Name: words1.png
Type: image/png
Size: 12191 bytes
Desc: not available
Url : http://mail.python.org/pipermail/spambayes/attachments/20021109/85c3f3b5/words1-0001.png

---------------------- multipart/mixed attachment
A non-text attachment was scrubbed...
Name: words2.png
Type: image/png
Size: 12807 bytes
Desc: not available
Url : http://mail.python.org/pipermail/spambayes/attachments/20021109/85c3f3b5/words2-0001.png

---------------------- multipart/mixed attachment--


From rob@hooft.net  Sat Nov  9 23:46:02 2002
From: rob@hooft.net (Rob Hooft)
Date: Sun, 10 Nov 2002 00:46:02 +0100
Subject: [Spambayes] More experiments with weaktest.py
Message-ID: <3DCD9E3A.4040809@hooft.net>

These were results of weaktest with default parameters:

   Total messages 6540 (4800 ham and 1740 spam)
   Total unsure (including 30 startup messages): 336 (5.1%)
   Trained on 178 ham and 162 spam
   fp: 2 fn: 2
   Total cost: $89.20

If I set the "ham_cutoff" to 10 from 20 to make things more symmetrical 
(spam_cutoff is 90 by default):

   Total messages 6540 (4800 ham and 1740 spam)
   Total unsure (including 30 startup messages): 442 (6.8%)
   Trained on 292 ham and 152 spam
   fp: 2 fn: 0
   Total cost: $108.40

So the database grows by 30% but it didn't help my cost. The training 
set is now unbalanced 2:1. Set spam_cutoff to 80 and ham_cutoff back to 
the default 20:

   Total messages 6540 (4800 ham and 1740 spam)
   Total unsure (including 30 startup messages): 304 (4.6%)
   Trained on 213 ham and 101 spam
   fp: 7 fn: 3
   Total cost: $133.80

This reduces the database by only 10%, but at very high fp cost. Same
2:1 unbalance in the training set.
Back to the default 20:90 then, and set the minimum_prob_strength to 0.0:

   Total messages 6540 (4800 ham and 1740 spam)
   Total unsure (including 30 startup messages): 933 (14.3%)
   Trained on 497 ham and 437 spam
   fp: 0 fn: 1
   Total cost: $187.60

OK, so that didn't work either. How about setting it to 0.2?

   Total messages 6540 (4800 ham and 1740 spam)
   Total unsure (including 30 startup messages): 304 (4.6%)
   Trained on 134 ham and 177 spam
   fp: 2 fn: 5
   Total cost: $85.80

Hm. That is slightly better. Funny, we are suddenly training on more 
spam than ham.... Back to 0.1 anyway ---the differences are too small--- 
and set robinson_probability_x = 0.3 (default is 0.5):

   Total messages 6540 (4800 ham and 1740 spam)
   Total unsure (including 30 startup messages): 602 (9.2%)
   Trained on 54 ham and 616 spam
   fp: 1 fn: 67
   Total cost: $197.40

Very interesting: this changes the training ratio to 1:12, at huge cost!
(less than one in three spams was recognized solidly as such).
Wonder what this could do if changed together with the cutoff....
Lets move it back to 0.5, and try "robinson_probability_s = 0.3":

   Total messages 6540 (4800 ham and 1740 spam)
   Total unsure (including 30 startup messages): 348 (5.3%)
   Trained on 237 ham and 120 spam
   fp: 7 fn: 2
   Total cost: $141.60

Ouf.

I am back with the defaults, but I'd still like to do an automated 
optimization of everything simultaneously. Might try that.

Rob

-- 
Rob W.W. Hooft  ||  rob@hooft.net  ||  http://www.hooft.net/people/rob/


From trebor@animeigo.com  Sun Nov 10 00:32:36 2002
From: trebor@animeigo.com (Robert Woodhead)
Date: Sat, 9 Nov 2002 19:32:36 -0500
Subject: [Spambayes] Introducing myself
In-Reply-To: <E18AYxq-0006sT-00@mail.python.org>
References: <E18AYxq-0006sT-00@mail.python.org>
Message-ID: <a05200f03b9f34909a00b@[192.168.1.103]>

Hello everyone,

Just a quick note to introduce myself; I ran the session at that 
Hacker's conference that Guido mentioned, and passed on the 
suggestion of checking out Bill Y's combinatorial approach.

I've been playing with rules-based techniques for almost a year (see 
http://www.madoverlord.com/projects/told.t for details) and toying 
with bayesian  systems for only the last couple of months, on and 
off.  So no expert in that regard; I have mostly replicated the early 
work you guys have done (skimmed the archive today).

I'm particularly impressed with the chi-square work, it looks very 
interesting (but more stats for my poor stats-challenged mind to work 
on; not to mention that now I'm going to have to get around to 
cramming python in there with all the other languages that have 
accumulated over the years...).  Also, it's nice the way you're 
testing a lot of variants, I've been crossing things off my "try 
this" list all afternoon.

Couple of comments (bear in mind, I haven't grabbed the source yet, 
and only skimmed the archive, so if this repeats things you've 
already tried, sorry).  This is just stuff that's been in my mind 
recently, plus stuff stimulated by my skim.

* The great headers debate; suggest you put both machine and human 
readable opinions in the header, eg:

	X-SpamBayes-Rating: 9 (Very Spammy)
	X-SpamBayes-Rating: 7 (Somewhat Spammy)
	X-SpamBayes-Rating: 5 (Unsure)
	X-SpamBayes-Rating: 3 (Probably Innocent)
	X-SpamBayes-Rating: 0 (The Finest Ham)

The reason being, many mailreaders can use a finer discriminant than 
(yes,no,beats me) in ranking spam.  A common strategy (which I like 
myself) is to start an email at neutral priority and bump it up and 
down based on various triggers, whitelists, whatever, then sort the 
inbox by the final priority.

A cute hack I used in TOLD was to output the result like this:

	X-SpamBayes-Rating: 0123456789 (Very Spammy)
	X-SpamBayes-Rating: 012345 (Unsure)

This permits a mailreader with limited filtering tools (like Eudora) 
to classify multiple results with a single rule (such as "if an 
X-SpamBayes-Rating header contains the string 12345678, set priority 
to double-low", which catches both 8 and 9 rated emails).

BTW, being pedantic, "rating" is a better word to use, it is more 
precisely what the discriminator is doing, is the same in all flavors 
of english, and is shorter.  "Score" might be even better.  ;^)

* Hashing to a 32-bit token is very fast, saves a ton of memory, and 
the number of collisions using the python hash (I appealed for hash 
functions on the hackers-l and Guido was kind enough to send me the 
source) is low.  About 1100 collisions out of 3.3 million unique 
tokens on a training set I was using.  CRC32, of all things, is 
actually slightly better, but only by a hair.  So this kind of 
hashing probably won't have much effect on the statistical results.

* Bill Y's byte bucket system has a lot of problems, but a there are 
probably some data reduction techniques that would work well.  One 
that occurred to me on the way back from Hackers would be simply to 
keep a 1-byte count of ham/spam hits for each token, and when the ham 
or spam count is about to wrap, cut each count in half, rounding up 
the other value; ie:

	// increment ham count for bucket i
	// apologies, my pseudocode is a bizarre mixture of
	// now-unknown languages

	if (ham[i]=255)
	   {
	      ham[i]=128;
	      spam[i]=(spam[i]/2)+(spam[i]%2)
    	   }
	else
	   ham[i]++;

The nice thing about this is that it would bias in favor of more 
recent email; things would "age".  But note this means when building 
the original database you have to feed it ham and spam in small 
chunks, or use a greater resolution before cramming it into 
individual bytes.

* I was playing a week or two back with 1 and 2 token groups, and 
found that a useful technique was, for each new token, to only 
consider the most deviant result.  So if the individual word was .99 
spam, and the two word phrase was .95, it would only consider the .99 
result.  This would probably help with Bill Y's combinatorial scheme. 
Dunno if you've tried this; it prevents a particularly spammy or 
hammy sequence from dominating the results (I was only considering 
the 16 or so most deviant results in my bayesian calc, at least on my 
corpus, more than that didn't really help).

* My personal bias (as I think Guido mentioned) is for a multifaceted 
approach, using Bayesian, rules-based (attacking things that bayesian 
isn't good at, like looking for obfuscated url structures), DNSBL, 
and whitelisting heuristics to generate an overall ranking.  So a 
hammy mail from a guy in your address book would bubble up to highest 
priority, whereas something spammy from him would stay neutral. 
There's lots of room for cooperation between the various approaches 
and multiple agents means its less likely that a spam will get by. 
In particular, whitelisting heuristics can almost eliminate false 
positives.

* Finally, if anyone needs more spam, I get over 300 a day (I've been 
around a while!) and have a cleaned corpus of over 130MB of spam and 
foreign email.  Also, given all the legit web-marketing email I get 
because of the url registration work I've done, I've got tons of the 
spammiest ham you could imagine.

Best
R

-- 
-----------------------------------------------------------------------
http://madoverlord.com/    World Domination - a fun family activity!

From tim.one@comcast.net  Sun Nov 10 06:55:59 2002
From: tim.one@comcast.net (Tim Peters)
Date: Sun, 10 Nov 2002 01:55:59 -0500
Subject: [Spambayes] Introducing myself
In-Reply-To: <a05200f03b9f34909a00b@[192.168.1.103]>
Message-ID: <LNBBLJKPBEHFEDALKOLCKEPMCIAB.tim.one@comcast.net>

[Robert Woodhead]
> Hello everyone,

Hi!

> Just a quick note to introduce myself; I ran the session at that
> Hacker's conference that Guido mentioned, and passed on the
> suggestion of checking out Bill Y's combinatorial approach.

You can find test results for that in the list archives.  Bottom line is
that it did worse than what we're doing now, to such an extent that I'm the
only one who appeared to try it (my reports weren't encouraging).  I may
have misimplemented the idea, but I don't think so.  The results were in
line with earlier experiments we tried on gimmicks that systematically
generate highly correlated "words".  Such things appear to learn a lot
faster than word unigrams, but we've always found (so far) that unigrams
soon enough overcome that, and then go on to win.  What we're missing is any
practical approach to a scheme that can suck out phrases without identifying
them by hand first, and without generating highly correlated phrases
(overlapping word n-grams are highly correlated, of course, and Bill carries
that to extremes).

Something I didn't report on is later experiments using chi-combining
instead of Bill's "add up the raw counts".  chi-combining worked better.  I
know Bill has gone on to do a more Bayesian-like combination method, but I
expect that to do worse than what we've got now for the same reasons we gave
up on Paul Graham's combining scheme, but more so:  the word independence
assumption is bogus, and feeding the Bayesian calcs highly correlated words
grossly over- or under- estimates the true probability as a result.  In the
end you get a scheme that claims certainty even when it's dead wrong, and
although it's not dead wrong often, it is dead wrong at a non-zero rate.

Revealing:  fiddle our chi2.py to use whatever combining scheme Bill is
using now, and feed it vectors of *random* probabilities.  Most of the code
needed for that, and to display a histogram of results, is already there.
Try it with Graham's combining scheme and you'll find that scores are almost
always very near 0 or very near 1 even when the inputs are random and
uniformly distributed.  I expect that can only get worse by doing "chain
rule" calcs on probs that are highly correlated to begin with.

The internal chi-combining S and H scores are uniformly distributed given
random inputs, so chi-combining doesn't infer certainty by chance any more
often than it "should" infer certainty by chance.  That appears to be what
makes it far more robust against embarrassing mistakes, and it reliably
pumps out a score near 0.5 given a highly ambiguous input msg (many strong
ham and many strong spam clues -- we call that "cancellation disease" here,
and chi-combining doesn't infer certainty when it happens; all other schemes
did, and didn't do better than chance when it happened).

> I've been playing with rules-based techniques for almost a year (see
> http://www.madoverlord.com/projects/told.t for details) and toying
> with bayesian  systems for only the last couple of months, on and
> off.  So no expert in that regard; I have mostly replicated the early
> work you guys have done (skimmed the archive today).
>
> I'm particularly impressed with the chi-square work, it looks very
> interesting (but more stats for my poor stats-challenged mind to work
> on;

So copy and paste <wink>.

> not to mention that now I'm going to have to get around to
> cramming python in there with all the other languages that have
> accumulated over the years...).

In return, you can throw twelve other languages out <0.7 wink>.

> Also, it's nice the way you're testing a lot of variants, I've been
> crossing things off my "try this" list all afternoon.

Testing has pretty much run out of steam here, though.  My error rates are
so low now I couldn't measure an improvement in a convincing way even if one
were to be made, and the same is true of a few others here too.  We appear
to be fresh out of big algorithmic wins, so are pushing on to wrestling with
deployment issues.

BTW, download the source code and read the comments in tokenizer.py:  the
results of many early experiments are given there in comment blocks.

> Couple of comments (bear in mind, I haven't grabbed the source yet,
> and only skimmed the archive, so if this repeats things you've
> already tried, sorry).  This is just stuff that's been in my mind
> recently, plus stuff stimulated by my skim.
>
> * The great headers debate; suggest you put both machine and human
> readable opinions in the header, eg:
>
> 	X-SpamBayes-Rating: 9 (Very Spammy)
> 	X-SpamBayes-Rating: 7 (Somewhat Spammy)
> 	X-SpamBayes-Rating: 5 (Unsure)
> 	X-SpamBayes-Rating: 3 (Probably Innocent)
> 	X-SpamBayes-Rating: 0 (The Finest Ham)
>
> The reason being, many mailreaders can use a finer discriminant than
> (yes,no,beats me) in ranking spam.  A common strategy (which I like
> myself) is to start an email at neutral priority and bump it up and
> down based on various triggers, whitelists, whatever, then sort the
> inbox by the final priority.

Spoken like someone who worked on a rule-based system <wink>.  We have three
categories:  Ham, Unsure, and Spam, and I haven't seen anything to make me
believe that a finer distinction than that can be quantitatively justified
(but my primary test data makes 2 mistakes out of 34,000 msgs now -- that's
what I mean by "can't measure an improvement anymore", and a finer-grained
scheme isn't going to touch those 2 mistakes; one of them is formally ham
because it was sent by a real person, but consists of a one-line comment
followed by a quote of an entire Nigerian scam spam -- nothing useful is
ever going to *call* that one ham, and it scores as spam *almost* as solidly
as an original Nigerian spam).

> A cute hack I used in TOLD was to output the result like this:
>
> 	X-SpamBayes-Rating: 0123456789 (Very Spammy)
> 	X-SpamBayes-Rating: 012345 (Unsure)
>
> This permits a mailreader with limited filtering tools (like Eudora)
> to classify multiple results with a single rule (such as "if an
> X-SpamBayes-Rating header contains the string 12345678, set priority
> to double-low", which catches both 8 and 9 rated emails).
>
> BTW, being pedantic, "rating" is a better word to use, it is more
> precisely what the discriminator is doing, is the same in all flavors
> of english, and is shorter.  "Score" might be even better.  ;^)

"Score" is my favorite, but isn't catching on.  I believe the word "ham" for
"not spam" was my invention, and since that one caught on big, I'm not
fighting to the death for any others <wink>.

> * Hashing to a 32-bit token is very fast, saves a ton of memory,
> and the number of collisions using the python hash (I appealed for hash
> functions on the hackers-l and Guido was kind enough to send me the
> source) is low.  About 1100 collisions out of 3.3 million unique
> tokens on a training set I was using.

That's significantly better than you could expect from a truly random hash
function, so is fishy.  Tossing 3.3M balls into 2**32 buckets at random
should leave 3298733 buckets occupied on average, with an sdev of 35.58
buckets.  Getting 1100 collisions is about 4.7 sdevs fewer than the random
mean.

> CRC32, of all things, is actually slightly better,

With sparse occupancy (3.3e6 out of 4.3e9 buckets is sparse) they may be
comparable.  PythonLabs ran large-scale statistical tests a few years ago on
this.  The Python string hash produced 32-bit numbers indistinguishable from
random (on the #-of-collision basis) as far as we pushed it; crc32 broken
down *very* badly as occupancy increased, with collision rates hundreds of
sdevs worse than random.  So I can't recommend crc32 for general string
hashing (and the Python docs indeed warn about this now), but can recommend
Python's string hash.  By coincidence, it turns out that Python's string
hash is very similar to what later became "the standard" Fowler-Noll-Vo
string hash, which may be the most widely tested "seems as good as random"
fast string hash now:

    http://www.isthe.com/chongo/tech/comp/fnv/

> but only by a hair.  So this kind of hashing probably won't have much
> effect on the statistical results.

Since we're sticking to unigrams, we don't have an insane database burden.
We also (by default) limit ourselves to looking at no more than 150 words
per msg.  So I'm not sure saving some bytes of string storage is "worth it"
for us, and it's very nice that we can get back the exact list of words that
went into computing a score later.  A pile of hash codes wouldn't give the
same loving effect <wink>.

> * Bill Y's byte bucket system has a lot of problems, but a there are
> probably some data reduction techniques that would work well.  One
> that occurred to me on the way back from Hackers would be simply to
> keep a 1-byte count of ham/spam hits for each token, and when the ham
> or spam count is about to wrap, cut each count in half, rounding up
> the other value; ie:
>
> 	// increment ham count for bucket i
> 	// apologies, my pseudocode is a bizarre mixture of
> 	// now-unknown languages
>
> 	if (ham[i]=255)
> 	   {
> 	      ham[i]=128;
> 	      spam[i]=(spam[i]/2)+(spam[i]%2)
>     	   }
> 	else
> 	   ham[i]++;
>
> The nice thing about this is that it would bias in favor of more
> recent email; things would "age".  But note this means when building
> the original database you have to feed it ham and spam in small
> chunks, or use a greater resolution before cramming it into
> individual bytes.

Except I didn't get good enough results from his approach to justify
pursuing it here, even leaving the hash codes at the full 32 bits.  When I
went on to squash them to fit in a million buckets, a few false positives
popped up that were just too bad to bear (two can be found in the list
archives):  ham that was so obviously ham that no system that called them
spam would be acceptable to most people.

> * I was playing a week or two back with 1 and 2 token groups, and
> found that a useful technique was, for each new token, to only
> consider the most deviant result.  So if the individual word was .99
> spam, and the two word phrase was .95, it would only consider the .99
> result.  This would probably help with Bill Y's combinatorial scheme.

It could be a viable approach to the problem mentioned above:  a scheme to
suck out more than one word that doesn't systematically generate mounds of
nearly redundant (highly correlated) clues.  We're clearly missing info by
never looking at bigrams (or beyond) now, and that continues to bother me
(even if it doesn't seem to be bothering the error rates <wink>).

> Dunno if you've tried this; it prevents a particularly spammy or
> hammy sequence from dominating the results (I was only considering
> the 16 or so most deviant results in my bayesian calc, at least on my
> corpus, more than that didn't really help).

There's too much I don't know about everything you're doing to say much
about that.  *All* the biases in Graham's original scheme eventually went
away in this project, and things like clamping the spamprobs into [.01,
0.99] turned out to make it systematically useless to try to use more than
16 words under Graham-combining (it just caused more "cancellation disease",
and so caused more wildly wrong mistakes).  We use 150 now, but IIRC we
generally stopped seeing strong benefits after hitting about 40.  That 40
was better than 16 very much relied on removing all the biases, though (no
"ham boosts", no prob clamping, no minimum word count, no giving unknown
words spamprobs above 0.5 to favor ham, no doubling the ham count when
computing a spam prob, etc).

> * My personal bias (as I think Guido mentioned) is for a multifaceted
> approach, using Bayesian, rules-based (attacking things that bayesian
> isn't good at, like looking for obfuscated url structures), DNSBL,
> and whitelisting heuristics to generate an overall ranking.  So a
> hammy mail from a guy in your address book would bubble up to highest
> priority, whereas something spammy from him would stay neutral.

I'm not sure we really need it.  For example, *lots* of spam has been
discussed on this mailing list, so much so that the python.org email admin
had to castrate SpamAssassin for msgs to this list address else it kept
blocking ordinary list traffic.  My personal email classifier never calls
anything here spam, though, nor does it call the originals of the spams
posted here ham.

I do worry a little about obsfuscated HTML.  We strip almost all HTML tags
by default for a reason I've harped on enough <wink>:  all HTML decorations
have very high spamprobs, and counting more than one of them as "a clue"
fools almost every combining scheme into believing the msg containing them
is spam (if you know a msg contains both <br> and <p>, it's not really more
likely to be spam than if you just know it contains <br>!).  So we blind the
classifier to HTML decorations now.

But a spam I forwarded here a week or so ago exploited that:  the spam was
interleaved with size=1 white-on-white news stories and tech mailing list
postings.  The classifier *did* see those, but didn't see the HTML
decorations hiding them.  This was a cancellation-disease-by-construction
kind of msg, and chi-combining scored it near 0.5 as a result (solidly
Unsure).  It's the only spam of that kind I've seen so far; if it becomes a
popular techinque, we'll have to take more HTML blinders off the classifier.

> There's lots of room for cooperation between the various approaches
> and multiple agents means its less likely that a spam will get by.
> In particular, whitelisting heuristics can almost eliminate false
> positives.

I'll let you know if I ever see one <wink>.  Seriously, one of the apps I've
especially got in mind is filtering the high-volume mailing lists on
python.org.  The only kind of FP I see there now in tests is adminstrative
requests to *-request addresses, which typically consist of a one word
"subscribe" or "unsubscribe" (themselves words with high spamprobs!),
followed by 6KB of employer-generated HTML disclaimers, and/or a forwarded
spam or conference announcement the sender didn't like.  There's still a
very low FP rate even on those, but text analysis simply can't be expected
to nail them every time.  Under SpamAssassin, those recipient addresses are
given strong ham boosts by the python.org email admin.

> * Finally, if anyone needs more spam, I get over 300 a day (I've been
> around a while!) and have a cleaned corpus of over 130MB of spam and
> foreign email.  Also, given all the legit web-marketing email I get
> because of the url registration work I've done, I've got tons of the
> spammiest ham you could imagine.

Wasn't Paul Graham collecting corpora?  Yup, still is:

    http://www.paulgraham.com/spamarchives.html

Getting vast quantities of spam isn't a problem anymore, but getting vast
quantities of ham is.  Since your spammy ham is presumably business-related,
I assume you can't share it.  Or can you?  Mixing spam and ham from
different sources also causes worlds of problems (indeed, we still (by
default) ignore most of the header lines partly for that reason, else the
system gets great results for bogus reasons).


From tim.one@comcast.net  Sun Nov 10 07:27:38 2002
From: tim.one@comcast.net (Tim Peters)
Date: Sun, 10 Nov 2002 02:27:38 -0500
Subject: [Spambayes] More experiments with weaktest.py
In-Reply-To: <3DCD9E3A.4040809@hooft.net>
Message-ID: <LNBBLJKPBEHFEDALKOLCIEPOCIAB.tim.one@comcast.net>

[Rob Hooft]
> These were results of weaktest with default parameters:

Very interesting!  I'll have to try that too.  Note that in my live email
experiment here, I'm (except for the very start) also scoring/training msgs
in (with small lapses) the order they arrive.  It's been reported before
that this helps; although I still haven't run a controlled experiment on
that, my *impression* is that it does help.

>    Total messages 6540 (4800 ham and 1740 spam)
>    Total unsure (including 30 startup messages): 336 (5.1%)
>    Trained on 178 ham and 162 spam
>    fp: 2 fn: 2
>    Total cost: $89.20
>
> If I set the "ham_cutoff" to 10 from 20 to make things more symmetrical
> (spam_cutoff is 90 by default):

The asymmetry is intentional:  most people hate FP more than FN, so by
default I made it harder for a thing to get called spam.  In test after test
we've also seen that spam has a tighter score distribution than ham, which
is a more formal justification for setting the spam cutoff closer to its
endpoint than the ham cutoff.  Setting ham_cutoff as low as 10 is for the
truly paranoid <0.9 wink>.

>    Total messages 6540 (4800 ham and 1740 spam)
>    Total unsure (including 30 startup messages): 442 (6.8%)
>    Trained on 292 ham and 152 spam
>    fp: 2 fn: 0
>    Total cost: $108.40
>
> So the database grows by 30% but it didn't help my cost. The training
> set is now unbalanced 2:1. Set spam_cutoff to 80 and ham_cutoff back to
> the default 20:
>
>    Total messages 6540 (4800 ham and 1740 spam)
>    Total unsure (including 30 startup messages): 304 (4.6%)
>    Trained on 213 ham and 101 spam
>    fp: 7 fn: 3
>    Total cost: $133.80
>
> This reduces the database by only 10%, but at very high fp cost. Same
> 2:1 unbalance in the training set.
> Back to the default 20:90 then, and set the minimum_prob_strength to 0.0:
>
>    Total messages 6540 (4800 ham and 1740 spam)
>    Total unsure (including 30 startup messages): 933 (14.3%)
>    Trained on 497 ham and 437 spam
>    fp: 0 fn: 1
>    Total cost: $187.60
>
> OK, so that didn't work either. How about setting it to 0.2?
>
>    Total messages 6540 (4800 ham and 1740 spam)
>    Total unsure (including 30 startup messages): 304 (4.6%)
>    Trained on 134 ham and 177 spam
>    fp: 2 fn: 5
>    Total cost: $85.80
>
> Hm. That is slightly better. Funny, we are suddenly training on more
> spam than ham.... Back to 0.1 anyway ---the differences are too small---
> and set robinson_probability_x = 0.3 (default is 0.5):
>
>    Total messages 6540 (4800 ham and 1740 spam)
>    Total unsure (including 30 startup messages): 602 (9.2%)
>    Trained on 54 ham and 616 spam
>    fp: 1 fn: 67
>    Total cost: $197.40
>
> Very interesting: this changes the training ratio to 1:12, at huge cost!
> (less than one in three spams was recognized solidly as such).

Note that in calculations I reported a day or two ago, the measured mean of
spamprobs across 3 different corpora was > 0.5, but not by a lot.  .3 moves
it outside the range minimum_prob_strength ignores, so now every "new word"
is instantly taken as a ham clue, where before all new words were ignored by
default.  So that it grossly inflated the FN rate isn't surprising;
everything that will *eventually* become a hapax is initially taken to be a
ham clue, even when it's never been seen before.

> Wonder what this could do if changed together with the cutoff....
> Lets move it back to 0.5, and try "robinson_probability_s = 0.3":
>
>    Total messages 6540 (4800 ham and 1740 spam)
>    Total unsure (including 30 startup messages): 348 (5.3%)
>    Trained on 237 ham and 120 spam
>    fp: 7 fn: 2
>    Total cost: $141.60
>
> Ouf.

I hope you're at least gaining some respect for how much work went into
picking the defaults <wink>.

> I am back with the defaults, but I'd still like to do an automated
> optimization of everything simultaneously. Might try that.

Now *that* could be a useful system regardless of scheme.  I've tended to do
hill-climbing across one dimension at a time, occasionally moving batches of
params random amounts at once (to see whether that kicks it out of a
stubborn local minimum).


From tim.one@comcast.net  Sun Nov 10 07:52:42 2002
From: tim.one@comcast.net (Tim Peters)
Date: Sun, 10 Nov 2002 02:52:42 -0500
Subject: [Spambayes] Outlook plugin - training
In-Reply-To: <3DCD8B34.6040903@hooft.net>
Message-ID: <LNBBLJKPBEHFEDALKOLCCEAACJAB.tim.one@comcast.net>

[Rob Hooft]
> I just added a testdriver to CVS that simulates your behaviour as I
> understand it: It will train on the first 30 messages,

I trained on 1 of each at the start.  If I were to do it over, I'd start
with an empty database <wink>.

> plus on all misclassified and all unsure messages.

Since I'm doing this real-time on my live email, I've been training "on the
worst" (farthest away from correct) msg that arrives in a batch, then
rescoring all the ones that arrived in the batch, then training the worst
remaining, ... until all new ham is below ham_cutoff and all new spam above
spam_cutoff.  I don't know that it matters, just being clear(er).  As things
turned out, this worst-at-a-time training never managed to push one of the
remaining mistakes/unsures into the correct category, *except* for cases
where I got more than one copy of a spam from different accounts at the same
time.  Then it always pushed the copies into scoring near 1.0, since the
hapaxes in the training copy are abundant.

> It is called "weaktest.py", and uses the good-old-Data/{Sp|H}am
> hierarchy.
>
> I think we should test its performance at different Options settings.
>
> It may not even be very realistic to training on fp's, as I think in my
> private E-mail I won't even check the spam folder very thoroughly at all.

But I will (and do), and my primary interest here is to see how bad things
can get if a user takes mistake-based training to an extreme.  Despite that
it's heavily hapax-driven, it appears to do very well when judged by error
rate.

I've been doing it long enough now, though, that it doesn't do so well
subjectively:  the Unsures are too often bizarre.  For example, I sent a
long reply here to Robert Woodland, and the copy I get bock showed up as
Unsure, with H=1 and S=0.66.  There were a lot of accidental spam hapaxes in
that msg!  Training on it as ham then eliminated about 30 spam hapaxes
(there're now netural, having been seen in one ham and one spam each).

So it's no different from my POV than the cases where people have sent me
"surprising msgs" in the past, and my carefully trained slice-of-life
classifier (regularly trained on a sampling of correctly classified msgs
too) at the time had no trouble nailing them as ham or spam, with lots of
non-hapax evidence to back it up.

IOW, I'm still sticking to what I guessed before I started this:
mistake-driven training will appear to work well over the short term, but
it's brittle, and is brittle because of its reliance on hapaxes.

> Anyway, a default run for me now gives:
>
>    100 trained:31H+16S wrds:4203 fp:0 fn:0 unsure:47
>    200 trained:35H+25S wrds:6997 fp:0 fn:0 unsure:60
>    300 trained:38H+29S wrds:7503 fp:0 fn:0 unsure:67
>    400 trained:41H+32S wrds:8503 fp:0 fn:0 unsure:73
>    500 trained:45H+38S wrds:8887 fp:0 fn:0 unsure:83
>    600 trained:48H+39S wrds:9010 fp:0 fn:0 unsure:87
>    700 trained:57H+41S wrds:9484 fp:0 fn:0 unsure:98
>    800 trained:63H+43S wrds:9837 fp:0 fn:0 unsure:106
>    900 trained:63H+45S wrds:9936 fp:0 fn:0 unsure:108
>   1000 trained:67H+45S wrds:10001 fp:0 fn:0 unsure:112
>   1100 trained:72H+47S wrds:10268 fp:0 fn:0 unsure:119
>   1200 trained:72H+53S wrds:10386 fp:0 fn:0 unsure:125
>   1300 trained:77H+56S wrds:11178 fp:0 fn:0 unsure:133
>   1400 trained:81H+58S wrds:11546 fp:0 fn:0 unsure:139
>   1500 trained:85H+60S wrds:11734 fp:0 fn:0 unsure:145
>   1600 trained:87H+62S wrds:12023 fp:0 fn:0 unsure:149
>   1700 trained:89H+63S wrds:12161 fp:0 fn:0 unsure:152
>   1800 trained:93H+65S wrds:12287 fp:0 fn:0 unsure:158
>   1900 trained:93H+68S wrds:12449 fp:0 fn:0 unsure:161
>   2000 trained:96H+70S wrds:12637 fp:0 fn:0 unsure:166
>   2100 trained:100H+70S wrds:12742 fp:0 fn:0 unsure:170
>   2200 trained:103H+72S wrds:12984 fp:0 fn:0 unsure:175
>   2300 trained:105H+73S wrds:13047 fp:0 fn:0 unsure:178
>   2400 trained:108H+74S wrds:13220 fp:0 fn:0 unsure:182
>   2500 trained:111H+78S wrds:13407 fp:0 fn:0 unsure:189
>   2600 trained:112H+79S wrds:13485 fp:0 fn:0 unsure:191
>   2700 trained:115H+81S wrds:13647 fp:0 fn:0 unsure:196
>   2800 trained:118H+84S wrds:13797 fp:0 fn:0 unsure:202
>   2900 trained:120H+84S wrds:13845 fp:0 fn:0 unsure:204
>   3000 trained:123H+86S wrds:14131 fp:0 fn:0 unsure:209
> fp: Data/Ham/Set2/n05250.txt score:0.9312
>   3100 trained:128H+87S wrds:14327 fp:1 fn:0 unsure:214
>   3200 trained:129H+90S wrds:14430 fp:1 fn:0 unsure:218
>   3300 trained:132H+91S wrds:14633 fp:1 fn:0 unsure:222
>   3400 trained:133H+93S wrds:14923 fp:1 fn:1 unsure:224
>   3500 trained:133H+94S wrds:14937 fp:1 fn:1 unsure:225
>   3600 trained:133H+98S wrds:15023 fp:1 fn:1 unsure:229
>   3700 trained:135H+102S wrds:15463 fp:1 fn:1 unsure:235
>   3800 trained:135H+107S wrds:15627 fp:1 fn:1 unsure:240
>   3900 trained:138H+107S wrds:15786 fp:1 fn:1 unsure:243
>   4000 trained:140H+111S wrds:15951 fp:1 fn:1 unsure:249
>   4100 trained:142H+116S wrds:16115 fp:1 fn:1 unsure:256
>   4200 trained:142H+117S wrds:16124 fp:1 fn:1 unsure:257
>   4300 trained:143H+122S wrds:16251 fp:1 fn:1 unsure:263
>   4400 trained:143H+126S wrds:16366 fp:1 fn:1 unsure:267
>   4500 trained:144H+130S wrds:16434 fp:1 fn:1 unsure:272
>   4600 trained:144H+134S wrds:16599 fp:1 fn:1 unsure:276
>   4700 trained:146H+135S wrds:16664 fp:1 fn:1 unsure:279
>   4800 trained:147H+135S wrds:16682 fp:1 fn:1 unsure:280
>   4900 trained:149H+138S wrds:16911 fp:1 fn:1 unsure:285
> fp: Data/Ham/Set1/n01590.txt score:0.9092
>   5000 trained:151H+140S wrds:17257 fp:2 fn:1 unsure:288
>   5100 trained:153H+141S wrds:17390 fp:2 fn:1 unsure:291
>   5200 trained:155H+142S wrds:17747 fp:2 fn:1 unsure:294
>   5300 trained:156H+143S wrds:18095 fp:2 fn:1 unsure:296
>   5400 trained:159H+147S wrds:18205 fp:2 fn:1 unsure:303
>   5500 trained:160H+147S wrds:18230 fp:2 fn:1 unsure:304
>   5600 trained:163H+147S wrds:18334 fp:2 fn:1 unsure:307
>   5700 trained:163H+150S wrds:18410 fp:2 fn:1 unsure:310
>   5800 trained:165H+150S wrds:18455 fp:2 fn:1 unsure:312
>   5900 trained:168H+151S wrds:18671 fp:2 fn:1 unsure:316
>   6000 trained:170H+154S wrds:18764 fp:2 fn:1 unsure:321
>   6100 trained:170H+155S wrds:18787 fp:2 fn:1 unsure:322
>   6200 trained:170H+156S wrds:18791 fp:2 fn:1 unsure:323
>   6300 trained:174H+157S wrds:19095 fp:2 fn:1 unsure:328
>   6400 trained:176H+161S wrds:19398 fp:2 fn:2 unsure:333
>   6500 trained:178H+161S wrds:19444 fp:2 fn:2 unsure:335
> Total messages 6540 (4800 ham and 1740 spam)
> Total unsure (including 30 startup messages): 336 (5.1%)
> Trained on 178 ham and 162 spam
> fp: 2 fn: 2
> Total cost: $89.20
>
> (This is on 3 out of my 10 test directories).
>
> Interesting to note so far:
>   * The "Total cost" is much higher than for train-on-all schemes,
>     but it is only due to Unsures; fp and fn are still small.

That matches my experience too, although I started with 1 ham and 1 spam and
had high FP and FN rates over the first few hours.

>   * The database growth doesn't decay with time after a while;
>     it can be described as:
>        nwords = 9200 + 1.6 * nmessages
>     or alternatively:
>        nwords = 5700 + 40 * ntrained
>     ..as can be seen in the attached png's

I expect that's mostly because there are still (relatively) few total msgs
trained on.

>   * The training set is almost balanced, even though I scored
>     many more ham than spam

Curiously, same here!  I get about 500 ham and 100 spam per day, but my
training database now has 47 ham and 41 spam.  It does well, except when it
sucks <wink>.

>   * The unsure rate drops over time:

I haven't measured that, but it's clearly been so here too (as I said
before).

>          0- 1000: 11.2% (minus 3.0% to be fair)
>       1000- 2000:  5.4%
>       2000- 3000:  4.3%
>       3000- 4000:  4.0%
>       4000- 5000:  3.9%
>       5000- 6000:  3.3%

Proving what I've always suspected:  over time, all msgs are repetitions of
ones you've seen before <0.9 wink>.


From tim.one@comcast.net  Sun Nov 10 08:36:10 2002
From: tim.one@comcast.net (Tim Peters)
Date: Sun, 10 Nov 2002 03:36:10 -0500
Subject: [Spambayes] My first non-personal personal false positive
In-Reply-To: <B9EFE93D.5C286%francois.granger@free.fr>
Message-ID: <LNBBLJKPBEHFEDALKOLCEEAFCJAB.tim.one@comcast.net>

[Tim, asks for help on a Spanish Unsure]

[Fran=E7ois Granger]
> Here are the most probable English equivalents of the Spanish words=
.
> 'using', 'page', 'have', 'click', 'much', 'but', 'know', 'with',
> 'good', 'this', 'Hi', 'that', 'here', 'the', 'for'
>
> This illustrate he need for properly balanced training sets and
> re raise the question of language discrimination.

It really doesn't raise it for me:  this was in my personal email, an=
d since
I couldn't read the msg anyway, it may as well have been spam.  I get=
 way
too much email to bother more than 2 seconds with something I can't r=
ead.  I
only looked at this one because I'm paying heavy attention to everyth=
ing the
automatic classifier calls spam.  If I weren't using this system, I w=
ould
have thrown out that msg at once.

If I were someone who got any quantity of Spanish ham, the system wou=
ld have
scored it as ham.  As is, the only Spanish I get is in Spanish spam, =
so the
system correctly judged it for my personal email mix.

> At least prior language discrimination would allow for a different
> database for each language

Whether that would improve results is a testable hypothesis; I've alr=
eady
said I doubt it would be helpful, and have no motivation to try such =
an
experiment myself.

> or for a systematic "unsure" flag for not trained languages.

But I *do* train on Spanish -- and Russian, and Turkish, and Chinese,=
 and
Japanese, and German, and French, and Polish (at least):  in my email=
 mix,
they're all used in spam, aren't used in my ham, and are spam to me b=
ecause
they're unreadable by me.

> If you put my messages in a Ham training set, you will flag French =
spams
> as ham because of my French sig ;-)

Nope, the system isn't that stupid (or, rather, it is <wink>).  What =
it will
do is knock down the spamprobs of those words.  Despite that I've got=
 French
spam in my training data, your msg here-- including the French sig --=
got a
solid ham score, with H=3D1 (to six significant digits) and S=3D1.1e-=
11.  The
strongest spam word in fact came from your sig, spamprob('est')=3D0.8=
4.  It
didn't matter, because I could actually read most of what you wrote, =
and it
wasn't trying to sell me Viagra <wink>.

> All these words should rate around 0.5 since they are among the
> most common ones in this language.

If I got any French ham, they would rate around 0.5, but for my perso=
nal
email it's Just Fine that they're considered spam words.  It wouldn't=
 be OK
for python.org use, but python.org gets a non-trivial amount of non-E=
nglish
ham, so it trains there accordingly.

> Le courrier est un moyen de communication. Les gens devraient
> se poser des questions sur les implications politiques des choix (o=
u non
> choix) de leurs outils et technologies. Pour des courriers propres =
:
> <http://marc.herbert.free.fr/mail/> -- <http://minilien.com/?IXZneL=
oID0>

Indeed <wink>.


From rob@hooft.net  Sun Nov 10 11:09:28 2002
From: rob@hooft.net (Rob Hooft)
Date: Sun, 10 Nov 2002 12:09:28 +0100
Subject: [Spambayes] Introducing myself
References: <E18AYxq-0006sT-00@mail.python.org>
	<a05200f03b9f34909a00b@[192.168.1.103]>
Message-ID: <3DCE3E68.2060101@hooft.net>

Robert Woodhead wrote:

> * My personal bias (as I think Guido mentioned) is for a multifaceted 
> approach, using Bayesian, rules-based (attacking things that bayesian 
> isn't good at, like looking for obfuscated url structures), DNSBL, and 
> whitelisting heuristics to generate an overall ranking.  So a hammy mail 
> from a guy in your address book would bubble up to highest priority, 
> whereas something spammy from him would stay neutral. There's lots of 
> room for cooperation between the various approaches and multiple agents 
> means its less likely that a spam will get by. In particular, 
> whitelisting heuristics can almost eliminate false positives.

I think our very good experience with the bayesian classifier would 
"forbid" to use whitelisting. Once a whitelisted feature "leaks" into 
the spam community, it will be useless.

But there is a bayesian solution to it: Make the tokenizer recognize the 
feature that you want to whitelist or blacklist, and emit a new token to 
that effect.

    From:<in-address-book>  --> Will have a low spamprob
    url:numeric-host        --> Will have a high spamprob

We're already doing something that for a number of the SpamAssassin 
tests (e.g. mime-type tokens). This approach still uses a purely 
bayesian classifier, and it will follow reality automatically.

I'd like to note that a lot of what you were saying and what was in
Tim's response (and mine here) is only valid in a train-on-all scheme. 
i.e. like we've been using until a week ago....

Rob

-- 
Rob W.W. Hooft  ||  rob@hooft.net  ||  http://www.hooft.net/people/rob/


From rob@hooft.net  Sun Nov 10 12:11:46 2002
From: rob@hooft.net (Rob Hooft)
Date: Sun, 10 Nov 2002 13:11:46 +0100
Subject: [Spambayes] More experiments with weaktest.py
References: <LNBBLJKPBEHFEDALKOLCIEPOCIAB.tim.one@comcast.net>
Message-ID: <3DCE4D02.6060907@hooft.net>

Tim Peters wrote:
> [Rob Hooft]
> 
>>These were results of weaktest with default parameters:
> 
> 
> Very interesting!  I'll have to try that too.  Note that in my live email
> experiment here, I'm (except for the very start) also scoring/training msgs
> in (with small lapses) the order they arrive.  It's been reported before
> that this helps; although I still haven't run a controlled experiment on
> that, my *impression* is that it does help.

I toyed with the idea, but that would involve parsing all messages once 
before starting, and sorting them on date. Putting them in a set to 
"randomize" the order is much easier, so I was lazy.

> Setting ham_cutoff as low as 10 is for the
> truly paranoid <0.9 wink>.

Very much so. For my "production" systems, I have ham_cutoff at 40...

> I hope you're at least gaining some respect for how much work went into
> picking the defaults <wink>.

I was just arriving when it happened. But that was on a completely 
different classifier, so I'm still convinced these need to be thoroughly 
tested.

>>I am back with the defaults, but I'd still like to do an automated
>>optimization of everything simultaneously. Might try that.

> Now *that* could be a useful system regardless of scheme.  I've tended to do
> hill-climbing across one dimension at a time, occasionally moving batches of
> params random amounts at once (to see whether that kicks it out of a
> stubborn local minimum).

Hm. That sounds so enthousiastic that I just might commit what I have 
gone through this night. Some more info:

  * No, I have not used a "Simulated Annealing" or "Threshold Accepting"
    yet. Please keep in mind that each step in the optimization takes
    between 3 minutes (1 set on my home PC) and 15 minutes (10 sets on my
    work PC). This would be way too costly. Just minimization it will be.
  * I tried to use "Simplex optimization" (let a multidimensional
    triangle walk through phase space) on the "Total cost" parameter.
    This was simply disastrous. Phase space consists of plateau regions
    that are exactly flat, joined by huge ridges. Think about that one
    spam that goes from a 0.11 to a 0.09 score: it will add $9.80 in one
    bang to the cost. This field is impossible to optimize.
  * I designed a new "Flex cost" field. That one does away with the
    "unsure cost". The cost of a message is 0.0 at its own cutoff, and
    increases linearly towards its "false" cost at the other cutoff,
    and increases further to the other end. Hm. Unreadable. A table:

           Score    Spam with this   Ham with this
                      score costs     score costs
            0.00         $ 1.29          $ 0.00
            0.20         $ 1.00          $ 0.00
            0.55         $ 0.50          $ 5.00
            0.90         $ 0.00          $10.00
            1.00         $ 0.00          $11.43

     This field is much more smooth than the total cost field, so I was
     hoping that pure minimization will do. Obviously, the flex cost is
     much, much higher than the total cost because unsures are so much
     more expensive. The flex cost field will also be less sensitive to
     the {sp|h}am_cutoff parameters than the total cost field, because
     there are no sudden cost jumps.
   * Results are not great I need to experiment more before reporting
     on them.
   * I just committed:
      weaktest.py: introduction of the flexcost measure
      optimize.py: simplex optimization (needs Numeric python; sorry)
      weakloop.py: run weaktest.py repeatedly under simplex optimization

Regards,

Rob Hooft
-- 
Rob W.W. Hooft  ||  rob@hooft.net  ||  http://www.hooft.net/people/rob/


From rob@hooft.net  Sun Nov 10 12:28:44 2002
From: rob@hooft.net (Rob Hooft)
Date: Sun, 10 Nov 2002 13:28:44 +0100
Subject: [Spambayes] Outlook plugin - training
References: <LNBBLJKPBEHFEDALKOLCCEAACJAB.tim.one@comcast.net>
Message-ID: <3DCE50FC.3050005@hooft.net>

This is a multi-part message in MIME format.
---------------------- multipart/mixed attachment
Tim Peters wrote:
> [Rob Hooft]
> 
>>I just added a testdriver to CVS that simulates your behaviour as I
>>understand it: It will train on the first 30 messages,
> 
> 
> I trained on 1 of each at the start.  If I were to do it over, I'd start
> with an empty database <wink>.

This is easy enough to change, but I left it at 30 for now.

> Since I'm doing this real-time on my live email, I've been training "on the
> worst" (farthest away from correct) msg that arrives in a batch, then
> rescoring all the ones that arrived in the batch, then training the worst
> remaining, ... until all new ham is below ham_cutoff and all new spam above
> spam_cutoff.  I don't know that it matters, just being clear(er).  As things
> turned out, this worst-at-a-time training never managed to push one of the
> remaining mistakes/unsures into the correct category, *except* for cases
> where I got more than one copy of a spam from different accounts at the same
> time.  Then it always pushed the copies into scoring near 1.0, since the
> hapaxes in the training copy are abundant.

But I'm doing exactly the same, except that my batch size is always 1 ;-)

>>It may not even be very realistic to training on fp's, as I think in my
>>private E-mail I won't even check the spam folder very thoroughly at all.

> But I will (and do), and my primary interest here is to see how bad things
> can get if a user takes mistake-based training to an extreme.  Despite that
> it's heavily hapax-driven, it appears to do very well when judged by error
> rate.

Hm. There are so little fp/fn's relative to unsures (at least after 30 
messages initial training), that it wouldn't matter much (I think).

>>  * The database growth doesn't decay with time after a while;
>>    it can be described as:
>>       nwords = 9200 + 1.6 * nmessages
>>    or alternatively:
>>       nwords = 5700 + 40 * ntrained
>>    ..as can be seen in the attached png's
> 
> 
> I expect that's mostly because there are still (relatively) few total msgs
> trained on.

Hm, it is more like a sqrt after more messages. See attached image which 
has a sqrt X axis. The fit fits the data even at the lowest end.

Regards,

Rob

-- 
Rob W.W. Hooft  ||  rob@hooft.net  ||  http://www.hooft.net/people/rob/

---------------------- multipart/mixed attachment
A non-text attachment was scrubbed...
Name: words3.png
Type: image/png
Size: 13675 bytes
Desc: not available
Url : http://mail.python.org/pipermail/spambayes/attachments/20021110/b5905d0f/words3-0001.png

---------------------- multipart/mixed attachment--


From lists@morpheus.demon.co.uk  Sun Nov 10 14:31:30 2002
From: lists@morpheus.demon.co.uk (Paul Moore)
Date: Sun, 10 Nov 2002 14:31:30 +0000
Subject: [Spambayes] Outlook plugin plus Exchange
References: <n2m-g.u1ir3gwe.fsf@morpheus.demon.co.uk>
	<LCEPIIGDJPKCOIHOBJEPMENEHJAA.mhammond@skippinet.com.au>
Message-ID: <n2m-g.d6pdqywt.fsf@morpheus.demon.co.uk>

"Mark Hammond" <mhammond@skippinet.com.au> writes:

> I am working on code that optionally processes "missed" messages at startup.
> It looks like I can list all unread, unscored mail in my 1000+ item inbox
> very quickly, so this should be feasible.

That sounds like the best option. I haven't had a chance to check
Exchange yet, but with an IMAP store there are no "New mail" events
triggered when I start Outlook with new mail in the IMAP inbox. I'd
expect Exchange to be the same. (I didn't write a new addin, the
spambayes addin does log when it gets a NewMail event, which I can see
via win32traceutil...)

I'll be interested to see the code, in any case, as when I tried to
list unread mail for anotyher project, I couldn't get it to be fast
:-(

Paul.

-- 
This signature intentionally left blank

From trebor@animeigo.com  Sun Nov 10 21:59:28 2002
From: trebor@animeigo.com (Robert Woodhead)
Date: Sun, 10 Nov 2002 16:59:28 -0500
Subject: [Spambayes] Introducing myself
In-Reply-To: <LNBBLJKPBEHFEDALKOLCKEPMCIAB.tim.one@comcast.net>
References: <LNBBLJKPBEHFEDALKOLCKEPMCIAB.tim.one@comcast.net>
Message-ID: <a05200f19b9f46cfba659@[192.168.1.103]>

[my apologies if some of the suggestions/comments below have been previously
discussed, I'm still getting up to speed on the list]

>  > I'm particularly impressed with the chi-square work, it looks very
>>  interesting (but more stats for my poor stats-challenged mind to work
>>  on;
>
>So copy and paste <wink>.

Heh, call me old fashioned, but I actually like to know how things 
work, rather than relying on black magic.  ;^)

>  > not to mention that now I'm going to have to get around to
>>  cramming python in there with all the other languages that have
>>  accumulated over the years...).
>
>In return, you can throw twelve other languages out <0.7 wink>.

Why would I ever want to do that?  You never know when you'll need to 
be able to remember PL/C, JPL, APL, TUTOR, etc., etc., etc.  Though I 
pray I never have to remember NOVA MOBOL ("Language of Kings") ;^)

>Testing has pretty much run out of steam here, though.  My error rates are
>so low now I couldn't measure an improvement in a convincing way even if one
>were to be made, and the same is true of a few others here too.  We appear
>to be fresh out of big algorithmic wins, so are pushing on to wrestling with
>deployment issues.

Indeed.  And you also have to start worrying about the metagame; 
assuming your system goes into widespread deployment, what will the 
intelligent spammer (oxymoron) responses be?

>BTW, download the source code and read the comments in tokenizer.py:  the
>results of many early experiments are given there in comment blocks.

Will be doing this over the next day or so.

>Spoken like someone who worked on a rule-based system <wink>.  We have three
>categories:  Ham, Unsure, and Spam, and I haven't seen anything to make me
>believe that a finer distinction than that can be quantitatively justified
>(but my primary test data makes 2 mistakes out of 34,000 msgs now -- that's
>what I mean by "can't measure an improvement anymore", and a finer-grained
>scheme isn't going to touch those 2 mistakes; one of them is formally ham
>because it was sent by a real person, but consists of a one-line comment
>followed by a quote of an entire Nigerian scam spam -- nothing useful is
>ever going to *call* that one ham, and it scores as spam *almost* as solidly
>as an original Nigerian spam).

Ah, but there are more considerations.  First, many people's training 
sets may not be as distinct as yours, so the results might be more 
blurry.  Second, future versions of the software might end up 
including other recognizers in the mix (for example, DNSBL, url 
heuristics, whitelists, stamping systems, etc), so adding a bit of 
flexibility at the start doesn't cost you anything, but could end up 
saving everyone a lot of work down the road.  Since most existing 
mailreader filter schemes are relatively primitive, more than 10 
levels of discrimination isn't going to be all that useful.  But only 
3 would seem to be to be too few.  In a 1-9 scheme, the current 3 
levels would map to (say), 2,5,8.

It's just a syntactic difference, but it gives you precious wiggle room.

>"Score" is my favorite, but isn't catching on.  I believe the word "ham" for
>"not spam" was my invention, and since that one caught on big, I'm not
>fighting to the death for any others <wink>.

Hey, why quit when you're on a roll?

>
>>  * Hashing to a 32-bit token is very fast, saves a ton of memory,
>>  and the number of collisions using the python hash (I appealed for hash
>>  functions on the hackers-l and Guido was kind enough to send me the
>>  source) is low.  About 1100 collisions out of 3.3 million unique
>>  tokens on a training set I was using.
>
>That's significantly better than you could expect from a truly random hash
>function, so is fishy.  Tossing 3.3M balls into 2**32 buckets at random
>should leave 3298733 buckets occupied on average, with an sdev of 35.58
>buckets.  Getting 1100 collisions is about 4.7 sdevs fewer than the random
>mean.

I may have gotten the # of tokens wrong.  Currently my test runs are 
using 3.3M tokens but it may have been fewer when I was doing the 
hash tests.  Maybe 2.3-2.4M tokens at that time?  Anyway, thanks for 
the info about the relative merits of CRC32 and the Python hash; I'd 
been told CRC32 was bad and so was really surprised when it was 
marginally better.

>Since we're sticking to unigrams, we don't have an insane database burden.
>We also (by default) limit ourselves to looking at no more than 150 words
>per msg.  So I'm not sure saving some bytes of string storage is "worth it"
>for us, and it's very nice that we can get back the exact list of words that
>went into computing a score later.  A pile of hash codes wouldn't give the
>same loving effect <wink>.

Well, unless I'm missing something, you've got to keep track of every 
token you've ever seen, and you've got to look up every token you 
encounter to determine if it's significant enough to consider in the 
final calc.  If so, assuming the final calc isn't exponential, 
reducing the lookup time/resources can be a big win performance-wise.

Note that since you have the text of the token before you hash it, 
you can keep that around for significant tokens and display it later. 
The only reason to hash is for speed of access to the probability 
data.  The cost of the hashing is the inevitable collisions, which 
blur the probabilities for colliding tokens.

>Except I didn't get good enough results from his approach to justify
>pursuing it here, even leaving the hash codes at the full 32 bits.  When I
>went on to squash them to fit in a million buckets, a few false positives
>popped up that were just too bad to bear (two can be found in the list
>archives):  ham that was so obviously ham that no system that called them
>spam would be acceptable to most people.

I wasn't commenting on the phrase system, or even hashing, but rather 
on data reduction to reduce the memory footprint required of the 
statistical tables (ie: using 1 byte frequency counts vs. 4 byte 
ones).

Also, a cautionary note: just because the current system doesn't 
generate any horrible false positives on your corpii doesn't mean it 
won't do so on Joe Schmoe's.  Or my slightly smelly ham.

>  > * I was playing a week or two back with 1 and 2 token groups, and
>>  found that a useful technique was, for each new token, to only
>>  consider the most deviant result.  So if the individual word was .99
>>  spam, and the two word phrase was .95, it would only consider the .99
>>  result.  This would probably help with Bill Y's combinatorial scheme.
>
>It could be a viable approach to the problem mentioned above:  a scheme to
>suck out more than one word that doesn't systematically generate mounds of
>nearly redundant (highly correlated) clues.  We're clearly missing info by
>never looking at bigrams (or beyond) now, and that continues to bother me
>(even if it doesn't seem to be bothering the error rates <wink>).

Right; and, related to the metagame, you've got to consider responses 
by the spammers.  The initial attempt to defeat these kind of 
recognizers is going to try and exploit cancellation disease, 
probably by having a spammy preamble and a very hammy postscript.

So one possible approach would be to gradually degrade the 
significance of a token the further along in the email it is (both 
during training and recognition).  But of course, then you'll have to 
watch for html email that loads the front of the message with 
invisible ham.  So a parser that spits out only the tokens a human is 
going to see is indicated.

>  > * My personal bias (as I think Guido mentioned) is for a multifaceted
>>  approach, using Bayesian, rules-based (attacking things that bayesian
>>  isn't good at, like looking for obfuscated url structures), DNSBL,
>>  and whitelisting heuristics to generate an overall ranking.  So a
>>  hammy mail from a guy in your address book would bubble up to highest
>>  priority, whereas something spammy from him would stay neutral.
>
>I'm not sure we really need it.  For example, *lots* of spam has been
>discussed on this mailing list, so much so that the python.org email admin
>had to castrate SpamAssassin for msgs to this list address else it kept
>blocking ordinary list traffic.  My personal email classifier never calls
>anything here spam, though, nor does it call the originals of the spams
>posted here ham.

Beware the One True Path.  There is strength in diversity.

Or, as the noted philosopher D. Vader put it, "Don't be too proud of 
this technological terror you have created."  As you will recall, 
those rebel scum managed to craft a nasty false positive.

>
>I do worry a little about obsfuscated HTML.  We strip almost all HTML tags
>by default for a reason I've harped on enough <wink>:  all HTML decorations
>have very high spamprobs, and counting more than one of them as "a clue"
>fools almost every combining scheme into believing the msg containing them
>is spam (if you know a msg contains both <br> and <p>, it's not really more
>likely to be spam than if you just know it contains <br>!).  So we blind the
>classifier to HTML decorations now.
>
>But a spam I forwarded here a week or so ago exploited that:  the spam was
>interleaved with size=1 white-on-white news stories and tech mailing list
>postings.  The classifier *did* see those, but didn't see the HTML
>decorations hiding them.  This was a cancellation-disease-by-construction
>kind of msg, and chi-combining scored it near 0.5 as a result (solidly
>Unsure).  It's the only spam of that kind I've seen so far; if it becomes a
>popular techinque, we'll have to take more HTML blinders off the classifier.

That's a classic example of metagaming.  Seems to me, the strength of 
the spambayes recognizer is in recognizing the semantics (the spammy 
meaning of the message), not the syntactics.  So train it only on 
what a human would see reading the message.  Have another recognizer 
(either rules-based, bayesian, whatever works) that deals with the 
syntactics, and picks up on the html decoration tricks.  In other 
words, one that looks at what the message says, and another that 
looks at how it is presented.  This will prevent that particular kind 
of simple cancellation attacks.

And that wraps back to the "more responses" suggestion above.  How do 
you rate a hammy message with spammy html ornaments?  Might not "a 
little hammy" be a better response than "beat's me, boss!"?

>
>>  There's lots of room for cooperation between the various approaches
>>  and multiple agents means its less likely that a spam will get by.
>>  In particular, whitelisting heuristics can almost eliminate false
>>  positives.
>
>I'll let you know if I ever see one <wink>.

You will.  And it will be the one email that you really, really 
needed to read.  Murphy's Law guarantees that it will happen.  In 
fact, it typically happens (in my painful personal experience) soon 
after you make comments like the above.

>Getting vast quantities of spam isn't a problem anymore, but getting vast
>quantities of ham is.  Since your spammy ham is presumably business-related,
>I assume you can't share it.  Or can you?

Probably not.  Unless I could process them and just give you the 
tokens and frequencies in some useable format.  I'll see what I can 
do next week, gotta get python up and running along on my Mac.  Also 
gotta get the battlebot finished or my kids will hurt me.

>   Mixing spam and ham from
>different sources also causes worlds of problems (indeed, we still (by
>default) ignore most of the header lines partly for that reason, else the
>system gets great results for bogus reasons).

I do the same, I'm currently just looking at the subject line.

At 12:09 PM +0100 11/10/02, Rob Hooft wrote:
>I think our very good experience with the bayesian classifier would 
>"forbid" to use whitelisting. Once a whitelisted feature "leaks" 
>into the spam community, it will be useless.

Not if the whitelist heuristics are based on the individual user's 
environment, as opposed to global features.

>But there is a bayesian solution to it: Make the tokenizer recognize 
>the feature that you want to whitelist or blacklist, and emit a new 
>token to that effect.
>
>    From:<in-address-book>  --> Will have a low spamprob
>    url:numeric-host        --> Will have a high spamprob


While this is a useful approach, there is (IMHO) a need for users to 
be able to override, or at least modulate, the bayesian results in 
certain circumstances.  The classic example would be your boss 
forwarding a 419 scam to you with the comment "Looks good, I'm going 
to invest in this, what do you think?".  The spamminess might 
overwhelm the low spamprob From:<in-address-book>

A (paranoid) user needs to be able to tell the system "I don't care 
how spammy an email looks, if it's got this feature, I've got to at 
least glance at it with the Mk.1 Eyeball Recognition System".  Note 
that this doesn't mean that it should be declared "clean as the 
driven snow", just "might not be a pile of decomposing lunchmeat"

Yeah, this means that every spam going into Microsoft will eventually 
be from "billg@microsoft.com", but the consequences of this might be 
interesting.  Or at least, amusing.

best,R

-- 

Woodhead's Law: "The further you are from your server,  the more likely
it is to crash."

From tim.one@comcast.net  Mon Nov 11 00:59:05 2002
From: tim.one@comcast.net (Tim Peters)
Date: Sun, 10 Nov 2002 19:59:05 -0500
Subject: [Spambayes] More experiments with weaktest.py
In-Reply-To: <3DCE4D02.6060907@hooft.net>
Message-ID: <LNBBLJKPBEHFEDALKOLCMECDCJAB.tim.one@comcast.net>

[Tim, notes that his mistake-only training works in the order msgs
 come in]

[Rob Hooft]
> I toyed with the idea, but that would involve parsing all messages once
> before starting, and sorting them on date. Putting them in a set to
> "randomize" the order is much easier, so I was lazy.

That's fine.  For purposes of comparing this against previous tests, I
expect it's even good, since they were randomized too.

> ...
> Hm. That sounds so enthousiastic that I just might commit what I have
> gone through this night.

You did, and I thank you!  Note that there were already three Simplex pkgs
linked from

    http://www.python.org/topics/scicomp/numbercrunching.html

but I know how much fun it is write such stuff again <wink>.

> Some more info:
>
>   * No, I have not used a "Simulated Annealing" or "Threshold Accepting"
>     yet. Please keep in mind that each step in the optimization takes
>     between 3 minutes (1 set on my home PC) and 15 minutes (10 sets on my
>     work PC). This would be way too costly. Just minimization it will be.

Understood.

>   * I tried to use "Simplex optimization" (let a multidimensional
>     triangle walk through phase space) on the "Total cost" parameter.
>     This was simply disastrous. Phase space consists of plateau regions
>     that are exactly flat, joined by huge ridges. Think about that one
>     spam that goes from a 0.11 to a 0.09 score: it will add $9.80 in one
>     bang to the cost. This field is impossible to optimize.

Yes, it's a sum of step functions in the end, and at every point "the
derivative" is either 0 or infinite, depending on where you are and which
direction you look.  Making a new "smooth" cost measure was thoroughly
appropriate:

>   * I designed a new "Flex cost" field. That one does away with the
>     "unsure cost". The cost of a message is 0.0 at its own cutoff, and
>     increases linearly towards its "false" cost at the other cutoff,
>     and increases further to the other end. Hm. Unreadable.

The code is clear enough, though.  What I didn't understand is why each term
in the flexcost is divided by the difference between the (fixed per run)
cutoff levels:   / (SPC - HC).  That seems to systematically penalize, e.g.,
ham_cutoff=.4 and spam_cutoff=0.8 compared to ham_cutoff=0.1 and
spam_cutoff=0.9 (the former divides every term by 0.4, the latter by 0.8).
In the limit, if someone wanted a binary classifier (ham_cutoff ==
spam_cutoff), any mistake would be charged an infinite penalty.

> A table:
>
>            Score    Spam with this   Ham with this
>                       score costs     score costs
>             0.00         $ 1.29          $ 0.00

It's hard to see where that comes from.  Assuming ham_cutoff is 0.2 and
spam_cutoff 0.9, and so a spam scoring 0.0 works out to $1 *
(.9-0.0)/(.9-.2) ?

>             0.20         $ 1.00          $ 0.00
>             0.55         $ 0.50          $ 5.00
>             0.90         $ 0.00          $10.00
>             1.00         $ 0.00          $11.43
>
>      This field is much more smooth than the total cost field, so I was
>      hoping that pure minimization will do. Obviously, the flex cost is
>      much, much higher than the total cost because unsures are so much
>      more expensive. The flex cost field will also be less sensitive to
>      the {sp|h}am_cutoff parameters than the total cost field, because
>      there are no sudden cost jumps.

Well, if ham_cutoff==spam_cutoff, then (as above) any mistake will cause a
DivideByZero exception, so it's sure sensitive there <wink>.  I suspect it
might work better if the "/(SPC-HC)" business were simply removed?

>    * Results are not great I need to experiment more before reporting
>      on them.
>    * I just committed:
>       weaktest.py: introduction of the flexcost measure
>       optimize.py: simplex optimization (needs Numeric python; sorry)
>       weakloop.py: run weaktest.py repeatedly under simplex optimization

I've been running weakloop.py over two sets of my c.l.py data while typing
this.  That's 2*2000 = 4000 ham, and 2*1400 = 2800 spam, for 6800 total
msgs.  It's been thru the whole business about 25 times now.  At the start,

Trained on 88 ham and 66 spam
fp: 0 fn: 0
Total cost: $30.80
Flex cost: $212.3120
x=0.5000 p=0.1000 s=0.4500 sc=0.900 hc=0.200 212.31

It's having a hard time doing better than that.  The best so far seems to be

Trained on 82 ham and 66 spam
fp: 0 fn: 0
Total cost: $29.60
Flex cost: $200.0924
x=0.5011 p=0.1026 s=0.4515 sc=0.901 hc=0.205 200.09

which is so close to the starting point that it's hard to believe it's
finding something "real".  It *does* seem to be in a nasty local minimum,
though, as the next attempt was:

Trained on 118 ham and 69 spam
fp: 1 fn: 0
Total cost: $47.20
Flex cost: $344.7334
x=0.4989 p=0.1038 s=0.4531 sc=0.900 hc=0.209 344.73

I'm afraid it looks like it's eventually going to converge on the most
delicate possible settings that barely manage to avoid that 1 FP.


From tim.one@comcast.net  Mon Nov 11 01:17:46 2002
From: tim.one@comcast.net (Tim Peters)
Date: Sun, 10 Nov 2002 20:17:46 -0500
Subject: [Spambayes] Outlook plugin - training
In-Reply-To: <3DCE50FC.3050005@hooft.net>
Message-ID: <LNBBLJKPBEHFEDALKOLCCECGCJAB.tim.one@comcast.net>

[Tim]
>> ... my primary interest here is to see how bad things can get if
>> a user takes mistake-based training to an extreme.  Despite that
>> it's heavily hapax-driven, it appears to do very well when judged by
>> error rate.

[Rob Hooft]
> Hm. There are so little fp/fn's relative to unsures (at least after 30
> messages initial training), that it wouldn't matter much (I think).

As I tried to explain later, the psychological impact of the Unsures isn't
attractive, though -- they remain bizarre to human eyes.  When I got up
today, I got 6 new Unsure spam:  human growth hormone, gay porn, life
insurance, mortgage rates, a msg that made no sense (empty except for a
Yahoo auto-generated sig), and Genuine Leather Jackets.  It's not picking up
on general "this is advertising" clues, or even on general "this is gay
porn" clues.  Indeed, "XXX" is still a hapax!  This particular HGH spam will
never get through again, because training it found 80(!) hapaxes unique to
it.  It's not going to do much to stop other HGH spam, though -- this one
was especially chatty, and added words like 'forget', 'hair', 'lose', 'lost'
and 'anywhere' to the collection of (what are now, after training on it)
spam hapaxes -- just as previous HGH spam trained on didn't stop this one.
To my eyes, I had already told it about HGH spam, and I'm irked that it
showed me another one.  Ditto gay porn, ditto life insurance, etc.


[on database growth as a function of # of msgs]
> Hm, it is more like a sqrt after more messages. See attached image which
> has a sqrt X axis. The fit fits the data even at the lowest end.

Cool!  That was a dramatic graph indeed.  Soon there will be no mysteries
remaining <wink>.


From tim.one@comcast.net  Mon Nov 11 02:00:20 2002
From: tim.one@comcast.net (Tim Peters)
Date: Sun, 10 Nov 2002 21:00:20 -0500
Subject: [Spambayes] Proposing to rename some fundamental options
In-Reply-To: <LNBBLJKPBEHFEDALKOLCAEPOCHAB.tim.one@comcast.net>
Message-ID: <LNBBLJKPBEHFEDALKOLCCECJCJAB.tim.one@comcast.net>

[Tim]
> The original names made more sense when we had half a dozen competing
> schemes.
>
> Current                         Proposed
> -------                         --------
> robinson_probability_x          unknown_word_prob
> robinson_probability_s          unknown_word_strength
> robinson_minimum_prob_strength  minimum_prob_strength

This renaming has been done.  It should have no effect on pickles or
databases (i.e., no need to retrain).


From anthony@interlink.com.au  Mon Nov 11 02:22:26 2002
From: anthony@interlink.com.au (Anthony Baxter)
Date: Mon, 11 Nov 2002 13:22:26 +1100
Subject: [Spambayes] helping push the ham score for "nigeria" higher.
Message-ID: <200211110222.gAB2MQB11817@localhost.localdomain>


apologies for the marginal relevance, but it entertained me :)

http://news.bbc.co.uk/1/hi/world/africa/2423283.stm
"I am writing to you in the hope that you are under god and well. My naming
is Professor Isoun Turner, and I am having hope you can assist. We are 
having a communications sattelite worth $15 millon US dollars that needs
to be launched, but we need to find an international  launch pad"


From tim.one@comcast.net  Mon Nov 11 05:42:51 2002
From: tim.one@comcast.net (Tim Peters)
Date: Mon, 11 Nov 2002 00:42:51 -0500
Subject: [Spambayes] Introducing myself
In-Reply-To: <a05200f19b9f46cfba659@[192.168.1.103]>
Message-ID: <LNBBLJKPBEHFEDALKOLCMEDCCJAB.tim.one@comcast.net>

[Robert Woodhead]
> ...
> Heh, call me old fashioned, but I actually like to know how things
> work, rather than relying on black magic.  ;^)

You'll like this code, then!  We hate "mystery knobs", and everything has a
purpose.  A purpose may not make sense, but at least it has one.

> ...
> Indeed.  And you also have to start worrying about the metagame;
> assuming your system goes into widespread deployment, what will the
> intelligent spammer (oxymoron) responses be?

I expect to get rich by selling spammer software to defeat this latest round
of classifiers, so it's not that I can't tell you what their responses will
be, it's that I don't want to reveal trade secrets <wink>.  Indeed, if there
are technically savvy spammers, they're subscribed to this list (and others
like it).

> ...
> Ah, but there are more considerations.  First, many people's training
> sets may not be as distinct as yours, so the results might be more
> blurry.

Of all the things this project has done I find lacking in other projects,
this is the part I think gives this project its clearest advantage:  we have
a statistically sound testing framework, more than one person testing on
more than one corpus, people are beat up for running sloppy tests, and major
algorithm improvements have been vetted by many here on their own data, and
publicly reported results..  Winners survived and losers got purged from the
codebase, and no single test corpus ruled that.  Even for people with a
single test corpus, the testing framework slices-and-dices it into multiple
runs, so that results specific to a quirk of one subset can't be mistaken
for "the truth".  The project's TESTING.txt talks more about this.

My tech mailing-list data turned out to be easier than most peoples',
seemingly because almost all forms of advertising, and of HTML, are despised
on tech mailing lists.  But I've got other, harder test data too, and at
least one person here (hi, Anthony!) has a flatly horrid corpus.

> Second, future versions of the software might end up including other
> recognizers in the mix (for example, DNSBL, url heuristics, whitelists,
> stamping systems, etc), so adding a bit of flexibility at the start
> doesn't cost you anything, but could end up saving everyone a lot of
> work down the road.

We'll define a stable API for accessing this system.  If people want to
combine it with other systems, that's fine, and Python excels at playing
nice with other systems.  If someone wants to add, e.g., a DNSBL gimmick to
*this* codebase, they should write a new module to do so.  I don't want
fundamentally different approaches mixed into one module, let alone one
function.

> Since most existing mailreader filter schemes are relatively primitive,
> more than 10 levels of discrimination isn't going to be all that useful.
> But only 3 would seem to be to be too few.  In a 1-9 scheme, the
> current 3 levels would map to (say), 2,5,8.

Let me clarify:  I don't object to defining a billion levels, the problem is
that I've seen no evidence that the algorithm in use here *can* provide more
than 3 meaningful levels.  chi-combining usually gives extreme scores.  The
median spam score is (to 6 significant digits) 1.0; the median ham score is
on the order of 1e-10.  The difference between, e.g., 1e-20 and 1e-5 appears
meaningless, despite that it's 15 orders of magnitude.  When chi doesn't
give an extreme score, it tends to give one near 0.5, and which side of 0.5
it lies on doesn't appear to have strong correlation with whether a thing is
ham or spam.  The system is saying "I'm lost!" then, and it is.

In effect, it's a 1-bit classifier but with a very useful middle ground.
That it only gives about 1 bit of info follows from that the underlying math
is a statistical accept/reject test (a two-outcome decision).  Well, it's
actually two accept/reject tests under the covers (one for ham, one for
spam), and that's where the middle ground comes from (they both accept or
both reject).

If we were to call our middle ground 5, what good would that do anyone else?
It doesn't mean we judge the odds of a msg being spam at 1 in 2.  It means
we have no idea.  It certainly doesn't mean what, e.g., a 5 coming out of
SpamAssassin means.  "Unsure" means what it says.  If, in the future, a new
and better algorithm comes along with 6 meaningful digits, then I expect a
new X- header would be defined to report it.

> It's just a syntactic difference, but it gives you precious wiggle room.

I'll leave more on this to people adding headers (the client I'm using
doesn't use headers, but does attach integer score (in 0-100) metadata to
msgs).

[on hash collisions]
> ...
> I may have gotten the # of tokens wrong.  Currently my test runs are
> using 3.3M tokens but it may have been fewer when I was doing the
> hash tests.  Maybe 2.3-2.4M tokens at that time?  Anyway, thanks for
> the info about the relative merits of CRC32 and the Python hash; I'd
> been told CRC32 was bad and so was really surprised when it was
> marginally better.

Hard to say.  Neither CRC32 nor Python's string hash make any effort toward
being "crytographically secure", and Python's string hash is in fact and
deliberately "better than random" in some common cases:

>>> hash('x1')
739453787
>>> hash('x2')
739453784
>>> hash('x3')
739453785
>>> hash('x4')
739453790
>>>

That is, it's very regular in a way that most often yields fewer 32-bit
collisions than a truly random hash function would yield when fed input
strings with regularities.  That eventually breaks down if you throw enough
strings at it -- but it doesn't get "worse than random" then either, so far
as it's ever been pushed.

> ...
> Well, unless I'm missing something, you've got to keep track of every
> token you've ever seen,

So far we have, but there's slow-motion work in progress on database
pruning.

> and you've got to look up every token you encounter to determine if
> it's significant enough to consider in the final calc.

Yes, and that will probably always be true.

> If so, assuming the final calc isn't exponential, reducing the lookup
> time/resources can be a big win performance-wise.

I don't believe so.  When using a Python dict as "the database", the time
for scoring a msg is minor compared to the time taken by parsing and
tokenization, and especially compared to the time just to get the msg *into*
the system (whether that's file I/O, or socket I/O, or some email pkg's
programming API, or whatever -- that part is the bottleneck when using a
dict; when not using a dict, database access time may become a burden, and
most databases in use here require string keys even if you're working with
ints -- the database user has to convert the hash code to a string!  Other
databases (like ZODB) could use ints directly as keys, but they're rare.).

> Note that since you have the text of the token before you hash it,
> you can keep that around for significant tokens and display it later.

Good point!  I had overlooked that indeed.

> The only reason to hash is for speed of access to the probability
> data.

Feel free to experiment; as above, I don't have reason to suspect that
switching to hash codes would speed anything here, except for Jeremy's ZODB
database (which could switch to using an IOBTree, which is zippier than an
OOBTree).

> The cost of the hashing is the inevitable collisions, which
> blur the probabilities for colliding tokens.

Another cost is obscuring the code.

> ...
> I wasn't commenting on the phrase system, or even hashing, but rather
> on data reduction to reduce the memory footprint required of the
> statistical tables (ie: using 1 byte frequency counts vs. 4 byte
> ones).

Ours are actually unbounded, but I don't have any problem with the memory
footprint now.  Others do.  It seems more fruitful at this point to
concentrate on ways to reduce the # of tokens, rather than the size burden
per token.  BTW, see the neil*.py files for how one person here builds a
lean scoring-only CDB database -- you can store things any way you like,
provided that the database access function is fiddled to convert to what the
classifier expects to use.  I don't believe such conversion is a significant
time burden, but I haven't run the CDB variant and so haven't timed it
(Neil, do you have gripes about memory or time?  Spit 'em out.).

> Also, a cautionary note: just because the current system doesn't
> generate any horrible false positives on your corpii doesn't mean it
> won't do so on Joe Schmoe's.  Or my slightly smelly ham.

Sure, but I'm a realist:  any non-trivial scheme has a non-zero FP rate.
That's life.  What users choose to do about that isn't for this project to
dictate.  It is our responsibility to say up-front that there will be false
positives, and we do say so.

> ...
> Right; and, related to the metagame, you've got to consider responses
> by the spammers.  The initial attempt to defeat these kind of
> recognizers is going to try and exploit cancellation disease,
> probably by having a spammy preamble and a very hammy postscript.

They can't really defeat this scheme that way.  At best they can hope to
push msgs into Unsure territory.  What constitutes "very hammy" is a
function of each user's database here, and no generic blob of text is going
to score high for hamminess everywhere.  The spam in question happened to
include a news story about the DC-area snipers, and that was very hammy for
*me* because I live in that area and many friends and relatives had
corresponded about the snipers (including forwarding the text of that very
news story, as if we were suffering a news blackout here <wink>).  Even so,
the message ended up as Unsure for me, not as Ham.  That's to the credit of
chi-combining, which is very good about knowing when it's confused.

> So one possible approach would be to gradually degrade the
> significance of a token the further along in the email it is (both
> during training and recognition).

I think there is reason to believe that spammers have to get your attention
early.  OTOH, many pieces of incriminating evidence also live at the end of
spams ("this is not spam!" blurbs, the explanation that you got this because
you're on an opt-in list run by one of their "partners", references to
various state and federal bills, the "unsubscribe me" URL slash address
harverster, etc).

The white-on-white spam I mentioned before had hammish stuff at the start,
and at the end, and between each pair of paragraphs.

> But of course, then you'll have to watch for html email that loads the
> front of the message with invisible ham.  So a parser that spits out
> only the tokens a human is going to see is indicated.

Yup.  Guido suggested that at the start, but that level of HTML analysis
gets a lot more expensive too.  We'll see.

BTW, on large tests this system scores about 80 msgs/second on my box,
including everything (system time, training, I/O, parsing, tokenizing,
scoring, reporting, recording, and analyzing results -- this is # of msgs
divided by elapsed wall-clock time).  We could afford to get slower, if
necessary.

> ...
> Beware the One True Path.  There is strength in diversity.

Let a thousand classifiers bloom.  If someone here wants to volunteer the
effort to try a different approach, that's always been welcome.  But the
results have been so good sticking to one basic approach that I don't see
that happening.  We ended up doing one thing exceedingly well, and that's a
contribution to diversity too, of a kind you may be undervaluing <wink>.

> Or, as the noted philosopher D. Vader put it, "Don't be too proud of
> this technological terror you have created."  As you will recall,
> those rebel scum managed to craft a nasty false positive.

I don't view an FP as being as costly as needing to build a new Death Star.
For goodness sake, this is email we're talking about -- anyone trusting a
truly critical msg to email is dreaming to begin with.

> ...
> That's a classic example of metagaming.  Seems to me, the strength of
> the spambayes recognizer is in recognizing the semantics (the spammy
> meaning of the message), not the syntactics.

Well, it's got no semantic knowledge at all.  It doesn't even know which
language a msg is written in, let alone what it means, and has no concept of
"word" beyond "stuff that appears between whitespace".  It's very much
focused on purely local lexical structure.

> So train it only on what a human would see reading the message.

We get a lot of value out of mining a handful of header lines.  We also get
a lot of value out of tokenizing embedded "invisible" URLs.  The theme here
is that we tokenize "what works", and that's driven by measured error rates;
philosophy doesn't enter into that part.

> Have another recognizer (either rules-based, bayesian, whatever works)
> that deals with the syntactics, and picks up on the html decoration
> tricks.  In other words, one that looks at what the message says, and
> another that looks at how it is presented.  This will prevent that
> particular kind of simple cancellation attacks.

A rule-based system seems more effective to me too against that particular
gimmick.  Also against viruses.

> And that wraps back to the "more responses" suggestion above.  How do
> you rate a hammy message with spammy html ornaments?  Might not "a
> little hammy" be a better response than "beat's me, boss!"?

I have no real idea, but fear that presuming "yes" is presuming a lot of
intelligence that systems parsing this header won't actually have.  The
fancier the rating scheme the fancier they have to be too.  In the end, the
user has to decide what to do about everything that's not called ham, no
matter how many or few the non-ham categories.  As a user myself, I've got
no use at all for distinctions beyond "I'm pretty sure it's spam" and "beats
me".  That already gives two categories I have to check, and that's enough.
I do find it useful that my client can sort on the score metadata, and there
are proposals here too to add fancier header lines beyond the basic
spam/ham/unsure one.

[on FPs]
> You will.

Of course I will.

> And it will be the one email that you really, really needed to read.

It doesn't matter -- I review all my spam.  Other people won't, and so it
goes.

> Murphy's Law guarantees that it will happen.  In fact, it typically
> happens (in my painful personal experience) soon  after you make
> comments like the above.

You realize you're overselling badly here, right <wink>?

> ...
> I do the same, I'm currently just looking at the subject line.

Look at tokenize_headers() in tokenizer.py for a number of other
corpus-independent header lines that proved useful to tokenize.  Surprising
but true:  we can get a very good classifier by looking at this handful of
header lines alone.  Or by looking at the body alone.  Looking at both takes
longer <wink>.

> ...
> While this is a useful approach, there is (IMHO) a need for users to
> be able to override, or at least modulate, the bayesian results in
> certain circumstances.  The classic example would be your boss
> forwarding a 419 scam to you with the comment "Looks good, I'm going
> to invest in this, what do you think?".  The spamminess might
> overwhelm the low spamprob From:<in-address-book>

This is akin to my "entire Nigerian scam quote" FP, and it's all but certain
that the spam content would overwhelm the brief "from the boss" clues.
OTOH, if my boss didn't wait for my reply and went ahead and invested
anyway, the subsequent financial disgrace would open the door for me to take
his job.  After all, he relied on me for advice, so who more logical to
succeed him?

two-winners-and-only-one-loser-ly y'rs  - tim


From popiel@wolfskeep.com  Mon Nov 11 06:11:25 2002
From: popiel@wolfskeep.com (T. Alexander Popiel)
Date: Sun, 10 Nov 2002 22:11:25 -0800
Subject: [Spambayes] More experiments with weaktest.py 
In-Reply-To: Message from Tim Peters <tim.one@comcast.net> 
	<LNBBLJKPBEHFEDALKOLCMECDCJAB.tim.one@comcast.net> 
References: <LNBBLJKPBEHFEDALKOLCMECDCJAB.tim.one@comcast.net> 
Message-ID: <20021111061126.211B5F4CD@cashew.wolfskeep.com>

In message:  <LNBBLJKPBEHFEDALKOLCMECDCJAB.tim.one@comcast.net>
             Tim Peters <tim.one@comcast.net> writes:
>
>I've been running weakloop.py over two sets of my c.l.py data while typing

I've now run weakloop.py over three sets of my private data;
that's 3*200 ham and 3*200 spam, for a total of 1200 messages.

The best few it came up with were:

Trained on 39 ham and 61 spam
fp: 4 fn: 3
Total cost: $61.60
Flex cost: $189.7713
x=0.5040 p=0.1040 s=0.4400 sc=0.902 hc=0.204 189.77

Trained on 38 ham and 61 spam
fp: 4 fn: 2
Total cost: $60.60
Flex cost: $189.9767
x=0.5060 p=0.1060 s=0.4300 sc=0.903 hc=0.206 189.98

Trained on 37 ham and 61 spam
fp: 4 fn: 2
Total cost: $60.40
Flex cost: $189.2842
x=0.5054 p=0.0980 s=0.4436 sc=0.905 hc=0.209 189.28

Trained on 37 ham and 61 spam
fp: 4 fn: 2
Total cost: $60.40
Flex cost: $189.8255
x=0.5033 p=0.0981 s=0.4456 sc=0.903 hc=0.206 189.83

Trained on 37 ham and 61 spam
fp: 4 fn: 2
Total cost: $60.40
Flex cost: $189.8260
x=0.5026 p=0.1000 s=0.4458 sc=0.902 hc=0.207 189.83

There were a few where it trained on a couple more or less ham and
spam... but I had to go hunting for them.  I find it quite interesting
that my ham:spam training ratio here (about 2:3, about where all my
ratio tests have been pointing as a sweet spot) is significantly
different than that reported by others (which has been much closer
to 1:1 or favoring more ham than spam).  I guess my corpus really
is unusual.

FWIW, I'm running it again with all 10 of my sets (4000 messages
total) overnight.

- Alex

From popiel@wolfskeep.com  Fri Nov  8 00:06:27 2002
From: popiel@wolfskeep.com (T. Alexander Popiel)
Date: Thu, 07 Nov 2002 16:06:27 -0800
Subject: [Spambayes] Outlook plugin - training 
In-Reply-To: Message from "Tim Peters" <tim@zope.com> 
	<BIEJKCLHCIOIHAGOKOLHMEDFDOAA.tim@zope.com> 
References: <BIEJKCLHCIOIHAGOKOLHMEDFDOAA.tim@zope.com> 
Message-ID: <20021108000627.2B918F5CC@cashew.wolfskeep.com>

In message:  <BIEJKCLHCIOIHAGOKOLHMEDFDOAA.tim@zope.com>
             "Tim Peters" <tim@zope.com> writes:
>[Anthony Baxter]
>> Note that "random sample" is not as trivial as all that, either - if
>> you have a very high ham:spam ratio in your training DB, your accuracy
>> will suffer (see the tests from Alex, myself and others).
>
>I still need to try to make sense of those tests.  A real complication is
>that more than one thing changes when trying to test ratios:  it's not just
>the ratio that changes, it's the absolute number of each trained on too.

True.

>For example, (a) train on 5000 ham and 1000 spam; or, (b) train on 50000 ham
>and 10000 spam.  The ratios are identical.  Do we expect the error rates to
>be identical too?  I don't, but haven't tried it.

I have tried this, and the effects of ratio were diminished
as the training set size increased.  For details, see
http://www.wolfskeep.com/~popiel/spambayes/ratio2 .  The
tests were done with gary-combining, not chi-square, so I
really ought to rerun them.

>I expect the latter would do better than the former, despite the identical
>ratios, simply because more msgs allow better spamprob estimates.

It depended on what the ratio in question was... for 1:4
ham:spam, increased training set size hurt instead of helped,
in the ranges that I was able to test.  For 1:1, increased
training helped instead of hurt.

>Something missing in "the ratio tests" is a rationale (even an
>after-the-fact one) for believing there's some aspect of the system that's
>sensitive to the ratio.  The combining method certainly is not, and the
>spamprob estimation (update_probabilities()) deliberately works with
>percentages instead of raw counts so that the ham::spam training ratio
>has no direct effect on the spamprobs calculated.

Eh, I have a perfectly good rationale for believing that
something is sensitive the the ratio: the tests I've run
show such a sensitivity.  What's missing is a theory on
_why_ there's a sensitivity. ;-)

I don't think the following theory is perfectly phrased, but
it seems plausible to me:

Perhaps the number of topics discussed in ham is greater
than that in spam.  Thus, the average percentage of ham
messages containing a particular significant ham word is
systematically lower than the average probability of a
particular significant spam word appearing in spam messages.
As the training set size increases, the percentage difference
becomes more consistent and pronounced.  Since we're then
combining the percentages, we systematically skew slightly
due to the differing averages.

Changing the ratio of ham to spam has the effect of changing
the number of topics discussed, particularly when the training
set size is small and random chance can exclude all instances
of a given topic.  Balancing the number of topics removes the
skew in the probabilities.  As training set size increases,
adjusting the ratio has less effect, because it has less
likelyhood of eliminating topics of discussion.

I think that would account for my data.

>The total # of spam training msgs does limit how high a spamprob can get,
>and the total # of ham training msgs limits how low.  The *suspicion* I had
>running my large c.l.py test is that it wasn't the ratio that mattered so
>much as the absolute number, and that the error rates didn't "settle down"
>to the 4th digit until I got near 10,000 spam total.

I suspect that by the time the corpora got that large, adjusting
the training ratio wouldn't make a lick of difference if the
corpora were sampled randomly to achieve the given ratio.  There
would just be too little chance of excluding a topic from the
samples.  Systematically excluding a topic might produce equivalent
results to my ratio tests.

- Alex

From richie@entrian.com  Fri Nov  8 00:17:25 2002
From: richie@entrian.com (Richie Hindle)
Date: Fri, 08 Nov 2002 00:17:25 +0000
Subject: [Spambayes] SMTP proxy questions
Message-ID: <ievlsu08r2krkv5n6clac6c5p58uianqph@4ax.com>


[Me]
> Also on my list is to commit Tim Stone's SMTP proxy code, possibly after
> integrating it with the pop3proxy (but I need to discuss that with you,
> Tim, after looking in more detail at the code, hopefully tonight).

I've discussed this with Tim S, and he's going off the SMTP proxy idea
while I'm still broadly in favour of it.  What do people think - do
non-Outlook users want to forward messages to 'spam' and 'ham' to train the
system, or use an HTML UI?

The most difficult problem for retraining-by-forwarding is matching the
forwarded message to one from the cache, after Outlook Express has stripped
the headers, top-quoted the users .sig, converted it to HTML and added
fifteen macro viruses.  Any ideas?  Can the tokeniser help?

Or perhaps there's another way.  The only other option I'd thought of was
to add two hyperlinks to the end of the message, "This is spam" and "This
is ham" (in ways that would work for both HTML and plain-text messages, in
both HTML and plain-text email clients).  They'd link to the HTML interface
and tell it the cache ID of the message.  Adding content to emails is way
more intrusive (and difficult) than adding headers.  But no more intrusive
than the .sig that mailman adds.

-- 
Richie Hindle
richie@entrian.com


From anthony@interlink.com.au  Fri Nov  8 00:30:09 2002
From: anthony@interlink.com.au (Anthony Baxter)
Date: Fri, 08 Nov 2002 11:30:09 +1100
Subject: [Spambayes] SMTP proxy questions 
In-Reply-To: <ievlsu08r2krkv5n6clac6c5p58uianqph@4ax.com> 
Message-ID: <200211080030.gA80UAf11390@localhost.localdomain>


> I've discussed this with Tim S, and he's going off the SMTP proxy idea
> while I'm still broadly in favour of it.  What do people think - do
> non-Outlook users want to forward messages to 'spam' and 'ham' to train the
> system, or use an HTML UI?

I'd have to say I don't like the idea. There's too many potential places
where it can all go horribly horribly pear-shaped, and too many rat-holes
that the various email clients can screw up with.

Anthony

-- 
Anthony Baxter     <anthony@interlink.com.au>   
It's never too late to have a happy childhood.


From rob@hooft.net  Mon Nov 11 09:12:57 2002
From: rob@hooft.net (Rob W.W. Hooft)
Date: Mon, 11 Nov 2002 10:12:57 +0100
Subject: [Spambayes] More experiments with weaktest.py
References: <LNBBLJKPBEHFEDALKOLCMECDCJAB.tim.one@comcast.net>
Message-ID: <3DCF7499.6030705@hooft.net>

Tim Peters wrote:
> [Rob Hooft]
>>...
>>Hm. That sounds so enthousiastic that I just might commit what I have
>>gone through this night.
> 
> 
> You did, and I thank you!  Note that there were already three Simplex pkgs
> linked from
> 
>     http://www.python.org/topics/scicomp/numbercrunching.html
> 
> but I know how much fun it is write such stuff again <wink>.

Yeah, but on the other hand, all those people didn't have access to my 
module when they wrote theirs, because it wasn't publicized ;-) [Let me 
add that my optimize code dates from late 1997]

>>  * I designed a new "Flex cost" field. That one does away with the
>>    "unsure cost". The cost of a message is 0.0 at its own cutoff, and
>>    increases linearly towards its "false" cost at the other cutoff,
>>    and increases further to the other end. Hm. Unreadable.
> 
> 
> The code is clear enough, though.  What I didn't understand is why each term
> in the flexcost is divided by the difference between the (fixed per run)
> cutoff levels:   / (SPC - HC).  That seems to systematically penalize, e.g.,
> ham_cutoff=.4 and spam_cutoff=0.8 compared to ham_cutoff=0.1 and
> spam_cutoff=0.9 (the former divides every term by 0.4, the latter by 0.8).
> In the limit, if someone wanted a binary classifier (ham_cutoff ==
> spam_cutoff), any mistake would be charged an infinite penalty.

You're right.
> 
> 
>>A table:
>>
>>           Score    Spam with this   Ham with this
>>                      score costs     score costs
>>            0.00         $ 1.29          $ 0.00
> 
> 
> It's hard to see where that comes from.  Assuming ham_cutoff is 0.2 and
> spam_cutoff 0.9, and so a spam scoring 0.0 works out to $1 *
> (.9-0.0)/(.9-.2) ?

Yes.

> 
>>            0.20         $ 1.00          $ 0.00
>>            0.55         $ 0.50          $ 5.00
>>            0.90         $ 0.00          $10.00
>>            1.00         $ 0.00          $11.43

But you're right that it would be better to make:

            Score    Spam with this   Ham with this
                       score costs     score costs
             0.00         $ 1.00          $ 0.00
             0.20         $ 1.00          $ 0.00
             0.55         $ 0.50          $ 5.00
             0.90         $ 0.00          $10.00
             1.00         $ 0.00          $10.00

i.e. both functions consist of 3 linear segments rather than 2.

> Well, if ham_cutoff==spam_cutoff, then (as above) any mistake will cause a
> DivideByZero exception, so it's sure sensitive there <wink>.  I suspect it
> might work better if the "/(SPC-HC)" business were simply removed?

That would no longer satisfy the constraints I put in.

> I've been running weakloop.py over two sets of my c.l.py data while typing
> this.  That's 2*2000 = 4000 ham, and 2*1400 = 2800 spam, for 6800 total
> msgs.  It's been thru the whole business about 25 times now.  At the start,
> 
> Trained on 88 ham and 66 spam
> fp: 0 fn: 0
> Total cost: $30.80
> Flex cost: $212.3120
> x=0.5000 p=0.1000 s=0.4500 sc=0.900 hc=0.200 212.31
> 
> It's having a hard time doing better than that.  The best so far seems to be
> 
> Trained on 82 ham and 66 spam
> fp: 0 fn: 0
> Total cost: $29.60
> Flex cost: $200.0924
> x=0.5011 p=0.1026 s=0.4515 sc=0.901 hc=0.205 200.09
> 
> which is so close to the starting point that it's hard to believe it's
> finding something "real".  It *does* seem to be in a nasty local minimum,
> though, as the next attempt was:
> 
> Trained on 118 ham and 69 spam
> fp: 1 fn: 0
> Total cost: $47.20
> Flex cost: $344.7334
> x=0.4989 p=0.1038 s=0.4531 sc=0.900 hc=0.209 344.73
> 
> I'm afraid it looks like it's eventually going to converge on the most
> delicate possible settings that barely manage to avoid that 1 FP.

This is exactly what I found so far, even with my complete data set. It 
is too delicate to work. Now this could be due to 2 things:

  1. The flexcost is still causing lots of false minima
  2. The weaktest is causing lots of false minima

I suspect the latter, because it contains lots of "yes/no" decisions 
that may tuble the other way with minimal changes in the parameters.

My conclusion is to stop this, and try the optimization on something 
like timtest.py but with the flexcost as target function. Or maybe 
change weaktest such that it trains on all messages in the process. That 
would simulate the "optimal" strategy of a user that has to start from 
nothing.

Rob


-- 
Rob W.W. Hooft  ||  rob@hooft.net  ||  http://www.hooft.net/people/rob/


From msergeant@startechgroup.co.uk  Mon Nov 11 09:49:38 2002
From: msergeant@startechgroup.co.uk (Matt Sergeant)
Date: Mon, 11 Nov 2002 09:49:38 +0000
Subject: [Spambayes] Introducing myself
References: <E18AYxq-0006sT-00@mail.python.org>
	<a05200f03b9f34909a00b@[192.168.1.103]>
Message-ID: <3DCF7D32.4090209@startechgroup.co.uk>

Robert Woodhead said the following on 10/11/02 00:32:

> * My personal bias (as I think Guido mentioned) is for a multifaceted 
> approach, using Bayesian, rules-based (attacking things that bayesian 
> isn't good at, like looking for obfuscated url structures), DNSBL, 
> and whitelisting heuristics to generate an overall ranking.  So a 
> hammy mail from a guy in your address book would bubble up to highest 
> priority, whereas something spammy from him would stay neutral. 
> There's lots of room for cooperation between the various approaches 
> and multiple agents means its less likely that a spam will get by. 
> In particular, whitelisting heuristics can almost eliminate false 
> positives.

That's the approach SpamAssassin now takes, fwiw (including the bayesian 
stuff). All done in 2.50 CVS.

> * Finally, if anyone needs more spam, I get over 300 a day (I've been 
> around a while!) and have a cleaned corpus of over 130MB of spam and 
> foreign email.  Also, given all the legit web-marketing email I get 
> because of the url registration work I've done, I've got tons of the 
> spammiest ham you could imagine.

I'm always looking for more corpuses. Stick the data on an FTP/HTTP 
server somewhere (password protect if you need to). Or contact me 
privately if that's not possible.

Matt.


From papaDoc@videotron.ca  Mon Nov 11 13:03:40 2002
From: papaDoc@videotron.ca (papaDoc)
Date: Mon, 11 Nov 2002 08:03:40 -0500
Subject: [Spambayes] Outlook plugin - training
References: <LNBBLJKPBEHFEDALKOLCCEHLCIAB.tim.one@comcast.net>
Message-ID: <3DCFAAAC.4020807@videotron.ca>

Hi,

Can someone define what is an hapaxe !

>Scores remain grossly hapax-driven, but that's actually enough to classify
>most of my email correctly:  a small number of subjects and senders and
>mailing lists overwhelmingly dominate my ham mix, and one email account
>accounts for the vast bulk of my spam.  Removing the hapaxes from the
>database dropped the # of words from 5500 to about 1700.  Rescoring the
>inbox with this reduced database then pushed about 5% of the msgs back into
>Unsure.
>
>So (no surprise here) hapaxes are vital with little training data.  That
>also means that as soon as one of those words shows up in the other kind of
>email, it changes from a strong clue to netural, *provided that* I actually
>train on the new email.  I'm not training now unless there's a
>mistake/unsure, so the hapaxes remain strong clues (even when they point in
>the wrong direction).  BTW, when there are mistakes/unsures, I'm not
>training on all of them:  as I did when I got up, I train the worst example
>then rescore, one at a time, until no mistakes/unsures remain.
>  
>

papaDoc

P.S. Someday I will contribute to the code but first I need to learn python.

>  
>


From bkc@murkworks.com  Mon Nov 11 13:22:53 2002
From: bkc@murkworks.com (Brad Clements)
Date: Mon, 11 Nov 2002 08:22:53 -0500
Subject: [Spambayes] Exchange integration
Message-ID: <3DCF67F7.16091.91EB9C8@localhost>

Just musing here hoping someone can jump in with good comments.

--

I'm thinking about running spambayes inside Exchange 5.5 (or Exchange 2000).

At first, I thought I'd use the Event service, but in 5.5 it's async and MS even says "don't 
use this to filter all your messages".

In Exchange 2000 apparently there's a synchronous event service, but I don't have 
Exchange 2000.

So it looks like I need to create some kind of MAPI hook or preprocessor or mailbox 
assistant.. I'm not sure which.

Anyone know? And, can I do this all in Python via COM or do I need some "real C to 
hook in?

Finally, why does MS make it so hard to find the info you want? 


Brad Clements,                bkc@murkworks.com   (315)268-1000
http://www.murkworks.com                          (315)268-9812 Fax
AOL-IM: BKClements


From Paul.Moore@atosorigin.com  Mon Nov 11 14:24:24 2002
From: Paul.Moore@atosorigin.com (Moore, Paul)
Date: Mon, 11 Nov 2002 14:24:24 -0000
Subject: [Spambayes] Some more experiences with the Outlook plugin
Message-ID: <16E1010E4581B049ABC51D4975CEDB88619933@UKDCX001.uk.int.atosorigin.com>

I've now had the Outlook plugin running for about a week, and I'm
starting to get a feel for using it. The following is my "user
interface" experience. It's a slightly unrealistic combination of
"what I actually did" and "what I realised afterwards I should have
done", but it is what I would use as notes telling a new user how
to set the system up, and as such it picks up on a few interesting
issues:

1. To start with, configure the plugin to define one "Spam" folder and
   one "Unsure" folder, and define all other folders as "Ham". [1]
2. Train the classifier on whatever you have available. This will
   usually be massively overbalanced in favour of ham (few people
   collect their spam) but it *will* make a start. [2]
3. Run with this for a while, incrementally training on mistakes and
   unsures. Keep all of the spam!
4. Periodically, retrain the full database on all the collected ham
   and spam.


Notes:

[1] I got this wrong at the start - the key point to stress here is
    that *everything* that isn't spam is ham - by definition. Trying
    to "help" the classifier by telling it to ignore messages which
    you "know" are ham is actually detrimental - if you know, let the
    classifier find out!

[2] I'm getting pretty good results now (but see below), with 5661
    ham and 303 spam, but even with under 100 spam (admittedly with
    less ham, as I made the "exclude some ham" mistake) I was getting
    visible benefits.

Other points:

* The collection I end up with is still biased - there are a lot of
  ham messages which I just read and delete, and they are probably
  somehow "similar". While I could retain these, this would require
  a much more significant change to my way of working.

* Results still seem to be pretty much hapax based (if I understand the
  term and its usage). Looking at the clues for a message often shows
  some pretty bizarre tokens showing up as *either* sort of clue. (One
  message showed 'yet' as a ham clue with a probability of 0.000877364!)

* Following on from this, I also see Tim's behaviour of surprising
  unsure cases (or worse, false negatives!). Worst case recently was a
  message which scored as solid ham. I trained on it as "Spam", and
  rescored it. It still scored 5 - solid ham. My immediate reaction was
  "But I just *told* you it's spam!". I know that isn't how the =
classifier
  works, but even so it was unsettling. FWIW, I attach the spam clues =
for
  this one (I don't know if they make any sense in isolation, but it =
can't
  hurt...)

* I don't know how long it will be before I start grudging the use of
  disk space to store spam. At that point, the nasty question of whether
  I keep it, or risk being unable to recreate my database, becomes
  important.

I need to look at how to get some more information out of the =
classifier,
to try to understand how much of the good results I see are down to luck
(hapaxes, I guess - which makes me think of "happy accidents" rather =
than
its real meaning...) and hence is fragile, and how much is actually =
solid.
Can anyone point me at the right part of the code to read to find this?

Paul.

------------------------------- Clues for that message I mentioned

Spam Score: 0.0531684


'*H*'                          1
'*S*'                          0.106337
'(and'                         0.00044603
'looking'                      0.000489716
'added'                        0.000613999
'work,'                        0.00120032
'group'                        0.00138504
'saying'                       0.00196592
'possible,'                    0.00254381
'up,'                          0.00260266
'said,'                        0.00306331
'thing.'                       0.00372208
'first.'                       0.00420954
'but,'                         0.00530035
'posting'                      0.00585176
'number.'                      0.00600801
'exist.'                       0.00617284
'enough.'                      0.00634697
'mind.'                        0.0065312
'skip:=3D 70'                    0.00738916
'negative'                     0.00764007
'links,'                       0.00764007
'month.'                       0.00884086
'info,'                        0.0100223
'value,'                       0.0104895
'tends'                        0.0110024
'hook'                         0.0110024
'them?'                        0.0136778
'continues'                    0.0145631
'large,'                       0.0167286
'invite'                       0.0196507
'to:2**1'                      0.0234065
'experiences'                  0.0238095
'submitting'                   0.0238095
'answering'                    0.0266272
'cost,'                        0.0266272
'listen'                       0.0302013
'chuck'                        0.0412844
'this.'                        0.0478427
'club'                         0.0505618
'there:'                       0.0505618
'agree'                        0.0621736
'confirm'                      0.0689037
'to:'                          0.069921
'kind'                         0.0736897
'but'                          0.079886
'resident'                     0.0918367
'intended'                     0.0939539
'might'                        0.104747
'create'                       0.113245
'otherwise'                    0.116335
'soon'                         0.117769
"there's"                      0.120957
'actually'                     0.121474
'had'                          0.123378
'skip:u 10'                    0.134221
'having'                       0.134288
'done'                         0.135374
'there'                        0.142693
'still'                        0.147177
'doing'                        0.150927
'going'                        0.151564
'sweet'                        0.155172
'ads,'                         0.155172
'insult'                       0.155172
'subject:COMPUTER'             0.155172
'does'                         0.158656
'they'                         0.16113
'need'                         0.163861
'week,'                        0.164396
'blank'                        0.168753
'pass'                         0.173864
'thing'                        0.182002
'also'                         0.193781
'work'                         0.194529
'based'                        0.199869
"don't"                        0.209846
'same'                         0.211326
'different'                    0.214242
'just'                         0.215541
'expect'                       0.215606
'result'                       0.217003
'them'                         0.219267
'can'                          0.220053
"that's"                       0.221224
'meaning'                      0.221593
'have'                         0.22283
'put'                          0.228286
'after'                        0.23066
'each'                         0.231785
'then'                         0.237896
'check'                        0.240837
"what's"                       0.241936
'it.'                          0.24197
'been'                         0.241975
'most'                         0.246523
"we'll"                        0.250462
'opportunity'                  0.754514
'luck,'                        0.765135
'e-mail'                       0.768274
'p.s.'                         0.769646
'computer,'                    0.770262
'address'                      0.771157
'wish'                         0.776725
'increase'                     0.781917
'"this'                        0.793163
'spam'                         0.794529
'unknowingly'                  0.798255
'continuing'                   0.801283
'line.'                        0.805659
'header:Return-Path:1'         0.80863
'url:com'                      0.82533
'member,'                      0.826136
'incredibly'                   0.826136
'offering'                     0.826485
'removal'                      0.837909
'membership.'                  0.844828
'home-based'                   0.844828
'cash.'                        0.844828
'list,'                        0.847218
'washington'                   0.851451
'compliance'                   0.862362
'10-12'                        0.86708
'site,'                        0.872105
'sincerely,'                   0.873162
'intelligence'                 0.877181
'emails'                       0.879528
'money'                        0.895718
'companies'                    0.898816
'hearing'                      0.900058
'money.'                       0.900837
'opportunity,'                 0.908163
'reward'                       0.908163
'classified'                   0.913944
'skip:& 10'                    0.914039
'obligation'                   0.925726
'internet.'                    0.92719
'header:Received:7'            0.933081
'ability.'                     0.934783
'screening'                    0.934783
'&quot;this'                   0.949438
'header:MiME-Version:1'        0.949438
'consumer'                     0.950295
'e-mail,'                      0.955625
'income'                       0.955625
'residents'                    0.958716
'washington,'                  0.958716
'marketing'                    0.959139
'opt-in'                       0.966259
'subject:your'                 0.973253
'click'                        0.974006
'"remove"'                     0.985437

From popiel@wolfskeep.com  Mon Nov 11 14:50:13 2002
From: popiel@wolfskeep.com (T. Alexander Popiel)
Date: Mon, 11 Nov 2002 06:50:13 -0800
Subject: [Spambayes] Outlook plugin - training 
In-Reply-To: Message from papaDoc <papaDoc@videotron.ca> 
   of "Mon, 11 Nov 2002 08:03:40 EST." <3DCFAAAC.4020807@videotron.ca> 
References: <LNBBLJKPBEHFEDALKOLCCEHLCIAB.tim.one@comcast.net>
	<3DCFAAAC.4020807@videotron.ca> 
Message-ID: <20021111145013.5AEBFF58B@cashew.wolfskeep.com>

In message:  <3DCFAAAC.4020807@videotron.ca>
             papaDoc <papaDoc@videotron.ca> writes:
>
>Can someone define what is an hapaxe !

Sure.  Merriam Webster says:

hapax legomenon: noun:
  1. a word or form occuring only once in a document or corpus
plural: hapax legomena

>P.S. Someday I will contribute to the code but first I need to learn python.

There's a lot of ways to contribute (testing, documentation, etc.)
without knowing the language, if you're interested...

- Alex

From popiel@wolfskeep.com  Mon Nov 11 14:55:11 2002
From: popiel@wolfskeep.com (T. Alexander Popiel)
Date: Mon, 11 Nov 2002 06:55:11 -0800
Subject: [Spambayes] Exchange integration 
In-Reply-To: Message from "Brad Clements" <bkc@murkworks.com> 
   of "Mon, 11 Nov 2002 08:22:53 EST." <3DCF67F7.16091.91EB9C8@localhost> 
References: <3DCF67F7.16091.91EB9C8@localhost> 
Message-ID: <20021111145511.5DC84F58B@cashew.wolfskeep.com>

In message:  <3DCF67F7.16091.91EB9C8@localhost>
             "Brad Clements" <bkc@murkworks.com> writes:
>
>Finally, why does MS make it so hard to find the info you want? 

A bit off topic ;-), but they just have a _LOT_ of information,
much of it written by people trained to dumb-down the tech so
that it's acceptable to the masses.  The good stuff (for techies
like us) is drowned out in a sea of end-user docs, and the
indexing tools don't know how to rate the stuff by technical
thoroughness.

MS isn't _trying_ to make stuff hard to find... it's just that
by trying to make it accessible for _everyone_, they make it
difficult for anyone to find stuff at an appropriate level.

- Alex (not normally an MS apologist...)

From tim.one@comcast.net  Mon Nov 11 16:48:27 2002
From: tim.one@comcast.net (Tim Peters)
Date: Mon, 11 Nov 2002 11:48:27 -0500
Subject: [Spambayes] Outlook plugin - training
In-Reply-To: <3DCD8B34.6040903@hooft.net>
Message-ID: <LNBBLJKPBEHFEDALKOLCMEGKCJAB.tim.one@comcast.net>

[Rob Hooft]
> ...
> It may not even be very realistic to training on fp's, as I think in my
> private E-mail I won't even check the spam folder very thoroughly at all.

FYI, here's my base weaktest run:

Total messages 6800 (4000 ham and 2800 spam)
Total unsure (including 30 startup messages): 124 (1.8%)
Trained on 57 ham and 68 spam
fp: 1 fn: 0
Total cost: $34.80
Flex cost: $193.3770

Here's the same thing, but even weaker, fiddling the code *not* to train on
false positives (so the only ham ever trained on is however much appeared in
the first 30 startup msgs, and later Unsure ham):

Total messages 6800 (4000 ham and 2800 spam)
Total unsure (including 30 startup messages): 123 (1.8%)
Trained on 57 ham and 66 spam
fp: 1 fn: 0
Total cost: $34.60
Flex cost: $199.3106

And one more time, not only not training on FP, but starting with an empty
database (no startup msgs).

Total messages 6800 (4000 ham and 2800 spam)
Total unsure (NO startup messages): 123 (1.8%)
Trained on 57 ham and 67 spam
fp: 4 fn: 1
Total cost: $65.60
Flex cost: $174.5831

All four FP were among the first 30.

Since even my sisters <wink> could be talked into training on 10 msgs at the
start:

Total messages 6800 (4000 ham and 2800 spam)
Total unsure (10 startup messages): 115 (1.7%)
Trained on 50 ham and 66 spam
fp: 0 fn: 1
Total cost: $24.00
Flex cost: $124.9315

Now for another extreme:  after 10 startup msgs, the system trains itself on
its own decisions, except that:

1. Unsures are correctly classified by the user.
2. False negatives are correctly classified by the user.

But false positives are trained on *as spam*, assuming the user never looks
at their spam folder.  That takes a long time to run, because
update_probabilities() is called after every msg.  After 2,100 msgs,

 2100 trained:1181H+919S wrds:59659 fp:0 fn:0 unsure:26

and the unsures are growing very slowly now (at 1400 msgs there were 25
unsures).

So one more twist:  as above (train on self-decisions, but spam below
spam_cutoff is corrected by the user, and FP gets trained on as spam), but
only update probabilities for each of the first 50 msgs, and every 50th msg
thereafter:  at 2,100 msgs, it was up to 29 unsure.  At the end,

Total messages 6800 (4000 ham and 2800 spam)
Total unsure (10 startup messages): 48 (0.7%)
Trained on 4000 ham and 2800 spam
fp: 0 fn: 0
Total cost: $9.60
Flex cost: $104.3355

It would have been more interesting had there been an FP, eh?

One conclusion is that, so far as error rates go, on this data it doesn't
much matter how training is done, but by any cost measure lots of training
is better than little (due to unsures).


From nas@python.ca  Mon Nov 11 17:33:40 2002
From: nas@python.ca (Neil Schemenauer)
Date: Mon, 11 Nov 2002 09:33:40 -0800
Subject: [Spambayes] Introducing myself
In-Reply-To: <LNBBLJKPBEHFEDALKOLCMEDCCJAB.tim.one@comcast.net>
References: <a05200f19b9f46cfba659@[192.168.1.103]>
	<LNBBLJKPBEHFEDALKOLCMEDCCJAB.tim.one@comcast.net>
Message-ID: <20021111173340.GA22411@glacier.arctrix.com>

Tim Peters wrote:
> [...] I haven't run the CDB variant and so haven't timed it
> (Neil, do you have gripes about memory or time?  Spit 'em out.).

It works fine for me on the old 200 Mhz machine I use for a mail
server.  I retrain very rarely so I don't care if it takes a bit extra
time to rebuild the DB.

  Neil 

From tim.one@comcast.net  Mon Nov 11 17:37:27 2002
From: tim.one@comcast.net (Tim Peters)
Date: Mon, 11 Nov 2002 12:37:27 -0500
Subject: [Spambayes] Exchange integration
In-Reply-To: <3DCF67F7.16091.91EB9C8@localhost>
Message-ID: <LNBBLJKPBEHFEDALKOLCMEHDCJAB.tim.one@comcast.net>

[Brad Clements]
> I'm thinking about running spambayes inside Exchange 5.5 (or
> Exchange 2000).
>
> At first, I thought I'd use the Event service, but in 5.5 it's
> async and MS even says "don't use this to filter all your messages".
>
> In Exchange 2000 apparently there's a synchronous event service,
> but I don't have Exchange 2000.

I've got no version of Exchange at all, and neither does Mark Hammond, so
you're pretty much on your own wrt the people here.  Jump in <wink>.

> So it looks like I need to create some kind of MAPI hook or
> preprocessor or mailbox assistant.. I'm not sure which.
>
> Anyone know? And, can I do this all in Python via COM or do I
> need some "real C to hook in?

Study the project's Outlook2000 directory.  There's a quite sophisticated
Outlook 2000 addin there, written in Python + MarkH's win32 extensions.  It
uses a mix of MAPI and the Outlook object model.  It used to use CDO too,
but I think Mark found ways to get rid of all that (CDO isn't installed by
default for IMO Outlook installs, so it required the user to dig out their
Office CD and install CDO first).

> Finally, why does MS make it so hard to find the info you want?

They don't -- they only make it hard to find the info *you* want.  You must
have done something to piss off Bill.  Beyond that, MAPI is a massive and
excruciatingly low-level API.  The MSDN SDK MAPI docs are extensive, and so
are web resources trying to make sense of it all (e.g., expect to spend a
lot of time staring at <http://www.slipstick.com>).


From trebor@animeigo.com  Mon Nov 11 18:10:19 2002
From: trebor@animeigo.com (Robert Woodhead)
Date: Mon, 11 Nov 2002 13:10:19 -0500
Subject: [Spambayes] Introducing myself
In-Reply-To: <LNBBLJKPBEHFEDALKOLCMEDCCJAB.tim.one@comcast.net>
References: <LNBBLJKPBEHFEDALKOLCMEDCCJAB.tim.one@comcast.net>
Message-ID: <a05200f1ab9f5529c591c@[192.168.1.101]>

>  > If so, assuming the final calc isn't exponential, reducing the lookup
>>  time/resources can be a big win performance-wise.
>
>I don't believe so.  When using a Python dict as "the database", the time
>for scoring a msg is minor compared to the time taken by parsing and
>tokenization, and especially compared to the time just to get the msg *into*
>the system (whether that's file I/O, or socket I/O, or some email pkg's
>programming API, or whatever -- that part is the bottleneck when using a
>dict; when not using a dict, database access time may become a burden, and
>most databases in use here require string keys even if you're working with
>ints -- the database user has to convert the hash code to a string!  Other
>databases (like ZODB) could use ints directly as keys, but they're rare.).

Oh, I'd roll my own, probably using an in-memory hash table scheme. 
If you're hashing to a nice, randomly distributed 32-bit key, you'd 
effectively take the database out of the equation.

I think most of the reason I lean this way is that I'm thinking about 
actual implementations (as opposed to testing), and with bayesian, 
you want to do this as close to each individual user as possible 
(right in the mailreader, via a plugin).  It seems to me that you're 
at the point where testing the effects of data reduction techniques 
would be fruitful.  Once I get up and running on the code (just paid 
the tithe to O'Reilly) I'll test it out.

One thing that occurred to me: now that you have something that seems 
to work pretty well, have you considered backtracking on particular 
features to see how much they contribute; for example, going to a 
trivial state machine parser to spit out tokens?

>
>>  Note that since you have the text of the token before you hash it,
>>  you can keep that around for significant tokens and display it later.
>
>Good point!  I had overlooked that indeed.

Yeah, we old farts ("When I was a lad, the bytes only had 6 bits!") 
have lots of tricks.  We don't so much write code as remember it and 
retype it.

>  > The cost of the hashing is the inevitable collisions, which
>>  blur the probabilities for colliding tokens.
>
>Another cost is obscuring the code.

Not really; it doesn't really matter what the format of a token 
coming out of the parser is, does it?  You might need an extra data 
structure to take care of the hashed token/string token 
correspondences but you only need touch that at the end of the parser 
and in the diagnostic output.

>They can't really defeat this scheme that way.  At best they can hope to
>push msgs into Unsure territory.

That is good enough, because it means the human has to look at it. 
Which is what spammers want to have happen.

>   What constitutes "very hammy" is a
>function of each user's database here, and no generic blob of text is going
>to score high for hamminess everywhere.

True; then it becomes a game of finding generic messages that are 
likely to evaluate as hammy enough to the average recognizer.  And 
the meta-response is to send out multiple emails with differently 
tuned slices of ham.

I hereby, btw, coin the term "Dagwood" (or perhaps it should be 
Wooddag?) to mean an email containing artfully sliced amounts of ham, 
spam, and html condiments.  ;^)

>  > So one possible approach would be to gradually degrade the
>>  significance of a token the further along in the email it is (both
>>  during training and recognition).
>
>I think there is reason to believe that spammers have to get your attention
>early.  OTOH, many pieces of incriminating evidence also live at the end of
>spams ("this is not spam!" blurbs, the explanation that you got this because
>you're on an opt-in list run by one of their "partners", references to
>various state and federal bills, the "unsubscribe me" URL slash address
>harverster, etc).

Might have to be a U-shaped function then.  Or it may turn out that 
ignoring the stuff at the end doesn't cost much but reduces false 
positives on new (legit) mailing lists.  I'm just throwing out ideas 
for possible tests.

>Yup.  Guido suggested that at the start, but that level of HTML analysis
>gets a lot more expensive too.  We'll see.

Well, what you'd need is a hacked HTML renderer that output sets that 
look like (token,size,color,background) and ignored words that were 
too small or hard to read.

>
>BTW, on large tests this system scores about 80 msgs/second on my box,
>including everything (system time, training, I/O, parsing, tokenizing,
>scoring, reporting, recording, and analyzing results -- this is # of msgs
>divided by elapsed wall-clock time).  We could afford to get slower, if
>necessary.

And the machines will get faster.  Eventually.

>  > Beware the One True Path.  There is strength in diversity.
>
>Let a thousand classifiers bloom.  If someone here wants to volunteer the
>effort to try a different approach, that's always been welcome.  But the
>results have been so good sticking to one basic approach that I don't see
>that happening.  We ended up doing one thing exceedingly well, and that's a
>contribution to diversity too, of a kind you may be undervaluing <wink>.

I was somewhat teasing you.

>
>>  Or, as the noted philosopher D. Vader put it, "Don't be too proud of
>>  this technological terror you have created."  As you will recall,
>>  those rebel scum managed to craft a nasty false positive.
>
>I don't view an FP as being as costly as needing to build a new Death Star.
>For goodness sake, this is email we're talking about -- anyone trusting a
>truly critical msg to email is dreaming to begin with.

Unfortunately, in the real world, this happens all too often.  Keep 
in mind that the readers of this list are not the typical users of 
the resulting software techniques.

>Well, it's got no semantic knowledge at all.  It doesn't even know which
>language a msg is written in, let alone what it means, and has no concept of
>"word" beyond "stuff that appears between whitespace".  It's very much
>focused on purely local lexical structure.

OK, I was being fuzzy in my use of semantics and syntactics.  Mea Culpa.

>  > So train it only on what a human would see reading the message.
>
>We get a lot of value out of mining a handful of header lines.  We also get
>a lot of value out of tokenizing embedded "invisible" URLs.  The theme here
>is that we tokenize "what works", and that's driven by measured error rates;
>philosophy doesn't enter into that part.

Well, I'm thinking of the metagame.  What are the spammer responses 
to a truly effective bayesian filter?  Obviously, remove those 
features that are typical of spam.  What features cannot be removed 
without making the spam useless as a commercial message?  The actual 
words visible to the reader.

This is what led me to decide, in my testing, to use a simple parser 
that extracted alphanumerics with a few permitted interior 
punctuation characters (like . and '), and which handled tokens with 
interior comments properly.

An interesting test would be to train the system, then run a test 
with a parser that only outputs the simple tokens (simulating a 
spammer response) and see how well it does.

>I have no real idea, but fear that presuming "yes" is presuming a lot of
>intelligence that systems parsing this header won't actually have.  The
>fancier the rating scheme the fancier they have to be too.  In the end, the
>user has to decide what to do about everything that's not called ham, no
>matter how many or few the non-ham categories.  As a user myself, I've got
>no use at all for distinctions beyond "I'm pretty sure it's spam" and "beats
>me".  That already gives two categories I have to check, and that's enough.
>I do find it useful that my client can sort on the score metadata, and there
>are proposals here too to add fancier header lines beyond the basic
>spam/ham/unsure one.

Fair enough.  Optional fancier header lines would do the job as well.

>  > Murphy's Law guarantees that it will happen.  In fact, it typically
>>  happens (in my painful personal experience) soon  after you make
>>  comments like the above.
>
>You realize you're overselling badly here, right <wink>?

If anything, the opposite. <smirk>!

>This is akin to my "entire Nigerian scam quote" FP, and it's all but certain
>that the spam content would overwhelm the brief "from the boss" clues.
>OTOH, if my boss didn't wait for my reply and went ahead and invested
>anyway, the subsequent financial disgrace would open the door for me to take
>his job.  After all, he relied on me for advice, so who more logical to
>succeed him?

Unfortunately, he invested your pension money.  Ooops.  ;^)

R

-- 

Woodhead's Law: "The further you are from your server,  the more likely
it is to crash."

From trebor@animeigo.com  Mon Nov 11 18:10:12 2002
From: trebor@animeigo.com (Robert Woodhead)
Date: Mon, 11 Nov 2002 13:10:12 -0500
Subject: [Spambayes] Introducing myself
In-Reply-To: <3DCF7D32.4090209@startechgroup.co.uk>
References: <E18AYxq-0006sT-00@mail.python.org>
 <a05200f03b9f34909a00b@[192.168.1.103]>
 <3DCF7D32.4090209@startechgroup.co.uk>
Message-ID: <a05200f00b9f5a28504bf@[192.168.1.101]>

>I'm always looking for more corpuses. Stick the data on an FTP/HTTP 
>server somewhere (password protect if you need to). Or contact me 
>privately if that's not possible.

Should be up by the time you read this, a 30M zipped file containing 
a Macintosh Eudora Mailbox of mixed english and foreign spam. 
Represents the last few months of receipts, but nothing more current 
than a couple of weeks ago.

http://www.madoverlord.com/data/spam.zip

Let me know if you have troubles grabbing it.
-- 
Robert Woodhead, Webslave & Mad Overlord    http://selfpromotion.com/
Located in the Hurricane capitol of the US, Wilmington NC.  Lucky me!

From db3l@fitlinxx.com  Mon Nov 11 22:28:01 2002
From: db3l@fitlinxx.com (David Bolen)
Date: 11 Nov 2002 17:28:01 -0500
Subject: [Spambayes] Re: Outlook plugin plus Exchange
References: <n2m-g.u1ir3gwe.fsf@morpheus.demon.co.uk>
	<LCEPIIGDJPKCOIHOBJEPMENEHJAA.mhammond@skippinet.com.au>
	<n2m-g.d6pdqywt.fsf@morpheus.demon.co.uk>
Message-ID: <r0hn0ofpwr2.fsf@ctwd0222.corp.fitlinxx.com>

Paul Moore <lists@morpheus.demon.co.uk> writes:

> That sounds like the best option. I haven't had a chance to check
> Exchange yet, but with an IMAP store there are no "New mail" events
> triggered when I start Outlook with new mail in the IMAP inbox. I'd
> expect Exchange to be the same.  (...)

I'm based on an Exchange server, and yes, the behavior is the same -
no events fire.  I think Mark's in-progress approach of scanning for
unread and unscanned messages on startup is reasonable.  I'm not quite
sure how the Outlook processes client rules on startup, but it does
have the "feeling" that it simply starts an execution of the rules
against Inbox, so the spambayes addin would just be following along.

-- David


From tim@fourstonesExpressions.com  Mon Nov 11 22:35:29 2002
From: tim@fourstonesExpressions.com (Tim Stone - Four Stones Expressions)
Date: Mon, 11 Nov 2002 16:35:29 -0600
Subject: [Spambayes] Re: Outlook plugin plus Exchange
In-Reply-To: <r0hn0ofpwr2.fsf@ctwd0222.corp.fitlinxx.com>
Message-ID: <MCAHB09QN93QKU2UA5265CAD8829.3dd030b1@riven>

11/11/2002 4:28:01 PM, David Bolen <db3l@fitlinxx.com> wrote:

>Paul Moore <lists@morpheus.demon.co.uk> writes:
>
>> That sounds like the best option. I haven't had a chance to check
>> Exchange yet, but with an IMAP store there are no "New mail" events
>> triggered when I start Outlook with new mail in the IMAP inbox. I'd
>> expect Exchange to be the same.  (...)
>
>I'm based on an Exchange server, and yes, the behavior is the same -
>no events fire.  I think Mark's in-progress approach of scanning for
>unread and unscanned messages on startup is reasonable.  I'm not quite
>sure how the Outlook processes client rules on startup, but it does
>have the "feeling" that it simply starts an execution of the rules
>against Inbox, so the spambayes addin would just be following along.

The whole problem I see with this is that �$0pht could and most likely will 
screw all these machinations up with the next release of Outlook or 
Exchange... They have this great history of not caring if their api changes, 
or system behavior changes, are backward compatible.  If we're having this 
level of difficulty now, get ready...  :(

- TimS

>
>-- David
>
>
>
>_______________________________________________
>Spambayes mailing list
>Spambayes@python.org
>http://mail.python.org/mailman/listinfo/spambayes
>
>
- Tim
www.fourstonesExpressions.com 


From tim.one@comcast.net  Mon Nov 11 23:24:09 2002
From: tim.one@comcast.net (Tim Peters)
Date: Mon, 11 Nov 2002 18:24:09 -0500
Subject: [Spambayes] A couple of small tokenizer experiments.
In-Reply-To: <200211040950.gA49oU809201@localhost.localdomain>
Message-ID: <LNBBLJKPBEHFEDALKOLCCELFCJAB.tim.one@comcast.net>

[Anthony Baxter
 Sent: Monday, November 04, 2002 4:51 AM
]
> First experiment was to make the URL tokenizer look for the string
> 'mailman' in the URL. If it was found, simple push the clue "url:
> Mailman URL" onto the clue-pile. This was an attempt to remove the
> many many related clues that get bolted onto the occasional spam that
> makes it past Greg to the python.org mailservers. It's something of a
> violation of "stupid beats smart", but I'd noticed that the mailman
> footer from spam via mailman lists was always providing a bunch of
> clues that were making life harder.

Indeed they do.

>
> --- tokenizer.py        1 Nov 2002 16:10:13 -0000       1.60
> +++ tokenizer.py        4 Nov 2002 06:59:37 -0000
> @@ -931,6 +931,11 @@
>          new_text.append(text[i : start])
>          new_text.append(' ')
>
> +        if guts.find('mailman') != -1:
> +            pushclue("url: Mailman URL")
> +            i = end
> +            break

Can you try this again replacing "break" with "continue"?  I can't believe
you intended break here -- it means that the first time we see a Mailman URL
in a msg, we stop looking for embedded URLs period.  Spam could easily
exploit that.

>> ham:spam:  11192:1826
>>                   11192:1826

You realize you've get a very high ratio of ham to spam, right?

> ...
> Next I tried tokenizing the To: line.  I parsed it properly, then
> decoded the real name and split the words. I also added a token for
> the RHS and LHS of the email @ sign.

We don't tokenize To: now because it gives good results for bad reasons on
mixed-source corpora.  It would be good to have an option to tokenize it.
It appears that your code also tokenized Cc:; also fine.  I would rather see
the code added to the loop currently cracking "from" lines:

        for field in ('from',):

so that we tokenize all address thingies in a uniform way.  The option would
control the list of field names looped over there (default just from:,
optionally also to: and cc:).

> ...
> The final test was to decode the Subject header if it's encoded, and
> tokenize that, rather than in encoded.
>
> --- tokenizer.py        1 Nov 2002 16:10:13 -0000       1.60
> +++ tokenizer.py        4 Nov 2002 09:45:25 -0000
> @@ -1071,6 +1078,10 @@
>          # especially significant in this context.  Experiment
> showed a small
>          # but real benefit to keeping case intact in this
> specific context.
>          x = msg.get('subject', '')
> +        # Subject decoding.
> +        x, subjcharset = email.Header.decode_header(x)[0]

Why is this tokenzing only "the first" piece of the Subject line?


> +        if subjcharset is not None:
> +            yield 'subjectcharset:' + subjcharset
>          for w in subject_word_re.findall(x):
>              for t in tokenize_word(w):
>                  yield 'subject:' + t


I changed this to loop over all the Subject parts, and saw some minor good
effects on marginal msgs, so I'll check this one in without further ado.  It
wasn't much of a win for you either, but it's cheap so why not.  In my
personal email "subjectcharset:unknown" shows up a lot for some reason (but
only in spam).


> My remaining 6 fns are:
>
> a brazilian spam-ish thing: (*H* 0.633859 *S* 0.20342 = 0.28478)
> ...
> -----------------
> Received: from localhost (localhost.localdomain [127.0.0.1])
>         by localhost.localdomain (8.11.6/8.11.6) with ESMTP id
> g8RNZhh05864
>         for <anthony@localhost>; Sat, 28 Sep 2002 09:35:44 +1000
> Received: from mail.interlink.com.au [203.9.111.130]
>         by localhost with POP3 (fetchmail-5.9.0)
>         for anthony@localhost (single-drop); Sat, 28 Sep 2002
> 09:35:44 +1000 (ES
> T)
> Received: from mediterraneo.rjnet.com.br (root@[200.152.115.30])
>         by valdez.interlink.com.au (8.11.6/8.11.2) with ESMTP id
> g8RNZJc28230
>         for <anthony@interlink.com.au>; Sat, 28 Sep 2002 09:35:20 +1000
> Received: from locutus.rjnet.com.br (root@locutus.rjnet.com.br
> [200.222.31.10])
>         by mediterraneo.rjnet.com.br (8.11.4/8.11.4) with ESMTP
> id g8RNNc801901;
>         Fri, 27 Sep 2002 20:23:38 -0300
> Received: from localhost ([200.222.39.21])
>         by locutus.rjnet.com.br (8.11.2/8.11.2) with ESMTP id
> g8RMqEN00464;
>         Fri, 27 Sep 2002 19:52:14 -0300

> DATA
> -----------------
> I plan to try something like tokenizing the oldest three received
> lines (to hopefully avoid the previous issues with mail.python.org
> blowing numbers to hell) to see if that will help this one.

Did you try that yet?  I'm not replying in a timely fashion because I'm not
interested, it's just because I'm 244 msgs behind on this mailing list alone
now <wink/sigh>.

> The "iron citadel" python-list spam
> (*H* 0.999999, *S* 0.038123 = 0.01906)

DAMNED good spam!

> A base64d MP3 spam sent via zope-dev
> (*H* 0.993904, *S* 0.187868 = 0.0969820429397)
> which got a bunch of hammy clues from "Subject: [Zope-dev] Re: ofpa" and
> also the various mailman type clues (although that's better with the
> first patch, above)
>
> Someone spamming Linux CDs via a list at 4thought
> (*H* 1, *S* 0.207177 = 0.103588442478)
>
> A short porn spam sent via python-list
> (*H* 0.817004, *S* 0.618399 = 0.400697521022)
>
> A wierd german spam for some sort of expert systems (in english).
> (*H* 0.997132, *S* 0.84965 = 0.426259133645)

It's Weird that you have cutoffs arranged such that a number near .40 isn't
Unsure for you.  That may (or may not) be related to the lopsidedness of
your data (> 6 ham per spam).


From spambayes@djl.freeuk.com  Mon Nov 11 23:54:27 2002
From: spambayes@djl.freeuk.com (David Leftley)
Date: Mon, 11 Nov 2002 23:54:27 +0000
Subject: [Spambayes] Re: Outlook plugin plus Exchange
In-Reply-To: <r0hn0ofpwr2.fsf@ctwd0222.corp.fitlinxx.com>
References: <n2m-g.u1ir3gwe.fsf@morpheus.demon.co.uk>
	<LCEPIIGDJPKCOIHOBJEPMENEHJAA.mhammond@skippinet.com.au>
	<n2m-g.d6pdqywt.fsf@morpheus.demon.co.uk>
	<r0hn0ofpwr2.fsf@ctwd0222.corp.fitlinxx.com>
Message-ID: <76f0tu4o9e3a9p6ictc29kvv1u6bhict22@4ax.com>

On 11 Nov 2002 17:28:01 -0500, David Bolen <db3l@fitlinxx.com> wrote:
>I'm based on an Exchange server, and yes, the behavior is the same -
>no events fire.  I think Mark's in-progress approach of scanning for
>unread and unscanned messages on startup is reasonable.

With Exchange, though, it's not just on startup that the plugin
doesn't notice new messages. I've only been playing with it for a
couple of days, so I'm still not exactly sure in which circumstances
it fails, but here's what I observed from today's mail:

External e-mail was in every case processed immediately on arrival by
the plugin.

Internal e-mail (i.e. sent through Exchange) is never picked up
immediately by the plugin.
- In some cases these messages were classified when the next e-mail
(whether external or internal) arrived. When this happened, the first
message was also (annoyingly) marked as unread when it was classified.
- in another case, the plugin classified an e-mail when I went to my
Calendar and opened the details of an existing (from before I
installed the plugin) meeting.
- I replied to a couple of e-mails before further messages came in.
The plugin never got around to classifying those messages.

And, while I'm reporting the quirks of the Outlook plugin, I have 3
messages (out of my spam corpus of c. 2000) that the plugin refuses to
classify. If I attempt to score the contents of a folder containing
one of these messages, scoring simply stops at that point - the
progress bar disappears, and the remaining messages are left unscored.

Apart from those little things, though, this software rocks! Keep up
the good work, guys!

David.


From tim.one@comcast.net  Tue Nov 12 00:21:20 2002
From: tim.one@comcast.net (Tim Peters)
Date: Mon, 11 Nov 2002 19:21:20 -0500
Subject: [Spambayes] Some more experiences with the Outlook plugin
In-Reply-To: 
 <16E1010E4581B049ABC51D4975CEDB88619933@UKDCX001.uk.int.atosorigin.com>
Message-ID: <LNBBLJKPBEHFEDALKOLCEELJCJAB.tim.one@comcast.net>

[Moore, Paul]

You might want to play along with "the other" training strategy we're
trying:  last week I wiped my database and started over from scratch,
training it *only* on mistakes and unsures.  It's been thru a few thousand
msgs since then, but so far I've trained it on only 51 ham and 55 spam.  The
Unsures are weird, but the Unsure rate is falling, and it makes very few
outright mistakes now (BTW, I have ham_cutoff at 20 and spam_cutoff at 80 in
the Outlook client).

> ...
> 1. To start with, configure the plugin to define one "Spam" folder and
>    one "Unsure" folder, and define all other folders as "Ham". [1]

> [1] I got this wrong at the start - the key point to stress here is
>     that *everything* that isn't spam is ham - by definition. Trying
>     to "help" the classifier by telling it to ignore messages which
>     you "know" are ham is actually detrimental - if you know, let the
>     classifier find out!

We don't have a way to train on a random sample now, and that's going to be
a killer for some people (e.g., Sean True has 2 gigabytes of ham).

> 2. Train the classifier on whatever you have available. This will
>    usually be massively overbalanced in favour of ham (few people
>    collect their spam) but it *will* make a start. [2]

> [2] I'm getting pretty good results now (but see below), with 5661
>     ham and 303 spam, but even with under 100 spam (admittedly with
>     less ham, as I made the "exclude some ham" mistake) I was getting
>     visible benefits.

My guess is that you'd do better by striving for no more than a 3:1
imbalance in either direction.  There are reasons to despise the "purely
mistake-based training" described at the top, but it seems very naturally to
keep the training sets in rough balance.

> 3. Run with this for a while, incrementally training on mistakes and
>    unsures.

Training on those is vital no matter what else you do.

> Keep all of the spam!

I'm afraid that one won't fly over time, except for researchers.  And people
boldly using unstable pre-alpha code <wink>.

> 4. Periodically, retrain the full database on all the collected ham
>    and spam.

That shouldn't be necessary when the code is complete and stable.


> Notes:
>
>
>
> Other points:
>
> * The collection I end up with is still biased - there are a lot of
>   ham messages which I just read and delete, and they are probably
>   somehow "similar". While I could retain these, this would require
>   a much more significant change to my way of working.

Keep working the way you like!  The client should eventually be able to
deduce what's ham by watching you throw away things without first calling
them spam.

> * Results still seem to be pretty much hapax based (if I understand the
>   term and its usage). Looking at the clues for a message often shows
>   some pretty bizarre tokens showing up as *either* sort of clue. (One
>   message showed 'yet' as a ham clue with a probability of 0.000877364!)

hapax means a word that appeared only once in your entire training corpus.
In the list you gave below, there are very few hapaxes (I recognize them
from the probabilities; I should probably add code to the client to display
the raw counts too):

> 'sweet'                        0.155172  these 4 appeared in one ham
> 'ads,'                         0.155172
> 'insult'                       0.155172
> 'subject:COMPUTER'             0.155172

> 'membership.'                  0.844828  these 3 appeared in one spam,
> 'home-based'                   0.844828  presumably itself since you
> 'cash.'                        0.844828  said you trained on it


> * Following on from this, I also see Tim's behaviour of surprising
>   unsure cases (or worse, false negatives!).

I expect for a very different reason, though:  your 18:1 ham:spam imbalance.
This implies words can get spamprobs much closer to 0 than they can get to
1.  There's just not enough spam to *justify* spamprobs closer to 1 than
there is enough ham to justify spamprobs closer to 0.  Let's look at the 3
most extreme words on both ends of your listing:

> '(and'                         0.00044603
> 'looking'                      0.000489716
> 'added'                        0.000613999

> 'subject:your'                 0.973253
> 'click'                        0.974006
> '"remove"'                     0.985437

'(and' is nearly "33 times closer" to 0 than '"remove"' is to 1, and that
makes the accidental appearance of a ham word in spam much more powerful
than the systematic appearance of a spam word in spam.  If you only had 300
ham in your training set, it would be much harder for a word to get a very
low spamprob; contrarily, if you had 5500 spam in your training, it would be
much easier for a word to get a very high spamprob.  As is, your strong ham
words are much more powerful than your strong spam words, and almost *must*
be.

Anthony Baxter here routinely runs with a ridiculous <wink> ham:spam ratio
too, but you're even way beyond him (his is about 6:1).  This brings out
effects I've never seen before.

>   Worst case recently was a message which scored as solid ham.  I
>   trained on it as "Spam", and rescored it. It still scored 5 - solid
>   ham.

That's because you're *not* hapax-driven.  If you were, the score would have
shot up to 100 (maybe 99).  All ham contains spam words, and my guess is
you've got so much more ham than spam that it's drowning out the spam.
That's but picturesque but inaccurate <wink>.  A more accurate speculation
was given above.

>   My immediate reaction was "But I just *told* you it's spam!". I know
>   that isn't how the classifier works, but even so it was unsettling.
>   FWIW, I attach the spam clues for this one (I don't know if they make
>   any sense in isolation, but it can't hurt...)

No more than what I copied above.  If you like, send me the original (as an
attachment), and I'll score it under my well-trained classifier (the one I
parked last week when starting the mistake-only training experiment).  That
one was trained on about 2 thousand recent spam.

If that works better for me than for you, then I'd like tp try another
experiment, shipping you just the stronger-than-hapax spam words from that
classifier, along with a bit of code you can run to *merge* that into your
own classifier.  That would be an experiment in "seeding" a classifier,
something we haven't gotten a good start on here yet.

> * I don't know how long it will be before I start grudging the use of
>   disk space to store spam. At that point, the nasty question of
>   whether I keep it, or risk being unable to recreate my database,
>   becomes important.

At 300 measly spam saved, I should remind you that a gigabyte of disk space
costs less than the value of your time worrying about it <wink>.

> I need to look at how to get some more information out of the
> classifier, to try to understand how much of the good results I see
> are down to luck (hapaxes, I guess - which makes me think of "happy
> accidents" rather than its real meaning...)

Cool!  When hapaxes work, they *are* happy accidents!  I like it.

> and hence is fragile, and how much is actually solid.  Can anyone point
> me at the right part of the code to read to find this?

classifier.py contains all the code for probability estimation and scoring.


From tim.one@comcast.net  Tue Nov 12 00:30:01 2002
From: tim.one@comcast.net (Tim Peters)
Date: Mon, 11 Nov 2002 19:30:01 -0500
Subject: [Spambayes] Re: Outlook plugin plus Exchange
In-Reply-To: <76f0tu4o9e3a9p6ictc29kvv1u6bhict22@4ax.com>
Message-ID: <LNBBLJKPBEHFEDALKOLCCELLCJAB.tim.one@comcast.net>

[David Leftley]
> ...
> And, while I'm reporting the quirks of the Outlook plugin, I have 3
> messages (out of my spam corpus of c. 2000) that the plugin refuses to
> classify. If I attempt to score the contents of a folder containing
> one of these messages, scoring simply stops at that point - the
> progress bar disappears, and the remaining messages are left unscored.

Next time that happens, bring up PythonWin and do Tools -> Trace Collector
Debugging Tool.  That will pop up a window showing diagnostic msgs and
tracebacks produced by the Outlook client.  You'll probably find something
"interesting" near the end.  Note that nobody who has done work on the
client has any form of Exchange running, so diagnosis may not lead to a
cure.  Still, can't fix what nobody understands, so it will be a start.

> Apart from those little things, though, this software rocks! Keep up
> the good work, guys!

Tell Redmond -- if they paid Mark to bust his balls on this, I bet he'd grow
a new pair <wink>.


From tim@fourstonesExpressions.com  Tue Nov 12 00:49:52 2002
From: tim@fourstonesExpressions.com (Tim Stone - Four Stones Expressions)
Date: Mon, 11 Nov 2002 18:49:52 -0600
Subject: [Spambayes] Re: RE: [Spambayes-checkins] website docs.ht,1.3,1.4
In-Reply-To: <LNBBLJKPBEHFEDALKOLCOELMCJAB.tim.one@comcast.net>
Message-ID: <TNPJJYSSQXTGBURB4B96ZUROFC7.3dd05030@riven>

11/11/2002 6:40:44 PM, Tim Peters <tim.one@comcast.net> wrote:

>> ! <dt>hapax, hapax legomenon <dd>a word or form occuring only once in a
>> ! document or corpus. (plural is hapax legomena)
>>   </dl>
>
>Ya, but even I'm not that anal -- I usually say hapaxes.  hapaxora would be
>a hoot too <wink>.

Hapax driven, alternate defn: Typical mode of intra-gender communication, as 
in:

"Husband: Beer"
"Wife: No"
"Husband: Now"
"Wife: NOT!!!"
"Husband: Please?"
"Wife: Dreamer"
"Husband: <expletive>"
"Wife: Idiot"
"Husband: What?"
"Wife: LISTEN!"

- TimS

etc. etc.
>
>
>_______________________________________________
>Spambayes-checkins mailing list
>Spambayes-checkins@python.org
>http://mail.python.org/mailman/listinfo/spambayes-checkins
>
>
- Tim
www.fourstonesExpressions.com 


From tim.one@comcast.net  Tue Nov 12 01:27:03 2002
From: tim.one@comcast.net (Tim Peters)
Date: Mon, 11 Nov 2002 20:27:03 -0500
Subject: [Spambayes] Introducing myself
In-Reply-To: <a05200f1ab9f5529c591c@[192.168.1.101]>
Message-ID: <LNBBLJKPBEHFEDALKOLCKELPCJAB.tim.one@comcast.net>

[Robert Woodhead]
> ...
> It seems to me that you're at the point where testing the effects of
> data reduction techniques would be fruitful.

Bootstrapping a classifier, connecting to a gazillion quirky email clients,
and testing training strategies are all current high priorities.  Saving
memory wouldn't buy me anything in the Outlook client I'm using, or in the
high-volume python.org application.  But, as I said, other people are keener
on that, and I expect that reducing the sheer number of tokens is a more
effective approach (in part because it ties into effective training
strategies over time -- the database will just keep growing (albeit at a
slackening pace) without active pruning, and whether a token takes one byte
or 50).

> Once I get up and running on the code (just paid  the tithe to O'Reilly)
> I'll test it out.

It's all yours <wink>.

> One thing that occurred to me: now that you have something that seems
> to work pretty well, have you considered backtracking on particular
> features to see how much they contribute; for example, going to a
> trivial state machine parser to spit out tokens?

In theory, all prior decisions should be revisited after every change.  I
haven't done anything like that lately, though, in part because no previous
"let's revisit this!" experiment ever paid off.

Note that the bulk of the body tokenizer couldn't be simpler:

1. Convert to lowercase.
2. Split on whitespace.

Well, we *could* skip #1, but previous experiments found that it didn't give
better error rates but did increase the database size.  It did change the
*kinds* of errors, though, and in particular conference announcements had a
hard time getting thru when case was preserved (they're trying to sell you a
conference, and often SCREAM ABOUT IT).

> ...
> Yeah, we old farts ("When I was a lad, the bytes only had 6 bits!")

They had 6 or 9 when I was a lad, depending on how you set the control bit
for the Univac 1108's 36-bit words.

> have lots of tricks.  We don't so much write code as remember it and
> retype it.

You don't want to bet on who'e older here <wink>.

> ...
> Not really; it doesn't really matter what the format of a token
> coming out of the parser is, does it?

The classifier is happy with any immutable and hashable Python object, i.e.
anything that can be used as a Python dict key.  But people grafting various
databases onto this have stronger requirements, and they're not always
clear.  As I mentioned last time, most "lightweight" databases require
string keys, so any switch away from strings would break those systems.
It's pre-alpha code, but still I'm not keen to rock anyone's boat unless
there's a clear win in return.

> ...
> True; then it becomes a game of finding generic messages that are
> likely to evaluate as hammy enough to the average recognizer.  And
> the meta-response is to send out multiple emails with differently
> tuned slices of ham.

They can try.  Spam doesn't need to be stopped, though, it merely has to be
made more costly to send than it brings back.

Last week Jeremy and Guido here both reported a *very* effective technique:
spam was sent to them as replies to mailing-list postings (not this mailing
list <wink>) they had made, including a full quote of the msg they had
posted.  That was guaranteed to have lots of ham words for them, and the
Subject line was the expected "Re:" followed by their own subject line.

I doubt they're going to get a response rate high enough to be able to
afford this scheme over time, at least not on tech mailing lists.  We'll
see; if they can, it's going to be hard to beat.

> I hereby, btw, coin the term "Dagwood" (or perhaps it should be
> Wooddag?) to mean an email containing artfully sliced amounts of ham,
> spam, and html condiments.  ;^)

Cool!  Dagwood it is.

> ...
> Well, what you'd need is a hacked HTML renderer that output sets that
> look like (token,size,color,background) and ignored words that were
> too small or hard to read.

Sure.  I expect the quickest path would be to feed the source thru a
text-only browser, and stare its output.  That seems mondo expensive,
though,

>> For goodness sake, this is email we're talking about -- anyone
>> trusting a truly critical msg to email is dreaming to begin with.

> Unfortunately, in the real world, this happens all too often.  Keep
> in mind that the readers of this list are not the typical users of
> the resulting software techniques.

I do, but it's still not my problem <0.5 wink>.  All non-trivial systems
have non-zero FP rates, and that's a fact of life.  You're keen on
whitelists, but they wouldn't do a thing to stop any of the false positives
I've seen, and so on; a multitude of schemes may reduce the overall error
rates if they're combined intelligently, but they're not going to reach an
error rate of 0.  Not even with human review (as has become obvious to
everyone who's run a good system over their supposedly clean ham and spam
collections).  At some point, learning that Santa Claus isn't actually a
white man is a part of growing up <wink>.

show-me-an-isp-that-guarantees-email-delivery-and-we'll-get-
    rich-shorting-its-stock-ly y'rs  - tim


From tim@fourstonesExpressions.com  Tue Nov 12 01:32:18 2002
From: tim@fourstonesExpressions.com (Tim Stone - Four Stones Expressions)
Date: Mon, 11 Nov 2002 19:32:18 -0600
Subject: [Spambayes] Introducing myself
In-Reply-To: <LNBBLJKPBEHFEDALKOLCKELPCJAB.tim.one@comcast.net>
Message-ID: <FDD0USKH71KSN1XWSRM6JDWXW64VR.3dd05a22@riven>

11/11/2002 7:27:03 PM, Tim Peters <tim.one@comcast.net> wrote:

>[Robert Woodhead]
>> ...
>> It seems to me that you're at the point where testing the effects of
>> data reduction techniques would be fruitful.
>
>Bootstrapping a classifier, connecting to a gazillion quirky email clients,
>and testing training strategies are all current high priorities.  Saving
>memory wouldn't buy me anything in the Outlook client I'm using, or in the
>high-volume python.org application.  But, as I said, other people are keener
>on that, and I expect that reducing the sheer number of tokens is a more
>effective approach (in part because it ties into effective training
>strategies over time -- the database will just keep growing (albeit at a
>slackening pace) without active pruning, and whether a token takes one byte
>or 50).
>
>> Once I get up and running on the code (just paid  the tithe to O'Reilly)
>> I'll test it out.
>
>It's all yours <wink>.
>
>> One thing that occurred to me: now that you have something that seems
>> to work pretty well, have you considered backtracking on particular
>> features to see how much they contribute; for example, going to a
>> trivial state machine parser to spit out tokens?
>
>In theory, all prior decisions should be revisited after every change.  I
>haven't done anything like that lately, though, in part because no previous
>"let's revisit this!" experiment ever paid off.
>
>Note that the bulk of the body tokenizer couldn't be simpler:
>
>1. Convert to lowercase.
>2. Split on whitespace.

This makes me wonder what happens if someone spams you with various devices 
like c o n v e r t i n g wor ds into var ious c.o.m.b in a.tions of
w
h
i
t
e
s
p
a
c
e

- TimS

>
>Well, we *could* skip #1, but previous experiments found that it didn't give
>better error rates but did increase the database size.  It did change the
>*kinds* of errors, though, and in particular conference announcements had a
>hard time getting thru when case was preserved (they're trying to sell you a
>conference, and often SCREAM ABOUT IT).
>
>> ...
>> Yeah, we old farts ("When I was a lad, the bytes only had 6 bits!")
>
>They had 6 or 9 when I was a lad, depending on how you set the control bit
>for the Univac 1108's 36-bit words.
>
>> have lots of tricks.  We don't so much write code as remember it and
>> retype it.
>
>You don't want to bet on who'e older here <wink>.
>
>> ...
>> Not really; it doesn't really matter what the format of a token
>> coming out of the parser is, does it?
>
>The classifier is happy with any immutable and hashable Python object, i.e.
>anything that can be used as a Python dict key.  But people grafting various
>databases onto this have stronger requirements, and they're not always
>clear.  As I mentioned last time, most "lightweight" databases require
>string keys, so any switch away from strings would break those systems.
>It's pre-alpha code, but still I'm not keen to rock anyone's boat unless
>there's a clear win in return.
>
>> ...
>> True; then it becomes a game of finding generic messages that are
>> likely to evaluate as hammy enough to the average recognizer.  And
>> the meta-response is to send out multiple emails with differently
>> tuned slices of ham.
>
>They can try.  Spam doesn't need to be stopped, though, it merely has to be
>made more costly to send than it brings back.
>
>Last week Jeremy and Guido here both reported a *very* effective technique:
>spam was sent to them as replies to mailing-list postings (not this mailing
>list <wink>) they had made, including a full quote of the msg they had
>posted.  That was guaranteed to have lots of ham words for them, and the
>Subject line was the expected "Re:" followed by their own subject line.
>
>I doubt they're going to get a response rate high enough to be able to
>afford this scheme over time, at least not on tech mailing lists.  We'll
>see; if they can, it's going to be hard to beat.
>
>> I hereby, btw, coin the term "Dagwood" (or perhaps it should be
>> Wooddag?) to mean an email containing artfully sliced amounts of ham,
>> spam, and html condiments.  ;^)
>
>Cool!  Dagwood it is.
>
>> ...
>> Well, what you'd need is a hacked HTML renderer that output sets that
>> look like (token,size,color,background) and ignored words that were
>> too small or hard to read.
>
>Sure.  I expect the quickest path would be to feed the source thru a
>text-only browser, and stare its output.  That seems mondo expensive,
>though,
>
>>> For goodness sake, this is email we're talking about -- anyone
>>> trusting a truly critical msg to email is dreaming to begin with.
>
>> Unfortunately, in the real world, this happens all too often.  Keep
>> in mind that the readers of this list are not the typical users of
>> the resulting software techniques.
>
>I do, but it's still not my problem <0.5 wink>.  All non-trivial systems
>have non-zero FP rates, and that's a fact of life.  You're keen on
>whitelists, but they wouldn't do a thing to stop any of the false positives
>I've seen, and so on; a multitude of schemes may reduce the overall error
>rates if they're combined intelligently, but they're not going to reach an
>error rate of 0.  Not even with human review (as has become obvious to
>everyone who's run a good system over their supposedly clean ham and spam
>collections).  At some point, learning that Santa Claus isn't actually a
>white man is a part of growing up <wink>.
>
>show-me-an-isp-that-guarantees-email-delivery-and-we'll-get-
>    rich-shorting-its-stock-ly y'rs  - tim
>
>
>_______________________________________________
>Spambayes mailing list
>Spambayes@python.org
>http://mail.python.org/mailman/listinfo/spambayes
>
>
- Tim
www.fourstonesExpressions.com 


From anthony@interlink.com.au  Tue Nov 12 01:36:28 2002
From: anthony@interlink.com.au (Anthony Baxter)
Date: Tue, 12 Nov 2002 12:36:28 +1100
Subject: [Spambayes] A couple of small tokenizer experiments. 
In-Reply-To: <LNBBLJKPBEHFEDALKOLCCELFCJAB.tim.one@comcast.net> 
Message-ID: <200211120136.gAC1aTs09777@localhost.localdomain>


>>> Tim Peters 
> Can you try this again replacing "break" with "continue"?  I can't believe
> you intended break here -- it means that the first time we see a Mailman URL
> in a msg, we stop looking for embedded URLs period.  Spam could easily
> exploit that.

Woopsie. I knew that :)


> >> ham:spam:  11192:1826
> >>                   11192:1826
> 
> You realize you've get a very high ratio of ham to spam, right?

*nod* It's my full personal test corpus. There's another 600 spam 
that haven't been dropped in. I'm re-running tests at the moment
with smaller amounts.

> We don't tokenize To: now because it gives good results for bad reasons on
> mixed-source corpora.  It would be good to have an option to tokenize it.
> It appears that your code also tokenized Cc:; also fine.  I would rather see
> the code added to the loop currently cracking "from" lines:

I've done this now, and am testing it before checking it in.

> Why is this tokenzing only "the first" piece of the Subject line?

Thinko.

> I changed this to loop over all the Subject parts, and saw some minor good
> effects on marginal msgs, so I'll check this one in without further ado.  It
> wasn't much of a win for you either, but it's cheap so why not.  In my
> personal email "subjectcharset:unknown" shows up a lot for some reason (but
> only in spam).

Hm. Dunno about that - Barry might know under what circumstances 
email package gives 'unknown' as a charset. I can't see how that 
could happen.


> > I plan to try something like tokenizing the oldest three received
> > lines (to hopefully avoid the previous issues with mail.python.org
> > blowing numbers to hell) to see if that will help this one.
> Did you try that yet?  I'm not replying in a timely fashion because I'm not
> interested, it's just because I'm 244 msgs behind on this mailing list alone
> now <wink/sigh>.

Not yet, no. It's on the stack.

> > A base64d MP3 spam sent via zope-dev
> > (*H* 0.993904, *S* 0.187868 = 0.0969820429397)
> > which got a bunch of hammy clues from "Subject: [Zope-dev] Re: ofpa" and
> > also the various mailman type clues (although that's better with the
> > first patch, above)

I'm going to try a patch to try and strip out mailing list [titles] at
some point, too.

Anthony

From tim.one@comcast.net  Tue Nov 12 01:53:07 2002
From: tim.one@comcast.net (Tim Peters)
Date: Mon, 11 Nov 2002 20:53:07 -0500
Subject: [Spambayes] Introducing myself
In-Reply-To: <FDD0USKH71KSN1XWSRM6JDWXW64VR.3dd05a22@riven>
Message-ID: <LNBBLJKPBEHFEDALKOLCCEMBCJAB.tim.one@comcast.net>

[Tim Stone]
> This makes me wonder what happens if someone spams you with
> various devices
> like c o n v e r t i n g wor ds into var ious c.o.m.b in a.tions of
> w
> h
> i
> t
> e
> s
> p
> a
> c
> e

Most of that would be invisible to us, as we ignore "words" with fewer than
3 characters, so they'd get judged mostly on the header lines, and it's not
easy for spam to get by those even in isolation.

But spammers won't *do* that regardless.  There'a A Reason they use giant
fonts and bright colors:  the harder a msg is to read, the lower the
response rate, and they're not immune to economics.

A better strategy is to just have HTML pointing to a .gif or .jpg out on the
web.  They can make that as gaudy as they like and the classifier won't see
any of it.  This seems quite common in Asian spam now, but Guido speculated
(and I think he's right) that this is more because the Asians are fighting
intractable character-set issues.  I'm seeing more of it now in English spam
too, but it's still rare.  For whatever reasons, this system hasn't had any
trouble learning to call such stuff spam (I expect that the special
tokenizing of URLs we do is helping a lot).


From tim@fourstonesExpressions.com  Tue Nov 12 02:03:38 2002
From: tim@fourstonesExpressions.com (Tim Stone - Four Stones Expressions)
Date: Mon, 11 Nov 2002 20:03:38 -0600
Subject: [Spambayes] Introducing myself
In-Reply-To: <LNBBLJKPBEHFEDALKOLCCEMBCJAB.tim.one@comcast.net>
Message-ID: <B761L97EAUQMKLJQPHGCB0ECXSDA.3dd0617a@riven>

Gotcha.  You dudes are on top of things... ;)

Wanna do some ocr stuff on referenced jpgs and gifs?  ;;;)  I know I know... 
bad idea for any of a thousand reasons...

- TimS

11/11/2002 7:53:07 PM, Tim Peters <tim.one@comcast.net> wrote:

>[Tim Stone]
>> This makes me wonder what happens if someone spams you with
>> various devices
>> like c o n v e r t i n g wor ds into var ious c.o.m.b in a.tions of
>> w
>> h
>> i
>> t
>> e
>> s
>> p
>> a
>> c
>> e
>
>Most of that would be invisible to us, as we ignore "words" with fewer than
>3 characters, so they'd get judged mostly on the header lines, and it's not
>easy for spam to get by those even in isolation.
>
>But spammers won't *do* that regardless.  There'a A Reason they use giant
>fonts and bright colors:  the harder a msg is to read, the lower the
>response rate, and they're not immune to economics.
>
>A better strategy is to just have HTML pointing to a .gif or .jpg out on the
>web.  They can make that as gaudy as they like and the classifier won't see
>any of it.  This seems quite common in Asian spam now, but Guido speculated
>(and I think he's right) that this is more because the Asians are fighting
>intractable character-set issues.  I'm seeing more of it now in English spam
>too, but it's still rare.  For whatever reasons, this system hasn't had any
>trouble learning to call such stuff spam (I expect that the special
>tokenizing of URLs we do is helping a lot).
>
>
>
- Tim
www.fourstonesExpressions.com 


From tim.one@comcast.net  Tue Nov 12 02:06:05 2002
From: tim.one@comcast.net (Tim Peters)
Date: Mon, 11 Nov 2002 21:06:05 -0500
Subject: [Spambayes] A couple of small tokenizer experiments.
In-Reply-To: <200211120136.gAC1aTs09777@localhost.localdomain>
Message-ID: <LNBBLJKPBEHFEDALKOLCMEMDCJAB.tim.one@comcast.net>

Quickie:


>> In personal email "subjectcharset:unknown" shows up a lot for some
>> reason (but only in spam).

> Hm. Dunno about that - Barry might know under what circumstances
> email package gives 'unknown' as a charset. I can't see how that
> could happen.

Easy <wink>:  it's my personal email, and the string UNKNOWN is what
*Outlook* delivers.  I think it actually says UNKNOWN as it came in off the
wire!

I get my share of

    Subject: =?Big5?B?pc7BecHIpGq/+g==?=

thingies but I also get a monsters like these:

Subject: =?UNKNOWN?Q?=1B$B!z%-%c%s%Z!=3C%s=3CB=3B=5CCf!*!*=1B=28
        B1=1B$B%/?==?UNKNOWN?Q?%j%C%/!w=1B=28B15=1B$B1=5F!A=1B=28
        B25=1B$B1=5F!z=1B?==?UNKNOWN?Q?=28B?=

That one came in to webmaster@python.org on Friday.  Perhaps they've learned
that Greg will reject a msg just for using an unloved charset, but I doubt
it.

In fact, I see that 'subjectcharset:unknown' is now the single strongest
spam word in my entire mistaken-driven (and tiny) training corpus:

'subjectcharset:unknown'       0.934783'


From trebor@animeigo.com  Tue Nov 12 02:16:13 2002
From: trebor@animeigo.com (Robert Woodhead)
Date: Mon, 11 Nov 2002 21:16:13 -0500
Subject: [Spambayes] Introducing myself
In-Reply-To: <LNBBLJKPBEHFEDALKOLCKELPCJAB.tim.one@comcast.net>
References: <LNBBLJKPBEHFEDALKOLCKELPCJAB.tim.one@comcast.net>
Message-ID: <a05200f3fb9f60c3ccb9e@[192.168.1.103]>

>Bootstrapping a classifier, connecting to a gazillion quirky email clients,
>and testing training strategies are all current high priorities.  Saving
>memory wouldn't buy me anything in the Outlook client I'm using, or in the
>high-volume python.org application.  But, as I said, other people are keener
>on that, and I expect that reducing the sheer number of tokens is a more
>effective approach (in part because it ties into effective training
>strategies over time -- the database will just keep growing (albeit at a
>slackening pace) without active pruning, and whether a token takes one byte
>or 50).

My hunch, based on things I've done in the past, is that as the total 
volume of mail increases, the rate of increase in the number of 
unique tokens will approach a limit (that being, the number of 
distinct individual words in the language, though foreign unicode 
gibberish will have an effect).  When I was doing single word 
analysis on a quarter-gig of ham and spam I was seeing, IIRC, about 
300,000 distinct tokens (including the aforementioned gibberish).

It will be interesting to see the results of some data reduction on 
the accuracy of the recogniser.  My WAG is that even some serious 
hashing (down to, say, 20 bit tokens) won't have much effect on 
accuracy because most of the collisions will be between low 
frequency, insignificant tokens.

>In theory, all prior decisions should be revisited after every change.  I
>haven't done anything like that lately, though, in part because no previous
>"let's revisit this!" experiment ever paid off.

Well, usually the time to check by chopping out particular components 
is when you've got it running so well that adding things doesn't help 
you.

>They had 6 or 9 when I was a lad, depending on how you set the control bit
>for the Univac 1108's 36-bit words.

You had use of a Univac?  You lucky, lucky bastard!  I had to use a 
CARDIAC, and share the eraser used to wipe out the core.  And had to 
walk 5 miles, uphill, in the snow, barefoot, to do that!  ;^)

>You don't want to bet on who'e older here <wink>.

Old Fartdom is not measured in chronological years; it is an 
existential state of being.  I became one the day I heard a young 
programmer complain that half a gigabyte of ram simply wasn't enough 
memory!  ;^)

>The classifier is happy with any immutable and hashable Python object, i.e.
>anything that can be used as a Python dict key.  But people grafting various
>databases onto this have stronger requirements, and they're not always
>clear.  As I mentioned last time, most "lightweight" databases require
>string keys, so any switch away from strings would break those systems.
>It's pre-alpha code, but still I'm not keen to rock anyone's boat unless
>there's a clear win in return.

Point taken; my point (maybe not expressed clearly) was that if you 
go to a hashing/data reduction scheme, then you just keep the entire 
thing in memory.  Or you graft a mock db interface onto the data 
structure for compatibility during testing (which is probably what 
I'll try).

>  > True; then it becomes a game of finding generic messages that are
>>  likely to evaluate as hammy enough to the average recognizer.  And
>>  the meta-response is to send out multiple emails with differently
>>  tuned slices of ham.
>
>They can try.  Spam doesn't need to be stopped, though, it merely has to be
>made more costly to send than it brings back.

Note that the multiple emails can be madlib'd.  They have access to 
more processor and bandwidth over time as well, alas.

>Last week Jeremy and Guido here both reported a *very* effective technique:
>spam was sent to them as replies to mailing-list postings (not this mailing
>list <wink>) they had made, including a full quote of the msg they had
>posted.  That was guaranteed to have lots of ham words for them, and the
>Subject line was the expected "Re:" followed by their own subject line.

Ouch, that's evil.  Maybe the solution for that is to look at the 
message and the quotation individually?  But that can be metagamed 
too.

>
>>  I hereby, btw, coin the term "Dagwood" (or perhaps it should be
>>  Wooddag?) to mean an email containing artfully sliced amounts of ham,
>>  spam, and html condiments.  ;^)
>
>Cool!  Dagwood it is.

What, we're agreeing on something?!  I must be doing something wrong! 
Wait a minute, you agreed with me.  What's wrong with you?  A fever 
perhaps?  ;^)

At 8:33 PM -0500 11/11/02, Tim Stone wrote:
>This makes me wonder what happens if someone spams you with various devices
>like c o n v e r t i n g wor ds into var ious c.o.m.b in a.tions of
>w
>h
>i
>t
>e
>s
>p
>a
>c
>e

I'd say that,
af
ter
be ing
forc ed
to rea d it,
if he was sell
ing t
yle
nol or
ibu
pr
of
en
I'
d
proba bl y b u y
som e!

(just not from him)

R

From piersh@friskit.com  Tue Nov 12 02:41:37 2002
From: piersh@friskit.com (Piers Haken)
Date: Mon, 11 Nov 2002 18:41:37 -0800
Subject: [Spambayes] Re: Outlook plugin plus Exchange
Message-ID: <9891913C5BFE87429D71E37F08210CB9183A08@zeus.sfhq.friskit.com>

This is a multi-part message in MIME format.

---------------------- multipart/mixed attachment
I see the same thing on a few messages in my corpus. I believe it's
something weird to do with the way outlook splits out the MIME headers.

Attached is a dump of the exception.

Piers.

> -----Original Message-----
> From: Tim Peters [mailto:tim.one@comcast.net]=20
> Sent: Monday, November 11, 2002 4:30 PM
> To: David Leftley
> Cc: spambayes@python.org
> Subject: RE: [Spambayes] Re: Outlook plugin plus Exchange
>=20
>=20
> [David Leftley]
> > ...
> > And, while I'm reporting the quirks of the Outlook plugin, I have 3=20
> > messages (out of my spam corpus of c. 2000) that the plugin=20
> refuses to=20
> > classify. If I attempt to score the contents of a folder containing=20
> > one of these messages, scoring simply stops at that point - the=20
> > progress bar disappears, and the remaining messages are=20
> left unscored.
>=20
> Next time that happens, bring up PythonWin and do Tools ->=20
> Trace Collector Debugging Tool.  That will pop up a window=20
> showing diagnostic msgs and tracebacks produced by the=20
> Outlook client.  You'll probably find something "interesting"=20
> near the end.  Note that nobody who has done work on the=20
> client has any form of Exchange running, so diagnosis may not=20
> lead to a cure.  Still, can't fix what nobody understands, so=20
> it will be a start.
>=20
> > Apart from those little things, though, this software=20
> rocks! Keep up=20
> > the good work, guys!
>=20
> Tell Redmond -- if they paid Mark to bust his balls on this,=20
> I bet he'd grow a new pair <wink>.
>=20
>=20
> _______________________________________________
> Spambayes mailing list
> Spambayes@python.org http://mail.python.org/mailman/listinfo/spambayes
>=20

---------------------- multipart/mixed attachment
RkFJTEVEIHRvIGNyZWF0ZSBlbWFpbC5tZXNzYWdlIGZyb206ICAnWC1NUy1NYWlsLUdpYmJlcmlz
aDogTWljcm9zb2Z0IE1haWwgSW50ZXJuZXQgSGVhZGVycyBWZXJzaW9uIDIuMFxyXG5SZWNlaXZl
ZDogZnJvbSBpbmV0LW1haWw3Lm9yYWNsZS5jb20gKFsyMDkuMjQ2LjEwLjE3MV0pIGJ5IHpldXMu
c2ZocS5mcmlza2l0LmNvbSB3aXRoIE1pY3Jvc29mdCBTTVRQU1ZDKDUuMC4yMTk1LjQ0NTMpO1xy
XG5cdCBTYXQsIDEzIEFwciAyMDAyIDAzOjE5OjAxIC0wNzAwXHJcblJlY2VpdmVkOiBmcm9tIGJs
YXN0ZXItc210cC5vcmFjbGUuY29tIChlYmxhc3QwMS5vcmFjbGVlYmxhc3QuY29tIFsxNDguODcu
OS4xMV0pXHJcblx0YnkgaW5ldC1tYWlsNy5vcmFjbGUuY29tIChTd2l0Y2gtMi4yLjEvU3dpdGNo
LTIuMi4wKSB3aXRoIEVTTVRQIGlkIGczREE4R1YzMDA2NVxyXG5cdGZvciBQSUVSU0hARlJJU0tJ
VC5DT007IFNhdCwgMTMgQXByIDIwMDIgMDM6MDg6MTYgLTA3MDBcclxuRGF0ZTogU2F0LCAxMyBB
cHIgMjAwMiAwMzowODoxNiAtMDcwMFxyXG5NZXNzYWdlLUlkOiA8MjAwMjA0MTMxMDA4LmczREE4
R1YzMDA2NUBpbmV0LW1haWw3Lm9yYWNsZS5jb20+XHJcblN1YmplY3Q6IE9yYWNsZSBVbml2ZXJz
aXR5IGlTZW1pbmFyc1xyXG5Gcm9tOiBPcmFjbGUgQ29ycG9yYXRpb248cmVwbGllc0BvcmFjbGVl
Ymxhc3QuY29tPlxyXG5UbzogUElFUlNIQEZSSVNLSVQuQ09NXHJcblJlcGx5LVRvOiByZXBsaWVz
QG9yYWNsZWVibGFzdC5jb21cclxuQ29udGVudC1UcmFuc2Zlci1FbmNvZGluZzogOGJpdFxyXG5N
SU1FLVZlcnNpb246IDEuMFxyXG5Db250ZW50LVR5cGU6IG11bHRpcGFydC9hbHRlcm5hdGl2ZTtc
clxuICAgIGJvdW5kYXJ5PSJuZXh0X3BhcnRfb2ZfbWVzc2FnZSJcclxuUmV0dXJuLVBhdGg6IHJl
cGxpZXNAb3JhY2xlZWJsYXN0LmNvbVxyXG5YLU9yaWdpbmFsQXJyaXZhbFRpbWU6IDEzIEFwciAy
MDAyIDEwOjE5OjAxLjA5MzggKFVUQykgRklMRVRJTUU9W0ExRDhGOTIwOjAxQzFFMkQ0XVxyXG5c
clxuLS1uZXh0X3BhcnRfb2ZfbWVzc2FnZVxyXG5vZl9tZXNzYWdlXHJcbmdlXHJcblxyXG4tLW5l
eHRfcGFydF9vZl9tZXNzYWdlXHJcbkNvbnRlbnQtVHlwZTogdGV4dC9odG1sXHJcblxyXG5cblxy
XG5cclxuPGh0bWw+XHJcbjxoZWFkPlxyXG48TUVUQSBIVFRQLUVRVUlWPSJDb250ZW50LVR5cGUi
IENPTlRFTlQ9InRleHQvaHRtbDsgY2hhcnNldD1pc28tODg1OS0xIj5cclxuXHJcbjx0aXRsZT5y
ZW1pbmRlcjwvdGl0bGU+XHJcblxyXG48c3R5bGUgdHlwZT0idGV4dC9jc3MiPlxyXG5cdHAgeyAg
Zm9udC1mYW1pbHk6IFZlcmRhbmEsIEFyaWFsLCBIZWx2ZXRpY2EsIHNhbnMtc2VyaWY7IGZvbnQt
c2l6ZTogMTFweDsgZm9udC1zdHlsZTogbm9ybWFsOyBsaW5lLWhlaWdodDogMThweDsgZm9udC13
ZWlnaHQ6IG5vcm1hbH1cclxuXHR1bCB7ICBmb250LWZhbWlseTogVmVyZGFuYSwgQXJpYWwsIEhl
bHZldGljYSwgc2Fucy1zZXJpZjsgZm9udC1zaXplOiAxMXB4OyBmb250LXN0eWxlOiBub3JtYWw7
IGxpbmUtaGVpZ2h0OiAxOHB4OyBmb250LXdlaWdodDogbm9ybWFsOyBsaXN0LXN0eWxlLXR5cGU6
IGRpc2N9XHJcblx0Lm5vYm9sZHR4dCB7ICBmb250OiBub3JtYWwgMTFweC8xNHB4IFZlcmRhbmEs
IEFyaWFsLCBIZWx2ZXRpY2EsIHNhbnMtc2VyaWZ9XHJcblx0LmxpdmVidG50eHQgeyAgZm9udC1m
YW1pbHk6IFZlcmRhbmEsIEFyaWFsLCBIZWx2ZXRpY2EsIHNhbnMtc2VyaWY7IGZvbnQtc2l6ZTog
MTFweDsgZm9udC1zdHlsZTogYm9sZDsgbGluZS1oZWlnaHQ6IDE0cHg7IGZvbnQtd2VpZ2h0OiBu
b3JtYWw7IGNvbG9yOiAjRkZGRkZGfVxyXG5cdGIgeyAgZm9udDogYm9sZCAxMXB4LzE0cHggVmVy
ZGFuYSwgQXJpYWwsIEhlbHZldGljYSwgc2Fucy1zZXJpZiB9XHJcblx0aSB7ICBmb250LWZhbWls
eTogVmVyZGFuYSwgQXJpYWwsIEhlbHZldGljYSwgc2Fucy1zZXJpZjsgZm9udC1zaXplOiAxMXB4
OyBmb250LXN0eWxlOiBpdGFsaWM7IGxpbmUtaGVpZ2h0OiAxOHB4OyBmb250LXdlaWdodDogbm9y
bWFsfVxyXG5cdGEgeyAgZm9udC1mYW1pbHk6IFZlcmRhbmEsIEFyaWFsLCBIZWx2ZXRpY2EsIHNh
bnMtc2VyaWY7IGZvbnQtc2l6ZTogMTFweDsgZm9udC1zdHlsZTogYm9sZDsgbGluZS1oZWlnaHQ6
IDE4cHg7IGZvbnQtd2VpZ2h0OiBub3JtYWw7IGNvbG9yOiAjRkYwMDAwfVxyXG5cdC50aXRsZSB7
ICBmb250LWZhbWlseTogVmVyZGFuYSwgQXJpYWwsIEhlbHZldGljYSwgc2Fucy1zZXJpZjsgZm9u
dC1zaXplOiAxNHB4OyBmb250LXN0eWxlOiBib2xkOyBsaW5lLWhlaWdodDogMjBweDsgZm9udC13
ZWlnaHQ6IGJvbGR9XHJcblx0LnN1YnRpdGxlIHsgIGZvbnQtZmFtaWx5OiBWZXJkYW5hLCBBcmlh
bCwgSGVsdmV0aWNhLCBzYW5zLXNlcmlmOyBmb250LXNpemU6IDEycHg7IGZvbnQtc3R5bGU6IGJv
bGQ7IGxpbmUtaGVpZ2h0OiAxOHB4OyBmb250LXdlaWdodDogYm9sZH1cclxuXHQuYm90bGluayB7
ICBmb250LWZhbWlseTogVmVyZGFuYSwgQXJpYWwsIEhlbHZldGljYSwgc2Fucy1zZXJpZjsgZm9u
dC1zaXplOiAxMXB4OyBsaW5lLWhlaWdodDogMTRweDsgZm9udC13ZWlnaHQ6IGJvbGQ7IGNvbG9y
OiAjRkZGRkZGfVxyXG48L3N0eWxlPlxyXG48L2hlYWQ+XHJcbjxib2R5IGJnY29sb3I9IiNGRkZG
RkYiPlxyXG5cclxuXHJcbjwhLS0gT3JhY2xlIExvZ28gLS0+XHJcbjxUQUJMRSB3aWR0aD0iNTAw
IiBib3JkZXI9IjAiIGNlbGxwYWRkaW5nPSIwIiBjZWxsc3BhY2luZz0iMCI+XHJcbiAgPHRyPiBc
clxuICAgIDx0ZCBoZWlnaHQ9IjQwIiBiZ2NvbG9yPSIjRkYwMDAwIiB2YWxpZ249InRvcCIgd2lk
dGg9IjUwMCI+PGltZyBzcmM9Imh0dHA6Ly93d3cub3JhY2xlLmNvbS9zdGFydC9vdV9zZW1pbmFy
cy9FVFMtb3JhY2xlTG9nby5naWYiIHdpZHRoPSI1MDAiIGhlaWdodD0iNDAiPjwvdGQ+XHJcbiAg
PC90cj5cclxuPC9UQUJMRT5cclxuXHJcblxyXG48IS0tICoqQkVHSU4gIEZsYXNoIE1lZGlhICBI
RVJFKiogLS0+XHJcbjxUQUJMRSB3aWR0aD0iNTAwIiBib3JkZXI9IjAiIGNlbGxzcGFjaW5nPSIw
IiBjZWxscGFkZGluZz0iMCI+XHJcbiAgPHRyPiBcclxuICAgIDx0ZCBjb2xzcGFuPSI1IiBhbGln
bj0iY2VudGVyIiBoZWlnaHQ9IjEwMCIgd2lkdGg9IjUwMCIgYmdjb2xvcj0iI0ZGMDAwMCI+IFxy
XG4gICAgICA8aW1nIHNyYz0iaHR0cDovL3d3dy5vcmFjbGUuY29tL3N0YXJ0L291X3NlbWluYXJz
LzAyMDU5OW91X2VtMS5naWYiIHdpZHRoPSI1MDAiIGhlaWdodD0iMTAwIj4gPC90ZD5cclxuICA8
L3RyPlxyXG48L1RBQkxFPlxyXG48IS0tIEVORCBGbGFzaCBNZWRpYSAtLT5cclxuXHJcblxyXG48
IS0tIGRlbGluZWF0b3IgLS0+XHJcbjxUQUJMRSB3aWR0aD0iNTAwIiBib3JkZXI9IjAiIGNlbGxz
cGFjaW5nPSIwIiBjZWxscGFkZGluZz0iMCI+XHJcbiAgPHRyPiBcclxuICAgIDx0ZCBhbGlnbj0i
cmlnaHQiIGhlaWdodD0iMTAiIGJnY29sb3I9IiNGRjAwMDAiIHZhbGlnbj0idG9wIiB3aWR0aD0i
NTAwIj48aW1nIHNyYz0iaHR0cDovL3d3dy5vcmFjbGUuY29tL3N0YXJ0L291X3NlbWluYXJzL0VU
Uy1zcGFjZXIuZ2lmIiB3aWR0aD0iMSIgaGVpZ2h0PSIxMCI+PC90ZD5cclxuICA8L3RyPlxyXG48
L1RBQkxFPlxyXG5cclxuXHJcbjwhLS0gKipCRUdJTiAgQm9keSBDb250ZW50ICAqKk1ha2UgY2hh
bmdlcyB0byBjb3B5IEhFUkUqKiAtLT5cclxuPFRBQkxFIHdpZHRoPSI1MDAiIGJvcmRlcj0iMCIg
Y2VsbHBhZGRpbmc9IjAiIGNlbGxzcGFjaW5nPSIwIj5cclxuICA8dHI+IFxyXG4gICAgPHRkIHJv
d3NwYW49IjMiIGJnY29sb3I9InJlZCI+PGltZyBzcmM9Imh0dHA6Ly93d3cub3JhY2xlLmNvbS9z
dGFydC9vdV9zZW1pbmFycy9FVFMtc3BhY2VyLmdpZiIgd2lkdGg9IjEiIGhlaWdodD0iMSI+PC90
ZD5cclxuICAgIDx0ZCByb3dzcGFuPSIzIiB2YWxpZ249InRvcCI+PGltZyBzcmM9Imh0dHA6Ly93
d3cub3JhY2xlLmNvbS9zdGFydC9vdV9zZW1pbmFycy9FVFMtc3BhY2VyLmdpZiIgd2lkdGg9IjMw
IiBoZWlnaHQ9IjEiPjwvdGQ+XHJcbiAgICA8dGQgYWxpZ249ImxlZnQiIHZhbGlnbj0idG9wIj48
aW1nIG5hbWU9InRpdGxlSGVhZCIgc3JjPSJodHRwOi8vd3d3Lm9yYWNsZS5jb20vc3RhcnQvb3Vf
c2VtaW5hcnMvcmVtaW5kX3RpdGxlaGVhZC5naWYiIHdpZHRoPSIzMzciIGhlaWdodD0iNTAiIGJv
cmRlcj0iMCI+PC90ZD5cclxuXHQ8dGQgcm93c3Bhbj0iMiIgdmFsaWduPSJ0b3AiPjxpbWcgc3Jj
PSJodHRwOi8vd3d3Lm9yYWNsZS5jb20vc3RhcnQvb3Vfc2VtaW5hcnMvRVRTLXNwYWNlci5naWYi
IHdpZHRoPSIxMCIgaGVpZ2h0PSIxMDAlIj48L3RkPlxyXG5cdDx0ZCByb3dzcGFuPSIyIiBhbGln
bj0icmlnaHQiIHZhbGlnbj0idG9wIj48YnI+XHJcbiAgICAgPGEgaHJlZj0iaHR0cDovL3d3dy5v
cmFjbGUuY29tL2dvLz8mU3JjPTEyOTYxOTcmQWN0PTgiPjxpbWcgc3JjPSJodHRwOi8vd3d3Lm9y
YWNsZS5jb20vc3RhcnQvb3Vfc2VtaW5hcnMvcmVtaW5kLWNhbGxidXR0b24uZ2lmIiB3aWR0aD0i
ODEiIGhlaWdodD0iMTAyIiBib3JkZXI9IjAiPjwvYT48L3RkPlxyXG4gICAgPHRkIHJvd3NwYW49
IjMiIHZhbGlnbj0idG9wIj48aW1nIHNyYz0iaHR0cDovL3d3dy5vcmFjbGUuY29tL3N0YXJ0L291
X3NlbWluYXJzL0VUUy1zcGFjZXIuZ2lmIiB3aWR0aD0iMjUiIGhlaWdodD0iMSI+PC90ZD5cclxu
ICAgIDx0ZCByb3dzcGFuPSIzIiBiZ2NvbG9yPSJyZWQiPjxpbWcgc3JjPSJodHRwOi8vd3d3Lm9y
YWNsZS5jb20vc3RhcnQvb3Vfc2VtaW5hcnMvRVRTLXNwYWNlci5naWYiIHdpZHRoPSIxIiBoZWln
aHQ9IjEiPjwvdGQ+XHJcbiAgPC90cj5cclxuICA8dHI+IFxyXG4gICAgPHRkIGhlaWdodD0iMTAw
JSIgdmFsaWduPSJ0b3AiIHdpZHRoPSIzMzciPiBcclxuICAgICAgPHA+PGI+RG9uJiMxNDY7dCBN
aXNzIE5leHQgV2VlayYjMTQ2O3MgaVNlbWluYXJzIGZyb20gT3JhY2xlIFVuaXZlcnNpdHkuPC9i
PiBcclxuICAgICAgPC9wPlxyXG4gICAgICA8cD4gRG9uJiMxNDY7dCBmb3JnZXQgYWJvdXQgdGhl
IEZSRUUgaVNlbWluYXJzIGFuZCBsaXZlIGNoYXQgY29taW5nIHVwIG5leHQgXHJcbiAgICAgICAg
d2VlayBmcm9tIE9yYWNsZSBVbml2ZXJzaXR5ISBXaXRoIHRoZSBsYXRlc3QgaW5mb3JtYXRpb24g
b24gT3JhY2xlIGNlcnRpZmljYXRpb24gXHJcbiAgICAgICAgYW5kIHRlY2hub2xvZ3ksIHRoZXNl
IGZpdmUgZXZlbnRzIHByb3ZpZGUgdW5pcXVlIGtub3dsZWRnZSBhbmQgZ3VpZGVkIFxyXG4gICAg
ICAgIHRyYWluaW5nIHVuYXZhaWxhYmxlIGFueXdoZXJlIGVsc2UuIEVhY2ggc2VtaW5hciBpbmNs
dWRlcyBhIDE1LW1pbnV0ZSBcclxuICAgICAgICBtaW5pLWxlc3NvbiBhbmQgUSZhbXA7QSBzZXNz
aW9uIHdpdGggYW4gT3JhY2xlIFVuaXZlcnNpdHkgaW5zdHJ1Y3Rvci48L3A+XHJcblxyXG4gICAg
ICA8cD4gSWYgeW91IGhhdmVuJiMxNDY7dCBhbHJlYWR5LCA8YSBocmVmPSJodHRwOi8vd3d3Lm9y
YWNsZS5jb20vZ28vPyZTcmM9MTI5NjE5NyZBY3Q9OCI+Y2xpY2sgaGVyZTwvYT4gdG8gcmVnaXN0
ZXIgXHJcbiAgICAgICAgZm9yIHRoZSBmdWxsIHdlZWsgb2YgZXZlbnRzLjwvcD5cclxuICAgICAg
PHA+IDxpPk1vbmRheSwgQXByaWwgMTUsIDIwMDIgJiMxNTA7IDEwOjAwIGEubS4gUERUPC9pPjxi
cj5cclxuICAgICAgICA8Yj5PcmFjbGU5PGk+aTwvaT4gJiMxNTA7IFRyYWluaW5nIGZvciBDZXJ0
aWZpY2F0aW9uLjwvYj4gS2ljayBvZmYgeW91ciBcclxuICAgICAgICB3ZWVrIHdpdGggYSAzMC1t
aW51dGUgc2Vzc2lvbiBvbiB0aGUgY29tcG9uZW50cywgdmFsdWUsIGFuZCBzdGVwcyBpbiB0aGUg
XHJcbiAgICAgICAgT3JhY2xlIENlcnRpZmljYXRpb24gcHJvY2Vzcy48L3A+XHJcbiAgICAgIDxw
PiA8aT5UdWVzZGF5LCBBcHJpbCAxNiwgMjAwMiAmIzE1MDsgODowMCBhLm0uIFBEVDwvaT48YnI+
XHJcbiAgICAgICAgPGI+T3JhY2xlOTxpPmk8L2k+IE5ldyBGZWF0dXJlcy48L2I+IFVwZGF0ZSB5
b3VyIGtub3dsZWRnZSBhbmQgaG9uZSB5b3VyIFxyXG4gICAgICAgIHNraWxscyBvbiB0aGUgbGF0
ZXN0IGZlYXR1cmVzIGFuZCBvcHRpb25zIGZvdW5kIGluIE9yYWNsZTk8aT5pPC9pPiB3aXRoIFxy
XG4gICAgICAgIGFkdmljZSBmcm9tIG91ciB0cmFpbmluZyBleHBlcnRzLiA8YnI+XHJcbiAgICAg
IDwvcD5cclxuICAgICAgPHA+PGk+V2VkbmVzZGF5LCBBcHJpbCAxNywgMjAwMiAmIzE1MDsgMTA6
MDAgYS5tLiBQRFQ8L2k+PGJyPlxyXG4gICAgICAgIDxiPk9yYWNsZTk8aT5pPC9pPiBTZWN1cml0
eSBUcmFpbmluZyBmb3IgQ2VydGlmaWNhdGlvbi48L2I+ICBMZWFybiBob3cgeW91IGNhbiBtZWV0
IHlvdXIgYnVzaW5lc3MgbmVlZHMgaW4gXHJcbiAgICB0aGUgcmFwaWRseSBjaGFuZ2luZyB3b3Js
ZCBvZiBoaWdoLXRlY2ggc2VjdXJpdHkuIDxicj5cclxuICAgICAgPC9wPlxyXG4gICAgICA8cD48
aT5UaHVyc2RheSwgQXByaWwgMTgsIDIwMDIgJiMxNTA7IDEyOjAwIHAubS4gUERUPC9pPjxicj5c
clxuICAgICAgICA8Yj5QZXJmb3JtYW5jZSBUdW5pbmcgJiMxNTA7IFRpcHMgZnJvbSB0aGUgRXhw
ZXJ0cy48L2I+IExlYXJuIGJlc3QgcHJhY3RpY2VzIFxyXG4gICAgICAgIGFuZCB0ZWNobmljYWwg
dGlwcyBvbiBob3cgdG8gZ2V0IHRoZSBtb3N0IGZyb20gT3JhY2xlJiMxNDY7cyBEYXRhYmFzZSBc
clxuICAgICAgICBzb2x1dGlvbi48YnI+XHJcbiAgICAgIDwvcD5cclxuICAgICAgPHA+PGk+RnJp
ZGF5LCBBcHJpbCAxOSwgMjAwMiAmIzE1MDsgMTI6MDAgcC5tLiBQRFQ8L2k+PGJyPlxyXG4gICAg
ICAgIDxiPkNlcnRpZmljYXRpb246IEFuIE9wZW4gRm9ydW0gd2l0aCBPcmFjbGUgQ2VydGlmaWNh
dGlvbiBQcm9ncmFtIEd1cnVzLCBcclxuICAgICAgICBNaWtlIFNlcnBlIGFuZCBKaW0gRGlJYW5u
aS48L2I+IEludGVyYWN0IGRpcmVjdGx5IHdpdGggT3JhY2xlIENlcnRpZmljYXRpb24gXHJcbiAg
ICAgICAgUHJvZ3JhbSBleHBlcnRzIHRvIGFuc3dlciBhbnkgcmVtYWluaW5nIHF1ZXN0aW9ucyB5
b3UgbWF5IGhhdmUgYWJvdXQgdGhlIFxyXG4gICAgICAgIHByb2dyYW0sIGN1cnJpY3VsdW0sIG9y
IHRyYWluaW5nIG9wdGlvbnMuIDxicj48L3A+XHJcbiAgICAgIDxwPjxhIGhyZWY9Imh0dHA6Ly93
d3cub3JhY2xlLmNvbS9nby8/JlNyYz0xMjk2MTk3JkFjdD04Ij5DbGljayBoZXJlPC9hPiB0byBy
ZWdpc3Rlci48L3A+XHJcbiAgICAgIDwvdGQ+XHJcbiAgPC90cj5cclxuICA8dHI+XHJcblx0PHRk
IGNvbHNwYW49IjMiIGhpZWdodD0iMTAwJSIgdmFsaWduPSJ0b3AiIGFsaWduPSJsZWZ0XHJcblx0
Ij48aW1nIHNyYz0iaHR0cDovL3d3dy5vcmFjbGUuY29tL3N0YXJ0L291X3NlbWluYXJzL0VUUy1z
cGFjZXIuZ2lmIiB3aWR0aD0iNDQzIiBoZWlnaHQ9IjIwIj48L3RkPlxyXG4gIDwvdHI+XHJcbjwv
VEFCTEU+XHJcbjwhLS0gRU5EIGJvZHkgY29udGVudCAtLT5cclxuXHJcblxyXG48IS0tICoqQkVH
SU4gIEJvdHRvbSBMaW5rICAtIFx4ZWNjYWxsLXRvLWFjdGlvblx4ZWUgSEVSRSoqIC0tPlxyXG48
VEFCTEUgd2lkdGg9IjUwMCIgYm9yZGVyPSIwIiBjZWxscGFkZGluZz0iMCIgY2VsbHNwYWNpbmc9
IjAiPlxyXG4gIDx0cj4gXHJcbiAgICA8dGQgaGVpZ2h0PSIzMCIgYmdjb2xvcj0iI0ZGMDAwMCIg
dmFsaWduPSJtaWRkbGUiIHdpZHRoPSI1MDAiIGFsaWduPSJjZW50ZXIiPjxhIGhyZWY9Imh0dHA6
Ly93d3cub3JhY2xlLmNvbS9nby8/JlNyYz0xMjk2MTk3JkFjdD04IiBjbGFzcz0iYm90bGluayI+
Q2xpY2sgXHJcbiAgICAgIGhlcmUgdG8gdmlldyB5b3VyIEZSRUUgaVNlbWluYXJzLjwvYT48L3Rk
PlxyXG4gIDwvdHI+XHJcbjwvVEFCTEU+XHJcbjwhLS0gRU5EIEJvdHRvbSBMaW5rIC0tPlxyXG5c
clxuPC9ib2R5PlxyXG48L2h0bWw+XHJcbjxwPjxmb250IGZhY2U9IkFyaWFsLCBoZWx2ZXRpY2Ei
IHNpemU9IjEiPlxyXG48YnI+VG8gYmUgcmVtb3ZlZCBmcm9tIE9yYWNsZVwncyBtYWlsaW5nIGxp
c3RzLCBzZW5kIGFuIGVtYWlsIHRvOiBcclxuPGJyPjxhIGhyZWY9Im1haWx0bzp1bnN1YnNjcmli
ZUBvcmFjbGVlYmxhc3QuY29tP3N1YmplY3Q9UkVNT1ZFIE9GIE9SQUNMRSBNQUlMSU5HIExJU1Qg
MTI5ODM5MCZib2R5PVJFTU9WRSBQSUVSU0hARlJJU0tJVC5DT00gIj51bnN1YnNjcmliZUBvcmFj
bGVlYmxhc3QuY29tPC9hPiBcclxuPGJyPndpdGggdGhlIGZvbGxvd2luZyBpbiB0aGUgbWVzc2Fn
ZSBib2R5OiBcclxuPGJyPlJFTU9WRSBQSUVSU0hARlJJU0tJVC5DT01cclxuPGJyPlNUT1AgXHJc
bjxwPlxyXG5bMTI3NTM3My81LzEwNzU0NzAxMl0gXHJcbjwvZm9udD5cclxuPGltZyBzcmM9Imh0
dHA6Ly93d3cub3JhY2xlLmNvbS9lbG9nL3RyYWNrdXJsP2RpPTEyOTgzOTAmc2kxPTEwNzU0NzAx
MiIgYm9yZGVyPTA+IFxyXG5cclxuXHJcblxyXG5cclxuXHJcblxyXG5cclxuXHJcblxyXG5cbiAg
PGh0dHA6Ly93d3cub3JhY2xlLmNvbS9zdGFydC9vdV9zZW1pbmFycy9FVFMtb3JhY2xlTG9nby5n
aWY+IFx0XHJcbiAgPGh0dHA6Ly93d3cub3JhY2xlLmNvbS9zdGFydC9vdV9zZW1pbmFycy8wMjA1
OTlvdV9lbTEuZ2lmPiBcdFxyXG4gIDxodHRwOi8vd3d3Lm9yYWNsZS5jb20vc3RhcnQvb3Vfc2Vt
aW5hcnMvRVRTLXNwYWNlci5naWY+IFx0XHJcbiAgPGh0dHA6Ly93d3cub3JhY2xlLmNvbS9zdGFy
dC9vdV9zZW1pbmFycy9FVFMtc3BhY2VyLmdpZj4gXHQgIDxodHRwOi8vd3d3Lm9yYWNsZS5jb20v
c3RhcnQvb3Vfc2VtaW5hcnMvRVRTLXNwYWNlci5naWY+IFx0ICA8aHR0cDovL3d3dy5vcmFjbGUu
Y29tL3N0YXJ0L291X3NlbWluYXJzL3JlbWluZF90aXRsZWhlYWQuZ2lmPiBcdCAgPGh0dHA6Ly93
d3cub3JhY2xlLmNvbS9zdGFydC9vdV9zZW1pbmFycy9FVFMtc3BhY2VyLmdpZj4gXHRcclxuIDxo
dHRwOi8vd3d3Lm9yYWNsZS5jb20vZ28vPyZTcmM9MTI5NjE5NyZBY3Q9OD4gXHQgICA8aHR0cDov
L3d3dy5vcmFjbGUuY29tL3N0YXJ0L291X3NlbWluYXJzL0VUUy1zcGFjZXIuZ2lmPiBcdCAgPGh0
dHA6Ly93d3cub3JhY2xlLmNvbS9zdGFydC9vdV9zZW1pbmFycy9FVFMtc3BhY2VyLmdpZj4gXHRc
clxuXHJcbkRvblx4OTJ0IE1pc3MgTmV4dCBXZWVrXHg5MnMgaVNlbWluYXJzIGZyb20gT3JhY2xl
IFVuaXZlcnNpdHkuIFxyXG5cclxuRG9uXHg5MnQgZm9yZ2V0IGFib3V0IHRoZSBGUkVFIGlTZW1p
bmFycyBhbmQgbGl2ZSBjaGF0IGNvbWluZyB1cCBuZXh0IHdlZWsgZnJvbSBPcmFjbGUgVW5pdmVy
c2l0eSEgV2l0aCB0aGUgbGF0ZXN0IGluZm9ybWF0aW9uIG9uIE9yYWNsZSBjZXJ0aWZpY2F0aW9u
IGFuZCB0ZWNobm9sb2d5LCB0aGVzZSBmaXZlIGV2ZW50cyBwcm92aWRlIHVuaXF1ZSBrbm93bGVk
Z2UgYW5kIGd1aWRlZCB0cmFpbmluZyB1bmF2YWlsYWJsZSBhbnl3aGVyZSBlbHNlLiBFYWNoIHNl
bWluYXIgaW5jbHVkZXMgYSAxNS1taW51dGUgbWluaS1sZXNzb24gYW5kIFEmQSBzZXNzaW9uIHdp
dGggYW4gT3JhY2xlIFVuaXZlcnNpdHkgaW5zdHJ1Y3Rvci5cclxuXHJcbklmIHlvdSBoYXZlblx4
OTJ0IGFscmVhZHksIGNsaWNrIGhlcmUgPGh0dHA6Ly93d3cub3JhY2xlLmNvbS9nby8/JlNyYz0x
Mjk2MTk3JkFjdD04PiAgdG8gcmVnaXN0ZXIgZm9yIHRoZSBmdWxsIHdlZWsgb2YgZXZlbnRzLlxy
XG5cclxuTW9uZGF5LCBBcHJpbCAxNSwgMjAwMiBceDk2IDEwOjAwIGEubS4gUERUXHJcbk9yYWNs
ZTlpIFx4OTYgVHJhaW5pbmcgZm9yIENlcnRpZmljYXRpb24uIEtpY2sgb2ZmIHlvdXIgd2VlayB3
aXRoIGEgMzAtbWludXRlIHNlc3Npb24gb24gdGhlIGNvbXBvbmVudHMsIHZhbHVlLCBhbmQgc3Rl
cHMgaW4gdGhlIE9yYWNsZSBDZXJ0aWZpY2F0aW9uIHByb2Nlc3MuXHJcblxyXG5UdWVzZGF5LCBB
cHJpbCAxNiwgMjAwMiBceDk2IDg6MDAgYS5tLiBQRFRcclxuT3JhY2xlOWkgTmV3IEZlYXR1cmVz
LiBVcGRhdGUgeW91ciBrbm93bGVkZ2UgYW5kIGhvbmUgeW91ciBza2lsbHMgb24gdGhlIGxhdGVz
dCBmZWF0dXJlcyBhbmQgb3B0aW9ucyBmb3VuZCBpbiBPcmFjbGU5aSB3aXRoIGFkdmljZSBmcm9t
IG91ciB0cmFpbmluZyBleHBlcnRzLiBcclxuXHJcblxyXG5XZWRuZXNkYXksIEFwcmlsIDE3LCAy
MDAyIFx4OTYgMTA6MDAgYS5tLiBQRFRcclxuT3JhY2xlOWkgU2VjdXJpdHkgVHJhaW5pbmcgZm9y
IENlcnRpZmljYXRpb24uIExlYXJuIGhvdyB5b3UgY2FuIG1lZXQgeW91ciBidXNpbmVzcyBuZWVk
cyBpbiB0aGUgcmFwaWRseSBjaGFuZ2luZyB3b3JsZCBvZiBoaWdoLXRlY2ggc2VjdXJpdHkuIFxy
XG5cclxuXHJcblRodXJzZGF5LCBBcHJpbCAxOCwgMjAwMiBceDk2IDEyOjAwIHAubS4gUERUXHJc
blBlcmZvcm1hbmNlIFR1bmluZyBceDk2IFRpcHMgZnJvbSB0aGUgRXhwZXJ0cy4gTGVhcm4gYmVz
dCBwcmFjdGljZXMgYW5kIHRlY2huaWNhbCB0aXBzIG9uIGhvdyB0byBnZXQgdGhlIG1vc3QgZnJv
bSBPcmFjbGVceDkycyBEYXRhYmFzZSBzb2x1dGlvbi5cclxuXHJcblxyXG5GcmlkYXksIEFwcmls
IDE5LCAyMDAyIFx4OTYgMTI6MDAgcC5tLiBQRFRcclxuQ2VydGlmaWNhdGlvbjogQW4gT3BlbiBG
b3J1bSB3aXRoIE9yYWNsZSBDZXJ0aWZpY2F0aW9uIFByb2dyYW0gR3VydXMsIE1pa2UgU2VycGUg
YW5kIEppbSBEaUlhbm5pLiBJbnRlcmFjdCBkaXJlY3RseSB3aXRoIE9yYWNsZSBDZXJ0aWZpY2F0
aW9uIFByb2dyYW0gZXhwZXJ0cyB0byBhbnN3ZXIgYW55IHJlbWFpbmluZyBxdWVzdGlvbnMgeW91
IG1heSBoYXZlIGFib3V0IHRoZSBwcm9ncmFtLCBjdXJyaWN1bHVtLCBvciB0cmFpbmluZyBvcHRp
b25zLiBcclxuXHJcblxyXG5DbGljayBoZXJlIDxodHRwOi8vd3d3Lm9yYWNsZS5jb20vZ28vPyZT
cmM9MTI5NjE5NyZBY3Q9OD4gIHRvIHJlZ2lzdGVyLlxyXG5cclxuICA8aHR0cDovL3d3dy5vcmFj
bGUuY29tL3N0YXJ0L291X3NlbWluYXJzL0VUUy1zcGFjZXIuZ2lmPiBcdFxyXG5DbGljayAgPGh0
dHA6Ly93d3cub3JhY2xlLmNvbS9nby8/JlNyYz0xMjk2MTk3JkFjdD04PiBoZXJlIHRvIHZpZXcg
eW91ciBGUkVFIGlTZW1pbmFycy5cdCBcclxuXHJcblxyXG5UbyBiZSByZW1vdmVkIGZyb20gT3Jh
Y2xlXCdzIG1haWxpbmcgbGlzdHMsIHNlbmQgYW4gZW1haWwgdG86IFxyXG51bnN1YnNjcmliZUBv
cmFjbGVlYmxhc3QuY29tIDxtYWlsdG86dW5zdWJzY3JpYmVAb3JhY2xlZWJsYXN0LmNvbT9zdWJq
ZWN0PVJFTU9WRSBPRiBPUkFDTEUgTUFJTElORyBMSVNUIDEyOTgzOTAmYm9keT1SRU1PVkUgUElF
UlNIQEZSSVNLSVQuQ09NPiAgXHJcbndpdGggdGhlIGZvbGxvd2luZyBpbiB0aGUgbWVzc2FnZSBi
b2R5OiBcclxuUkVNT1ZFIFBJRVJTSEBGUklTS0lULkNPTSBcclxuU1RPUCBcclxuXHJcblxyXG5b
MTI3NTM3My81LzEwNzU0NzAxMl0gICA8aHR0cDovL3d3dy5vcmFjbGUuY29tL2Vsb2cvdHJhY2t1
cmw/ZGk9MTI5ODM5MCZzaTE9MTA3NTQ3MDEyPiBcclxuXHJcbicKRXJyb3IgdHJhaW5pbmcgbWVz
c2FnZSAnPE1BUElNc2dTdG9yZU1zZywgKHJlYWQpIGlkPTAwMDAwMDAwMzhBMUJCMTAwNUU1MTAx
QUExQkIwODAwMkIyQTU2QzIwMDAwNDU0RDUzNEQ0NDQyMkU0NDRDNEMwMDAwMDAwMDAwMDAwMDAw
MUI1NUZBMjBBQTY2MTFDRDlCQzgwMEFBMDAyRkM0NUEwQzAwMDAwMDVBNDU1NTUzMDAyRjZGM0Q0
NjcyNjk3MzZCNjk3NDIwNDk2RTYzMkUyRjZGNzUzRDQ2Njk3MjczNzQyMDQxNjQ2RDY5NkU2OTcz
NzQ3MjYxNzQ2OTc2NjUyMDQ3NzI2Rjc1NzAyRjYzNkUzRDUyNjU2MzY5NzA2OTY1NkU3NDczMkY2
MzZFM0Q3MDY5NjU3MjczNjgwMC9FRjAwMDAwMDE5ODI2MkMwQUE2NjExQ0Q5QkM4MDBBQTAwMkZD
NDVBMDYwMDAxMDAwMTAwMDAwMDAwMjk1QTI1MDEwMDAwMDAwMDJCRjBCND4nClRyYWNlYmFjayAo
bW9zdCByZWNlbnQgY2FsbCBsYXN0KToKICBGaWxlICJDOlxQeXRob24yMlxzcGFtXHNwYW1iYXll
c1xPdXRsb29rMjAwMFx0cmFpbi5weSIsIGxpbmUgNjcsIGluIHRyYWluX2ZvbGRlcgogICAgaWYg
dHJhaW5fbWVzc2FnZShtZXNzYWdlLCBpc3NwYW0sIG1ncik6CiAgRmlsZSAiQzpcUHl0aG9uMjJc
c3BhbVxzcGFtYmF5ZXNcT3V0bG9vazIwMDBcdHJhaW4ucHkiLCBsaW5lIDM2LCBpbiB0cmFpbl9t
ZXNzYWdlCiAgICBzdHJlYW0gPSBtc2cuR2V0RW1haWxQYWNrYWdlT2JqZWN0KCkKICBGaWxlICJD
OlxQeXRob24yMlxzcGFtXHNwYW1iYXllc1xPdXRsb29rMjAwMFxtc2dzdG9yZS5weSIsIGxpbmUg
NDMxLCBpbiBHZXRFbWFpbFBhY2thZ2VPYmplY3QKICAgIG1zZyA9IGVtYWlsLm1lc3NhZ2VfZnJv
bV9zdHJpbmcodGV4dCkKICBGaWxlICJDOlxQeXRob24yMlxzcGFtXHNwYW1iYXllc1xlbWFpbFxf
X2luaXRfXy5weSIsIGxpbmUgMzksIGluIG1lc3NhZ2VfZnJvbV9zdHJpbmcKICAgIHJldHVybiBQ
YXJzZXIoX2NsYXNzLCBzdHJpY3Q9c3RyaWN0KS5wYXJzZXN0cihzKQogIEZpbGUgIkM6XFB5dGhv
bjIyXHNwYW1cc3BhbWJheWVzXGVtYWlsXFBhcnNlci5weSIsIGxpbmUgNTIsIGluIHBhcnNlc3Ry
CiAgICByZXR1cm4gc2VsZi5wYXJzZShTdHJpbmdJTyh0ZXh0KSwgaGVhZGVyc29ubHk9aGVhZGVy
c29ubHkpCiAgRmlsZSAiQzpcUHl0aG9uMjJcc3BhbVxzcGFtYmF5ZXNcZW1haWxcUGFyc2VyLnB5
IiwgbGluZSA0OCwgaW4gcGFyc2UKICAgIHNlbGYuX3BhcnNlYm9keShyb290LCBmcCkKICBGaWxl
ICJDOlxQeXRob24yMlxzcGFtXHNwYW1iYXllc1xlbWFpbFxQYXJzZXIucHkiLCBsaW5lIDIwNiwg
aW4gX3BhcnNlYm9keQogICAgbXNnb2JqID0gc2VsZi5wYXJzZXN0cihwYXJ0KQogIEZpbGUgIkM6
XFB5dGhvbjIyXHNwYW1cc3BhbWJheWVzXGVtYWlsXFBhcnNlci5weSIsIGxpbmUgNTIsIGluIHBh
cnNlc3RyCiAgICByZXR1cm4gc2VsZi5wYXJzZShTdHJpbmdJTyh0ZXh0KSwgaGVhZGVyc29ubHk9
aGVhZGVyc29ubHkpCiAgRmlsZSAiQzpcUHl0aG9uMjJcc3BhbVxzcGFtYmF5ZXNcZW1haWxcUGFy
c2VyLnB5IiwgbGluZSA0NiwgaW4gcGFyc2UKICAgIHNlbGYuX3BhcnNlaGVhZGVycyhyb290LCBm
cCkKICBGaWxlICJDOlxQeXRob24yMlxzcGFtXHNwYW1iYXllc1xlbWFpbFxQYXJzZXIucHkiLCBs
aW5lIDEwNSwgaW4gX3BhcnNlaGVhZGVycwogICAgcmFpc2UgRXJyb3JzLkhlYWRlclBhcnNlRXJy
b3IoCkhlYWRlclBhcnNlRXJyb3I6IE5vdCBhIGhlYWRlciwgbm90IGEgY29udGludWF0aW9uOiBg
YG9mX21lc3NhZ2UnJwo=

---------------------- multipart/mixed attachment--

From anthony@interlink.com.au  Tue Nov 12 02:33:33 2002
From: anthony@interlink.com.au (Anthony Baxter)
Date: Tue, 12 Nov 2002 13:33:33 +1100
Subject: [Spambayes] Introducing myself 
In-Reply-To: <a05200f3fb9f60c3ccb9e@[192.168.1.103]> 
Message-ID: <200211120233.gAC2XXJ10069@localhost.localdomain>


> >Last week Jeremy and Guido here both reported a *very* effective technique:
> >spam was sent to them as replies to mailing-list postings (not this mailing
> >list <wink>) they had made, including a full quote of the msg they had
> >posted.  That was guaranteed to have lots of ham words for them, and the
> >Subject line was the expected "Re:" followed by their own subject line.
> 
> Ouch, that's evil.  Maybe the solution for that is to look at the 
> message and the quotation individually?  But that can be metagamed 
> too.

This is still making the spammers work a lot harder, though, so it's 
not really a bad thing.

Anthony
-- 
Anthony Baxter     <anthony@interlink.com.au>   
It's never too late to have a happy childhood.


From anthony@interlink.com.au  Tue Nov 12 02:41:33 2002
From: anthony@interlink.com.au (Anthony Baxter)
Date: Tue, 12 Nov 2002 13:41:33 +1100
Subject: [Spambayes] Re: Outlook plugin plus Exchange 
In-Reply-To: <9891913C5BFE87429D71E37F08210CB9183A08@zeus.sfhq.friskit.com> 
Message-ID: <200211120241.gAC2fYW10142@localhost.localdomain>


>>> "Piers Haken" wrote
> I see the same thing on a few messages in my corpus. I believe it's
> something weird to do with the way outlook splits out the MIME headers.
> 
> Attached is a dump of the exception.

Here's a chunk of the message that's causing the problem:
[snip]
Content-Transfer-Encoding: 8bit
MIME-Version: 1.0
Content-Type: multipart/alternative;
    boundary="next_part_of_message"
Return-Path: replies@oracleeblast.com
X-OriginalArrivalTime: 13 Apr 2002 10:19:01.0938 (UTC) FILETIME=[A1D8F920:01C1E2D4]

--next_part_of_message
of_message
ge

--next_part_of_message
Content-Type: text/html

[snip]

This is utter bollocks :) The question is whether it's Oracle
that's bollocksed it up, or Outbreak.

Not a lot that could/should be done here - I guess in _theory_
email could do something where it tries to parse each multipart
bit individually, and return the bits that work, but this seems
like it's way too much work. 

I'm curious why the plugin doesn't fall back to raw message text in
this case? And does outbreak display this message correctly?

Anthony


From popiel@wolfskeep.com  Tue Nov 12 02:47:57 2002
From: popiel@wolfskeep.com (T. Alexander Popiel)
Date: Mon, 11 Nov 2002 18:47:57 -0800
Subject: [Spambayes] Introducing myself 
In-Reply-To: Message from Robert Woodhead <trebor@animeigo.com> 
	<a05200f3fb9f60c3ccb9e@[192.168.1.103]> 
References: <LNBBLJKPBEHFEDALKOLCKELPCJAB.tim.one@comcast.net>
	<a05200f3fb9f60c3ccb9e@[192.168.1.103]> 
Message-ID: <20021112024757.B199AF58B@cashew.wolfskeep.com>

In message:  <a05200f3fb9f60c3ccb9e@[192.168.1.103]>
             Robert Woodhead <trebor@animeigo.com> writes:
>
>My hunch, based on things I've done in the past, is that as the total 
>volume of mail increases, the rate of increase in the number of 
>unique tokens will approach a limit (that being, the number of 
>distinct individual words in the language, though foreign unicode 
>gibberish will have an effect).  When I was doing single word 
>analysis on a quarter-gig of ham and spam I was seeing, IIRC, about 
>300,000 distinct tokens (including the aforementioned gibberish).

Rob Hooft recently (yesterday, that is) did a nice analysis and
graph of database growth based on message count.  He found it
scaled almost linearly with the sqrt of the number of messages...
but he only went up to a total of about 22000 messages, which is
likely only about a fifth of a gig.

>It will be interesting to see the results of some data reduction on 
>the accuracy of the recogniser.  My WAG is that even some serious 
>hashing (down to, say, 20 bit tokens) won't have much effect on 
>accuracy because most of the collisions will be between low 
>frequency, insignificant tokens.

Tim Peters did some hashing experiments back on 3 Nov; he posted these
results:

OK, doing a 10-fold cross-validation run across 2000 random ham and 2000
random spam, but the same random sets for "before" and "after":

filename:    before     crm
ham:spam:  2000:2000
                   2000:2000
fp total:        1    1604
fp %:         0.05   80.20
fn total:        0       0
fn %:         0.00    0.00
unsure t:       20       0
unsure %:     0.50    0.00
real cost:  $14.00$16040.00
best cost:   $2.00 $228.00
h mean:       0.55   53.54
h sdev:       4.50    5.30
s mean:      99.91   71.40
s sdev:       1.64    6.84
mean diff:   99.36   17.86
k:           16.18    1.47

Granted, he was doing more complex word combinations with this, too,
and a different combining technique, but it really doesn't look
promising.

- Alex

From piersh@friskit.com  Tue Nov 12 03:21:25 2002
From: piersh@friskit.com (Piers Haken)
Date: Mon, 11 Nov 2002 19:21:25 -0800
Subject: [Spambayes] Re: Outlook plugin plus Exchange 
Message-ID: <9891913C5BFE87429D71E37F08210CB929750C@zeus.sfhq.friskit.com>

Yup, oulook displays it properly. I have a feeling that it's oracle's
mess, but that outlook just ignores the invalid MIME-part headers --
maybe spambayes can do the same. Maybe if someone else has received this
message from oracle they can shed some more light on this.

The problem is multiplied by the fact that outlook includes the
MIME-part headers and boundaries with the regular headers, but separates
the body parts and attachments. I don't think there's any way to get the
original, unseparated message from the API.

The Outlook UI shows the headers as:

<oracle-headers>
Microsoft Mail Internet Headers Version 2.0
Received: from inet-mail7.oracle.com ([209.246.10.171]) by
zeus.sfhq.friskit.com with Microsoft SMTPSVC(5.0.2195.4453);
	 Sat, 13 Apr 2002 03:19:01 -0700
Received: from blaster-smtp.oracle.com (eblast01.oracleeblast.com
[148.87.9.11])
	by inet-mail7.oracle.com (Switch-2.2.1/Switch-2.2.0) with ESMTP
id g3DA8GV30065
	for PIERSH@FRISKIT.COM; Sat, 13 Apr 2002 03:08:16 -0700
Date: Sat, 13 Apr 2002 03:08:16 -0700
Message-Id: <200204131008.g3DA8GV30065@inet-mail7.oracle.com>
Subject: Oracle University iSeminars
From: Oracle Corporation<replies@oracleeblast.com>
To: PIERSH@FRISKIT.COM
Reply-To: replies@oracleeblast.com
Content-Transfer-Encoding: 8bit
MIME-Version: 1.0
Content-Type: multipart/alternative;
    boundary=3D"next_part_of_message"
Return-Path: replies@oracleeblast.com
X-OriginalArrivalTime: 13 Apr 2002 10:19:01.0938 (UTC)
FILETIME=3D[A1D8F920:01C1E2D4]

--next_part_of_message
of_message
ge

--next_part_of_message
Content-Type: text/html

</oracle-headers>

Piers.

> -----Original Message-----
> From: Anthony Baxter [mailto:anthony@interlink.com.au]=20
> Sent: Monday, November 11, 2002 6:42 PM
> To: Piers Haken
> Cc: Tim Peters; David Leftley; spambayes@python.org
> Subject: Re: [Spambayes] Re: Outlook plugin plus Exchange=20
>=20
>=20
>=20
> >>> "Piers Haken" wrote
> > I see the same thing on a few messages in my corpus. I believe it's=20
> > something weird to do with the way outlook splits out the MIME=20
> > headers.
> >=20
> > Attached is a dump of the exception.
>=20
> Here's a chunk of the message that's causing the problem: [snip]
> Content-Transfer-Encoding: 8bit
> MIME-Version: 1.0
> Content-Type: multipart/alternative;
>     boundary=3D"next_part_of_message"
> Return-Path: replies@oracleeblast.com
> X-OriginalArrivalTime: 13 Apr 2002 10:19:01.0938 (UTC)=20
> FILETIME=3D[A1D8F920:01C1E2D4]
>=20
> --next_part_of_message
> of_message
> ge
>=20
> --next_part_of_message
> Content-Type: text/html
>=20
> [snip]
>=20
> This is utter bollocks :) The question is whether it's Oracle=20
> that's bollocksed it up, or Outbreak.
>=20
> Not a lot that could/should be done here - I guess in=20
> _theory_ email could do something where it tries to parse=20
> each multipart bit individually, and return the bits that=20
> work, but this seems like it's way too much work.=20
>=20
> I'm curious why the plugin doesn't fall back to raw message=20
> text in this case? And does outbreak display this message correctly?
>=20
> Anthony
>=20
>=20
From mhammond@skippinet.com.au  Tue Nov 12 04:44:11 2002
From: mhammond@skippinet.com.au (Mark Hammond)
Date: Tue, 12 Nov 2002 15:44:11 +1100
Subject: [Spambayes] Some more experiences with the Outlook plugin
In-Reply-To: <16E1010E4581B049ABC51D4975CEDB88619933@UKDCX001.uk.int.atosorigin.com>
Message-ID: <LCEPIIGDJPKCOIHOBJEPOEKLHKAA.mhammond@skippinet.com.au>

> I've now had the Outlook plugin running for about a week, and I'm
> starting to get a feel for using it. The following is my "user
> interface" experience. It's a slightly unrealistic combination of
> "what I actually did" and "what I realised afterwards I should have
> done", but it is what I would use as notes telling a new user how
> to set the system up, and as such it picks up on a few interesting
> issues:
>
> 1. To start with, configure the plugin to define one "Spam" folder and
>    one "Unsure" folder, and define all other folders as "Ham". [1]

Tim gives a great explanation of why this is not really possible - some
people simply have too much ham, while even for others, the relative ratios
are important.

> * Following on from this, I also see Tim's behaviour of surprising
>   unsure cases (or worse, false negatives!). Worst case recently was a
>   message which scored as solid ham. I trained on it as "Spam", and
>   rescored it. It still scored 5 - solid ham. My immediate reaction was
>   "But I just *told* you it's spam!". I know that isn't how the classifier
>   works, but even so it was unsettling. FWIW, I attach the spam clues for
>   this one (I don't know if they make any sense in isolation, but it can't
>   hurt...)

This too was my experience.  For a while, I did training over a huge ham
corpus, and spam is still less than 1000 messages.  I had around 15:1
ham:spam.  I too trained new ham and spam, and was dissappointed to see the
score remain almost identical.

Re-training on just my inbox yields far far better results - roughly 3:1
ham:spam.

Tim's idea of:
> In the list you gave below, there are very few hapaxes (I recognize
> them from the probabilities; I should probably add code to the client
> to display the raw counts too):

certainly would be useful.

Without the maths background, I find it interesting to ignorantly speculate
on these ratios.  Tim's analysis:
> '(and' is nearly "33 times closer" to 0 than '"remove"' is to 1,
> and that makes the accidental appearance of a ham word in spam much
> more powerful than the systematic appearance of a spam word in spam.

makes me wonder why the classifier can't exploit the ham:spam ratio to give
weighted results.  Or from another POV, what would happen if we artificially
boosted the ratio by training on each spam multiple times?

I speculate due to my experience with these large ratios, and the fact that
*every* one of these mails came through my Inbox.  Many messages are from
python.org's mailman - thus, the *true* ratio of ham:spam through my mail
account is much higher than the ham:spam ratio left once the mailing list
traffic is removed.  Even though the total spam is the same, the system will
score better or worse depending on the amount of ham I throw at it.  It
isn't intuitive to me why this need be so.

Mark.


From mhammond@skippinet.com.au  Tue Nov 12 04:48:17 2002
From: mhammond@skippinet.com.au (Mark Hammond)
Date: Tue, 12 Nov 2002 15:48:17 +1100
Subject: [Spambayes] Re: Outlook plugin plus Exchange 
In-Reply-To: <200211120241.gAC2fYW10142@localhost.localdomain>
Message-ID: <LCEPIIGDJPKCOIHOBJEPCEKMHKAA.mhammond@skippinet.com.au>

[Anthony]
> This is utter bollocks :) The question is whether it's Oracle
> that's bollocksed it up, or Outbreak.
>
> Not a lot that could/should be done here - I guess in _theory_
> email could do something where it tries to parse each multipart
> bit individually, and return the bits that work, but this seems
> like it's way too much work.

I believe the email package should give some consideration to the real world
here.  While creating well-formed messages is clearly mandatory, it is very
frustrating when something exists in the real world, is clearly invalid, but
everything else in the world has no trouble with it.

eg, HTML parsing, when your parser fails on pages that every browser
displays perfectly.  I didn't create the page, but I can see it, and want to
parse it.

> I'm curious why the plugin doesn't fall back to raw message text in
> this case?

And ditto for every other application in the world that may try and use the
email package on such an invalid message?  While I accept that we will fix
the plugin to handle this case, it does seem a shame to not be able to get
*anything* out of the email package when your mail client itself is quite
happy with the message.  Eg, how much smarts do we move back into the
plugin?  Do we try and recover any headers at all? etc.

I am *not* trying to say "outlook is broken, so the email package should
handle it" - but simply something along the lines of "if most mailers could
handle it, we should too".  Outlook *is* broken, and I certainly don't want
the email package to worm around all our problems - but I'm not convinced
the problem above (or indeed most header related errors from this package)
are outlook specific.

Mark.


From jeremy@alum.mit.edu  Tue Nov 12 04:51:00 2002
From: jeremy@alum.mit.edu (Jeremy Hylton)
Date: Mon, 11 Nov 2002 23:51:00 -0500
Subject: [Spambayes] Introducing myself 
In-Reply-To: <200211120233.gAC2XXJ10069@localhost.localdomain>
References: <a05200f3fb9f60c3ccb9e@[192.168.1.103]>
	<200211120233.gAC2XXJ10069@localhost.localdomain>
Message-ID: <15824.34996.357483.745111@slothrop.zope.com>

>>>>> "AB" == Anthony Baxter <anthony@interlink.com.au> writes:

  >> Ouch, that's evil.  Maybe the solution for that is to look at the
  >> message and the quotation individually?  But that can be
  >> metagamed too.

  AB> This is still making the spammers work a lot harder, though, so
  AB> it's not really a bad thing.

I'd wager that Tim is working much harder than any of the spammers.

Jeremy


From mhammond@skippinet.com.au  Tue Nov 12 05:10:50 2002
From: mhammond@skippinet.com.au (Mark Hammond)
Date: Tue, 12 Nov 2002 16:10:50 +1100
Subject: [Spambayes] Exchange integration
In-Reply-To: <3DCF67F7.16091.91EB9C8@localhost>
Message-ID: <LCEPIIGDJPKCOIHOBJEPKEKNHKAA.mhammond@skippinet.com.au>

> At first, I thought I'd use the Event service, but in 5.5 it's
> async and MS even says "don't
> use this to filter all your messages".

It appears the best way is to hook the message spooler, as per
http://msdn.microsoft.com/library/default.asp?url=/library/en-us/mapi/html/_
mapi1book_using_message_filtering_to_manage_messages.asp

and I believe that this can be configured to run on the client or an
exchange server.

> So it looks like I need to create some kind of MAPI hook or
> preprocessor or mailbox
> assistant.. I'm not sure which.
>
> Anyone know? And, can I do this all in Python via COM or do I
> need some "real C to
> hook in?

Python's MAPI support doesn't extend to this yet, but I would be happy to
help make it so.

eeek - except I also find in Q224362: SAMPLE: Hook.exe MAPI Spooler Hook
Provider Sample (C++)
(http://support.microsoft.com/default.aspx?scid=kb;EN-US;Q224362)
"""
Other Notes
Note that Hook Providers, including this one, will not work when using the
Microsoft Exchange Transport Provider. This is a result of Exchange's
tightly-coupled store and transport (that is, they bypass the MAPI spooler).

If you use Exchange's POP/SMTP/IMAP abilities, the spooler hook will
function just fine
"""

KB article Q190413
(http://support.microsoft.com/default.aspx?scid=kb;EN-US;Q190413) discusses
this a little more in the context of public exchange folders, but still
leaves me somewhat confused.  The documentation for IMAPIAdviseSink
certainly implies it can be used for all kinds of new mail notifications.
We may be forced to "suck it and see" :(

Mark.


From tim.one@comcast.net  Tue Nov 12 05:55:19 2002
From: tim.one@comcast.net (Tim Peters)
Date: Tue, 12 Nov 2002 00:55:19 -0500
Subject: [Spambayes] Some more experiences with the Outlook plugin
In-Reply-To: <LCEPIIGDJPKCOIHOBJEPOEKLHKAA.mhammond@skippinet.com.au>
Message-ID: <LNBBLJKPBEHFEDALKOLCEENDCJAB.tim.one@comcast.net>

[Tim]
>> In the list you gave below, there are very few hapaxes (I recognize
>> them from the probabilities; I should probably add code to the client
>> to display the raw counts too):

[MarkH]
> certainly would be useful.

That's been checked in now.


> Without the maths background, I find it interesting to ignorantly
> speculate on these ratios.  Tim's analysis:

>> '(and' is nearly "33 times closer" to 0 than '"remove"' is to 1,
>> and that makes the accidental appearance of a ham word in spam much
>> more powerful than the systematic appearance of a spam word in spam.

> makes me wonder why the classifier can't exploit the ham:spam
> ratio to give weighted results.

I think it's already doing the best it can here.  It's like I've met a
thousand Americans and 2 Australians, so from all I've *seen* I have to
conclude you're all beer-swilling, Ducati-riding, chain-smoking pigs.  But
that's really not enough evidence for me to *marry* an Australian, just
enough to think highly of 'em <wink>.

> Or from another POV, what would happen if we artificially
> boosted the ratio by training on each spam multiple times?

Nobody knows.  The "by-counting" spamprob estimate wouldn't change at all:
that's already computed by ratios instead of by absolute counts.  If a word
appears in 3 of 4 spam, it gets exactly the same by-counting estimate as a
word that appears in 15,000,000 of 20,000,000 spam.  The difference would be
solely in how much the Bayesian adjustment pushed the by-counting estimate
towards 0.5:  the greater the total number of msgs a word has been seen in,
the more willing the Bayesian adjustment is to leave the by-counting
estimate alone.

Much the same effect *could* be gotten via reducing option
unknown_word_strength instead.  That also makes the Bayesian adjustment more
willing to take the by-counting estimate at face value.

Most of the people who helped pick a good default value for
unknown_word_strength didn't have a strong imbalance in ham:spam.  Maybe you
need a lower value, but I expect it's much better for such people *not* to
train on so much ham.  Training on small random samples, plus mistakes and
unsures, may well be a better approach.

If you've been following the latest experiments, it turns out you can get
very good results with a tiny fraction of the msgs people *have* been
training on.  My personal classifier right now has been trained on only
about 100 msgs total, close to 1:1 ham:spam.  This has weaknesses too, but
not nearly as bad as I guessed in advance (it doesn't seem *any* more prone
to making flat-out mistakes, but the Unsures are hilarious <wink>).

> I speculate due to my experience with these large ratios, and the
> fact that *every* one of these mails came through my Inbox.  Many
> messages are from python.org's mailman - thus, the *true* ratio of
> ham:spam through my mail account is much higher than the ham:spam
> ratio left once the mailing list traffic is removed.  Even though
> the total spam is the same, the system will score better or worse
> depending on the amount of ham I throw at it.  It isn't intuitive
> to me why this need be so.

Only because if it has a lot more ham than spam, it has much more reason to
be confident about hamprobs than spamprobs.  I suppose the Bayesian
adjustment could be fiddled so that it didn't "believe" it *could* be more
confident about either class than is justified by the class for which it has
the least amount of evidence.  I'm not exactly sure of the details, but it's
inuitively clear to me so will be obvious when I wake up <wink>.  That would
prevent the strange result in the example, but:

1. Training on the spam again still wouldn't do you much good, because
   if the ratio was 18:1 before training, it would still be close to
   18:1 after training, so it still wouldn't have much reason to
   "believe" the new spamprobs.

and

2. It would make most of the ham you trained on essentially a waste
   of time and space:  by construction, it wouldn't believe the ham
   stats any more than it believed the spam stats.

We know a lot more at this point about how the system behaves if you don't
have a strong imbalance.


From tim.one@comcast.net  Tue Nov 12 06:25:02 2002
From: tim.one@comcast.net (Tim Peters)
Date: Tue, 12 Nov 2002 01:25:02 -0500
Subject: [Spambayes] Re: Outlook plugin plus Exchange
In-Reply-To: <9891913C5BFE87429D71E37F08210CB929750C@zeus.sfhq.friskit.com>
Message-ID: <LNBBLJKPBEHFEDALKOLCAENGCJAB.tim.one@comcast.net>

[Piers Haken]
> Yup, oulook displays it properly.

Meaning it shows you the HTML part, as rendered HTML, I bet.

> I have a feeling that it's oracle's mess,

Not from what you showed below.  It's not hard to find the end of the
headers!  The first blank line ends them.  That Outlook is showing you stuff
beyond that in its view of the headers says it didn't suck out the headers
properly to begin with.

> but that outlook just ignores the invalid MIME-part headers

By this point Outlook isn't looking *at all* at the part that's damaged (and
probably by it).  It's just sucking out the PR_BODY_HTML property from the
msg and rendering it, and the value of that property contains no MIME armor
at all, just HTML stuff.

> -- maybe spambayes can do the same.

I keep telling people never to call email.message_from_string() directly,
but they don't listen <wink>.  The tokenizer's way of getting an email
message from a string would have at least recovered the message body in this
case, but would have lost the headers entirely (they're crap -- what can you
do?).

> The problem is multiplied by the fact that outlook includes the MIME-
> part headers and boundaries with the regular headers,

The Outlook client actually deletes those from the headers, because:

> but separates the body parts and attachments. I don't think there's
> any way to get the original, unseparated message from the API.

That's right, there isn't.  Outlook's basic structure appears to predate
MIME catching on, and the MIME support very much appears hacked in after it
was too late for a change in worldview.  It's a mess that way, if you want
to (as we do) get MIME back out.  The Outlook client right now "loses" all
attachments, and even loses the msg body if the msg has been digitally
signed (because it turns out Outlook does Yet Another Entirely Different
Thing for signed msgs, leaving the two "normal" body properties empty and
stuffing the body *plus* the signature into Yet Another property).

> The Outlook UI shows the headers as:

By this do you mean View -> Options -> Internet headers?

<oracle-headers>
Microsoft Mail Internet Headers Version 2.0
Received: from inet-mail7.oracle.com ([209.246.10.171]) by
zeus.sfhq.friskit.com with Microsoft SMTPSVC(5.0.2195.4453);
         Sat, 13 Apr 2002 03:19:01 -0700
Received: from blaster-smtp.oracle.com (eblast01.oracleeblast.com
[148.87.9.11])
        by inet-mail7.oracle.com (Switch-2.2.1/Switch-2.2.0) with ESMTP id
g3DA8GV30065
        for PIERSH@FRISKIT.COM; Sat, 13 Apr 2002 03:08:16 -0700
Date: Sat, 13 Apr 2002 03:08:16 -0700
Message-Id: <200204131008.g3DA8GV30065@inet-mail7.oracle.com>
Subject: Oracle University iSeminars
From: Oracle Corporation<replies@oracleeblast.com>
To: PIERSH@FRISKIT.COM
Reply-To: replies@oracleeblast.com
Content-Transfer-Encoding: 8bit
MIME-Version: 1.0
Content-Type: multipart/alternative;
    boundary="next_part_of_message"
Return-Path: replies@oracleeblast.com
X-OriginalArrivalTime: 13 Apr 2002 10:19:01.0938 (UTC)
FILETIME=[A1D8F920:01C1E2D4]

--next_part_of_message
of_message
ge

--next_part_of_message
Content-Type: text/html

</oracle-headers>

There's no way blank lines can be part of the headers, so I don't believe
Oracle screwed this up.  They really are blank, too, as the traceback you
sent earlier showed this at the tail end of the headers:

\r\n
--next_part_of_message\r\n
of_message\r\n
ge\r\n
\r\n
--next_part_of_message\r\n
Content-Type: text/html\r\n
\r\n
\n

and *our* code put in the lone oddball \n after the end of what Outlook told
us were the original headers.  If that's common damage, I can worm around
it.


From tim.one@comcast.net  Tue Nov 12 06:33:59 2002
From: tim.one@comcast.net (Tim Peters)
Date: Tue, 12 Nov 2002 01:33:59 -0500
Subject: [Spambayes] Re: Outlook plugin plus Exchange
In-Reply-To: <LCEPIIGDJPKCOIHOBJEPCEKMHKAA.mhammond@skippinet.com.au>
Message-ID: <LNBBLJKPBEHFEDALKOLCOENGCJAB.tim.one@comcast.net>

[Mark Hammond]
> I believe the email package should give some consideration to the
> real world here.

It tries to, starting in 2.2.2:

    http://www.python.org/doc/current/lib/node383.html

The Parser class defaults to non-strict now, but as the docs say

    this doesn't mean MessageParseErrors are never raised; some ill-
    formatted messages just can't be parsed

I'm sure Barry would be willing to entertain this specific case as a bug
report.  In theory, he reads this list, so should be shamed enough to do
that himself <wink>.

> While creating well-formed messages is clearly mandatory,
> it is very frustrating when something exists in the real world, is
> clearly invalid, but everything else in the world has no trouble
> with it.

In this specific case, it looks like Outlook created damaged headers itself,
and Outlook doesn't care because *it* never *looks* at the headers again.
It already sucked out the HTML and stored it in a property, and that's the
only part it looks at again; the Subject and From etc are also tucked away
in other properties.  So as far as Outlook is concerned,
PR_TRANSPORT_MESSAGE_HEADERS is passive trash.

At least that's my guess <wink>.

> ...
> happy with the message.  Eg, how much smarts do we move back into the
> plugin?  Do we try and recover any headers at all? etc.

In this case I will; the form of the damage is clear and easily wormed
around, provided you know what you're looking for in advance.


From mhammond@skippinet.com.au  Tue Nov 12 06:44:30 2002
From: mhammond@skippinet.com.au (Mark Hammond)
Date: Tue, 12 Nov 2002 17:44:30 +1100
Subject: [Spambayes] Re: Outlook plugin plus Exchange
In-Reply-To: <LNBBLJKPBEHFEDALKOLCAENGCJAB.tim.one@comcast.net>
Message-ID: <LCEPIIGDJPKCOIHOBJEPOELBHKAA.mhammond@skippinet.com.au>

Something that confuses me completely here is:

* Outlook shows headers with blank lines, appearing to royally screw things
up.
* Out Outlook client simply appends the body(s) to the headers as a simple
string.
* We pass this re-constituted string back into the email package, and it too
seems to screw up the header parsing!

ie, Outlook shows the headers as:

"""
...
X-OriginalArrivalTime: 13 Apr 2002 10:19:01.0938 (UTC)
FILETIME=[A1D8F920:01C1E2D4]

--next_part_of_message
of_message
...
"""

And the traceback from the email package shows:

"C:\Python22\spam\spambayes\email\Parser.py", line 105, in _parseheaders
    raise Errors.HeaderParseError(
HeaderParseError: Not a header, not a continuation: ``of_message''

Which seems very strange to me.  Why is the email package complaining about
the "of_message" line, rather than itself stopping header parsing after that
blank?  (Recall that the the email package does not see the "ContentType:"
header, as we remove that before sending it in.)

I assume I am simply missing how messages are parsed.

Mark.


From piersh@friskit.com  Tue Nov 12 07:02:51 2002
From: piersh@friskit.com (Piers Haken)
Date: Mon, 11 Nov 2002 23:02:51 -0800
Subject: [Spambayes] Re: Outlook plugin plus Exchange
Message-ID: <9891913C5BFE87429D71E37F08210CB929750D@zeus.sfhq.friskit.com>

> -----Original Message-----
> From: Tim Peters [mailto:tim.one@comcast.net]=20
> Sent: Monday, November 11, 2002 10:25 PM
> To: Piers Haken
> Cc: David Leftley; spambayes@python.org
> Subject: RE: [Spambayes] Re: Outlook plugin plus Exchange
>=20
>=20
> [Piers Haken]
> > Yup, oulook displays it properly.
>=20
> Meaning it shows you the HTML part, as rendered HTML, I bet.

Yup.

> > I have a feeling that it's oracle's mess,
>=20
> Not from what you showed below.  It's not hard to find the=20
> end of the headers!  The first blank line ends them.  That=20
> Outlook is showing you stuff beyond that in its view of the=20
> headers says it didn't suck out the headers properly to begin with.

I'm not sure that's the case. Outlook _always_ shows the MIME headers
below the SMTP headers in its 'internet headers' UI. For example, heres
the 'headers' from another message which does render correctly and that
spambayes does parse correctly:

<example>
Microsoft Mail Internet Headers Version 2.0
Received: from sccrmhc02.attbi.com ([204.127.202.62]) by
zeus.sfhq.friskit.com with Microsoft SMTPSVC(5.0.2195.5329);
	 Mon, 11 Nov 2002 11:22:27 -0800
Received: from Computer ([12.236.244.49]) by sccrmhc02.attbi.com
          (InterMail vM.4.01.03.27 201-229-121-127-20010626) with SMTP
          id <20021111191007.KEOD5251.sccrmhc02.attbi.com@Computer>;
          Mon, 11 Nov 2002 19:10:07 +0000
From: "Rebecca Whitworth" <lesanctuaire@earthlink.net>
To: "Piers Haken" <piersh@friskit.com>
Cc: "Traci and Stephen Green" <tracigreen50@yahoo.com>
Subject: the green's car
Date: Mon, 11 Nov 2002 11:15:54 -0800
Message-ID: <LMBBIHONPNPJLKCOKALNIEDIDHAA.lesanctuaire@earthlink.net>
MIME-Version: 1.0
Content-Type: multipart/related;
	boundary=3D"----=3D_NextPart_000_002F_01C28973.B386B770"
X-Priority: 3 (Normal)
X-MSMail-Priority: Normal
X-Mailer: Microsoft Outlook IMO, Build 9.0.2416 (9.0.2910.0)
Importance: Normal
X-MimeOLE: Produced By Microsoft MimeOLE V6.00.2800.1106
Return-Path: lesanctuaire@earthlink.net
X-OriginalArrivalTime: 11 Nov 2002 19:22:27.0328 (UTC)
FILETIME=3D[ABBFA800:01C289B7]

------=3D_NextPart_000_002F_01C28973.B386B770
Content-Type: multipart/alternative;
	boundary=3D"----=3D_NextPart_001_0030_01C28973.B386B770"

------=3D_NextPart_001_0030_01C28973.B386B770
Content-Type: text/plain;
	charset=3D"iso-8859-1"
Content-Transfer-Encoding: 8bit

------=3D_NextPart_001_0030_01C28973.B386B770
Content-Type: text/html;
	charset=3D"iso-8859-1"
Content-Transfer-Encoding: quoted-printable


------=3D_NextPart_001_0030_01C28973.B386B770--
------=3D_NextPart_000_002F_01C28973.B386B770
Content-Type: image/jpeg;
	name=3D"image001.jpg"
Content-Transfer-Encoding: base64
Content-ID: <image001.jpg@01C28973.B30C7E60>


------=3D_NextPart_000_002F_01C28973.B386B770--
</example>

As you can see it's just showing everything but the contents of the MIME
parts. I don't think there's any suggestion that these are _just_ the
SMTP headers, but the outlook plugin is treating them as such. Maybe the
outlook plugin should trim the non-SMTP parts from these 'headers'
before passing them to the classifier??


> > but that outlook just ignores the invalid MIME-part headers
>=20
> By this point Outlook isn't looking *at all* at the part=20
> that's damaged (and probably by it).  It's just sucking out=20
> the PR_BODY_HTML property from the msg and rendering it, and=20
> the value of that property contains no MIME armor at all,=20
> just HTML stuff.
>=20
> > -- maybe spambayes can do the same.
>=20
> I keep telling people never to call=20
> email.message_from_string() directly, but they don't listen=20
> <wink>.  The tokenizer's way of getting an email message from=20
> a string would have at least recovered the message body in=20
> this case, but would have lost the headers entirely (they're=20
> crap -- what can you do?).
>=20
> > The problem is multiplied by the fact that outlook includes=20
> the MIME-=20
> > part headers and boundaries with the regular headers,
>=20
> The Outlook client actually deletes those from the headers, because:
>=20
> > but separates the body parts and attachments. I don't think there's=20
> > any way to get the original, unseparated message from the API.
>=20
> That's right, there isn't.  Outlook's basic structure appears=20
> to predate MIME catching on, and the MIME support very much=20
> appears hacked in after it was too late for a change in=20
> worldview.  It's a mess that way, if you want to (as we do)=20
> get MIME back out.  The Outlook client right now "loses" all=20
> attachments, and even loses the msg body if the msg has been=20
> digitally signed (because it turns out Outlook does Yet=20
> Another Entirely Different Thing for signed msgs, leaving the=20
> two "normal" body properties empty and stuffing the body=20
> *plus* the signature into Yet Another property).

Yeah, it's a mess, but I don't think that the classifier should assume
that the message has SMTP headers at all, since many other MTA's exist
(exchange, notes, etc...) Outlook wasn't designed with MIME in mind
since exchange doesn't use MIME.

> > The Outlook UI shows the headers as:
>=20
> By this do you mean View -> Options -> Internet headers?

Yup.

<snip/>
From tim.one@comcast.net  Tue Nov 12 06:53:33 2002
From: tim.one@comcast.net (Tim Peters)
Date: Tue, 12 Nov 2002 01:53:33 -0500
Subject: [Spambayes] Introducing myself
In-Reply-To: <B761L97EAUQMKLJQPHGCB0ECXSDA.3dd0617a@riven>
Message-ID: <LNBBLJKPBEHFEDALKOLCOENICJAB.tim.one@comcast.net>

[Tim Stone]
> Gotcha.  You dudes are on top of things... ;)

It's more that they're on top of us, and won't get off <wink>.

> Wanna do some ocr stuff on referenced jpgs and gifs?  ;;;)  I
> know I know...  bad idea for any of a thousand reasons...

It's a fine idea, if this kind of stuff becomes a problem we can't address
in a cheaper way.  So far, though, this kind of stuff hasn't had any luck
fooling this system.  For example, "jpg" and "gif" appearing in URL
components have high spamprobs, and if the msg just consists of pointing at
images, those high-spamprob gif and jpg tokens become a major part of the
msg's total token count, and kill it.  Assuming the headers didn't kill it
already.


From tim.one@comcast.net  Tue Nov 12 07:12:27 2002
From: tim.one@comcast.net (Tim Peters)
Date: Tue, 12 Nov 2002 02:12:27 -0500
Subject: [Spambayes] Re: Outlook plugin plus Exchange
In-Reply-To: <LCEPIIGDJPKCOIHOBJEPOELBHKAA.mhammond@skippinet.com.au>
Message-ID: <LNBBLJKPBEHFEDALKOLCGENKCJAB.tim.one@comcast.net>

[Mark Hammond]
> Something that confuses me completely here is:
>
> * Outlook shows headers with blank lines, appearing to royally
> screw things up.

Yes.

> * Out Outlook client simply appends the body(s) to the headers as a
> simple string.

Ditto.

> * We pass this re-constituted string back into the email package,
> and it too seems to screw up the header parsing!

Ditto again.  You're on a roll, Mark <wink>.

> ie, Outlook shows the headers as:
>
> """
> ...
> X-OriginalArrivalTime: 13 Apr 2002 10:19:01.0938 (UTC)
> FILETIME=[A1D8F920:01C1E2D4]
>
> --next_part_of_message
> of_message
> ...
> """
>
> And the traceback from the email package shows:
>
> "C:\Python22\spam\spambayes\email\Parser.py", line 105, in _parseheaders
>     raise Errors.HeaderParseError(
> HeaderParseError: Not a header, not a continuation: ``of_message''

This won't make sense to you just yet <wink>, but look at the full traceback
instead:

Traceback (most recent call last):
  File "C:\Python22\spam\spambayes\Outlook2000\train.py", line 67, in
train_folder
    if train_message(message, isspam, mgr):
  File "C:\Python22\spam\spambayes\Outlook2000\train.py", line 36, in
train_message
    stream = msg.GetEmailPackageObject()
  File "C:\Python22\spam\spambayes\Outlook2000\msgstore.py", line 431, in
GetEmailPackageObject
    msg = email.message_from_string(text)
  File "C:\Python22\spam\spambayes\email\__init__.py", line 39, in
message_from_string
    return Parser(_class, strict=strict).parsestr(s)
  File "C:\Python22\spam\spambayes\email\Parser.py", line 52, in parsestr
    return self.parse(StringIO(text), headersonly=headersonly)
  File "C:\Python22\spam\spambayes\email\Parser.py", line 48, in parse
    self._parsebody(root, fp)
  File "C:\Python22\spam\spambayes\email\Parser.py", line 206, in _parsebody
    msgobj = self.parsestr(part)
  File "C:\Python22\spam\spambayes\email\Parser.py", line 52, in parsestr
    return self.parse(StringIO(text), headersonly=headersonly)
  File "C:\Python22\spam\spambayes\email\Parser.py", line 46, in parse
    self._parseheaders(root, fp)
  File "C:\Python22\spam\spambayes\email\Parser.py", line 105, in
_parseheaders
    raise Errors.HeaderParseError(
HeaderParseError: Not a header, not a continuation: ``of_message''

It's descending *into* the body when the error occurs, and at that point
it's really talking about the MIME-section headers, not the message headers,
starting with

> --next_part_of_message
> of_message

as a distinct section.

> Which seems very strange to me.  Why is the email package
> complaining about the "of_message" line, rather than itself stopping
> header parsing after that blank?

My guess is that it *did* stop after the first blank line, so far as the
*message* headers were concerned.  At this point it's looking at the headers
in the individual MIME sections.  I realize this still doesn't make sense to
you, but it will very soon <wink>:

> (Recall that the the email package does not see the "ContentType:"
> header, as we remove that before sending it in.)

That's what confused me at first too, but it isn't true here:  we don't
remove the Content-Type header until *after* email_message_from_string()
returns a message.  We never got that far in this case.

> I assume I am simply missing how messages are parsed.

Maybe, but it's irrelevant <wink>.  By the time I'm stripping the MIME
headers in the Outlook client, it's too late to do any good.  I don't know
how to better, though (with minor effort) -- it's really a job for Barry.
We've been saved so far because the email parser *is* lax by default, and
doesn't complain about missing MIME armor.  It does complain about MIME
armor that makes no sense, though, and I've never seen that happen in any of
my email.  If we managed to get Content-Type out of the Outlook headers
before calling message_from_string, there's no problem with this msg (I
tried that -- it works -- but I removed Content-Type by hand with an editor,
which isn't terribly scalable <wink>).


From anthony@interlink.com.au  Tue Nov 12 07:13:56 2002
From: anthony@interlink.com.au (Anthony Baxter)
Date: Tue, 12 Nov 2002 18:13:56 +1100
Subject: [Spambayes] A couple of small tokenizer experiments. 
In-Reply-To: <LNBBLJKPBEHFEDALKOLCCELFCJAB.tim.one@comcast.net> 
Message-ID: <200211120713.gAC7Du512404@localhost.localdomain>

>>> Tim Peters wrote
> We don't tokenize To: now because it gives good results for bad reasons on
> mixed-source corpora.  It would be good to have an option to tokenize it.
> It appears that your code also tokenized Cc:; also fine.  I would rather see
> the code added to the loop currently cracking "from" lines:
> 
>         for field in ('from',):
> 
> so that we tokenize all address thingies in a uniform way.  The option would
> control the list of field names looped over there (default just from:,
> optionally also to: and cc:).

I've added this now. For me, tokenising just the 'from' line
with the new 'address_headers' option gives (vs the old code):

(all tests with 4 sets of 1200H/400S)

filename:  old_from       
                   new_from
ham:spam:  4800:1600      
                   4800:1600
fp total:        1       1
fp %:         0.02    0.02
fn total:       12      11
fn %:         0.75    0.69
unsure t:       86      88
unsure %:     1.34    1.38
real cost:  $39.20  $38.60
best cost:  $31.80  $32.40
h mean:       0.36    0.36
h sdev:       4.04    4.05
s mean:      98.25   98.25
s sdev:       8.93    8.99
mean diff:   97.89   97.89
k:            7.55    7.51

The old code's best cost was:
-> achieved at ham & spam cutoffs 0.24 & 0.99
->     fp 0; fn 3; unsure ham 26; unsure spam 118
->     fp rate 0%; fn rate 0.188%; unsure rate 2.25%

The new code's best cost was:
-> largest ham & spam cutoffs 0.26 & 0.99
->     fp 0; fn 4; unsure ham 24; unsure spam 118
->     fp rate 0%; fn rate 0.25%; unsure rate 2.22%

The one additional fn was a spam that was dragged from 0.35 to
0.21 because it came from 'update@localhost.net' - the 'update'
was a strong spam clue.

Where it gets more interesting is when I also tokenize to and cc:

filename:  new_from       
                   new_fromtocc
ham:spam:  4800:1600      
                   4800:1600
fp total:        1       1
fp %:         0.02    0.02
fn total:        4       5
fn %:         0.25    0.31
unsure t:      121     104
unsure %:     1.89    1.62
real cost:  $38.20  $35.80
best cost:  $32.40  $28.00
h mean:       0.36    0.31
h sdev:       4.05    3.80
s mean:      98.25   98.42
s sdev:       8.99    8.77
mean diff:   97.89   98.11
k:            7.51    7.81


We go from:
-> largest ham & spam cutoffs 0.26 & 0.99
->     fp 0; fn 4; unsure ham 24; unsure spam 118
->     fp rate 0%; fn rate 0.25%; unsure rate 2.22%

to
-> largest ham & spam cutoffs 0.22 & 0.99
->     fp 0; fn 3; unsure ham 25; unsure spam 100
->     fp rate 0%; fn rate 0.188%; unsure rate 1.95%

That's a total of 142->125 unsures. I'll accept that :)


Just to make sure, ran with a different seed.
filename:  new_from2      
                   new_fromtocc2
ham:spam:  4800:1600      
                   4800:1600
fp total:        0       0
fp %:         0.00    0.00
fn total:        6       6
fn %:         0.38    0.38
unsure t:      110      97
unsure %:     1.72    1.52
real cost:  $28.00  $25.40
best cost:  $23.00  $19.20
h mean:       0.45    0.39
h sdev:       4.72    4.48
s mean:      98.44   98.56
s sdev:       8.82    8.62
mean diff:   97.99   98.17
k:            7.24    7.49

went from:
-> largest ham & spam cutoffs 0.28 & 0.94
->     fp 0; fn 6; unsure ham 23; unsure spam 62
->     fp rate 0%; fn rate 0.375%; unsure rate 1.33%
to
-> largest ham & spam cutoffs 0.24 & 0.93
->     fp 0; fn 4; unsure ham 25; unsure spam 51
->     fp rate 0%; fn rate 0.25%; unsure rate 1.19%


toemail:python.org and toemail:zope.org both show up in 
my 'best discriminators' list as _very_ strong ham clues
(not suprising, given the mailing lists I'm on). My old/uncommon
email addresses generally show up as strong strong spam clues
(eg prob('toemail:arb') = 0.999356)

Next, I tried it against a chunk of my horrible corpus - 4 (out of
10) sets of 1200H/400S (out of 3500H/1800S in each set)

filename:  info_from     
                 info_fromtocc
ham:spam:  4800:1600      
                   4800:1600
fp total:        6       7
fp %:         0.12    0.15
fn total:        4       4
fn %:         0.25    0.25
unsure t:      208     179
unsure %:     3.25    2.80
real cost: $105.60 $109.80
best cost:  $78.00  $66.40
h mean:       3.05    2.63
h sdev:      10.88   10.12
s mean:      99.17   99.12
s sdev:       6.65    6.99
mean diff:   96.12   96.49
k:            5.48    5.64

That's 
-> achieved at ham & spam cutoffs 0.62 & 0.99
->     fp 5; fn 11; unsure ham 44; unsure spam 41
->     fp rate 0.104%; fn rate 0.688%; unsure rate 1.33%

going to
-> achieved at ham & spam cutoffs 0.62 & 0.99
->     fp 4; fn 12; unsure ham 36; unsure spam 36
->     fp rate 0.0833%; fn rate 0.75%; unsure rate 1.12%

Anyway, the option's checked in and there, so go play. I'll run a full
test of the horror corpus overnight...

Anthony

From tim.one@comcast.net  Tue Nov 12 07:31:49 2002
From: tim.one@comcast.net (Tim Peters)
Date: Tue, 12 Nov 2002 02:31:49 -0500
Subject: [Spambayes] Re: Outlook plugin plus Exchange
In-Reply-To: <9891913C5BFE87429D71E37F08210CB929750D@zeus.sfhq.friskit.com>
Message-ID: <LNBBLJKPBEHFEDALKOLCOENLCJAB.tim.one@comcast.net>

[Piers Haken]
> I'm not sure that's the case. Outlook _always_ shows the MIME headers
> below the SMTP headers in its 'internet headers' UI.

My Outlook never does.  Really!  Never.  I've trained on thousands of HTML
spam here, and surely would have noticed this problem if it had ever popped
up.

So, there's some difference between our Outlooks.  Which version are you
using?  I'm using Outlook 2000 SR-1, build 9.0.0.4201, Internet Mail Only
configuration, and get email solely thru remote POP3 accounts.  I'm
*guessing* you've got yours configured for Corporate/Workgroup, which is
said to change lots of stuff in undocumented ways.

> For example, heres the 'headers' from another message which does render
> correctly and that spambayes does parse correctly:

<example>
Microsoft Mail Internet Headers Version 2.0
Received: from sccrmhc02.attbi.com ([204.127.202.62]) by
zeus.sfhq.friskit.com with Microsoft SMTPSVC(5.0.2195.5329);
         Mon, 11 Nov 2002 11:22:27 -0800
Received: from Computer ([12.236.244.49]) by sccrmhc02.attbi.com
          (InterMail vM.4.01.03.27 201-229-121-127-20010626) with SMTP
          id <20021111191007.KEOD5251.sccrmhc02.attbi.com@Computer>;
          Mon, 11 Nov 2002 19:10:07 +0000
From: "Rebecca Whitworth" <lesanctuaire@earthlink.net>
To: "Piers Haken" <piersh@friskit.com>
Cc: "Traci and Stephen Green" <tracigreen50@yahoo.com>
Subject: the green's car
Date: Mon, 11 Nov 2002 11:15:54 -0800
Message-ID: <LMBBIHONPNPJLKCOKALNIEDIDHAA.lesanctuaire@earthlink.net>
MIME-Version: 1.0
Content-Type: multipart/related;
        boundary="----=_NextPart_000_002F_01C28973.B386B770"
X-Priority: 3 (Normal)
X-MSMail-Priority: Normal
X-Mailer: Microsoft Outlook IMO, Build 9.0.2416 (9.0.2910.0)
Importance: Normal
X-MimeOLE: Produced By Microsoft MimeOLE V6.00.2800.1106
Return-Path: lesanctuaire@earthlink.net
X-OriginalArrivalTime: 11 Nov 2002 19:22:27.0328 (UTC)
FILETIME=[ABBFA800:01C289B7]

------=_NextPart_000_002F_01C28973.B386B770
Content-Type: multipart/alternative;
        boundary="----=_NextPart_001_0030_01C28973.B386B770"

------=_NextPart_001_0030_01C28973.B386B770
Content-Type: text/plain;
        charset="iso-8859-1"
Content-Transfer-Encoding: 8bit

------=_NextPart_001_0030_01C28973.B386B770
Content-Type: text/html;
        charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable


------=_NextPart_001_0030_01C28973.B386B770--
------=_NextPart_000_002F_01C28973.B386B770
Content-Type: image/jpeg;
        name="image001.jpg"
Content-Transfer-Encoding: base64
Content-ID: <image001.jpg@01C28973.B30C7E60>


------=_NextPart_000_002F_01C28973.B386B770--
</example>

> As you can see it's just showing everything but the contents of
> the MIME parts.

Yes, I see.  I've never seen anything like that before, though.

> I don't think there's any suggestion that these are _just_ the SMTP
> headers, but the outlook plugin is treating them as such.

Sure, because that's what the people who wrote the client have *seen* in
their Outlooks.  It's not like what MS does here is documented <wink>.

> Maybe the outlook plugin should trim the non-SMTP parts from these
> 'headers' before passing them to the classifier??

Looks like we don't have any choice about that now.

> Yeah, it's a mess, but I don't think that the classifier should assume
> that the message has SMTP headers at all, since many other MTA's exist
> (exchange, notes, etc...)

We don't.  The value of the MAPI PR_TRANSPORT_MESSAGE_HEADERS property is
magnificently ill-defined, to the point of complete uselessness, so this is
purely poke-and-hope programming.  The only other case we've *seen* before
this is the one where PR_TRANSPORT_MESSAGE_HEADERS has no value, in which
case there's code to try to *synthesize* some bare-bones headers out of the
PR_SUBJECT, PR_DISPLAY_NAME, PR_DISPLAY_TO, and PR_DISPLAY_CC properties.
Now some other case has popped up -- so it goes.

> Outlook wasn't designed with MIME in mind ...

I had figured that one out already <wink>.


From anthony@interlink.com.au  Tue Nov 12 07:37:32 2002
From: anthony@interlink.com.au (Anthony Baxter)
Date: Tue, 12 Nov 2002 18:37:32 +1100
Subject: [Spambayes] Re: Outlook plugin plus Exchange 
In-Reply-To: <LNBBLJKPBEHFEDALKOLCOENGCJAB.tim.one@comcast.net> 
Message-ID: <200211120737.gAC7bXX12617@localhost.localdomain>


>>> Tim Peters wrote
> [Mark Hammond]
> > I believe the email package should give some consideration to the
> > real world here.
> 
> It tries to, starting in 2.2.2:
> 
>     http://www.python.org/doc/current/lib/node383.html
> 
> The Parser class defaults to non-strict now, but as the docs say
> 
>     this doesn't mean MessageParseErrors are never raised; some ill-
>     formatted messages just can't be parsed
> 
> I'm sure Barry would be willing to entertain this specific case as a bug
> report.  In theory, he reads this list, so should be shamed enough to do
> that himself <wink>.

Yah. The non-strict mode was initially my fault, because I wanted to
be able to parse bad MIME. In this case, you're hitting a broken MIME
subsection. That 'of_message' is nothing like a header at all - if
the broken MIME subsection is supposed to be parsed, there should be a
newline between the boundary and the subsection.

The section of code in question has this comment:

    # Normal, non-continuation header.  BAW: this should check to make
    # sure it's a legal header, e.g. doesn't contain spaces.  Also, we
    # should expose the header matching algorithm in the API, and
    # allow for a non-strict parsing mode (that ignores the line
    # instead of raising the exception).

Here's an (untested :) patch. Depending on how you want to handle these 
sorts of errors, uncomment either the 'break' or the 'continue' line.


--- Parser.py   23 Sep 2002 13:18:55 -0000      1.1.1.1
+++ Parser.py   12 Nov 2002 07:34:40 -0000
@@ -98,9 +98,15 @@
                 if self._strict:
                     raise Errors.HeaderParseError(
                         "Not a header, not a continuation: ``%s''"%line)
-                elif lineno == 1 and line.startswith('--'):
-                    # allow through duplicate boundary tags.
-                    continue
+                elif lineno == 1:
+                    if line.startswith('--'):
+                        # allow through duplicate boundary tags.
+                        continue
+                    else: 
+                        # hack hack hack. We saw a non header. Either:
+                        #continue # to ignore it silently.
+                        # or
+                        break # to treat the rest of the headers as body
                 else:
                     raise Errors.HeaderParseError(
                         "Not a header, not a continuation: ``%s''"%line)

I'm not comfortable that this should go into the core distribution of the
email package - but the above comment about exposed the header matching
API is a good one. I'll think about how to do this.

Anthony

From piersh@friskit.com  Tue Nov 12 07:51:48 2002
From: piersh@friskit.com (Piers Haken)
Date: Mon, 11 Nov 2002 23:51:48 -0800
Subject: [Spambayes] Re: Outlook plugin plus Exchange
Message-ID: <9891913C5BFE87429D71E37F08210CB929750E@zeus.sfhq.friskit.com>

> -----Original Message-----
> From: Tim Peters [mailto:tim.one@comcast.net]=20
> Sent: Monday, November 11, 2002 11:32 PM
> To: Piers Haken
> Cc: David Leftley; spambayes@python.org
> Subject: RE: [Spambayes] Re: Outlook plugin plus Exchange
>=20
>=20
> [Piers Haken]
> > I'm not sure that's the case. Outlook _always_ shows the=20
> MIME headers=20
> > below the SMTP headers in its 'internet headers' UI.
>=20
> My Outlook never does.  Really!  Never.  I've trained on=20
> thousands of HTML spam here, and surely would have noticed=20
> this problem if it had ever popped up.
>=20
> So, there's some difference between our Outlooks.  Which=20
> version are you using?  I'm using Outlook 2000 SR-1, build=20
> 9.0.0.4201, Internet Mail Only configuration, and get email=20
> solely thru remote POP3 accounts.  I'm
> *guessing* you've got yours configured for=20
> Corporate/Workgroup, which is said to change lots of stuff in=20
> undocumented ways.

Yeah, all my internet email comes via SMTP to my exchange server. It
might be the corporate/workgroup setting that making the difference, or
it could be the exchange SMTP MTA.

Ugh.

Piers.
From piersh@friskit.com  Tue Nov 12 07:55:57 2002
From: piersh@friskit.com (Piers Haken)
Date: Mon, 11 Nov 2002 23:55:57 -0800
Subject: [Spambayes] Re: Outlook plugin plus Exchange 
Message-ID: <9891913C5BFE87429D71E37F08210CB9183A0C@zeus.sfhq.friskit.com>

Yeha, the problem is twofolow: not only are the MIME headers broken, but
even if they weren't the content of that MIME header would be empty
since the body parts are appended (by the outlook plugin) after the MIME
headers:

Msgstory.py, line ~400:
        return "%s\n%s\n%s" % (headers, html, body)

Piers.

> -----Original Message-----
> From: Anthony Baxter [mailto:anthony@interlink.com.au]=20
> Sent: Monday, November 11, 2002 11:38 PM
> To: Tim Peters
> Cc: Mark Hammond; Piers Haken; Barry A. Warsaw; spambayes@python.org
> Subject: Re: [Spambayes] Re: Outlook plugin plus Exchange=20
>=20
>=20
>=20
> >>> Tim Peters wrote
> > [Mark Hammond]
> > > I believe the email package should give some consideration to the=20
> > > real world here.
> >=20
> > It tries to, starting in 2.2.2:
> >=20
> >     http://www.python.org/doc/current/lib/node383.html
> >=20
> > The Parser class defaults to non-strict now, but as the docs say
> >=20
> >     this doesn't mean MessageParseErrors are never raised; some ill-
> >     formatted messages just can't be parsed
> >=20
> > I'm sure Barry would be willing to entertain this specific=20
> case as a=20
> > bug report.  In theory, he reads this list, so should be=20
> shamed enough=20
> > to do that himself <wink>.
>=20
> Yah. The non-strict mode was initially my fault, because I=20
> wanted to be able to parse bad MIME. In this case, you're=20
> hitting a broken MIME subsection. That 'of_message' is=20
> nothing like a header at all - if the broken MIME subsection=20
> is supposed to be parsed, there should be a newline between=20
> the boundary and the subsection.
>=20
> The section of code in question has this comment:
>=20
>     # Normal, non-continuation header.  BAW: this should check to make
>     # sure it's a legal header, e.g. doesn't contain spaces.  Also, we
>     # should expose the header matching algorithm in the API, and
>     # allow for a non-strict parsing mode (that ignores the line
>     # instead of raising the exception).
>=20
> Here's an (untested :) patch. Depending on how you want to=20
> handle these=20
> sorts of errors, uncomment either the 'break' or the 'continue' line.
>=20
>=20
> --- Parser.py   23 Sep 2002 13:18:55 -0000      1.1.1.1
> +++ Parser.py   12 Nov 2002 07:34:40 -0000
> @@ -98,9 +98,15 @@
>                  if self._strict:
>                      raise Errors.HeaderParseError(
>                          "Not a header, not a continuation:=20
> ``%s''"%line)
> -                elif lineno =3D=3D 1 and line.startswith('--'):
> -                    # allow through duplicate boundary tags.
> -                    continue
> +                elif lineno =3D=3D 1:
> +                    if line.startswith('--'):
> +                        # allow through duplicate boundary tags.
> +                        continue
> +                    else:=20
> +                        # hack hack hack. We saw a non=20
> header. Either:
> +                        #continue # to ignore it silently.
> +                        # or
> +                        break # to treat the rest of the headers as=20
> + body
>                  else:
>                      raise Errors.HeaderParseError(
>                          "Not a header, not a continuation:=20
> ``%s''"%line)
>=20
> I'm not comfortable that this should go into the core=20
> distribution of the email package - but the above comment=20
> about exposed the header matching API is a good one. I'll=20
> think about how to do this.
>=20
> Anthony
>=20
From anthony@interlink.com.au  Tue Nov 12 07:49:59 2002
From: anthony@interlink.com.au (Anthony Baxter)
Date: Tue, 12 Nov 2002 18:49:59 +1100
Subject: [Spambayes] Re: Outlook plugin plus Exchange 
In-Reply-To: <9891913C5BFE87429D71E37F08210CB9183A0C@zeus.sfhq.friskit.com> 
Message-ID: <200211120749.gAC7nxn12707@localhost.localdomain>


>>> "Piers Haken" wrote
> 
> Yeha, the problem is twofolow: not only are the MIME headers broken, but
> even if they weren't the content of that MIME header would be empty
> since the body parts are appended (by the outlook plugin) after the MIME
> headers:
> 
> Msgstory.py, line ~400:
>         return "%s\n%s\n%s" % (headers, html, body)

No idea on that one - it's in the Outlook plugin code - I'm not going 
near that one.

Mark-touched-it-last-he-gets-to-fix-it,

Anthony

-- 
Anthony Baxter     <anthony@interlink.com.au>   
It's never too late to have a happy childhood.


From anthony@interlink.com.au  Tue Nov 12 07:52:19 2002
From: anthony@interlink.com.au (Anthony Baxter)
Date: Tue, 12 Nov 2002 18:52:19 +1100
Subject: [Spambayes] A couple of small tokenizer experiments. 
In-Reply-To: <LNBBLJKPBEHFEDALKOLCCELFCJAB.tim.one@comcast.net> 
Message-ID: <200211120752.gAC7qJv12783@localhost.localdomain>

>>> Tim Peters wrote
> > First experiment was to make the URL tokenizer look for the string
> > 'mailman' in the URL. If it was found, simple push the clue "url:
> > Mailman URL" onto the clue-pile. This was an attempt to remove the
> Can you try this again replacing "break" with "continue"?  I can't believe
> you intended break here -- it means that the first time we see a Mailman URL
> in a msg, we stop looking for embedded URLs period.  Spam could easily
> exploit that.

--- tokenizer.py        12 Nov 2002 06:21:38 -0000      1.66
+++ tokenizer.py        12 Nov 2002 07:23:30 -0000
@@ -944,6 +944,11 @@
         new_text.append(text[i : start])
         new_text.append(' ')
 
+        if guts.find('mailman') != -1:
+            pushclue("url: Mailman URL")
+            i = end
+            continue
+
         pushclue("proto:" + proto)
         # Lose the trailing punctuation for casual embedding, like:
         #     The code is at http://mystuff.org/here?  Didn't resolve.


filename:  new_fromtocc2  
                   new_mailman2
ham:spam:  4800:1600      
                   4800:1600
fp total:        0       0
fp %:         0.00    0.00
fn total:        6       5
fn %:         0.38    0.31
unsure t:       97      95
unsure %:     1.52    1.48
real cost:  $25.40  $24.00
best cost:  $19.20  $18.20
h mean:       0.39    0.42
h sdev:       4.48    4.59
s mean:      98.56   98.68
s sdev:       8.62    8.17
mean diff:   98.17   98.26
k:            7.49    7.70

before:
-> largest ham & spam cutoffs 0.24 & 0.93
->     fp 0; fn 4; unsure ham 25; unsure spam 51
->     fp rate 0%; fn rate 0.25%; unsure rate 1.19%

after:
-> largest ham & spam cutoffs 0.24 & 0.94
->     fp 0; fn 3; unsure ham 27; unsure spam 49
->     fp rate 0%; fn rate 0.188%; unsure rate 1.19%

It replaces a chunk of closely correlated ham clues, which has the
expected result of pushing both ham and spam up slightly. This (for
me) rescues one fn at the expense of a couple of extra unsure hams.

This looks like a YMMV one. It's (for me) a marginal win. 

Anthony

From Paul.Moore@atosorigin.com  Tue Nov 12 09:39:27 2002
From: Paul.Moore@atosorigin.com (Moore, Paul)
Date: Tue, 12 Nov 2002 09:39:27 -0000
Subject: [Spambayes] Re: Outlook plugin plus Exchange
Message-ID: <16E1010E4581B049ABC51D4975CEDB885E2DBC@UKDCX001.uk.int.atosorigin.com>

From: Tim Stone - Four Stones Expressions
> The whole problem I see with this is that =B5$0pht could and most
> likely will screw all these machinations up with the next release of
> Outlook or Exchange... They have this great history of not caring
> if their api changes, or system behavior changes, are backward
> compatible. If we're having this level of difficulty now, get
> ready... :(

But if we stick to a pretty trivial "On startup" hook which scans all
new mail, along with a "New mail arrived" hook which filters mail as
it arrives, then (1) we're covered, and (2) we aren't doing anything
sufficiently complex that it's *likely* to get broken. (Not that MS
can't break anything - Outlook.NET probably won't even have a COM
addin interface...)

Paul.

From tim@fourstonesExpressions.com  Tue Nov 12 09:45:26 2002
From: tim@fourstonesExpressions.com (Tim Stone - Four Stones Expressions)
Date: Tue, 12 Nov 2002 03:45:26 -0600
Subject: [Spambayes] Re: Outlook plugin plus Exchange
In-Reply-To: <16E1010E4581B049ABC51D4975CEDB885E2DBC@UKDCX001.uk.int.atosorigin.com>
Message-ID: <QGE1TRO1WC8SQ1ZC8LI74PJMI52JHSR.3dd0cdb6@riven>

11/12/2002 3:39:27 AM, "Moore, Paul" <Paul.Moore@atosorigin.com> wrote:

>From: Tim Stone - Four Stones Expressions
>> The whole problem I see with this is that �$0pht could and most
>> likely will screw all these machinations up with the next release of
>> Outlook or Exchange... They have this great history of not caring
>> if their api changes, or system behavior changes, are backward
>> compatible. If we're having this level of difficulty now, get
>> ready... :(
>
>But if we stick to a pretty trivial "On startup" hook which scans all
>new mail, along with a "New mail arrived" hook which filters mail as
>it arrives, then (1) we're covered, and (2) we aren't doing anything
>sufficiently complex that it's *likely* to get broken. (Not that MS
>can't break anything - Outlook.NET probably won't even have a COM
>addin interface...)

Yup.... 'xactly what I was thinkin.  We'll have to maintain at least two 
versions of the plugin for some time to come, if not ad-infinitum.

- TimS

>
>Paul.
>
>
- Tim
www.fourstonesExpressions.com 


From lists@webcrunchers.com  Tue Nov 12 10:39:39 2002
From: lists@webcrunchers.com (John D.)
Date: Tue, 12 Nov 2002 02:39:39 -0800
Subject: [Spambayes] How do I get the latest CVS?
Message-ID: <v0311070db9f68a7f883c@[192.168.0.2]>

I FINALLY setup a system I can use to start using some of the SpamBayes work,  but (sigh) I don't think I have access.    

I'm trying to get the SpamBayes project from SourceForge.   But I need a password to get into it.   I tried Anonymous,  but that didn't work.   

I only need read access to it.    

CrunchBox# cvs -q get -P spambayes
The authenticity of host 'cvs.spambayes.sourceforge.net (216.136.171.202)' can't be established.
DSA key fingerprint is 02:ab:7c:aa:49:ed:0b:a8:50:13:10:c2:3e:92:0f:42.
Are you sure you want to continue connecting (yes/no)? yes\
Warning: Permanently added 'cvs.spambayes.sourceforge.net,216.136.171.202' (DSA) to the list of known hosts.
anonymous@cvs.spambayes.sourceforge.net's password: 
Permission denied, please try again.

How do I get it?

John


From mhammond@skippinet.com.au  Tue Nov 12 10:53:55 2002
From: mhammond@skippinet.com.au (Mark Hammond)
Date: Tue, 12 Nov 2002 21:53:55 +1100
Subject: [Spambayes] How do I get the latest CVS?
In-Reply-To: <v0311070db9f68a7f883c@[192.168.0.2]>
Message-ID: <LCEPIIGDJPKCOIHOBJEPOELJHKAA.mhammond@skippinet.com.au>

> I'm trying to get the SpamBayes project from SourceForge.   But I
> need a password to get into it.   I tried Anonymous,  but that
> didn't work.

As per the CVS instructions (http://sourceforge.net/cvs/?group_id=61702) the
anonynous password is empty - just press the enter key.

Mark.


From mhammond@skippinet.com.au  Tue Nov 12 11:00:06 2002
From: mhammond@skippinet.com.au (Mark Hammond)
Date: Tue, 12 Nov 2002 22:00:06 +1100
Subject: [Spambayes] Re: Outlook plugin plus Exchange
In-Reply-To: <QGE1TRO1WC8SQ1ZC8LI74PJMI52JHSR.3dd0cdb6@riven>
Message-ID: <LCEPIIGDJPKCOIHOBJEPKELKHKAA.mhammond@skippinet.com.au>

[Tim Stone quoting Paul Moore]
> >sufficiently complex that it's *likely* to get broken. (Not that MS
> >can't break anything - Outlook.NET probably won't even have a COM
> >addin interface...)
>
> Yup.... 'xactly what I was thinkin.  We'll have to maintain at least two
> versions of the plugin for some time to come, if not ad-infinitum.

This is unlikely for some time I believe.  MS don't piss-off the people
required to generate their revenue (ie, corporates etc,) and as a rule go to
huge pains to make software backward compatible.  Windows is Windows for
this reason, not because anyone wants it this way <wink>.  We may end up
with optimizations for later versions, but that is a different question.

Mark.


From anthony@interlink.com.au  Tue Nov 12 11:01:56 2002
From: anthony@interlink.com.au (Anthony Baxter)
Date: Tue, 12 Nov 2002 22:01:56 +1100
Subject: [Spambayes] How do I get the latest CVS? 
In-Reply-To: <LCEPIIGDJPKCOIHOBJEPOELJHKAA.mhammond@skippinet.com.au> 
Message-ID: <200211121101.gACB1uh02523@localhost.localdomain>


>>> "Mark Hammond" wrote
> > I'm trying to get the SpamBayes project from SourceForge.   But I
> > need a password to get into it.   I tried Anonymous,  but that
> > didn't work.
> 
> As per the CVS instructions (http://sourceforge.net/cvs/?group_id=61702) the
> anonynous password is empty - just press the enter key.

You should also be using pserver, not ext. Finally, you should use
ssh protocol v1, not v2, for SF, if you are a developer.

Anthony
-- 
Anthony Baxter     <anthony@interlink.com.au>   
It's never too late to have a happy childhood.


From mhammond@skippinet.com.au  Tue Nov 12 11:18:32 2002
From: mhammond@skippinet.com.au (Mark Hammond)
Date: Tue, 12 Nov 2002 22:18:32 +1100
Subject: [Spambayes] Re: Outlook plugin plus Exchange 
In-Reply-To: <200211120749.gAC7nxn12707@localhost.localdomain>
Message-ID: <LCEPIIGDJPKCOIHOBJEPKELMHKAA.mhammond@skippinet.com.au>

[Anthony]
> >>> "Piers Haken" wrote
> >
> > Yeha, the problem is twofolow: not only are the MIME headers broken, but
> > even if they weren't the content of that MIME header would be empty
> > since the body parts are appended (by the outlook plugin) after the MIME
> > headers:
> >
> > Msgstory.py, line ~400:
> >         return "%s\n%s\n%s" % (headers, html, body)
>
> No idea on that one - it's in the Outlook plugin code - I'm not going
> near that one.
>
> Mark-touched-it-last-he-gets-to-fix-it,

Oh, if only it were that easy <wink>.  The good news is that however this is
fixed, it will also lend itself to a fix for the multipart/signed message
abomination we are lumped with and I am partially stalled on.

And I hereby retract any aspersions I cast upon the wonderful email package
<0.1 wink>

Mark.


From anthony@interlink.com.au  Tue Nov 12 11:28:33 2002
From: anthony@interlink.com.au (Anthony Baxter)
Date: Tue, 12 Nov 2002 22:28:33 +1100
Subject: [Spambayes] Re: Outlook plugin plus Exchange 
In-Reply-To: <LCEPIIGDJPKCOIHOBJEPKELMHKAA.mhammond@skippinet.com.au> 
Message-ID: <200211121128.gACBSXR02787@localhost.localdomain>


>>> "Mark Hammond" wrote
> Oh, if only it were that easy <wink>.  The good news is that however this is
> fixed, it will also lend itself to a fix for the multipart/signed message
> abomination we are lumped with and I am partially stalled on.

Not familiar with that particular one - details? 

> And I hereby retract any aspersions I cast upon the wonderful email package
> <0.1 wink>

There's at least one "known to be broken" mail message out there - 
stuff from some mailer called Entourage. multipart/alternative
nested inside a multipart/mixed, and both have the same boundary 
tag. Making this work with the current email package is extremely
non-trivial.

Go on, guess the vendor. Go on, I dare you. 


Anthony.

From guido@python.org  Tue Nov 12 12:55:17 2002
From: guido@python.org (Guido van Rossum)
Date: Tue, 12 Nov 2002 07:55:17 -0500
Subject: [Spambayes] Impersonation
Message-ID: <200211121255.gACCtHb22270@pcp02138704pcs.reston01.va.comcast.net>

More and more spam is impersonating real people rather than making up
phoney AOL addresses.  I just received a bounce quoting the following
spam:

> Message-ID: <000059425b35$0000481d$00007669@evertythingmail.net>
> From: guido@python.org
> Reply-To: Bob123@hotmail.com
> To: clevelandindians@flash.net
> Subject: 1hi29895
> Date: Tue, 12 Nov 2002 15:42:40 -0500
> MIME-Version: 1.0
> Content-Type: text/plain;
> 	charset="iso-8859-1"
> 
> g45, 
> 
> Do you own a Business?
> 
>  <http://www.best-cyber-deals.com/1/> Do you need a Web Site?  Let Us Design
> it.
> 
> We can Design a Web Site that's perfect for you.
> 
> feedback@cdsymas.com h6 

Surely this is copied from successful viruses.  My wife startled me
this weekend by telling me she couldn't run the virus protection
program I had sent her.  Of course, I had done no such thing.  It was
a Klez virus variant that disguises itself as a virus protection tool.
The computer of a mutual friend must have been infected. :-(

This shows how careful we have to be with whitelists...

--Guido van Rossum (home page: http://www.python.org/~guido/)

From msergeant@startechgroup.co.uk  Tue Nov 12 13:02:57 2002
From: msergeant@startechgroup.co.uk (Matt Sergeant)
Date: Tue, 12 Nov 2002 13:02:57 +0000
Subject: [Spambayes] Impersonation
References: <200211121255.gACCtHb22270@pcp02138704pcs.reston01.va.comcast.net>
Message-ID: <3DD0FC01.6010206@startechgroup.co.uk>

Guido van Rossum said the following on 12/11/02 12:55:
> More and more spam is impersonating real people rather than making up
> phoney AOL addresses.  I just received a bounce quoting the following
> spam:
> 
> 
>>Message-ID: <000059425b35$0000481d$00007669@evertythingmail.net>
>>From: guido@python.org
>>Reply-To: Bob123@hotmail.com
>>To: clevelandindians@flash.net
>>Subject: 1hi29895
>>Date: Tue, 12 Nov 2002 15:42:40 -0500
>>MIME-Version: 1.0
>>Content-Type: text/plain;
>>	charset="iso-8859-1"
>>
>>g45, 
>>
>>Do you own a Business?
>>
>> <http://www.best-cyber-deals.com/1/> Do you need a Web Site?  Let Us Design
>>it.
>>
>>We can Design a Web Site that's perfect for you.
>>
>>feedback@cdsymas.com h6 
> 
> Surely this is copied from successful viruses. 

Spam was doing this first, fwiw.

Matt.


From bkc@murkworks.com  Tue Nov 12 13:46:42 2002
From: bkc@murkworks.com (Brad Clements)
Date: Tue, 12 Nov 2002 08:46:42 -0500
Subject: [Spambayes] Introducing myself
In-Reply-To: <a05200f3fb9f60c3ccb9e@[192.168.1.103]>
References: <LNBBLJKPBEHFEDALKOLCKELPCJAB.tim.one@comcast.net>
Message-ID: <3DD0BF07.5980.E5AE4C2@localhost>

On 11 Nov 2002 at 21:16, Robert Woodhead wrote:

> My hunch, based on things I've done in the past, is that as the total 
> volume of mail increases, the rate of increase in the number of 

> analysis on a quarter-gig of ham and spam I was seeing, IIRC, about 
> 300,000 distinct tokens (including the aforementioned gibberish).

My training/testing set of ... 13,000 messages resulted in pickles with 320,000 words.


Brad Clements,                bkc@murkworks.com   (315)268-1000
http://www.murkworks.com                          (315)268-9812 Fax
AOL-IM: BKClements


From Paul.Moore@atosorigin.com  Tue Nov 12 15:09:09 2002
From: Paul.Moore@atosorigin.com (Moore, Paul)
Date: Tue, 12 Nov 2002 15:09:09 -0000
Subject: [Spambayes] Re: Outlook plugin plus Exchange 
Message-ID: <16E1010E4581B049ABC51D4975CEDB885E2DBE@UKDCX001.uk.int.atosorigin.com>

From: Mark Hammond [mailto:mhammond@skippinet.com.au]
> I believe the email package should give some consideration to the
> real world here.  While creating well-formed messages is clearly
> mandatory, it is very frustrating when something exists in the
> real world, is clearly invalid, but everything else in the world
> has no trouble with it.

Was the problem with duff headers, or invalid MIME sections in the
body? If the latter, there is an option in the email package to not
parse the body - instead of email.message_from_string(data), you can
use email.Parser.Parser().parsestr(data, 1). IIRC, this treats the
body as a single uninterpreted "payload", rather than as structured
MIME parts.

If the problem's with the headers, this won't help, though...

Paul.

From Paul.Moore@atosorigin.com  Tue Nov 12 15:26:11 2002
From: Paul.Moore@atosorigin.com (Moore, Paul)
Date: Tue, 12 Nov 2002 15:26:11 -0000
Subject: [Spambayes] Re: Outlook plugin plus Exchange
Message-ID: <16E1010E4581B049ABC51D4975CEDB885E2DBF@UKDCX001.uk.int.atosorigin.com>

From: Tim Peters [mailto:tim.one@comcast.net]
>> As you can see it's just showing everything but the contents of
>> the MIME parts.
>
> Yes, I see.  I've never seen anything like that before, though.

I see that too (on an Exchange server, ie Corporate/Workgroup
configuration). It looks like it's an Exchange-specific thing.

Paul.

From tim.one@comcast.net  Wed Nov 13 01:04:31 2002
From: tim.one@comcast.net (Tim Peters)
Date: Tue, 12 Nov 2002 20:04:31 -0500
Subject: [Spambayes] Re: Outlook plugin plus Exchange
In-Reply-To: <LCEPIIGDJPKCOIHOBJEPKELMHKAA.mhammond@skippinet.com.au>
Message-ID: <LNBBLJKPBEHFEDALKOLCKEEKCKAB.tim.one@comcast.net>

[Mark Hammond]
> Oh, if only it were that easy <wink>.  The good news is that
> however this is fixed, it will also lend itself to a fix for the
> multipart/signed message abomination we are lumped with and I am
> partially stalled on.

Good news -- I managed to fix it without making the multipart/signed
silliness even one iota easier.  OK, "fix" is a strong claim.  I cut off the
head, boltied it to my dashboard, and buried the torso in a different state.

> And I hereby retract any aspersions I cast upon the wonderful
> email package <0.1 wink>

It should do better on this one -- patches accepted <wink>.


From tim.one@comcast.net  Wed Nov 13 01:29:51 2002
From: tim.one@comcast.net (Tim Peters)
Date: Tue, 12 Nov 2002 20:29:51 -0500
Subject: [Spambayes] Re: Outlook plugin plus Exchange
In-Reply-To: 
 <16E1010E4581B049ABC51D4975CEDB885E2DBE@UKDCX001.uk.int.atosorigin.com>
Message-ID: <LNBBLJKPBEHFEDALKOLCGEENCKAB.tim.one@comcast.net>

[Moore, Paul]
> Was the problem with duff headers, or invalid MIME sections in the
> body?

The latter.

> If the latter, there is an option in the email package to not
> parse the body - instead of email.message_from_string(data), you can
> use email.Parser.Parser().parsestr(data, 1). IIRC, this treats the
> body as a single uninterpreted "payload", rather than as structured
> MIME parts.

I'm not sure it helps in this case, and I don't understand what it's doing.
Feeding the original msg (reconstructed from the diagnostic output) into
parsestr(data, True) yields a Message object m with a pretty bizarre string
representation:

>>> print m.as_string()
X-MS-Mail-Gibberish: Microsoft Mail Internet Headers Version 2.0
Received: from inet-mail7.oracle.com ([209.246.10.171]) by
        zeus.sfhq.friskit.com with Microsoft SMTPSVC(5.0.2195.4453);
        Sat, 13 Apr 2002 03:19:01 -0700
Received: from blaster-smtp.oracle.com (eblast01.oracleeblast.com
        [148.87.9.11])g3DA8GV30065
        for PIERSH@FRISKIT.COM; Sat, 13 Apr 2002 03:08:16 -0700
Date: Sat, 13 Apr 2002 03:08:16 -0700
Message-Id: <200204131008.g3DA8GV30065@inet-mail7.oracle.com>
Subject: Oracle University iSeminars
From: Oracle Corporation<replies@oracleeblast.com>
To: PIERSH@FRISKIT.COM
Reply-To: replies@oracleeblast.com
Content-Transfer-Encoding: 8bit
MIME-Version: 1.0
Content-Type: multipart/alternative;
    boundary="next_part_of_message"
Return-Path: replies@oracleeblast.com
X-OriginalArrivalTime: 13 Apr 2002 10:19:01.0938 (UTC)
        FILETIME=[A1D8F920:01C1E2D4]

--next_part_of_message


--next_part_of_message--

>>>

The mystery there is why those MIME boundaries show up after "the real"
headers.  They're not reflected in the header count:

>>> len(m)
14
>>>

and they're not payload, preamble, or epilogue either:

>>> `m.get_payload()`
'None'
>>> m.preamble
>>> m.epilogue
>>>

The type may or may not be expected:

>>> m.get_type()
'multipart/alternative'
>>>

The reason I think it *may* be unexpected is that I thought it was a Message
invariant that the type is multipart if and only if the payload is a list.

m doesn't think it's multipart despite its type:

>>> m.is_multipart()
0
>>>

but in *that* case the docs say the payload is a string (which None is not).

Barry, is this a sane Message object, or an insane one?


From tim.one@comcast.net  Wed Nov 13 02:09:28 2002
From: tim.one@comcast.net (Tim Peters)
Date: Tue, 12 Nov 2002 21:09:28 -0500
Subject: [Spambayes] Re: Outlook plugin plus Exchange
In-Reply-To: <200211121128.gACBSXR02787@localhost.localdomain>
Message-ID: <LNBBLJKPBEHFEDALKOLCKEFACKAB.tim.one@comcast.net>

[Mark Hammond]
> Oh, if only it were that easy <wink>.  The good news is that
> however this is fixed, it will also lend itself to a fix for the
> multipart/signed message abomination we are lumped with and I am
> partially stalled on.

[Anthony Baxter]
> Not familiar with that particular one - details?

You don't want to know -- it's an Outlook thing.  Outlook doesn't speak MIME
natively.  Incoming msgs are broken apart and sprayed into any number of
"properties", which latter have nothing to do with MIME.  Usually.  The
plain text part of a msg is usually stored in one property, and the HTML
part in another.  But in the case of a multipart/signed msg both those
"normal body parts" are empty, and the msg body is hiding in yet another
property that's usually otherwise empty.  But it's not just the msg body
hiding there then, it's also some disconnected MIME armor which also
includes the signature part.  Then it starts to get messy <wink>.


From tim@fourstonesExpressions.com  Wed Nov 13 02:13:14 2002
From: tim@fourstonesExpressions.com (Tim Stone - Four Stones Expressions)
Date: Tue, 12 Nov 2002 20:13:14 -0600
Subject: [Spambayes] Corpus modules
Message-ID: <WQID93TQWRTSICROUQ5ZYWKGUT42873W.3dd1b53a@riven>

I've been working with Richie Hindle to create modules that are useful for 
managing corpora for his pop3proxy.  There is a Corpus class, a Message class, 
and a MessageFactory class, with subclasses that add persistence into a file 
system as text or gzip files in subdirectories.  There's a Trainer class that 
observes Corpus instances and untrains/trains a bayes database as messages are 
moved between them.  (Corpus is defined simply as a collection of messages).  
I've also got a BayesHelper class, that adds persistence to a Bayes object, 
that is imported from classifier (Bayes) or hammie(PersistentBayes).

Assuming that I can get these things checked in sometime soon, they may be 
useful outside of the pop3proxy.  I see some overlap with Messages and the 
msgs.py module.  Also, the BayesHelper thing really doesn't belong in the 
Corpus.py module.

So there's the context of my question(s) ;)  Now for the questions.

Hammie has interesting PersistentBayes and DB_Dict classes, with some helper 
functions for bayes object creation.  It seems to me that a more cogent class 
hierarchy is called for, with Bayes being the abstract class, PersistentBayes 
being an abstract subclass, and subclasses of that for particular persistence 
mechanisms, like PickleBayes, ZODBBayes, DBDictBayes, etc. etc.

It doesn't make a lot of sense to me to have the Bayes class in classifier and 
the PersistentBayes class in hammie...  It would seem much more consistent to 
me to have a Bayes.py module, with all the bayes database classes.  There 
might be a lot of momentum behind the hammie.py module, perhaps too much to 
change directions now, but hammie doesn't tell me much about what this module 
is really for, and when I look in it, I don't see much coherence either.

The current scheme that I have in Corpus is to have a trainer object that 
knows about its Bayes object, and trains it in response observed message 
movement events.  This is mainly a hack.  It would be better for these bayes 
objects to be able to be the Corpus observers, and forget about this 
artificial Trainer object.

Right now, my Message objects are fairly dumb.  They simply wrap entire 
messages, which are used for training.  It seems as if the training methods on 
Bayes create objects from msgs.py which have a lot more smarts in them, like 
'gimme the headers', 'gimme the body', 'gimme a wordstream', etc.  However, my 
Message objects have some attributes that are specifically useful for the 
pop3proxy handling of incoming pop3 mail, specifically persistence.  Should 
these two classes be merged, could the msgs.py objects become more useful for 
the pop3proxy, or could my Message class become more broadly useful?  It seems 
that the current msgs classes are useful for test training, and for deep 
within the bowels of the training algorithms, but would not be too useful for 
the pop3proxy...

So I guess in summary, I propose that we create a Bayes.py module with guts 
from the current classifier and hammie modules, and make a Message class 
that's broadly useful, both for corpus management and for training...  It's my 
itch, so I'm willing to scratch it, but what do the rest of you think?

Musings of a latecomer to the party...

- TimS
www.fourstonesExpressions.com 


From lists@webcrunchers.com  Wed Nov 13 05:29:36 2002
From: lists@webcrunchers.com (John D.)
Date: Tue, 12 Nov 2002 21:29:36 -0800
Subject: [Spambayes] CVS Access....
Message-ID: <v0311073ab9f792ecb5e5@[192.168.0.2]>

Still trying to get the CVS Library...   I use these following commands....

	# export CVSROOT=anonymous@cvs.spambayes.sourceforge.net:/cvsroot/
	# cd /usr
	# cvs -q get -P spambayes

It then prompts me for a password.   I try "anonymous" for password, but it still won't allow me access.

If these are the wrong options for the "cvs" command,  would someone please enlighten me with the right ones?

I assume the "cvs" command gets it,  and puts it into the "/usr/spambayes" directory.    If I'm wrong,  I would like to know how.

John


From barry@wooz.org  Wed Nov 13 05:29:53 2002
From: barry@wooz.org (Barry A. Warsaw)
Date: Wed, 13 Nov 2002 00:29:53 -0500
Subject: [Spambayes] Re: Outlook plugin plus Exchange
References: <16E1010E4581B049ABC51D4975CEDB885E2DBE@UKDCX001.uk.int.atosorigin.com>
	<LNBBLJKPBEHFEDALKOLCGEENCKAB.tim.one@comcast.net>
Message-ID: <15825.58193.998340.448311@gargle.gargle.HOWL>


>>>>> "TP" == Tim Peters <tim.one@comcast.net> writes:

    TP> but in *that* case the docs say the payload is a string (which
    TP> None is not).

    TP> Barry, is this a sane Message object, or an insane one?

It's insane, but it may not entirely be your fault <wink>.  I think
Parser.parse() should call root.set_payload('') when headersonly is
True.  That ensures the Message invariant will hold.

I'll add a test and check a patch into the Python 2.3 cvs.  I'll also
update the docs to make it clear that the file pointer is left at the
first body line when parsing only the headers.  (I'm not sure what do
do with any firstbodyline that might get passed from _parseheaders()
though).

-Barry

From tim.one@comcast.net  Wed Nov 13 05:57:50 2002
From: tim.one@comcast.net (Tim Peters)
Date: Wed, 13 Nov 2002 00:57:50 -0500
Subject: [Spambayes] CVS Access....
In-Reply-To: <v0311073ab9f792ecb5e5@[192.168.0.2]>
Message-ID: <LNBBLJKPBEHFEDALKOLCGEFLCKAB.tim.one@comcast.net>

[John D.]
> Still trying to get the CVS Library...   I use these following
> commands....
>
> 	# export CVSROOT=anonymous@cvs.spambayes.sourceforge.net:/cvsroot/
> 	# cd /usr
> 	# cvs -q get -P spambayes
>
> It then prompts me for a password.   I try "anonymous" for
> password, but it still won't allow me access.

John, you got two replies earlier today.  Have you seen them?

http://mail.python.org/pipermail/spambayes/2002-November/001807.html

http://mail.python.org/pipermail/spambayes/2002-November/001809.html

They're just repeating the instructions at

    http://sourceforge.net/cvs/?group_id=61702

for "Anonymous CVS Access".  The instructions work, but you have to do what
they say <wink>.  As they say, there is no password, so don't give one.
Just hit Enter at the prompt, without typing anything.  But also use the cvs
commands in the instructions -- there's no need to guess about anything here
(except that the "modulename" variable in the instructions is indeed
spambayes, as you've guessed already).


From barry@python.org  Wed Nov 13 05:44:52 2002
From: barry@python.org (Barry A. Warsaw)
Date: Wed, 13 Nov 2002 00:44:52 -0500
Subject: [Spambayes] Re: Outlook plugin plus Exchange
References: <16E1010E4581B049ABC51D4975CEDB885E2DBE@UKDCX001.uk.int.atosorigin.com>
	<LNBBLJKPBEHFEDALKOLCGEENCKAB.tim.one@comcast.net>
	<15825.58193.998340.448311@gargle.gargle.HOWL>
Message-ID: <15825.59092.274040.453726@gargle.gargle.HOWL>


>>>>> "BAW" == Barry A Warsaw <barry@wooz.org> writes:

    BAW> It's insane, but it may not entirely be your fault <wink>.  I
    BAW> think Parser.parse() should call root.set_payload('') when
    BAW> headersonly is True.  That ensures the Message invariant will
    BAW> hold.

Hmm, that wasn't as easy as I though.  I'll sleep on it.
-Barry

From tim.one@comcast.net  Wed Nov 13 06:44:02 2002
From: tim.one@comcast.net (Tim Peters)
Date: Wed, 13 Nov 2002 01:44:02 -0500
Subject: [Spambayes] A couple of small tokenizer experiments.
In-Reply-To: <200211120713.gAC7Du512404@localhost.localdomain>
Message-ID: <LNBBLJKPBEHFEDALKOLCAEFOCKAB.tim.one@comcast.net>

[Anthony Baxter, tokenizing mail-address headers]
> I've added this now. For me, tokenising just the 'from' line
> with the new 'address_headers' option gives (vs the old code):
>
> (all tests with 4 sets of 1200H/400S)
>
> filename:  old_from
>                    new_from
> ham:spam:  4800:1600
>                    4800:1600
> fp total:        1       1
> fp %:         0.02    0.02
> fn total:       12      11
> fn %:         0.75    0.69
> unsure t:       86      88
> unsure %:     1.34    1.38
> real cost:  $39.20  $38.60
> best cost:  $31.80  $32.40
> h mean:       0.36    0.36
> h sdev:       4.04    4.05
> s mean:      98.25   98.25
> s sdev:       8.93    8.99
> mean diff:   97.89   97.89
> k:            7.55    7.51
>
> The old code's best cost was:
> -> achieved at ham & spam cutoffs 0.24 & 0.99
> ->     fp 0; fn 3; unsure ham 26; unsure spam 118
> ->     fp rate 0%; fn rate 0.188%; unsure rate 2.25%
>
> The new code's best cost was:
> -> largest ham & spam cutoffs 0.26 & 0.99
> ->     fp 0; fn 4; unsure ham 24; unsure spam 118
> ->     fp rate 0%; fn rate 0.25%; unsure rate 2.22%
>
> The one additional fn was a spam that was dragged from 0.35 to
> 0.21 because it came from 'update@localhost.net' - the 'update'
> was a strong spam clue.

Well, regardless of reason, the best cost got worse, and it did on my c.l.py
test too, but also by a trivial amount.  I fiddled the tokenization of this
field until it did better again, so please make sure I didn't screw you too
badly <wink>.

Something that helped:  it now generates log-count "no real name" metatokens
too for address headers without real-name parts.

        'from:no real name:2**0' 0.933186

became one of the 40 most-frequent discriminators in my c.l.py data then,
and is a strong spam clue.  The good news is that it raised my
lowest-scoring spam from near 0.20 to over 0.27, so at ham_cutoff=0.20
(which I'm using on the c.l.py test), I have no spam close to being called
ham anymore.  The bad news is that it gave me another FP, but it's one of
those useless msgs I don't care about (a two-word "confirm 12345" msg from a
first-time poster sent to a wrong address, using a free email acct that
inserted advertising at the bottom of the msg -- it's always been on the
edge).

> Where it gets more interesting is when I also tokenize to and cc:

I would hope so <wink>.

> filename:  new_from
>                    new_fromtocc
> ham:spam:  4800:1600
>                    4800:1600
> fp total:        1       1
> fp %:         0.02    0.02
> fn total:        4       5
> fn %:         0.25    0.31
> unsure t:      121     104
> unsure %:     1.89    1.62
> real cost:  $38.20  $35.80
> best cost:  $32.40  $28.00
> h mean:       0.36    0.31
> h sdev:       4.05    3.80
> s mean:      98.25   98.42
> s sdev:       8.99    8.77
> mean diff:   97.89   98.11
> k:            7.51    7.81
>
>
> We go from:
> -> largest ham & spam cutoffs 0.26 & 0.99
> ->     fp 0; fn 4; unsure ham 24; unsure spam 118
> ->     fp rate 0%; fn rate 0.25%; unsure rate 2.22%
>
> to
> -> largest ham & spam cutoffs 0.22 & 0.99
> ->     fp 0; fn 3; unsure ham 25; unsure spam 100
> ->     fp rate 0%; fn rate 0.188%; unsure rate 1.95%
>
> That's a total of 142->125 unsures. I'll accept that :)

Yup, it's a small win.  I can't use it my c.l.py test, but should be able to
on the general python.org corpus (plus, of course, my own email).

> Just to make sure, ran with a different seed.

... [and another small win] ...

BTW, you should make sure the seeds aren't close together.  For example,
using seed 123 one time, and 124 the next, will give a lot of msg overlap.

> toemail:python.org and toemail:zope.org both show up in
> my 'best discriminators' list as _very_ strong ham clues
> (not suprising, given the mailing lists I'm on).

Well, that's also going to make the spam that slips thru that much harder to
catch.  Of course, after Greg deploys this system, there won't be any more
spam slipping thru <wink>.

> My old/uncommon email addresses generally show up as strong strong
> spam clues (eg prob('toemail:arb') = 0.999356)

Cool!


From mhammond@skippinet.com.au  Wed Nov 13 06:49:44 2002
From: mhammond@skippinet.com.au (Mark Hammond)
Date: Wed, 13 Nov 2002 17:49:44 +1100
Subject: [Spambayes] Some more experiences with the Outlook plugin
In-Reply-To: <LNBBLJKPBEHFEDALKOLCEEFKCKAB.tim.one@comcast.net>
Message-ID: <LCEPIIGDJPKCOIHOBJEPMEPIHKAA.mhammond@skippinet.com.au>

> [Paul Moore]
> > * Following on from this, I also see Tim's behaviour of surprising
> >   unsure cases (or worse, false negatives!). Worst case recently was a
> >   message which scored as solid ham. I trained on it as "Spam", and
> >   rescored it. It still scored 5 - solid ham.
>
> [Mark Hammond]
> > This too was my experience.  For a while, I did training over a huge
> > ham corpus, and spam is still less than 1000 messages.  I had around
> > 15:1 ham:spam.  I too trained new ham and spam, and was dissappointed
> > to see the score remain almost identical.

> Almost identical or exactly identical?

Almost exactly identical <wink>.  I can't recall for sure, and wasn't
actually playing with bayes - just sorting through mail before trying to do
something productive.  I'll get back to playing with this stuff, but only
after I get back to the client itself <frown>.

> still heavily hapax-driven teensy classifier, the auto-rescore feature of
> the Outlook client never seemed to change my scores either, and for a
> hapax-driven classifier that's bizarre.  It turns out that was because it
> actually didn't change scores:  the probabilities didn't get updated after
> training on the reclassified msg, so "the new score" was in fact exactly
> equal to "the old score".  I just checked in a fix for that (unique to the
> Outlook client).

Hrm - I could have sworn I saw the scores change in quite a few cases.  But
as I said, this is hardly a controlled environment.  You should see my desk
<wink>.

And to compound things, I am seeing messages I don't understand from my
"delete as spam/recover from spam" functions - I suspect they are broken as
I see "already trained as spam" when, eg, training a new unsure as spam.
Quickly eyeballed the code and it looks OK - haven't debugged yet.

> So it would be good to retain the old database for concurrent scoring
> purposes until the new one is ready to use, or it would be good to delay
> scoring msgs until training is complete.  I've refrained from "doing
> something" about this because it seems like it would be easy to do after
> some mechanism is in place for scanning for unrated msgs at startup (i.e.,
> folder events could be disabled for the duration of from-scratch training,
> then re-enabled after, and the scan-for-unrated machinery kicked
> into action
> again).

Well, threads wouldn't be a bad infrastructure to use <wink>.  Extended MAPI
is documented as being thread-safe (which, of course, may just result in
serialization).  I understand that we still have the same issue with a full
re-train, so I only mention this to ask if now is also a good time to
implement our own locks or threading strategy.  In the very least, it
couldn't hurt to spin off the pickle loading, and anywhere else people
complain we hurt (eg, future bulk deletes or moves, etc)  A simple queue
would even suffice - not much needs to be synchronous at the moment, if
anything (assuming asynch usually means "almost synch" <wink>).  Asynch
message filtering wont fly in the lower-level message hooking functions we
are eyeballing though :(

Outlook itself certainly does plenty in the background (and currently shows
14 threads for me).  Eg, the unread message counts in the folder view and
"Outlook Shortcuts" panes can take a few seconds before they show up (during
which time Outlook is running just fine)

Mark.


From tim.one@comcast.net  Wed Nov 13 05:45:30 2002
From: tim.one@comcast.net (Tim Peters)
Date: Wed, 13 Nov 2002 00:45:30 -0500
Subject: [Spambayes] Some more experiences with the Outlook plugin
In-Reply-To: <LCEPIIGDJPKCOIHOBJEPOEKLHKAA.mhammond@skippinet.com.au>
Message-ID: <LNBBLJKPBEHFEDALKOLCEEFKCKAB.tim.one@comcast.net>

[Paul Moore]
> * Following on from this, I also see Tim's behaviour of surprising
>   unsure cases (or worse, false negatives!). Worst case recently was a
>   message which scored as solid ham. I trained on it as "Spam", and
>   rescored it. It still scored 5 - solid ham.

[Mark Hammond]
> This too was my experience.  For a while, I did training over a huge
> ham corpus, and spam is still less than 1000 messages.  I had around
> 15:1 ham:spam.  I too trained new ham and spam, and was dissappointed
> to see the score remain almost identical.

Almost identical or exactly identical?  I wasn't looking over your
shoulders, so it's hard to guess <wink>.  I've been noticing that, in my
still heavily hapax-driven teensy classifier, the auto-rescore feature of
the Outlook client never seemed to change my scores either, and for a
hapax-driven classifier that's bizarre.  It turns out that was because it
actually didn't change scores:  the probabilities didn't get updated after
training on the reclassified msg, so "the new score" was in fact exactly
equal to "the old score".  I just checked in a fix for that (unique to the
Outlook client).

BTW, another buglet here looks harder to fix:  if you do a retrain from
scratch in the client, all email that comes in *while* training is in
progress gets scored at exactly 50.  That's because the database being built
isn't useful until it's done being built, but is used for scoring during the
rebuild process.  It won't blow up, but every word has unknown_word_prob
before .update_probabilities() gets called at the end.

So it would be good to retain the old database for concurrent scoring
purposes until the new one is ready to use, or it would be good to delay
scoring msgs until training is complete.  I've refrained from "doing
something" about this because it seems like it would be easy to do after
some mechanism is in place for scanning for unrated msgs at startup (i.e.,
folder events could be disabled for the duration of from-scratch training,
then re-enabled after, and the scan-for-unrated machinery kicked into action
again).


From rob@hooft.net  Wed Nov 13 07:51:14 2002
From: rob@hooft.net (Rob W.W. Hooft)
Date: Wed, 13 Nov 2002 08:51:14 +0100
Subject: [Spambayes] Outlook plugin - training
References: <LNBBLJKPBEHFEDALKOLCMEGKCJAB.tim.one@comcast.net>
Message-ID: <3DD20472.5080103@hooft.net>

Tim Peters wrote:
> Now for another extreme:  after 10 startup msgs, the system trains itself on
> its own decisions, except that:
> 
> 1. Unsures are correctly classified by the user.
> 2. False negatives are correctly classified by the user.
> 
> But false positives are trained on *as spam*, assuming the user never looks
> at their spam folder.  That takes a long time to run, because
> update_probabilities() is called after every msg.  After 2,100 msgs,
> 
>  2100 trained:1181H+919S wrds:59659 fp:0 fn:0 unsure:26
> 
> and the unsures are growing very slowly now (at 1400 msgs there were 25
> unsures).

Now THIS is the way I'd like to go! I think this is approximately the 
minimum effort we can expect from lazy users (like myself). Sometimes, a 
fp might actually be corrected by the user at some point, but testing it 
the way you did should be giving the minimal possible performance of a 
minimal-impact system that would not require much training to begin with.

There is one catch: what if the first 10 messages are all ham or all 
spam? Shouldn't we require at least a few of each?

How would this work to start on a mailing list? I guess we could deliver
spambayes with 5 "representative recent spam" (or a URL where they can 
be found). The mailing list would moderate the first few messages to the 
list, and then the filter will kick in. If a message is "spam", it can 
be returned to the sender, saying that the message has been judged 
inappropriate by the filter based on wording. "ham" can be posted 
without moderator approval. And all "unsure" messages are held for 
approval. The approval interface could have a separate "Spam" 
classification, but that is not really necessary: anything 
"inappropriate" can go in the spam corpus. For "fn"s, the archives 
should have the options to delete a message as spam.

For now my MUA is so badly integrated that I have yet to train a second 
time....

Rob

-- 
Rob W.W. Hooft  ||  rob@hooft.net  ||  http://www.hooft.net/people/rob/


From francois.granger@free.fr  Wed Nov 13 13:13:16 2002
From: francois.granger@free.fr (Fran=?ISO-8859-1?B?5w==?=ois Granger)
Date: Wed, 13 Nov 2002 14:13:16 +0100
Subject: [Spambayes] Corpus modules
In-Reply-To: <WQID93TQWRTSICROUQ5ZYWKGUT42873W.3dd1b53a@riven>
Message-ID: <B9F80E7C.5C779%francois.granger@free.fr>

on 13/11/02 3:13, Tim Stone - Four Stones Expressions at
tim@fourstonesExpressions.com wrote:

> Hammie has interesting PersistentBayes and DB_Dict classes, with some helper
> functions for bayes object creation.  It seems to me that a more cogent class
> hierarchy is called for, with Bayes being the abstract class, PersistentBayes
> being an abstract subclass, and subclasses of that for particular persistence
> mechanisms, like PickleBayes, ZODBBayes, DBDictBayes, etc. etc.

I was thinking of hacking the DB mechanisme to split the load between two
databases (using anydbm) to reduce access to each one and to make them more
accessible from outside. The scoring module needs only the second one. The
training module would update both. I suspected that a major redesign was
underway. Here the proposed split.
{'word': ['ltime',     # when this record was last modified
          'spamcount', # of spams in which this word appears
          'hamcount',  # of hams in which this word appears
         ]
}
{'word': ['atime',     # when this record was last used by scoring(*)
          'killcount', # of times this made it to spamprob()'s nbest
          'spamprob',  # prob(spam | msg contains this word)
          ]
}

A 'dirty' flag could be added to the first so that a batch update of the
second would recalculate only the dirty records.

-- 
Le courrier est un moyen de communication. Les gens devraient
se poser des questions sur les implications politiques des choix (ou non
choix) de leurs outils et technologies. Pour des courriers propres :
<http://marc.herbert.free.fr/mail/> -- <http://minilien.com/?IXZneLoID0>


From anthony@interlink.com.au  Wed Nov 13 13:24:57 2002
From: anthony@interlink.com.au (Anthony Baxter)
Date: Thu, 14 Nov 2002 00:24:57 +1100
Subject: [Spambayes] A couple of small tokenizer experiments. 
In-Reply-To: <LNBBLJKPBEHFEDALKOLCAEFOCKAB.tim.one@comcast.net> 
Message-ID: <200211131324.gADDOvu16292@localhost.localdomain>


>>> Tim Peters wrote
> Well, regardless of reason, the best cost got worse, and it did on my c.l.py
> test too, but also by a trivial amount.  I fiddled the tokenization of this
> field until it did better again, so please make sure I didn't screw you too
> badly <wink>.

Seems fine.

In this case, the trivial amount worse was kinda necessary (imho) to 
allow us to get a whole lot of other cheap wins.

> Something that helped:  it now generates log-count "no real name" metatokens
> too for address headers without real-name parts.
>         'from:no real name:2**0' 0.933186

I'll give this a go, see how it helps me.

> BTW, you should make sure the seeds aren't close together.  For example,
> using seed 123 one time, and 124 the next, will give a lot of msg overlap.

I think I tend to use 12345 and 23456 - should be far enough apart.

> > toemail:python.org and toemail:zope.org both show up in
> > my 'best discriminators' list as _very_ strong ham clues
> > (not suprising, given the mailing lists I'm on).
> 
> Well, that's also going to make the spam that slips thru that much harder to
> catch.  Of course, after Greg deploys this system, there won't be any more
> spam slipping thru <wink>.

That's the theory, yes. Of course, if Greg doesn't deploy this, then all
the sophisticated new techniques that spammers will be forced to try will
leave poor old spamassassin terribly confused, and the amount of spam 
getting through it will fix the solution for us :)

Anthony

From anthony@interlink.com.au  Wed Nov 13 13:26:52 2002
From: anthony@interlink.com.au (Anthony Baxter)
Date: Thu, 14 Nov 2002 00:26:52 +1100
Subject: [Spambayes] Re: Outlook plugin plus Exchange 
In-Reply-To: <LNBBLJKPBEHFEDALKOLCKEFACKAB.tim.one@comcast.net> 
Message-ID: <200211131326.gADDQqQ16335@localhost.localdomain>


>>> Tim Peters wrote
> You don't want to know -- it's an Outlook thing.  Outlook doesn't speak MIME
> natively. [snip tale of horror]  Then it starts to get messy <wink>.

And you run this mailer why, again? For fun? Or some sort of masochistic
desire to see email messages mangled beyond all belief?

if-I-want-my-email-mangled-I'll-run-Lotus-Notes-thanks,
Anthony


From popiel@wolfskeep.com  Wed Nov 13 16:59:50 2002
From: popiel@wolfskeep.com (T. Alexander Popiel)
Date: Wed, 13 Nov 2002 08:59:50 -0800
Subject: [Spambayes] Corpus modules 
In-Reply-To: Message from Fran=?ISO-8859-1?B?5w==?=ois Granger
	<francois.granger@free.fr> <B9F80E7C.5C779%francois.granger@free.fr> 
References: <B9F80E7C.5C779%francois.granger@free.fr> 
Message-ID: <20021113165950.477C9F53E@cashew.wolfskeep.com>

In message:  <B9F80E7C.5C779%francois.granger@free.fr>
             <francois.granger@free.fr> writes:
>
>I was thinking of hacking the DB mechanisme to split the load between two
>databases (using anydbm) to reduce access to each one and to make them more
>accessible from outside. The scoring module needs only the second one. The
>training module would update both. I suspected that a major redesign was
>underway. Here the proposed split.
>{'word': ['ltime',     # when this record was last modified
>          'spamcount', # of spams in which this word appears
>          'hamcount',  # of hams in which this word appears
>         ]
>}
>{'word': ['atime',     # when this record was last used by scoring(*)
>          'killcount', # of times this made it to spamprob()'s nbest
>          'spamprob',  # prob(spam | msg contains this word)
>          ]
>}
>
>A 'dirty' flag could be added to the first so that a batch update of the
>second would recalculate only the dirty records.

I am in the process of doing a very similar split, although
I've (for my private stuff) made a few simplifications:

1. I don't keep track of modification and access times.
   Nothing references them, and I'm more in favor of the
   aging methods which keep the actual wordlists for
   messages around until the message as a whole is slated
   for untraining.

2. I don't keep track of killcounts.  Again, nothing
   references them, and I really don't care which clues
   are being used a lot.

Also, when a training (or untraining) event occurs, I
completely trash the second database.  This is warranted
in most cases, since the number of spam and/or ham has
changed, and thus (almost) all the spamprobs are invalidated.
This saves us from needing a dirty flag.

As I score messages, I fetch spamprobs from the second
database, and if they aren't there, I compute them based
on the first database.  (If the words aren't in the first
database either, then just use the unknown word probability
and don't bother storing in the second database.)

Initial tests show a 4% speed hit on large batch training
and testing.  On the other hand, it speeds up the 'score
one, train one' runs immensely.

I've got a few bugs yet, and it's rather intrusive...
which is why I haven't checked it in.

- Alex

From piersh@friskit.com  Wed Nov 13 17:50:47 2002
From: piersh@friskit.com (Piers Haken)
Date: Wed, 13 Nov 2002 09:50:47 -0800
Subject: [Spambayes] Corpus modules 
Message-ID: <9891913C5BFE87429D71E37F08210CB9183A0D@zeus.sfhq.friskit.com>

> -----Original Message-----
> From: T. Alexander Popiel [mailto:popiel@wolfskeep.com]=20
>
> Also, when a training (or untraining) event occurs, I=20
> completely trash the second database.  This is warranted in=20
> most cases, since the number of spam and/or ham has changed,=20
> and thus (almost) all the spamprobs are invalidated. This=20
> saves us from needing a dirty flag.

Ouch, isn't this overly expensive for retraining a single message?

Piers.
From popiel@wolfskeep.com  Wed Nov 13 17:46:24 2002
From: popiel@wolfskeep.com (T. Alexander Popiel)
Date: Wed, 13 Nov 2002 09:46:24 -0800
Subject: [Spambayes] Corpus modules 
In-Reply-To: Message from "Piers Haken" <piersh@friskit.com> 
	<9891913C5BFE87429D71E37F08210CB9183A0D@zeus.sfhq.friskit.com> 
References: <9891913C5BFE87429D71E37F08210CB9183A0D@zeus.sfhq.friskit.com> 
Message-ID: <20021113174625.055ACF53E@cashew.wolfskeep.com>

In message:  <9891913C5BFE87429D71E37F08210CB9183A0D@zeus.sfhq.friskit.com>
             "Piers Haken" <piersh@friskit.com> writes:
>
>> -----Original Message-----
>> From: T. Alexander Popiel [mailto:popiel@wolfskeep.com]=20
>>
>> Also, when a training (or untraining) event occurs, I=20
>> completely trash the second database.  This is warranted in=20
>> most cases, since the number of spam and/or ham has changed,=20
>> and thus (almost) all the spamprobs are invalidated. This=20
>> saves us from needing a dirty flag.
>
>Ouch, isn't this overly expensive for retraining a single message?

No, not really.  That's the whole point; throwing away the entire
database is a lot cheaper than touching every record individually,
which is what update_probabilities does.  I then compute the
spamprobs on demand, instead of doing all of them regardless of if
they're used.

If you don't throw away the old spamprobs in some form when you
(re)train a message, then you're getting invalid results from
the scoring mechanism.  The mechanism I outlined achieves
correctness in the face of dynamically changing training data
with less than a 5% speed penalty, worst case.

- Alex

From tim@fourstonesExpressions.com  Wed Nov 13 23:35:06 2002
From: tim@fourstonesExpressions.com (Tim Stone - Four Stones Expressions)
Date: Wed, 13 Nov 2002 17:35:06 -0600
Subject: [Spambayes] Bayes Training
Message-ID: <952VONFAZVMLSQCBSMHCB97C995ROUP.3dd2e1aa@riven>

It occurs to me that perhaps *outgoing* mail might be a source of ham 
training.  With the presence of the smtp proxy, we *could* train the database 
on mail that a user sends, presuming that mail that looks like mail that a 
person sends is unlikely to be spam...

- Tim
www.fourstonesExpressions.com 


From popiel@wolfskeep.com  Wed Nov 13 23:50:59 2002
From: popiel@wolfskeep.com (T. Alexander Popiel)
Date: Wed, 13 Nov 2002 15:50:59 -0800
Subject: [Spambayes] Bayes Training 
In-Reply-To: Message from Tim Stone - Four Stones Expressions
	<tim@fourstonesExpressions.com> 
	<952VONFAZVMLSQCBSMHCB97C995ROUP.3dd2e1aa@riven> 
References: <952VONFAZVMLSQCBSMHCB97C995ROUP.3dd2e1aa@riven> 
Message-ID: <20021113235059.C4B55F53E@cashew.wolfskeep.com>

In message:  <952VONFAZVMLSQCBSMHCB97C995ROUP.3dd2e1aa@riven>
             <tim@fourstonesExpressions.com> writes:
>It occurs to me that perhaps *outgoing* mail might be a source of ham 
>training.  With the presence of the smtp proxy, we *could* train the database 
>on mail that a user sends, presuming that mail that looks like mail that a 
>person sends is unlikely to be spam...

Not so good, if we're parsing From addresses... one common spammer
tactic is to make the mail appear to be coming from yourself.
Training on a lot of data coming from the user would eliminate
that as a spam clue...

In any case, given the ham:spam ratios recently bandied about,
I don't think there's really a problem finding sufficient ham
from other sources. ;-)

- Alex

From tim@fourstonesExpressions.com  Thu Nov 14 00:06:33 2002
From: tim@fourstonesExpressions.com (Tim Stone - Four Stones Expressions)
Date: Wed, 13 Nov 2002 18:06:33 -0600
Subject: [Spambayes] Bayes Training 
In-Reply-To: <20021113235059.C4B55F53E@cashew.wolfskeep.com>
Message-ID: <95YLG95KGXVLKI1TC0EDFBS4XVSJH.3dd2e909@riven>

11/13/2002 5:50:59 PM, "T. Alexander Popiel" <popiel@wolfskeep.com> wrote:

>In message:  <952VONFAZVMLSQCBSMHCB97C995ROUP.3dd2e1aa@riven>
>             <tim@fourstonesExpressions.com> writes:
>>It occurs to me that perhaps *outgoing* mail might be a source of ham 
>>training.  With the presence of the smtp proxy, we *could* train the 
database 
>>on mail that a user sends, presuming that mail that looks like mail that a 
>>person sends is unlikely to be spam...
>
>Not so good, if we're parsing From addresses... one common spammer
>tactic is to make the mail appear to be coming from yourself.
>Training on a lot of data coming from the user would eliminate
>that as a spam clue...
>

Yeah, parsing on from: would be a problem, but the smtpproxy could easily 
strip the from header out, or all the headers for that matter, before sending 
it for training.  It seems very likely to me that the words I use in my mail 
are those that I would tend to want my database to weigh in the favor of 
ham...

>In any case, given the ham:spam ratios recently bandied about,
>I don't think there's really a problem finding sufficient ham
>from other sources. ;-)

I'm not completely convinced that the ham:spam that we're discussing are 
reflective of the average email user.  I think people commonly experience 1:15 
or 1:20 ratios... perhaps even more... we've been discussing much lower ratios 
if I recall correctly...

- TimS
>
>- Alex
>
>_______________________________________________
>Spambayes mailing list
>Spambayes@python.org
>http://mail.python.org/mailman/listinfo/spambayes
>
>
- Tim
www.fourstonesExpressions.com 


From tim.one@comcast.net  Thu Nov 14 01:23:27 2002
From: tim.one@comcast.net (Tim Peters)
Date: Wed, 13 Nov 2002 20:23:27 -0500
Subject: [Spambayes] Outlook users should update
Message-ID: <LNBBLJKPBEHFEDALKOLCKEMACKAB.tim.one@comcast.net>

I just checked in what should fix the last of a few related training bugs in
the Outlook client.  Incremental training (Train Now without selecting
"Rebuild entire database") is much faster (msgs that have already been
trained on are skipped over at light speed now).  The default options have
changed to exploit Anthony Baxter's new tokenization code for From, To, Cc,
Sender, and Reply-to headers (Bad Idea if you're using mixed-source data,
but, I think, if you're using Outlook you've got single-source data pretty
much by definition).

You don't have to retrain your database from scratch, but I recommend that
you do.


From anthony@interlink.com.au  Thu Nov 14 02:25:16 2002
From: anthony@interlink.com.au (Anthony Baxter)
Date: Thu, 14 Nov 2002 13:25:16 +1100
Subject: [Spambayes] A couple of small tokenizer experiments. 
In-Reply-To: <LNBBLJKPBEHFEDALKOLCAEFOCKAB.tim.one@comcast.net> 
Message-ID: <200211140225.gAE2PGk25367@localhost.localdomain>


>>> Tim Peters wrote
> Something that helped:  it now generates log-count "no real name" metatokens
> too for address headers without real-name parts.
> 
>         'from:no real name:2**0' 0.933186

I saw
        'from:no real name:2**0' 0.683287
        'reply-to:no real name:2**0' 0.873138

in the horror corpus.

> Yup, it's a small win.  I can't use it my c.l.py test, but should be able to
> on the general python.org corpus (plus, of course, my own email).

On the nasty corpus,

filename:  shout_from     
                   shout_fromccetc
ham:spam:  5000:2500      
                   5000:2500
fp total:       10       7
fp %:         0.20    0.14
fn total:        5       5
fn %:         0.20    0.20
unsure t:      297     257
unsure %:     3.96    3.43
real cost: $164.40 $126.40
best cost:  $99.80  $76.60
h mean:       4.12    3.53
h sdev:      12.63   11.53
s mean:      99.49   99.47
s sdev:       5.33    5.46
mean diff:   95.37   95.94
k:            5.31    5.65

Goes from:
-> best cost for all runs: $99.80
-> per-fp cost $10.00; per-fn cost $1.00; per-unsure cost $0.20
-> achieved at ham & spam cutoffs 0.71 & 0.99
->     fp 7; fn 15; unsure ham 37; unsure spam 37
->     fp rate 0.14%; fn rate 0.6%; unsure rate 0.987%

to:
-> best cost for all runs: $76.60
-> per-fp cost $10.00; per-fn cost $1.00; per-unsure cost $0.20
-> achieved at 2 cutoff pairs
-> smallest ham & spam cutoffs 0.69 & 0.99
->     fp 5; fn 14; unsure ham 29; unsure spam 34
->     fp rate 0.1%; fn rate 0.56%; unsure rate 0.84%
-> largest ham & spam cutoffs 0.7 & 0.99
->     fp 5; fn 14; unsure ham 29; unsure spam 34
->     fp rate 0.1%; fn rate 0.56%; unsure rate 0.84%


From IMarvinTPA@bigfoot.com  Thu Nov 14 02:46:58 2002
From: IMarvinTPA@bigfoot.com (IMarvinTPA)
Date: Wed, 13 Nov 2002 21:46:58 -0500
Subject: [Spambayes] Spam DB
Message-ID: <001c01c28b88$19c2b8c0$767ba8c0@Destruction>

Hi,
I was looking over the project and read that you haven't found many spam
databases.  I have nearly 2100 messages in my spam folder in Outlook
Express.  (I know I can't use OE with SpamBayes.)  If you have any interest
in it, I can provide it to you.  These messages are from 10/99 to today.
The general rule I used to put mail into this folder (and I review it) is my
e-mail address not being in either To or CC.

Thanks,
Andy Bay
aka IMarvinTPA
ICQ:1432002
My Homepage:  http://imarvintpa.selfhost.com/
INTP http://www.keirsey.com/
Your personality Andy Bay, is comprised of Evokateur, Sage, and Evokateur
styles. (http://www.ansir.com/)
"Wedgies rush in where angels fear to smurf."


From stuart@bmsi.com  Thu Nov 14 02:30:56 2002
From: stuart@bmsi.com (Stuart D. Gathman)
Date: Wed, 13 Nov 2002 21:30:56 -0500 (EST)
Subject: [Spambayes] Milter wrinkles
Message-ID: <Pine.LNX.4.44.0211132120500.2521-100000@gathman.bmsi.com>

I am looking for ways to integrate bayesian filtering of some kind with 
the Python Milter:  http://www.bmsi.com/python/milter.html

First, there is the difficulty of statistics being preferrably user
specific.  Is this a show stopper for this kind of filtering at the milter
level?  How could the system get feedback from the users?  Is this simply
an inappropriate thing to do at this level?

Second, a milter would like to hang up on spammers as soon as possible.  
This is why a blacklist of spam domains is valuabl -  although it only 
stops a small percentage, they are stopped immediately before many 
resources are used.

I had the thought that the bayesian analysis could be applied to the 
headers only.  Then, email with very spammy headers could be rejected 
without bothering with the body.  I'll have to experiment with how 
effective this is.

-- 
	      Stuart D. Gathman <stuart@bmsi.com>
Business Management Systems Inc.  Phone: 703 591-0911 Fax: 703 591-6154
"Confutatis maledictis, flamis acribus addictis" - background song for
a Microsoft sponsored "Where do you want to go from here?" commercial.


From anthony@interlink.com.au  Thu Nov 14 03:19:56 2002
From: anthony@interlink.com.au (Anthony Baxter)
Date: Thu, 14 Nov 2002 14:19:56 +1100
Subject: [Spambayes] Milter wrinkles 
In-Reply-To: <Pine.LNX.4.44.0211132120500.2521-100000@gathman.bmsi.com> 
Message-ID: <200211140319.gAE3JvF25731@localhost.localdomain>


>>> "Stuart D. Gathman" wrote
> First, there is the difficulty of statistics being preferrably user
> specific.  Is this a show stopper for this kind of filtering at the milter
> level?  How could the system get feedback from the users?  Is this simply
> an inappropriate thing to do at this level?

It depends on how closely coupled your user's interests are. You will
need to do ham training on representative emails from all users - otherwise
you could end up, say, with one of your users being interested in playing
the bass guitar, and suddenly all of your users would be getting spam from
those steenking bass guitar manufacturers. I guess it would be possible to
have separate databases for each user, and use that when you get the
RCPT TO: header. 

> Second, a milter would like to hang up on spammers as soon as possible.  
> This is why a blacklist of spam domains is valuabl -  although it only 
> stops a small percentage, they are stopped immediately before many 
> resources are used.

No-one's really done that much work on this yet. I think GregW has the
python.org mailer set up so it grabs the entire message, checks it with
spamassassin, and if it's completely spammy, it produces an error and
drops the message. Greg can probably fill in more details here.

> I had the thought that the bayesian analysis could be applied to the 
> headers only.  Then, email with very spammy headers could be rejected 
> without bothering with the body.  I'll have to experiment with how 
> effective this is.

There's been some work in this area, but not an enormous amount. If you're
letting them get past the DATA SMTP command, you may as well pull down the
entire message rather than just the headers.

Anthony


From richie@entrian.com  Thu Nov 14 09:45:59 2002
From: richie@entrian.com (richie@entrian.com)
Date: Thu, 14 Nov 2002 09:45:59 +0000
Subject: [Spambayes] Spam DB
In-Reply-To: <001c01c28b88$19c2b8c0$767ba8c0@Destruction>
Message-ID: <E18CGZT-0002Sf-0U@anchor-post-35.mail.demon.net>

Hi Andy,

> (I know I can't use OE with SpamBayes.)

You can't use the Outlook pieces, but you can use the POP3 proxy.  You
configure OE to collect mail from localhost rather than your POP3 server,
configure the proxy to connect to your POP3 proxy (see the [pop3proxy]
section of Options.py for what to put into your bayescustomize.ini) and
it then adds an X-Hammie-Disposition: Yes|No|Unsure header to each email.
OE can then be set up to filter on that header.

You can train it via a web interface (http://localhost:8880/ by default)
or on the command line using hammie.  The web interface only allows you
to train it one message at a time at the moment, so you're probably best
using hammie.

Sorry there isn't any better documentation, but it's still early days!

-- 
Richie Hindle
richie@entrian.com


From Alexander@Leidinger.net  Thu Nov 14 11:25:11 2002
From: Alexander@Leidinger.net (Alexander Leidinger)
Date: Thu, 14 Nov 2002 12:25:11 +0100
Subject: [Spambayes] Milter wrinkles
In-Reply-To: <200211140319.gAE3JvF25731@localhost.localdomain>
References: <Pine.LNX.4.44.0211132120500.2521-100000@gathman.bmsi.com>
	<200211140319.gAE3JvF25731@localhost.localdomain>
Message-ID: <20021114122511.5b5e3f15.Alexander@Leidinger.net>

On Thu, 14 Nov 2002 14:19:56 +1100
Anthony Baxter <anthony@interlink.com.au> wrote:

> 
> >>> "Stuart D. Gathman" wrote
> > First, there is the difficulty of statistics being preferrably user
> > specific.  Is this a show stopper for this kind of filtering at the milter
> > level?  How could the system get feedback from the users?  Is this simply
> > an inappropriate thing to do at this level?
> 
> It depends on how closely coupled your user's interests are. You will
> need to do ham training on representative emails from all users - otherwise
> you could end up, say, with one of your users being interested in playing
> the bass guitar, and suddenly all of your users would be getting spam from
> those steenking bass guitar manufacturers. I guess it would be possible to
> have separate databases for each user, and use that when you get the
> RCPT TO: header. 

While we are at it:
 - global database (may be not existent for a given setup)
 - per domain database (--"--)
 - per user database (--"--)
 - per "+-feature" database (--"--)... but I don't realy think this item
   may be usefull

So root could add some default spams (e.g. no porn spam please), and
also add some per domain defaults (no asian mails for one domain, but
not for the other).

With this features a per user database may be not needed in some cases.

Bye,
Alexander.

-- 
                           Reboot America.

http://www.Leidinger.net                       Alexander @ Leidinger.net
  GPG fingerprint = C518 BC70 E67F 143F BE91  3365 79E2 9C60 B006 3FE7

From richie@entrian.com  Thu Nov 14 09:45:59 2002
From: richie@entrian.com (richie@entrian.com)
Date: Thu, 14 Nov 2002 09:45:59 +0000
Subject: [Spambayes] Spam DB
In-Reply-To: <001c01c28b88$19c2b8c0$767ba8c0@Destruction>
Message-ID: <E18CGZT-0002Sf-0U@anchor-post-35.mail.demon.net>

Hi Andy,

> (I know I can't use OE with SpamBayes.)

You can't use the Outlook pieces, but you can use the POP3 proxy.  You
configure OE to collect mail from localhost rather than your POP3 server,
configure the proxy to connect to your POP3 proxy (see the [pop3proxy]
section of Options.py for what to put into your bayescustomize.ini) and
it then adds an X-Hammie-Disposition: Yes|No|Unsure header to each email.
OE can then be set up to filter on that header.

You can train it via a web interface (http://localhost:8880/ by default)
or on the command line using hammie.  The web interface only allows you
to train it one message at a time at the moment, so you're probably best
using hammie.

Sorry there isn't any better documentation, but it's still early days!

-- 
Richie Hindle
richie@entrian.com


From fgranger@teleprosoft.com  Thu Nov 14 09:47:06 2002
From: fgranger@teleprosoft.com (Fran=?ISO-8859-1?B?5w==?=ois Granger)
Date: Thu, 14 Nov 2002 10:47:06 +0100
Subject: [Spambayes] Mail with problem
Message-ID: <B9F92FAA.5C8B7%fgranger@teleprosoft.com>

The enclosed file contains a mail wich when received or trained throught
pop3prowy give me the following error:

(MacOS 9.1 24 Mo memory for Python 2.2.1)

When receiving:

Loading database... Done.
BayesProxyListener listening on port 110.
UserInterfaceListener listening on port 8880.
error: uncaptured python exception, closing channel
<__main__.ServerLineReader connected at 0x6c9f9f0>
(exceptions.RuntimeError:maximum recursion limit exceeded [HD:Python
2.2.1:Lib:asyncore.py|poll|95] [HD:Python
2.2.1:Lib:asyncore.py|handle_read_event|392] [HD:Python
2.2.1:Lib:asynchat.py|handle_read|130]
[HD:Dev:spambayes:pop3proxy.py|found_terminator|192]
[HD:Dev:spambayes:pop3proxy.py|onServerLine|260]
[HD:Dev:spambayes:pop3proxy.py|onResponse|315]
[HD:Dev:spambayes:pop3proxy.py|onTransaction|413]
[HD:Dev:spambayes:pop3proxy.py|onRetr|460]
[HD:Dev:spambayes:classifier.py|chi2_spamprob|234]
[HD:Dev:spambayes:classifier.py|_getclues|459]
[HD:Dev:spambayes:sets.py|__init__|374]
[HD:Dev:spambayes:sets.py|_update|333]
[HD:Dev:spambayes:tokenizer.py|tokenize|1008]
[HD:Dev:spambayes:tokenizer.py|tokenize_body|1254])


When training on it:

Loading database... Done.
BayesProxyListener listening on port 110.
UserInterfaceListener listening on port 8880.
error: uncaptured python exception, closing channel <__main__.UserInterface
connected at 0x6b93910> (exceptions.RuntimeError:maximum recursion limit
exceeded [HD:Python 2.2.1:Lib:asyncore.py|poll|95] [HD:Python
2.2.1:Lib:asyncore.py|handle_read_event|392] [HD:Python
2.2.1:Lib:asynchat.py|handle_read|112]
[HD:Dev:spambayes:pop3proxy.py|found_terminator|670]
[HD:Dev:spambayes:pop3proxy.py|onRequest|695]
[HD:Dev:spambayes:pop3proxy.py|onTrain|786]
[HD:Dev:spambayes:classifier.py|learn|296]
[HD:Dev:spambayes:classifier.py|_add_msg|411]
[HD:Dev:spambayes:sets.py|__init__|374]
[HD:Dev:spambayes:sets.py|_update|333]
[HD:Dev:spambayes:tokenizer.py|tokenize|1008]
[HD:Dev:spambayes:tokenizer.py|tokenize_body|1254])


Salutations,
Francois Granger
-- 
fgranger@teleprosoft.com - <http://www.teleprosoft.com>
tel: +33 1 41 88 48 00 - Fax: + 33 1 41 88 48 48

-------------- next part --------------
PK   �Tn-̦��  Qr     art.txt�][s۸~΃���ڛ�)������cˉ�vⱽ��t:%$$� �l���HI�-�m9�cyv3"��9�uAU""��^Q!��o��=���	K�P֧^����K#*�,���g*t��?jN�\�)�j�r��Oԧ���z��{�h,��J[_*@Wtyvu�.����)�F�L�P��f�,W����q67��j�@Zԥ��Đ�}t�K6���>�>�U�5TuN��8�/U��q���K峛���Dڪ:�K���v�v��Tk[����\��\�z��ilm�\N��~>�\�.]q@l�����D�j�0ǩ�Qlt���Jp�E��@(�ޙ���PY:���>�)�]�&V�}��N�:��TH����C))��6�&�b1��EE���_B.O"�ۖe�=�x�mRݫ�j[x����.�����_۫;�*�Q)q��'�"MM��8�����*��~3c�ڥ#���b�pI#��y�]��1��?�rfk�P�҆X�4Z�B��%�.��d�ިJ`�#��BR�b�ۻ�[{�3j�"}*�ND�Ǣn��L�J��0D�BI�h��p��Wg���%C}�H����+�5D��@%��{��}�E�2�h ��O�`�{ �D	i�c�t�K�l�sh�B�����\�XP��F��$4Yz���P��ab	��&������T��`¸���~7:�xx����������Yv��G��R9�:J��,W��i��A���`��z���f�b��
��!U���m�%a��E2
Pk��e�Ȟ��2C�jJu�;f퉔��O�m��*GZ@�o�T�D�� �S�o��>Yp��V���@=��b>b�0��G�=J��ʨ~�A���/C����"�ioO��l�kʺ=��k
���'��T��s���]x���(�9�4U���z/`�𝜤d�Fk��vp���2M�^*�.�n�/H�Q��`a��>%R1��l�F�+F%��S�rMC��5����7�#M�]P�k�^���g��ZV��A�=m	Z�nz-cL�u5U2%R����g�6��t�y�V���'c8O6ՠ�&�6@����
l�0�-K[�F�r}}]�iІ�]CNJ8��n���|c����B0��kY����wv��̇+�W�vv�<�}f��8vptv
��d��!5t�gtaO��0���iX�{�/
�.����>�]��jL��� XG�]�omn����w�Z�Utvp���C���*����/�:o?^u.��������ɇw�����+�t����+��Kss�͟�B��t�v3�O�j/��|���l��t��T
�g�z1����W׳%QZIp��V&V}$(�1������Q����Y�Q���]^����)�8��݌�im���T��N�z-re����k�D�}�ĸb�Q43���SCX���bI�#/Ag:8�x#��i�	]X��tA���J��������iH"��eI��`�j	�	M�n�3|�����@p�H@ט��Ivժ��kV����=��r��� $*Z#�z,���7 ��I��0����C����m���E�̀
��8�D�LN�l����y↧��ɭci$�cK�dD�
�G>�f�ԧ�ނhKmP}���T�i��hH� ?̠=��%����X�iZğ׵=Ppm�4�5�~��2�>��&�t6j��A`,` ))M�|��Z�g�A`M;$@�{�� a��x��o&��A7ш��0L"��0O����[�d� �`~# ��bm���뒂^ʿ�=x�<A0�I����	0�b8�g�����uEp&�M��r6+g�t�z�j�Ia��8����0��Lu� ����@6��4 -�cp��Rg)����.���El�ߨ�7Z�M�YLz���kD�Tw�!�_��T��9��$�/��P$�$�Y�
�)�	�����s��1���eu���bl�Ui�E��!����k\y��7�+�XDT=�0������N�D�hM[� ��O�
l��F �lfʜ]5{im�m��z�ٝ'FՎ�����t�� -P�z�����%L�}�
�R��S!@��ˣh8��M�u�bP]ToX�y��9�ͼ�f��pK�2p*"ې2�X������mm��v���f}���S�[r�ڼզ��m&b�i��wX�X�6VޮZc��5+W�&.���^Z�z�h=���MC� ZV�윣{X��N�{�s���zf��emnNp��-0-P��l��T�	4&�,���C�gk?��GPF3�"����b�O�&� �H���)L��si� �5���E��#�r������6Nm���}wb|+m��-��;��B���x���Dz���}g��y�|�<��;��)H�� ��<��<}!���2���;�cE۞�՘:����-�)�Z�N���Wc6v���Ķ�L��7������A��8���R����7�'�����X?��D�Ɯ�=W���ٻ,�$�:r����:˥�pPc|9#�|oA�Jzt�՜-�S��C��0ͣd&��D�z���^�|�9���-$�t�`:��^��)]Ug���K��V~�!�d+�_.��"�_��q���y�U��br�9�����g
�+-X�*Ҁ�I�}�
��8s���v�u�^��`�k�Sd�4�SlM��s]���I,ͤ�`��Jk�S �S�t1� ��>�P�< ��N@�QΣ�K��C��wʹ�g���=�3#��β�,�ss�1��\n޼+f
��w6(McW�N
5���ca�o¨ͣu	���Gf�N��]�m��yzKFo�CH?���y���!ڂl��>K��r3;��V�|.�)=����ٛ�&nrK�\l���-����W����DGQ�kC���zR��X-�
������_҃��q>�aa��/���9�{W���w�C��r���~�>���������9	\Jm�tܛ��~WW��0���U��;�B��x�.6k<�I� z,-�K�H��'釿��~D�2.�+��E����8�D��k����z�͜pH�~��,�R8x)����r��7*4"�$�1�X� /��Z[���O�(���E��� �=�"	�{L��^lm���ۣ�wG�f���[���ࣝߙ[��Dp/o�0u�����i�I~���nt�R�ԃ,�����ʑ� �� u��0�h���|T�^$��-�	�yufN���VH/9A������_�\��O߉r�_���/�e�������xz����B��dFgs�������]\����b��3��d�Y�����a?wjխ)��%L�E�j[Iʂ��KQ>~z�uO�j�2I�(�,�sі�qδ��U�̺}���^��{��xc�M��ߏ��qQ�H�
�8�i$z/��!���2=��iݩYa��٨�z�ޙN/3�(��?an�� �����<��9�J�t�:i�.�����||�8�������z�T��䦘Bv]
��|����X��9�I�n�a��X�� � go�\�Ű8����K�/zD�w�~��(�����ڜ+��]F��f�������P{2��}�I�V�=�j�&�����=>k�c��X��p�c?>2�Ϋ�?W2�%�~2(�gyϦ�^#<��G��ޮ��x��U��A�̫�����iB�q��y��{�0��m����fۥ�PK   �Tn-̦��  Qr            ��    art.txtPK      5       From Paul.Moore@atosorigin.com  Thu Nov 14 15:51:49 2002
From: Paul.Moore@atosorigin.com (Moore, Paul)
Date: Thu, 14 Nov 2002 15:51:49 -0000
Subject: [Spambayes] [Outlook addin] Filtering unread messages on startup
Message-ID: <16E1010E4581B049ABC51D4975CEDB885E2DCA@UKDCX001.uk.int.atosorigin.com>

Here's a first cut at filtering any new unread messages when Outlook
starts up (important for Exchange or IMAP users). It works, at least
for my limited testing, but there are a couple of points I think
need looking at. So far, I've tested it manually - the crunch comes
tomorrow morning, when I'll have received my overnight dose of spam
:-)

First, the msg.unread property doesn't seem to be set right - the
diagnostic output shows my code having filtered all of my inbox, not
just the unread messages. This, if true, is obviously a bug. (On the
other hand, it may be a bug in my code - I don't understand that stuff
too well...) I can't see a simple way of checking this.

Maybe I should only filter unread, unscored messages - the attached
code doesn't do this, as I would have to wait for new mail to arrive
if I were skipping scored messages. I'll add that once the basic thing
is working.

Second, an efficiency point - I'm going through the whole inbox via
GetMessageGenerator(). In my case, my inbox has 365 messages, with
only 4 unread. I was going to use the MAPI Find/FindNext methods,
but the msgstore code doesn't expose them. Would it be worth having
a method for just scanning unread messages (it could be used in the
filter dialog, too)?

Anyway, any comments would be appreciated.
Paul.

-------------- next part --------------
A non-text attachment was scrubbed...
Name: addin.patch
Type: application/octet-stream
Size: 1832 bytes
Desc: addin.patch
Url : http://mail.python.org/pipermail/spambayes/attachments/20021114/f19b3b41/addin.exe
From tim.one@comcast.net  Thu Nov 14 16:12:23 2002
From: tim.one@comcast.net (Tim Peters)
Date: Thu, 14 Nov 2002 11:12:23 -0500
Subject: [Spambayes] [Outlook addin] Filtering unread messages on startup
In-Reply-To: 
 <16E1010E4581B049ABC51D4975CEDB885E2DCA@UKDCX001.uk.int.atosorigin.com>
Message-ID: <BIEJKCLHCIOIHAGOKOLHEEDPDPAA.tim.one@comcast.net>

[Moore, Paul]
> Here's a first cut at filtering any new unread messages when Outlook
> starts up (important for Exchange or IMAP users).

Note that Mark Hammond already checked in something toward the same end.

> ...
> First, the msg.unread property doesn't seem to be set right - the
> diagnostic output shows my code having filtered all of my inbox, not
> just the unread messages. This, if true, is obviously a bug. (On the
> other hand, it may be a bug in my code - I don't understand that stuff
> too well...) I can't see a simple way of checking this.

Me neither.

> Maybe I should only filter unread, unscored messages - the attached
> code doesn't do this, as I would have to wait for new mail to arrive
> if I were skipping scored messages. I'll add that once the basic thing
> is working.
>
> Second, an efficiency point - I'm going through the whole inbox via
> GetMessageGenerator(). In my case, my inbox has 365 messages, with
> only 4 unread.

Mark added code to display time consumed by the startup scan.  On my lean
Work mailbox, it's like so:

rocessing 0 missed spam in folder 'Inbox' took 1.86141ms
Processing 0 missed spam in folder 'Zope' took 6.21308ms
Processing 0 missed spam in folder 'Bayes' took 1.91225ms
Processing 0 missed spam in folder 'Checkins' took 0.590019ms

The Zope folder there had over 400 pending msgs, and 6ms to note >400
previously scored msgs is nothing.

> I was going to use the MAPI Find/FindNext methods, but the msgstore code
> doesn't expose them. Would it be worth having a method for just scanning
> unread messages (it could be used in the filter dialog, too)?

See the new GetNewUnscoredMessageGenerator() (in msgstore.py).  Mark uses a
method now that sucks out 70 unread and unscored msgs per MAPI call.


From Paul.Moore@atosorigin.com  Thu Nov 14 16:31:54 2002
From: Paul.Moore@atosorigin.com (Moore, Paul)
Date: Thu, 14 Nov 2002 16:31:54 -0000
Subject: [Spambayes] [Outlook addin] Filtering unread messages on startup
Message-ID: <16E1010E4581B049ABC51D4975CEDB885E2DCD@UKDCX001.uk.int.atosorigin.com>

From: Tim Peters [mailto:tim.one@comcast.net]
> [Moore, Paul]
>> Here's a first cut at filtering any new unread messages when Outlook
>> starts up (important for Exchange or IMAP users).
>
> Note that Mark Hammond already checked in something toward the same =
end.

Drat. Must have been since last night, since I did a CVS checkout then.
(checks) yes it was.

Oh, well, it was a learning exercise.

Paul.

From sjoerd@acm.org  Thu Nov 14 16:37:06 2002
From: sjoerd@acm.org (Sjoerd Mullender)
Date: Thu, 14 Nov 2002 17:37:06 +0100
Subject: [Spambayes] updating email package
Message-ID: <20021114163712.2E6BE74C3B@indus.ins.cwi.nl>

Does anybody mind if I update the email package with the current
verion from the Python CVS?

I noticed that I received emails that can't be properly parsed by the
version that's in spambayes (it raises an exception, resulting in the
fallback behavior) but that can be parsed by the version in the Python
CVS.

-- Sjoerd Mullender <sjoerd@acm.org>

From tim.one@comcast.net  Thu Nov 14 16:40:42 2002
From: tim.one@comcast.net (Tim Peters)
Date: Thu, 14 Nov 2002 11:40:42 -0500
Subject: [Spambayes] [Outlook addin] Filtering unread messages on startup
In-Reply-To: 
 <16E1010E4581B049ABC51D4975CEDB885E2DCD@UKDCX001.uk.int.atosorigin.com>
Message-ID: <BIEJKCLHCIOIHAGOKOLHEEEFDPAA.tim.one@comcast.net>

>> Note that Mark Hammond already checked in something toward the same end.

[Moore, Paul]
> Drat. Must have been since last night, since I did a CVS checkout then.
> (checks) yes it was.

Well, Mark lives in Australia (if you can call that living <wink>), so "last
night" means something perverse to him.

Anyone developing code for this project should subscribe to the checkins
mailing list too:

    http://mail.python.org/mailman/listinfo/spambayes-checkins

Then you'll be the first on your continent to get breaking news.


From tim.one@comcast.net  Thu Nov 14 16:47:35 2002
From: tim.one@comcast.net (Tim Peters)
Date: Thu, 14 Nov 2002 11:47:35 -0500
Subject: [Spambayes] updating email package
In-Reply-To: <20021114163712.2E6BE74C3B@indus.ins.cwi.nl>
Message-ID: <BIEJKCLHCIOIHAGOKOLHAEEHDPAA.tim.one@comcast.net>

[Sjoerd Mullender]
> Does anybody mind if I update the email package with the current
> verion from the Python CVS?

Yes, that's a big -1.  Barry intends to delete the email pkg from this
project.  It doesn't belong here.  People working w/ current CVS will then
get the latest without effort.  People using older versions of Python will
need to hammer out a scheme for installing the standalone email pkg, at

    http://mimelib.sf.net/

> I noticed that I received emails that can't be properly parsed by the
> version that's in spambayes (it raises an exception, resulting in the
> fallback behavior) but that can be parsed by the version in the Python
> CVS.

I believe it <wink>.


From skip@pobox.com  Thu Nov 14 16:49:28 2002
From: skip@pobox.com (Skip Montanaro)
Date: Thu, 14 Nov 2002 10:49:28 -0600
Subject: [Spambayes] read-only DBDict in hammie?
Message-ID: <15827.54296.548473.905264@montanaro.dyndns.org>

I'd like to share the anydbm file between several accounts on my machine.
Before I fiddle hammie.py so it opens the file in read-only mode, is there
any reason when classifying (not training) it actually needs to update the
file?  There's a __del__ method in PersistentBayes which does this:

    def __del__(self):
        #super.__del__(self)
        self.save_state()

    def save_state(self):
        self.wordinfo[self.statekey] = (self.nham, self.nspam)

When classifying there's no reason that nham or nspam would change, right?

Skip


From jeremy@alum.mit.edu  Thu Nov 14 16:56:59 2002
From: jeremy@alum.mit.edu (Jeremy Hylton)
Date: Thu, 14 Nov 2002 11:56:59 -0500
Subject: [Spambayes] updating email package
In-Reply-To: <20021114163712.2E6BE74C3B@indus.ins.cwi.nl>
References: <20021114163712.2E6BE74C3B@indus.ins.cwi.nl>
Message-ID: <15827.54747.182021.325839@slothrop.zope.com>

It would be *much* better to remove the email the package from
spambayes and let users install a copy of whatever version they want
in their python site-packages.  I am using CVS python to run my spam
filter, and I have to manually delete the email package every time I
do a CVS update.  

(Well, technically, I deleted it once and replaced it with an empty
directory named email.  So I get a bunch of complaints every time I do
a CVS update about the directory being in the way.)

I can't think of any good reason to include an email package with
spambayes.  It only leads to complexity for people who have a version
of email installed on their pythonpath.

Jeremy


From barry@python.org  Thu Nov 14 16:59:19 2002
From: barry@python.org (Barry A. Warsaw)
Date: Thu, 14 Nov 2002 11:59:19 -0500
Subject: [Spambayes] updating email package
References: <20021114163712.2E6BE74C3B@indus.ins.cwi.nl>
Message-ID: <15827.54887.396770.384358@gargle.gargle.HOWL>


>>>>> "SM" == Sjoerd Mullender <sjoerd@acm.org> writes:

    SM> Does anybody mind if I update the email package with the
    SM> current verion from the Python CVS?

You might want the version from mimelib.sf.net, which is at an
officially stable release of 2.4.3.  I'm working on a 2.5 release,
which while it passes all the tests, still needs some tweaking before
it's ready to go out.

-Barry

From barry@python.org  Thu Nov 14 17:07:26 2002
From: barry@python.org (Barry A. Warsaw)
Date: Thu, 14 Nov 2002 12:07:26 -0500
Subject: [Spambayes] updating email package
References: <20021114163712.2E6BE74C3B@indus.ins.cwi.nl>
	<BIEJKCLHCIOIHAGOKOLHAEEHDPAA.tim.one@comcast.net>
Message-ID: <15827.55374.656571.993448@gargle.gargle.HOWL>


>>>>> "TP" == Tim Peters <tim.one@comcast.net> writes:

    TP> Yes, that's a big -1.  Barry intends to delete the email pkg
    TP> from this project.

I'm deleting it now.
-Barry

From tim.one@comcast.net  Thu Nov 14 17:16:47 2002
From: tim.one@comcast.net (Tim Peters)
Date: Thu, 14 Nov 2002 12:16:47 -0500
Subject: [Spambayes] read-only DBDict in hammie?
In-Reply-To: <15827.54296.548473.905264@montanaro.dyndns.org>
Message-ID: <BIEJKCLHCIOIHAGOKOLHIEEIDPAA.tim.one@comcast.net>

[Skip Montanaro]
> I'd like to share the anydbm file between several accounts on my machine.
> Before I fiddle hammie.py so it opens the file in read-only mode, is there
> any reason when classifying (not training) it actually needs to update the
> file?

Scoring currently updates .killcount and .atime members in WordInfo records.
If you're not using them for anything, you don't care.

> ...
> When classifying there's no reason that nham or nspam would change, right?

Correct.


From tim.one@comcast.net  Thu Nov 14 17:25:53 2002
From: tim.one@comcast.net (Tim Peters)
Date: Thu, 14 Nov 2002 12:25:53 -0500
Subject: [Spambayes] Milter wrinkles
In-Reply-To: <200211140319.gAE3JvF25731@localhost.localdomain>
Message-ID: <BIEJKCLHCIOIHAGOKOLHGEEJDPAA.tim.one@comcast.net>

[Anthony Baxter]
> ...
> No-one's really done that much work on this yet. I think GregW has the
> python.org mailer set up so it grabs the entire message, checks it with
> spamassassin, and if it's completely spammy, it produces an error and
> drops the message. Greg can probably fill in more details here.

He's probably tired of that by now <wink>.  python.org does lots of stuff,
including rejecting msgs based on some smoking-gun header scanning.  In
particular, email is rejected at once if the headers indicate it uses a
character set that's out of favor, or passed through a country that's out of
favor.  If it was email to a tech mailing list, he probably could (but
doesn't) reject email just for having multipart/* or text/html type.


From tim@fourstonesExpressions.com  Thu Nov 14 17:46:57 2002
From: tim@fourstonesExpressions.com (Tim Stone - Four Stones Expressions)
Date: Thu, 14 Nov 2002 11:46:57 -0600
Subject: [Spambayes] read-only DBDict in hammie?
In-Reply-To: <BIEJKCLHCIOIHAGOKOLHIEEIDPAA.tim.one@comcast.net>
Message-ID: <IHLF72MJQLOMMH76B6BJ8GEQGDKF.3dd3e191@riven>

11/14/2002 11:16:47 AM, Tim Peters <tim.one@comcast.net> wrote:

>[Skip Montanaro]
>> I'd like to share the anydbm file between several accounts on my machine.
>> Before I fiddle hammie.py so it opens the file in read-only mode, is there
>> any reason when classifying (not training) it actually needs to update the
>> file?

I'm using the DBDict class in hammie for doing training with the pop3proxy.  
Can we make a read-only option, rather than making it always open for read?

On a related note, should DBDict actually have it's own module, rather than be 
part of hammie?

- TimS
>
>Scoring currently updates .killcount and .atime members in WordInfo records.
>If you're not using them for anything, you don't care.
>
>> ...
>> When classifying there's no reason that nham or nspam would change, right?
>
>Correct.
>
>
>_______________________________________________
>Spambayes mailing list
>Spambayes@python.org
>http://mail.python.org/mailman/listinfo/spambayes
>
>
- Tim
www.fourstonesExpressions.com 


From grobinson@transpose.com  Thu Nov 14 18:50:39 2002
From: grobinson@transpose.com (Gary Robinson)
Date: Thu, 14 Nov 2002 13:50:39 -0500
Subject: [Spambayes] I thought this was an interesting spam article
Message-ID: <B9F95AAF.18FA1%grobinson@transpose.com>


http://maccentral.macworld.com/news/0211/14.spam.php

--Gary


-- 
Gary Robinson
CEO
Transpose, LLC
grobinson@transpose.com
207-942-3463
http://www.emergentmusic.com
http://radio.weblogs.com/0101454


From tim@fourstonesExpressions.com  Thu Nov 14 19:08:00 2002
From: tim@fourstonesExpressions.com (Tim Stone - Four Stones Expressions)
Date: Thu, 14 Nov 2002 13:08:00 -0600
Subject: Fwd: Re: [Spambayes] I thought this was an interesting spam article
Message-ID: <52NK72MIYVNIIFID3XGD4YJFWUUOF0OK.3dd3f490@riven>

The article originated at Computerworld, and their publication of the article 
at 
http://computerworld.com/softwaretopics/software/groupware/story/0,10801,75737
,00.html has several very interesting sidebars, particularly the one named 
"The Other Side"

-TimS

11/14/2002 12:50:39 PM, Gary Robinson <grobinson@transpose.com> wrote:

>
>http://maccentral.macworld.com/news/0211/14.spam.php
>
>--Gary
>
>
>-- 
>Gary Robinson
>CEO
>Transpose, LLC
>grobinson@transpose.com
>207-942-3463
>http://www.emergentmusic.com
>http://radio.weblogs.com/0101454
>
>
>
>_______________________________________________
>Spambayes mailing list
>Spambayes@python.org
>http://mail.python.org/mailman/listinfo/spambayes
>
>
- Tim
www.fourstonesExpressions.com 

-------- End of forwarded message --------
- Tim
www.fourstonesExpressions.com 


From tim.one@comcast.net  Thu Nov 14 19:24:58 2002
From: tim.one@comcast.net (Tim Peters)
Date: Thu, 14 Nov 2002 14:24:58 -0500
Subject: [Spambayes] Mail with problem
In-Reply-To: <B9F92FAA.5C8B7%fgranger@teleprosoft.com>
Message-ID: <BIEJKCLHCIOIHAGOKOLHOEFMDPAA.tim.one@comcast.net>

[Francois Granger]
> The enclosed file contains a mail wich when received or trained throught
> pop3prowy give me the following error:
>
> (MacOS 9.1 24 Mo memory for Python 2.2.1)
> ...
> [HD:Dev:spambayes:tokenizer.py|tokenize_body|1254])

Looks like the regular expression engine runs out of (C) stack space while
trying to find HTML tags to strip.  I don't know enough about Macs to
suggest something specific, but in general you have to do whatever it takes
to convince he OS to give the program more stack space to work with.

Short of that, reducing the instances of "2048" in html_re in tokenizer.py
should make the problem go away, but since C stack space limits are
platform-specific, it's impossible to say how small "is safe" for you
without simply trying it over and over until the error goes away.


From tim@fourstonesExpressions.com  Thu Nov 14 18:21:54 2002
From: tim@fourstonesExpressions.com (Tim Stone - Four Stones Expressions)
Date: Thu, 14 Nov 2002 12:21:54 -0600
Subject: [Spambayes] Mozilla.org using bayesian spam filtering
Message-ID: <OIVPVRUO83UPB8WV1YUQMKA7VPYSCPO.3dd3e9c2@riven>

Anybody know anything about this?  Doesn't look like our technology...

http://www.mozilla.org/mailnews/spam.html

- Tim
www.fourstonesExpressions.com 


From popiel@wolfskeep.com  Thu Nov 14 19:39:32 2002
From: popiel@wolfskeep.com (T. Alexander Popiel)
Date: Thu, 14 Nov 2002 11:39:32 -0800
Subject: [Spambayes] updating email package 
In-Reply-To: Message from barry@python.org (Barry A. Warsaw) 
	<15827.55374.656571.993448@gargle.gargle.HOWL> 
References: <20021114163712.2E6BE74C3B@indus.ins.cwi.nl>
	<BIEJKCLHCIOIHAGOKOLHAEEHDPAA.tim.one@comcast.net>
	<15827.55374.656571.993448@gargle.gargle.HOWL> 
Message-ID: <20021114193932.7E180F58A@cashew.wolfskeep.com>

In message:  <15827.55374.656571.993448@gargle.gargle.HOWL>
             barry@python.org (Barry A. Warsaw) writes:
>
>>>>>> "TP" == Tim Peters <tim.one@comcast.net> writes:
>
>    TP> Yes, that's a big -1.  Barry intends to delete the email pkg
>    TP> from this project.
>
>I'm deleting it now.

Would you mind putting some basic instructions about manual
installation of the email package on the spambayes website
for the python-package-management-impaired among us?

(I tried looking at the mimelib.sf.net website, but it doesn't
explain how to get the package into the search path...)

- Alex

From tim@fourstonesExpressions.com  Thu Nov 14 19:48:36 2002
From: tim@fourstonesExpressions.com (Tim Stone - Four Stones Expressions)
Date: Thu, 14 Nov 2002 13:48:36 -0600
Subject: [Spambayes] Mail with problem
In-Reply-To: <BIEJKCLHCIOIHAGOKOLHOEFMDPAA.tim.one@comcast.net>
Message-ID: <LGDA09YU5YQKHFCOJQLPNHGHG2VB7B9.3dd3fe14@riven>

Depending on what kind of regex engine python has (NFA or DFA) and on how the 
html parsing regex is implemented relative to its engine, it can take an 
enormous amount of memory.  For example, with an NFA and a regex that uses 
alternation in certain ways, the stack can grow exponentially.

We may want to take a hard look at tokenizer's html parsing regex.  I looked 
at it briefly yesterday, but didn't pay much attention.

Tim, do you know if the python regex is NFA or DFA?  If it's NFA, is there a 
DFA engine we can plug in?

- TimS
11/14/2002 1:24:58 PM, Tim Peters <tim.one@comcast.net> wrote:

>[Francois Granger]
>> The enclosed file contains a mail wich when received or trained throught
>> pop3prowy give me the following error:
>>
>> (MacOS 9.1 24 Mo memory for Python 2.2.1)
>> ...
>> [HD:Dev:spambayes:tokenizer.py|tokenize_body|1254])
>
>Looks like the regular expression engine runs out of (C) stack space while
>trying to find HTML tags to strip.  I don't know enough about Macs to
>suggest something specific, but in general you have to do whatever it takes
>to convince he OS to give the program more stack space to work with.
>
>Short of that, reducing the instances of "2048" in html_re in tokenizer.py
>should make the problem go away, but since C stack space limits are
>platform-specific, it's impossible to say how small "is safe" for you
>without simply trying it over and over until the error goes away.
>
>
>_______________________________________________
>Spambayes mailing list
>Spambayes@python.org
>http://mail.python.org/mailman/listinfo/spambayes
>
>
- Tim
www.fourstonesExpressions.com 


From barry@python.org  Thu Nov 14 19:58:56 2002
From: barry@python.org (Barry A. Warsaw)
Date: Thu, 14 Nov 2002 14:58:56 -0500
Subject: [Spambayes] updating email package 
References: <20021114163712.2E6BE74C3B@indus.ins.cwi.nl>
	<BIEJKCLHCIOIHAGOKOLHAEEHDPAA.tim.one@comcast.net>
	<15827.55374.656571.993448@gargle.gargle.HOWL>
	<20021114193932.7E180F58A@cashew.wolfskeep.com>
Message-ID: <15828.128.561764.174214@gargle.gargle.HOWL>


>>>>> "TAP" == T Alexander Popiel <popiel@wolfskeep.com> writes:

    TAP> Would you mind putting some basic instructions about manual
    TAP> installation of the email package on the spambayes website
    TAP> for the python-package-management-impaired among us?

I made the change to the developer.ht file but couldn't push out the
.html files, probably due to the i'm-in-too-many-sf-groups permissions
problem.  Can someone else regen and push them out?

    TAP> (I tried looking at the mimelib.sf.net website, but it
    TAP> doesn't explain how to get the package into the search
    TAP> path...)

You should be able to just unpack the tarball, and then follow the
directions in the README file.

-Barry

From tim.one@comcast.net  Thu Nov 14 20:01:50 2002
From: tim.one@comcast.net (Tim Peters)
Date: Thu, 14 Nov 2002 15:01:50 -0500
Subject: [Spambayes] Mail with problem
In-Reply-To: <LGDA09YU5YQKHFCOJQLPNHGHG2VB7B9.3dd3fe14@riven>
Message-ID: <BIEJKCLHCIOIHAGOKOLHEEGBDPAA.tim.one@comcast.net>

[Tim Stone]
> Depending on what kind of regex engine python has (NFA or DFA)
> and on how the html parsing regex is implemented relative to its
> engine, it can take an enormous amount of memory.

That isn't the problem here.  There are no "runaway" regexps.  The problem
is that minimal matching is implemented via recursive call, one level per
character matched.  That's a long-standing problem.  All minimal matches in
the tokenizer regexps are bounded, but it's impossible to guess an upper
limit that's "safe" across every half-a-brain C implementation in existence.

> For example, with an NFA and a regex that uses alternation in certain
ways,
> the stack can grow exponentially.

Yes, but that's not the case here.

> We may want to take a hard look at tokenizer's html parsing
> regex.  I looked  at it briefly yesterday, but didn't pay much attention.
>
> Tim, do you know if the python regex is NFA or DFA?

NFA.

> If it's NFA, is there a DFA engine we can plug in?

No.  With great pain, the regexp in question could be rewritten to avoid
minimal matches.  I'd rather the OP convince his OS to let him use some of
the dozens of megabytes sitting idle on his machine <wink>.


From tim.one@comcast.net  Thu Nov 14 20:33:47 2002
From: tim.one@comcast.net (Tim Peters)
Date: Thu, 14 Nov 2002 15:33:47 -0500
Subject: [Spambayes] Mail with problem
In-Reply-To: <2Z1V1W06C8B9FCUWQ62XYZYPNOKHE.3dd403e4@riven>
Message-ID: <BIEJKCLHCIOIHAGOKOLHOEGMDPAA.tim.one@comcast.net>

[Tim Stone]
> Point taken.  Makes me wonder, though, if we might not have a
> problem like this when this starts getting used by regular folks, like
>with the proxy...

The OP attached the email in question to his msg.  It tokenized fine on my
box at the time (Win2K), and if I don't hear about it causing problems on
other boxes either, then I'll assume it's Just Another Glitch specific to
Mac OS 9.

> I suppose the reason we're not using python's html parser is
> performance...?

Flyswatters versus dynamite mostly.  We're not doing anything with HTML
except throwing it away.  Half-assed regexps can do a fine job of this, are
very robust against ill-formed HTML too, against damaged email that intended
to call itself text/html but forgot to, etc.  If we need to do fancier
things with HTML, then a real parser becomes correspondingly more
attractive.


From skip@pobox.com  Thu Nov 14 22:04:29 2002
From: skip@pobox.com (Skip Montanaro)
Date: Thu, 14 Nov 2002 16:04:29 -0600
Subject: [Spambayes] read-only DBDict in hammie?
In-Reply-To: <IHLF72MJQLOMMH76B6BJ8GEQGDKF.3dd3e191@riven>
References: <BIEJKCLHCIOIHAGOKOLHIEEIDPAA.tim.one@comcast.net>
        <IHLF72MJQLOMMH76B6BJ8GEQGDKF.3dd3e191@riven>
Message-ID: <15828.7661.200538.997623@montanaro.dyndns.org>


    >>> I'd like to share the anydbm file between several accounts on my
    >>> machine.  Before I fiddle hammie.py so it opens the file in
    >>> read-only mode, is there any reason when classifying (not training)
    >>> it actually needs to update the file?

    Tim> I'm using the DBDict class in hammie for doing training with the
    Tim> pop3proxy.  Can we make a read-only option, rather than making it
    Tim> always open for read?

Yes, that was my intent.  When you run hammie with the -g or -s flags it's
opened for writing, but opened for reading otherwise.  There is a new mode
argument to the DBDict constructor.  I suppose I should have defaulted it to
'c'.

Skip

From mhammond@skippinet.com.au  Thu Nov 14 22:20:25 2002
From: mhammond@skippinet.com.au (Mark Hammond)
Date: Fri, 15 Nov 2002 09:20:25 +1100
Subject: [Spambayes] Mozilla.org using bayesian spam filtering
In-Reply-To: <OIVPVRUO83UPB8WV1YUQMKA7VPYSCPO.3dd3e9c2@riven>
Message-ID: <LCEPIIGDJPKCOIHOBJEPGEGNHLAA.mhammond@skippinet.com.au>

> Anybody know anything about this?  Doesn't look like our technology...
>
> http://www.mozilla.org/mailnews/spam.html
>

It's not - but I bet quite a few of our worthless Aussie dollars that
someone with Mozilla and Python knowledge, starting now with PyXPCOM and
SpamBayes, would have a system much much better, and "finished" way before
theirs <0.1 wink>.

Still-vainly-wishing-mozilla-and-pyxpcom-take-over-the-world ly,

Mark.


From neale@woozle.org  Fri Nov 15 07:00:44 2002
From: neale@woozle.org (Neale Pickett)
Date: 14 Nov 2002 23:00:44 -0800
Subject: [Spambayes] Optimization to DBDict (Was: read-only DBDict in hammie?)
In-Reply-To: <IHLF72MJQLOMMH76B6BJ8GEQGDKF.3dd3e191@riven>
References: <IHLF72MJQLOMMH76B6BJ8GEQGDKF.3dd3e191@riven>
Message-ID: <w53smy39v1f.fsf@woozle.org>

I have sitting here on my hard drive some changes to DBDict which make
for much smaller databases by introducing an optimization for WordInfo
classes (getting rid of Administrative Pickle Bloat).  However, if I
submit this, everyone's hammie database will slowly be rewritten to the
new format, so I want to solicit feedback first.  Here are the two new
methods:

    def __getitem__(self, key):
        v = self.hash[key]
        if v[0] == 'W':
            val = pickle.loads(v[1:])
            # We could be sneaky, like pickle.Unpickler.load_inst,
            # but I think that's overly confusing.
            obj = classifier.WordInfo(0)
            obj.__setstate__(val)
            return obj
        else:
            return pickle.loads(v)

    def __setitem__(self, key, val):
        if isinstance(val, classifier.WordInfo):
            val = val.__getstate__()
            v = 'W' + pickle.dumps(val, 1)
        else:
            v = pickle.dumps(val, 1)
        self.hash[key] = v

Note that this makes the assumption that if a "W" pickle type is ever
added to Python's pickler, it won't be pickled in a DBDict.  Otherwise,
you're in for trouble.  If someone knows of a better way to do this,
please step forward before I submit it and hammie starts to rewrite
everyone's database.


So then, Tim Stone - Four Stones Expressions <tim@fourstonesExpressions.com> is all like:

> On a related note, should DBDict actually have it's own module, rather
> than be part of hammie?

You know what's funny is after I wrote DBDict I discovered python's
shelve module, which does the same thing.  I should probably rewrite
DBDict to wrap the shelve class, but shelve is so minimal, maybe shelve
should be rewritten to incorporate the DBDict class <0.2 wink> (gah, I'm
doing the wink thing now).


From sjoerd@acm.org  Fri Nov 15 08:40:21 2002
From: sjoerd@acm.org (Sjoerd Mullender)
Date: Fri, 15 Nov 2002 09:40:21 +0100
Subject: [Spambayes] updating email package
In-Reply-To: <BIEJKCLHCIOIHAGOKOLHAEEHDPAA.tim.one@comcast.net> 
References: <BIEJKCLHCIOIHAGOKOLHAEEHDPAA.tim.one@comcast.net> 
Message-ID: <20021115084026.5635C74C3B@indus.ins.cwi.nl>

On Thu, Nov 14 2002 Tim Peters wrote:

> [Sjoerd Mullender]
> > Does anybody mind if I update the email package with the current
> > verion from the Python CVS?
> 
> Yes, that's a big -1.  Barry intends to delete the email pkg from this
> project.

That's fine with me too.

-- Sjoerd Mullender <sjoerd@acm.org>

From Paul.Moore@atosorigin.com  Fri Nov 15 09:23:44 2002
From: Paul.Moore@atosorigin.com (Moore, Paul)
Date: Fri, 15 Nov 2002 09:23:44 -0000
Subject: [Spambayes] [Outlook addin] Filtering unread messages on startup
Message-ID: <16E1010E4581B049ABC51D4975CEDB8861993B@UKDCX001.uk.int.atosorigin.com>

From: Tim Peters [mailto:tim.one@comcast.net]

> [Moore, Paul]
>> Here's a first cut at filtering any new unread messages when Outlook
>> starts up (important for Exchange or IMAP users).
>
> Note that Mark Hammond already checked in something toward the same =
end.

I got Mark's changes last night (or in the morning, if you're
Australian...) Tried them today, but got some strange results. The
first time I started Outlook, nothing happened, but that's because
I'd lost my filter definitions (probably my fault, I think I know why
this happened). Then, when I fixed that and restarted, I got the trace
output saying that "0 missed messages" had been handled.

Looking at the code for GetNewUnscoredMessageGenerator, it seems to
scan the folder for messages with Unread =3D True and no "Spam" field.
There were 40 or so unread messages, so I can only assume that they
had a Spam field.

I don't know enough about MAPI to diagnose this much further - I do,
however, have the "Spam" field displayed in my Inbox - could that be
enough to cause the field to be automatically created? On the other
hand, manually filtering with the "unread only" and "only messages not
already filtered" checkboxes set on, *did* filter the messages. But
that uses a different method at the moment, checking message.unread
and message.GetField(mgr.config.field_score_name). So it would filter
a message with a defined, but empty, "Spam" property.

I think that GetNewUnscoredMessageGenerator needs to allow for
messages with an existing but empty "Spam" field. I *think* this means
you want to extend the restriction from

    AND
        PROPERTY =3D UNREAD (UNREAD, True)
        NOT EXISTS "Spam"

to

    AND
        PROPERTY =3D UNREAD (UNREAD, True)
        OR
            NOT EXISTS "Spam"
            PROPERTY =3D "Spam" ("Spam", PT_NULL)

(excuse weird pseudo-code description).

But I know Mark will be better able to make the appropriate fix :-)

Paul.

From francois.granger@free.fr  Fri Nov 15 09:53:36 2002
From: francois.granger@free.fr (=?iso-8859-1?Q?Fran=E7ois?= Granger)
Date: Fri, 15 Nov 2002 10:53:36 +0100
Subject: [Spambayes] Another software in the field
Message-ID: <a05100305b9fa744298a5@[192.168.1.11]>

Disovered today:

http://cristal.inria.fr/~xleroy/software.html
(page in english)

-- 
Le courrier �lectronique est un moyen de communication. Les gens devraient
se poser des questions sur les implications politiques des choix (ou non
choix) de leurs outils et technologies.
Pour des courriers propres : http://minilien.com/?IXZneLoID0 - 
http://marc.herbert.free.fr/mail/ http://expita.com/nomime.html

From just@letterror.com  Fri Nov 15 09:57:53 2002
From: just@letterror.com (Just van Rossum)
Date: Fri, 15 Nov 2002 10:57:53 +0100
Subject: [Spambayes] Another software in the field
In-Reply-To: <a05100305b9fa744298a5@[192.168.1.11]>
Message-ID: <r01050400-1022-CEE9892BF88011D6AC0E003065D5E7E4@[10.0.0.23]>

Fran=E7ois Granger wrote:

> http://cristal.inria.fr/~xleroy/software.html

I don't think we have to fear much from procmail-only solutions...

How close are "we" to an alpha (or even beta ;-) release? I think spambay=
es
could get some great publicity, but we need to be quick. The topic is ver=
y hot.

Just

From anthony@interlink.com.au  Fri Nov 15 11:00:02 2002
From: anthony@interlink.com.au (Anthony Baxter)
Date: Fri, 15 Nov 2002 22:00:02 +1100
Subject: [Spambayes] Another software in the field 
In-Reply-To: <a05100305b9fa744298a5@[192.168.1.11]> 
Message-ID: <200211151100.gAFB02h16080@localhost.localdomain>


>>> =3D?iso-8859-1?Q?Fran=3DE7ois?=3D Granger wrote
> Disovered today:
> =

> http://cristal.inria.fr/~xleroy/software.html
> (page in english)
> =

http://spambayes.sourceforge.net/related.html

Got that one already. :)

It's another implementation of the Graham algorithm. =


-- =

Anthony Baxter     <anthony@interlink.com.au>   =

It's never too late to have a happy childhood.


From anthony@interlink.com.au  Fri Nov 15 11:01:44 2002
From: anthony@interlink.com.au (Anthony Baxter)
Date: Fri, 15 Nov 2002 22:01:44 +1100
Subject: [Spambayes] Another software in the field 
In-Reply-To: <r01050400-1022-CEE9892BF88011D6AC0E003065D5E7E4@[10.0.0.23]> 
Message-ID: <200211151101.gAFB1iX16108@localhost.localdomain>


>>> Just van Rossum wrote
> How close are "we" to an alpha (or even beta ;-) release? I think spamb=
ayes
> could get some great publicity, but we need to be quick. The topic is v=
ery
> hot.

One thing that suprises me is that there's a seemingly endless list of =

projects all implementing Graham's approach exactly as he originally
described it - almost no-one else is doing the basic testing and
research that this sort of approach would seem to cry out for.


From msergeant@startechgroup.co.uk  Fri Nov 15 11:05:49 2002
From: msergeant@startechgroup.co.uk (Matt Sergeant)
Date: Fri, 15 Nov 2002 11:05:49 +0000
Subject: [Spambayes] Another software in the field
References: <200211151101.gAFB1iX16108@localhost.localdomain>
Message-ID: <3DD4D50D.3090307@startechgroup.co.uk>

Anthony Baxter said the following on 15/11/02 11:01:
>>>>Just van Rossum wrote
>>
>>How close are "we" to an alpha (or even beta ;-) release? I think spambayes
>>could get some great publicity, but we need to be quick. The topic is very
>>hot.
> 
> 
> One thing that suprises me is that there's a seemingly endless list of 
> projects all implementing Graham's approach exactly as he originally
> described it - almost no-one else is doing the basic testing and
> research that this sort of approach would seem to cry out for.

Why would anyone else want to, when you guys are doing such an amazing 
job of it? ;-)

Matt.


From skip@pobox.com  Fri Nov 15 11:11:36 2002
From: skip@pobox.com (Skip Montanaro)
Date: Fri, 15 Nov 2002 05:11:36 -0600
Subject: [Spambayes] Another software in the field
In-Reply-To: <r01050400-1022-CEE9892BF88011D6AC0E003065D5E7E4@[10.0.0.23]>
References: <a05100305b9fa744298a5@[192.168.1.11]>
        <r01050400-1022-CEE9892BF88011D6AC0E003065D5E7E4@[10.0.0.23]>
Message-ID: <15828.54888.833862.811209@montanaro.dyndns.org>


    >> http://cristal.inria.fr/~xleroy/software.html

    Just> I don't think we have to fear much from procmail-only solutions...

There are a few of us who are quite happy with procmail-only solutions.
Some of us even use Macs. ;-)

Skip


From just@letterror.com  Fri Nov 15 11:14:22 2002
From: just@letterror.com (Just van Rossum)
Date: Fri, 15 Nov 2002 12:14:22 +0100
Subject: [Spambayes] Another software in the field
In-Reply-To: <15828.54888.833862.811209@montanaro.dyndns.org>
Message-ID: <r01050400-1022-669766F4F88B11D6AC0E003065D5E7E4@[10.0.0.23]>

Skip Montanaro wrote:

>     Just> I don't think we have to fear much from procmail-only solutions...
> 
> There are a few of us who are quite happy with procmail-only solutions.
> Some of us even use Macs. ;-)

And I completely happy with a pop3-only solution... The great thing about
spambayes is that we're _both_ happy ;-)

Just

From mwh@python.net  Fri Nov 15 11:44:41 2002
From: mwh@python.net (Michael Hudson)
Date: 15 Nov 2002 11:44:41 +0000
Subject: [Spambayes] Re: Another software in the field
References: <200211151101.gAFB1iX16108@localhost.localdomain>
	<3DD4D50D.3090307@startechgroup.co.uk>
Message-ID: <2mheejgiqe.fsf@starship.python.net>

Matt Sergeant <msergeant@startechgroup.co.uk> writes:

> Anthony Baxter said the following on 15/11/02 11:01:
> >>>>Just van Rossum wrote
> >>
> >>How close are "we" to an alpha (or even beta ;-) release? I think spambayes
> >>could get some great publicity, but we need to be quick. The topic is very
> >>hot.
> > 
> > 
> > One thing that suprises me is that there's a seemingly endless list of 
> > projects all implementing Graham's approach exactly as he originally
> > described it - almost no-one else is doing the basic testing and
> > research that this sort of approach would seem to cry out for.
> 
> Why would anyone else want to, when you guys are doing such an amazing 
> job of it? ;-)

But isn't that the point?  The people are implementing Graham's
original algorithm, not the soupa-doupa wizzy one that lives in
spambayes...

Cheers,
M.

-- 
  I don't remember any dirty green trousers.
                                             -- Ian Jackson, ucam.chat


From mhammond@skippinet.com.au  Fri Nov 15 12:14:47 2002
From: mhammond@skippinet.com.au (Mark Hammond)
Date: Fri, 15 Nov 2002 23:14:47 +1100
Subject: [Spambayes] [Outlook addin] Filtering unread messages on startup
In-Reply-To: <16E1010E4581B049ABC51D4975CEDB8861993B@UKDCX001.uk.int.atosorigin.com>
Message-ID: <LCEPIIGDJPKCOIHOBJEPEEJGHLAA.mhammond@skippinet.com.au>

> Looking at the code for GetNewUnscoredMessageGenerator, it seems to
> scan the folder for messages with Unread = True and no "Spam" field.
> There were 40 or so unread messages, so I can only assume that they
> had a Spam field.

See if you can convince the Outlook2000/sandbox/dump_props.py script to
produce anything useful.  It should tell you if such a field exists (and
plenty of other things!)

> I don't know enough about MAPI to diagnose this much further - I do,
> however, have the "Spam" field displayed in my Inbox - could that be
> enough to cause the field to be automatically created?

It shouldn't.  Indeed, having a blank score would indicate there is no such
field on the message.  Getting the field automatically created in the folder
is a different problem we are yet to solve, but that is different.

> But I know Mark will be better able to make the appropriate fix :-)

That may be right, but I would prefer to see some output from dump_props
first for you.  delete_outlook_field.py can also be useful - it can be used
to remove the Spam field from all messages in a folder.  This is indeed how
I did most of my testing.

Eg, I run:

F:\...>delete_outlook_field.py -d --no-outlook Spam
Processing folder Inbox
Deleting field Spam
Deleted 1257 field instances via MAPI
Could not find property to delete in the folder

Then when I restart outlook, I see:

Processing 16 missed spam in folder 'Inbox' took 349.376ms

As I do indeed have 16 unread mail in my inbox :(  These 16 messages are the
only ones now showing the Spam field.  Further, I have indeed seen this work
for me, for real - this was to scratch a personal itch for when Outlook
crashes due to a buggy PGP plugin a client insists I use :(

I will also forward the test script I used to come up with the fastest
technique I could for the usual case of zero missed messages.

Mark.


From msergeant@startechgroup.co.uk  Fri Nov 15 12:38:48 2002
From: msergeant@startechgroup.co.uk (Matt Sergeant)
Date: Fri, 15 Nov 2002 12:38:48 +0000
Subject: [Spambayes] Re: Another software in the field
References: <200211151101.gAFB1iX16108@localhost.localdomain>
	<3DD4D50D.3090307@startechgroup.co.uk> <2mheejgiqe.fsf@starship.python.net>
Message-ID: <3DD4EAD8.7090703@startechgroup.co.uk>

Michael Hudson said the following on 15/11/02 11:44:
> But isn't that the point?  The people are implementing Graham's
> original algorithm, not the soupa-doupa wizzy one that lives in
> spambayes...

Well some of them are. SpamAssassin implemented Gary's algorithm, and 
I'm working on the chi-squared one (I think it's working, but my DB is 
trained with count == word count, not mail count).

Matt.


From anthony@interlink.com.au  Fri Nov 15 14:39:41 2002
From: anthony@interlink.com.au (Anthony Baxter)
Date: Sat, 16 Nov 2002 01:39:41 +1100
Subject: [Spambayes] Another software in the field 
In-Reply-To: <3DD4D50D.3090307@startechgroup.co.uk> 
Message-ID: <200211151439.gAFEdgP17487@localhost.localdomain>


>>> Matt Sergeant wrote
> Why would anyone else want to, when you guys are doing such an amazing 
> job of it? ;-)

Ooo. Flattery :)

I just hope that the various projects that are implementing the straight
out Graham algorithm are planning to replace it with the more optimal code
once Tim's finished perfecting it for us all.


From msergeant@startechgroup.co.uk  Fri Nov 15 14:44:45 2002
From: msergeant@startechgroup.co.uk (Matt Sergeant)
Date: Fri, 15 Nov 2002 14:44:45 +0000
Subject: [Spambayes] Another software in the field
References: <200211151439.gAFEdgP17487@localhost.localdomain>
Message-ID: <3DD5085D.2070408@startechgroup.co.uk>

Anthony Baxter said the following on 15/11/02 14:39:
>>>>Matt Sergeant wrote
>>
>>Why would anyone else want to, when you guys are doing such an amazing 
>>job of it? ;-)
> 
> 
> Ooo. Flattery :)
> 
> I just hope that the various projects that are implementing the straight
> out Graham algorithm are planning to replace it with the more optimal code
> once Tim's finished perfecting it for us all.

If I discover that chi-squared is indeed better than PG's method (I'd 
have to rebuild a live database, which is understandably a pain), then I 
definitely will be, and I'll be giving the code to SpamAssassin too. I 
suspect a lot of these projects that sprang up after PG's article will 
fall by the wayside as developers realise they don't really want to 
maintain them. But that's just evolution :)


From Paul.Moore@atosorigin.com  Fri Nov 15 14:54:07 2002
From: Paul.Moore@atosorigin.com (Moore, Paul)
Date: Fri, 15 Nov 2002 14:54:07 -0000
Subject: [Spambayes] [Outlook addin] Filtering unread messages on startup
Message-ID: <16E1010E4581B049ABC51D4975CEDB885E2DD3@UKDCX001.uk.int.atosorigin.com>

From: Mark Hammond [mailto:mhammond@skippinet.com.au]
>> Looking at the code for GetNewUnscoredMessageGenerator, it seems
>> to scan the folder for messages with Unread = True and no "Spam"
>> field. There were 40 or so unread messages, so I can only assume
>> that they had a Spam field.
>
>See if you can convince the Outlook2000/sandbox/dump_props.py script
>to produce anything useful. It should tell you if such a field exists
>(and plenty of other things!)

The script doesn't work as it stands (COM error "Unknown Trust
Provider", which I suspect is due to our Active Directory setup
somehow) but I hacked to to just look at the Inbox.

I did this, and got a colleague to send me a mail (while I did *not*
have Outlook running). dump_props shows no Spam property.

Started Outlook, it wasn't filtered, and there was still no Spam
property.

Filtered manually, and the Spam property appeared as expected...

For what it's worth, I attach the output of dump_props.py from
before I started Exchange, after I started but did nothing else,
and after I manually filtered the Inbox. I also include the trace
output from the addin after the startup filtering happened.

Paul.


-------------- next part --------------
z'��mj�Zr�����+���t֦y+Z�۩N��y<�y�i�'�*'��-z�-��J,��_�׬�J֫��S��R�a���f�������銗������Z���z�ڶ֜�g�����M��^��7q���+Z�۩N��y�^-------------- next part --------------
A non-text attachment was scrubbed...
Name: AfterManualFilter
Type: application/octet-stream
Size: 12406 bytes
Desc: AfterManualFilter
Url : http://mail.python.org/pipermail/spambayes/attachments/20021115/6e43caaf/AfterManualFilter.exe
-------------- next part --------------
A non-text attachment was scrubbed...
Name: BeforeExchangeStarted
Type: application/octet-stream
Size: 11632 bytes
Desc: BeforeExchangeStarted
Url : http://mail.python.org/pipermail/spambayes/attachments/20021115/6e43caaf/BeforeExchangeStarted.exe
-------------- next part --------------
A non-text attachment was scrubbed...
Name: AfterExchangeStarted
Type: application/octet-stream
Size: 11632 bytes
Desc: AfterExchangeStarted
Url : http://mail.python.org/pipermail/spambayes/attachments/20021115/6e43caaf/AfterExchangeStarted.exe
From jm@jmason.org  Fri Nov 15 12:45:15 2002
From: jm@jmason.org (Justin Mason)
Date: Fri, 15 Nov 2002 12:45:15 +0000
Subject: [Spambayes] Another software in the field 
In-Reply-To: Message from Matt Sergeant <msergeant@startechgroup.co.uk> 
   of "Fri, 15 Nov 2002 11:05:49 GMT." <3DD4D50D.3090307@startechgroup.co.uk> 
Message-ID: <20021115124520.3E85D16F17@jmason.org>


Matt Sergeant said:
> > One thing that suprises me is that there's a seemingly endless list of 
> > projects all implementing Graham's approach exactly as he originally
> > described it - almost no-one else is doing the basic testing and
> > research that this sort of approach would seem to cry out for.
> Why would anyone else want to, when you guys are doing such an amazing 
> job of it? ;-)

Hi all,

Well, I've just started for SpamAssassin -- I'm gradually reinventing the
wheel I think.  For example, I've just found that including hapaxes
improves the middle ground very well, which I think is something you guys
did a long time ago ;)

But here's one thing I've noticed which might be useful for you guys.
In SpamAssassin recently, we've been meditating on Message-Ids;
particularly Outlook-format ones, like:

	<002901c28c22$3e8cb260$0201a8c0@gorm>

now, I've figured out this is composed of

	<???? TIMESTAMP $ ???????? $ SENDERID @ hostname>

TIMESTAMP is the top 4 bytes of the FILETIME struct on windows, which
we can validate in SpamAssassin using perl code. not a runner for
spambayes, unfortunately.

However, SENDERID is a constant value which never changes for an Outlook
or Exchange installation, as far as I can see -- so you want to make sure
your tokenizer will parse message-ids, and will return that as one
token.  It will gain valuable probabilities for those tricky spammers
who are getting good at sending legit-looking text and headers ;)
No matter what hostnames they use, unless they reinstall Outlook (as far
as I know) that should not change.

Quick question BTW -- I've been trying to keep our bayes-testing stats
close to yours, so we can compare portably.  But there's one thing I've
run into.  As far as I can see, in your 10-fold cross-validation suite,
you train using 1 fold and test against 9 -- whereas the published lit (or
at least Ion's papers) seems to suggest that 10FCV works better trained
against 9 and tested against 1.  Is there a reason you chose this?

PS: about time I posted here, I've been lurking and reading for weeks ;)

--j.

From neale@woozle.org  Fri Nov 15 16:25:08 2002
From: neale@woozle.org (Neale Pickett)
Date: 15 Nov 2002 08:25:08 -0800
Subject: [Spambayes] Another software in the field
In-Reply-To: <a05100305b9fa744298a5@[192.168.1.11]>
References: <a05100305b9fa744298a5@[192.168.1.11]>
Message-ID: <w53fzu2ajh7.fsf@woozle.org>

So then, Fran�ois Granger <francois.granger@free.fr> is all like:

> Disovered today:
> 
> http://cristal.inria.fr/~xleroy/software.html
> (page in english)

That's SpamOracle.  It's been around a while--it's what I began using
immediately after writing my own pythonic spam filter and immediately
before signing on with the spambayes effort :)

What's cool about SpamOracle (aside from it being written in OCaml) is
that it uses a lexical analyzer to parse up email.  It's really fast!
At least, fast for parsing messages.  The original X-Hammie-Disposition
header was a blatant rip from the header SpamOracle uses, too.

Shameless,

Neale

From jm@jmason.org  Fri Nov 15 15:36:25 2002
From: jm@jmason.org (Justin Mason)
Date: Fri, 15 Nov 2002 15:36:25 +0000
Subject: [Spambayes] Another software in the field 
In-Reply-To: Message from Anthony Baxter <anthony@interlink.com.au> 
	<200211151439.gAFEdgP17487@localhost.localdomain> 
Message-ID: <20021115153630.15A5016F17@jmason.org>


Anthony Baxter said:
> I just hope that the various projects that are implementing the straight
> out Graham algorithm are planning to replace it with the more optimal code
> once Tim's finished perfecting it for us all.

Actually, I reckon most of the projects will be happy to say "works for
me", as PG did himself in the first place, at least until someone comes
along and points out spambayes' much higher efficiency rating in a public
forum like /. ;)

--j.

From popiel@wolfskeep.com  Fri Nov 15 16:50:10 2002
From: popiel@wolfskeep.com (T. Alexander Popiel)
Date: Fri, 15 Nov 2002 08:50:10 -0800
Subject: [Spambayes] Re: Another software in the field 
In-Reply-To: Message from Michael Hudson <mwh@python.net> 
   of "15 Nov 2002 11:44:41 GMT." <2mheejgiqe.fsf@starship.python.net> 
References: <200211151101.gAFB1iX16108@localhost.localdomain>
	<3DD4D50D.3090307@startechgroup.co.uk>  <2mheejgiqe.fsf@starship.python.net> 
Message-ID: <20021115165010.DBCC7F54C@cashew.wolfskeep.com>

In message:  <2mheejgiqe.fsf@starship.python.net>
             Michael Hudson <mwh@python.net> writes:
>Matt Sergeant <msergeant@startechgroup.co.uk> writes:
>
>> Anthony Baxter said the following on 15/11/02 11:01:
>> > 
>> > One thing that suprises me is that there's a seemingly endless list of 
>> > projects all implementing Graham's approach exactly as he originally
>> > described it - almost no-one else is doing the basic testing and
>> > research that this sort of approach would seem to cry out for.
>> 
>> Why would anyone else want to, when you guys are doing such an amazing 
>> job of it? ;-)
>
>But isn't that the point?  The people are implementing Graham's
>original algorithm, not the soupa-doupa wizzy one that lives in
>spambayes...

None of us have written a nice, concise, and easily understood
_English_ description of the algorithm we're actually using.
Further, that nonexistant concise description hasn't been
slashdotted.  Until that happens, spambayes will be a footnote.

Personally, I think the classifier algorithm is mature enough
now to support such a description.  (Yes, I know Gary has written
essays on part of it, but they're not comprehensive, and I don't
think anything has been written about our use of chi-square.)
The classifier _implementation_ is still fluctuating a fair
amount as people consider splitting the database, removing
access counts, etc.

The tokenizer algorithm we've got is both not mature enough to
support such a description, and also not amenable to concise
description.  On the other hand, I suspect that in a screed
about the classifier, we could get away with a very brief
description of the basics (words counted once per message,
split-on-whitespace, ignore most headers, etc.) and leave the
fine tuning as a reference to the tokenizer implementation.

Unfortunately, I'm not a good technical writer.  If I were,
I'd try my hand at writing up a description of the classifier,
and then post it somewhere with a lot of bandwidth.  As it is,
I can only whine about it not existing.

- Alex

From tim.one@comcast.net  Fri Nov 15 17:21:19 2002
From: tim.one@comcast.net (Tim Peters)
Date: Fri, 15 Nov 2002 12:21:19 -0500
Subject: [Spambayes] Another software in the field
In-Reply-To: <20021115124520.3E85D16F17@jmason.org>
Message-ID: <LNBBLJKPBEHFEDALKOLCOEILCLAB.tim.one@comcast.net>

[Justin Mason]
> Well, I've just started for SpamAssassin -- I'm gradually reinventing
> the wheel I think.  For example, I've just found that including hapaxes
> improves the middle ground very well, which I think is something you
> guys did a long time ago ;)

Ya, ignoring hapaxes is a form of bias, and we eventually found that all
forms of bias hurt.

> But here's one thing I've noticed which might be useful for you guys.
> In SpamAssassin recently, we've been meditating on Message-Ids;
> particularly Outlook-format ones, like:
>
> 	<002901c28c22$3e8cb260$0201a8c0@gorm>

Hmm.  I use Outlook 2000, and my last post had:

 Message-id: <BIEJKCLHCIOIHAGOKOLHEEEFDPAA.tim.one@comcast.net>

OTOH, a recent one from Paul Moore had:

 Message-id:
 <16E1010E4581B049ABC51D4975CEDB885E2DCA@UKDCX001.uk.int.atosorigin.com>

and from Mark Hammond:

 Message-id: <LCEPIIGDJPKCOIHOBJEPEEJGHLAA.mhammond@skippinet.com.au>

and from Sean True:

 Message-id: <MJEHLHJKGINLONDMMKNEKELIHFAA.seant@iname.com>

These are all (I believe) Outlook users.  No $ in sight!  I believe Paul is
alone in this group in using an Exchange server instead of straight SMTP.

> now, I've figured out this is composed of
>
> 	<???? TIMESTAMP $ ???????? $ SENDERID @ hostname>
>
> TIMESTAMP is the top 4 bytes of the FILETIME struct on windows, which
> we can validate in SpamAssassin using perl code.

What does "validate" mean in this context?

> not a runner for spambayes, unfortunately.

Post the Perl code and I bet it will be easy to do in Python too.  I'm not
sure what you mean otherwise; for example, a FILETIME is conceptually a
64-bit integer, and by "top 4 bytes" it's unclear to me whether you mean the
most-significant 4 bytes of that int, or the first 4 bytes in storage order
(which happen to be the least-significant 4 bytes of the big int).

> However, SENDERID is a constant value which never changes for an
> Outlook or Exchange installation, as far as I can see -- so you want
> to make sure your tokenizer will parse message-ids, and will return
> that as one token.
>
> It will gain valuable probabilities for those tricky spammers
> who are getting good at sending legit-looking text and headers ;)
> No matter what hostnames they use, unless they reinstall Outlook
> (as far as I know) that should not change.

That would indeed be a great clue!

> Quick question BTW -- I've been trying to keep our bayes-testing stats
> close to yours, so we can compare portably.  But there's one thing I've
> run into.  As far as I can see, in your 10-fold cross-validation suite,
> you train using 1 fold and test against 9

That's backwards, although it's tricky:  for speed, timcv.py:

+ Train on sets 2-10.

+ Predicts against set 1.
+ Incrementally trains set 1 (leaving the classifier trained on 1-10).

+ Incrementally *untrains* set 2 (leaving 1 + 3-10 trained).
+ Predicts against set 2.
+ Incrementailly trains set 2 (leaving 1-10 trained again).

+ Incrementally untrains set 3 (leaving 1-2 + 4-10 trained).
+ Predicts against set 3.
+ Incrementailly trains set 3 (levaing 1-10 trained again).

and so on.  This has huge performance benefits, in both instruction count
and cache locality, versus running timcv.py with option
build_each_classifier_from_scratch enabled.

 -- whereas the published lit (or at least Ion's papers) seems to
> suggest that 10FCV works better trained against 9 and tested against 1.

Right.

> Is there a reason you chose this?

I was looking for a new hobby after I stopped beating my wife <wink>.
timtest.py is an NxN grid driver, running N**2-N tests each training on 1
and predicting against N-1.  That's a good way to get lots of hard test runs
if you have lots of data.  timcv.py is vanilla cross-validation, running N
tests each training on N-1 and predicting against 1.  README.txt  and
TESTING.txt say more about all this.

> PS: about time I posted here, I've been lurking and reading for weeks ;)

Poor man -- I'm glad you uncloaked!  Did the Outlook Message-Ids fit a
pattern you've seen?  I'm keen to pursue that.


From tim.one@comcast.net  Fri Nov 15 17:29:55 2002
From: tim.one@comcast.net (Tim Peters)
Date: Fri, 15 Nov 2002 12:29:55 -0500
Subject: [Spambayes] Another software in the field
In-Reply-To: <200211151101.gAFB1iX16108@localhost.localdomain>
Message-ID: <LNBBLJKPBEHFEDALKOLCAEIPCLAB.tim.one@comcast.net>

[Anthony Baxter]
> One thing that suprises me is that there's a seemingly endless list of
> projects all implementing Graham's approach exactly as he originally
> described it - almost no-one else is doing the basic testing and
> research that this sort of approach would seem to cry out for.

Well, coding is fun, and the basic approach works so well at once that
there's instant gratification.  Sound statistical testing is (as the people
here who've played along will surely testify!) tedious, time-consuming,
trap-laden, and ego-deflating (it's not really *fun* when the data tells you
your pet idea sucked, at least not before you've developed a healthy taste
for humiliation <wink>).


From jm@jmason.org  Fri Nov 15 18:06:24 2002
From: jm@jmason.org (Justin Mason)
Date: Fri, 15 Nov 2002 18:06:24 +0000
Subject: [Spambayes] Another software in the field 
In-Reply-To: Message from Tim Peters <tim.one@comcast.net> 
	<LNBBLJKPBEHFEDALKOLCOEILCLAB.tim.one@comcast.net> 
Message-ID: <20021115180629.996C316F16@jmason.org>


Tim Peters said:

> Hmm.  I use Outlook 2000, and my last post had:
>  Message-id: <BIEJKCLHCIOIHAGOKOLHEEEFDPAA.tim.one@comcast.net>
...
> These are all (I believe) Outlook users.  No $ in sight!  I believe Paul is
> alone in this group in using an Exchange server instead of straight SMTP.

Hmm, we thought they were Exchange-format ids; looks like O2K now uses
that format.  (thinks) maybe it's just Outlook Express does the $ id
format -- but the important point is that it's frequently spoofed in spam
(about 29% of my spam load, for example).  So it becomes a great spam
indicator.  In fact, as Outlook users migrate *away* from that format,
it gets better ;)

BTW the O2K format IDs have not been spoofed yet, as far as I can see,
so they would be a good ham sign, if the tokenizer could recognise them.
as far as I know they always match /^<[A-Z]{28}\.\S+\@\S+>$/ .

> What does "validate" mean in this context?

compute what the value *should* be and compare.

> Post the Perl code and I bet it will be easy to do in Python too.  I'm not
> sure what you mean otherwise; for example, a FILETIME is conceptually a
> 64-bit integer, and by "top 4 bytes" it's unclear to me whether you mean the
> most-significant 4 bytes of that int, or the first 4 bytes in storage order
> (which happen to be the least-significant 4 bytes of the big int).

most significant.   Perl code is at the end of the mail...

> [ten-pass]
> That's backwards, although it's tricky:  for speed, timcv.py:
> + Train on sets 2-10.
> + Predicts against set 1.
> + Incrementally trains set 1 (leaving the classifier trained on 1-10).
> + Incrementally *untrains* set 2 (leaving 1 + 3-10 trained).
> + Predicts against set 2.
> + Incrementailly trains set 2 (leaving 1-10 trained again).
> + Incrementally untrains set 3 (leaving 1-2 + 4-10 trained).
> + Predicts against set 3.
> + Incrementailly trains set 3 (levaing 1-10 trained again).
> and so on.  This has huge performance benefits, in both instruction count
> and cache locality, versus running timcv.py with option
> build_each_classifier_from_scratch enabled.

OK -- I must have misread it.  so timcv.py *is* training on 9 sets each
time.  good.

> I was looking for a new hobby after I stopped beating my wife <wink>.
> timtest.py is an NxN grid driver, running N**2-N tests each training on 1
> and predicting against N-1.  That's a good way to get lots of hard test runs
> if you have lots of data.  timcv.py is vanilla cross-validation, running N
> tests each training on N-1 and predicting against 1.  README.txt  and
> TESTING.txt say more about all this.

bloody hell, timtest.py must take years to run ;)  sounds interesting.
BTW I hadn't read TESTING.txt (for some reason) -- I like the bigrams
story.

> Poor man -- I'm glad you uncloaked!  Did the Outlook Message-Ids fit a
> pattern you've seen?  I'm keen to pursue that.

yep, see above ;)

BTW here's the perl code.  it's cut and pasted from
current Mail::SpamAssassin::EvalTests, so it won't run as-is, but
it should be pretty easy to grok...


  # valid Outlookish Message-Ids contain the top word of the system time
  # when the message was sent!
  # We can verify this, by decoding the Date header, extracting
  # the time token from the Message-Id, and comparing them.
  #
  sub check_outlook_timestamp_token {
    my ($self) = @_;
    local ($_);

    my $id = $self->get ('Message-Id');
    return 0 unless ($id =~ /^<[0-9a-f]{4}([0-9a-f]{8})\$[0-9a-f]{8}\$[0-9a-f]{8}\@/);

    my $timetoken = hex($1);

    # convert UNIX time_t to Windows FILETIME.  From MSDN:
    #
    #     LONGLONG ll = Int32x32To64(t, 10000000) + 116444736000000000;
    #     pft->dwLowDateTime = (DWORD) ll;
    #     pft->dwHighDateTime = ll >>32;
    #
    # IOW, ((tt * a) + b) / c = id .
    # Now to avoid using any kind of LONGLONG data type, we do this:
    #     => tt * (a/c) + (b/c) = id
    #     let x = (a/c) = 0.0023283064365387
    #     let y = (b/c) = 27111902.8329849
    #
    my $x = 0.0023283064365387;
    my $y = 27111902.8329849;

    # quite generous, but we just want to be in the right ballpark, so we
    # can handle mostly-correct values OK, but catch random strings.
    my $fudge = 200;

    $_ = $self->get ('Date');
    $_ = $self->_parse_rfc822_date($_); $_ ||= 0;
    my $expected = int (($_ * $x) + $y);
    my $diff = $timetoken - $expected;
    dbg("time token found: $timetoken expected (from Date): $expected: $diff");
    if (abs ($diff) < $fudge) { return 0; }

    # also try last date in Received header, Date could have been rewritten
    $_ = $self->get ('Received');
    /(\s.?\d+ \S\S\S \d+ \d+:\d+:\d+ \S+).*?$/;
    dbg("last date in Received: $1");
    $_ = $self->_parse_rfc822_date($_); $_ ||= 0;
    $expected = int (($_ * $x) + $y);
    $diff = $timetoken - $expected;
    dbg("time token found: $timetoken expected (from Received): $expected: $diff");
    if (abs ($diff) < $fudge) { return 0; }

    return 1;
  }

  # parse an RFC822 date into a time_t
  sub _parse_rfc822_date {
    my ($self, $date) = @_;
    local ($_);
    my ($yyyy, $mmm, $dd, $hh, $mm, $ss, $mon, $tzoff);

    # make it a bit easier to match
    $_ = " $date "; s/, */ /gs; s/\s+/ /gs;

    # now match it in parts.  Date part first:
    if (s/ (\d+) (Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec) (\d{4}) / /i) {
      $dd = $1; $mon = $2; $yyyy = $3;
    } elsif (s/ (Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec) +(\d+) \d+:\d+:\d+ (\d{4}) / /i) {
      $dd = $2; $mon = $1; $yyyy = $3;
    } elsif (s/ (\d+) (Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec) (\d{2,3}) / /i) {
      $dd = $1; $mon = $2; $yyyy = $3;
    } else {
      dbg ("time cannot be parsed: $date");
      return undef;
    }

    # handle two and three digit dates as specified by RFC 2822
    if (defined $yyyy) {
      if (length($yyyy) == 2 && $yyyy < 50) {
	$yyyy += 2000;
      }
      elsif (length($yyyy) != 4) {
	# three digit years and two digit years with values between 50 and 99
	$yyyy += 1900;
      }
    }

    # hh:mm:ss
    if (s/ ([\d\s]\d):(\d\d)(:(\d\d))? / /) {
      $hh = $1; $mm = $2; $ss = $4 || 0;
    }

    # numeric timezones
    if (s/ ([-+]\d{4}) / /) {
      $tzoff = $1;
    }
    # all other timezones are considered equivalent to "-0000"
    $tzoff ||= '-0000';

    if (!defined $mmm && defined $mon) {
      my @months = qw(jan feb mar apr may jun jul aug sep oct nov dec);
      $mon = lc($mon);
      my $i; for ($i = 0; $i < 12; $i++) {
	if ($mon eq $months[$i]) { $mmm = $i+1; last; }
      }
    }

    $hh ||= 0; $mm ||= 0; $ss ||= 0; $dd ||= 0; $mmm ||= 0; $yyyy ||= 0;

    my $time;
    eval {		# could croak
      $time = timegm ($ss, $mm, $hh, $dd, $mmm-1, $yyyy);
    };

    if ($@) {
      dbg ("time cannot be parsed: $date, $yyyy-$mmm-$dd $hh:$mm:$ss");
      return undef;
    }

    if ($tzoff =~ /([-+])(\d\d)(\d\d)$/)	# convert to seconds difference
    {
      $tzoff = (($2 * 60) + $3) * 60;
      if ($1 eq '-') {
	$time += $tzoff;
      } else {
	$time -= $tzoff;
      }
    }

    return $time;
  }


From popiel@wolfskeep.com  Fri Nov 15 18:20:39 2002
From: popiel@wolfskeep.com (T. Alexander Popiel)
Date: Fri, 15 Nov 2002 10:20:39 -0800
Subject: [Spambayes] Another software in the field 
In-Reply-To: Message from Tim Peters <tim.one@comcast.net> 
	<LNBBLJKPBEHFEDALKOLCOEILCLAB.tim.one@comcast.net> 
References: <LNBBLJKPBEHFEDALKOLCOEILCLAB.tim.one@comcast.net> 
Message-ID: <20021115182039.BD3F3F54C@cashew.wolfskeep.com>

In message:  <LNBBLJKPBEHFEDALKOLCOEILCLAB.tim.one@comcast.net>
             Tim Peters <tim.one@comcast.net> writes:
>
>Poor man -- I'm glad you uncloaked!  Did the Outlook Message-Ids fit a
>pattern you've seen?  I'm keen to pursue that.

If you're keen on message ids, then one idea I've had (with no
time to implement, alas) is to compare the message id domain with
the sequence in the received headers, to detect when message ids
are generated late in the delivery sequence.  In more detail:

Most received headers these days are of the (rfc 821 dictated) form:

  Received: from ([^ ]*).* by ([^ ]*).*;(.*)

where \1 is the prior MTA, \2 is the current MTA, and \3 is the
time of transfer.  Reading all the received headers, you can get
a chain of MTAs as the delivery sequence... as an example:

  Received: from mail.python.org (mail.python.org [12.155.117.29])
          by cashew.wolfskeep.com (Postfix) with ESMTP id 97FAFF54C
          for <popiel@wolfskeep.com>; Fri, 15 Nov 2002 09:44:19 -0800 (PST)
  Received: from localhost.localdomain ([127.0.0.1] helo=mail.python.org)
          by mail.python.org with esmtp (Exim 4.05)
          id 18CkXd-00065D-01; Fri, 15 Nov 2002 12:46:05 -0500
  Received: from smtp.comcast.net ([24.153.64.2])
          by mail.python.org with esmtp (Exim 4.05)
          id 18CkAN-0007r1-00
          for spambayes@python.org; Fri, 15 Nov 2002 12:22:03 -0500
  Received: from cj569191b (pcp736393pcs.reston01.va.comcast.net
          [68.48.241.201]) by mtaout03.icomcast.net
          (iPlanet Messaging Server 5.1 HotFix 1.5 (built Sep 23 2002))
          spambayes@python.org; Fri, 15 Nov 2002 12:21:16 -0500 (EST)

yields the sequence:

  cj569191b -> mtaout03.icomcast.net -> smtp.comcast.net -> mail.python.org
  -> localhost.localdomain -> mail.python.org -> mail.python.org ->
  cashew.wolfskeep.com

Remove references to localhost.localdomain or localhost, then compress
identical neighbors to yield:

  cj569191b -> mtaout03.icomcast.net -> smtp.comcast.net -> mail.python.org
  -> cashew.wolfskeep.com

Now, look at the message id:

  Message-id: <LNBBLJKPBEHFEDALKOLCOEILCLAB.tim.one@comcast.net>

Extracting just the domain name from that, we get:

  comcast.net

Now, compare the domain from the message id to the domains in the
received list, yielding the number of hierarchy levels matched:

  0 -> 1 -> 2 -> 0 -> 0

Find the first occurence of the best match, and generate a token:

  message-id-generation:skipped 2

If the received parser were a little smarter about parsing iPlanet
received lines, it would have "pcp736393pcs.reston01.va.comcast.net"
instead of "cj569191b" as the first element in the sequence, and
the match list would have been 2 -> 1 -> 2 -> 0 -> 0, yielding:

  message-id-generation:skipped 0

I suspect that high skipped numbers would be a strong spam indicator,
howing where message ids were omitted in the sent mail and/or received
headers naively forged to prevent backtracking.

Unfortunately, I haven't had time to implement and test this...

- Alex

From tinacoruth@concentric.net  Fri Nov 15 01:19:42 2002
From: tinacoruth@concentric.net (polaner)
Date: Fri, 15 Nov 2002 01:19:42 +0000
Subject: [Spambayes] 
 Bullet proof bulk email friendly hosting & cheap mass email campaigns.
Message-ID: <1897911063vsdped|hvCs|wkrq1ruj@prodigy.com>

We are the marketing specialists www.host4bulk.com that provide cheap 
bullet proof bulk email friendly hosting for your website ($400 for 
one month of bullet proof hosting) and cheap bulk email campaigns ($200 
for 1 million emails sent)
As you may already know, many web hosting companies have Terms of Service 
(TOS) or Acceptable Use Policies (AUP) against the delivery of emails 
advertising or promoting your web site. If your web site host receives 
complaints or discovers that your web site has been advertised in email 
broadcasts, they may disconnect your account and shut down your web site. 
Our mission is to solve your problem and provide you with bulk email 
friendly hosting. You don't have to worry about your website being 
closed again. Adult and gambling sites welcomed. No set up fee.
 
You may advertise your website by using your own resources or using 3rd 
party's service. However we can do all the advertising for your business. 
You just sit, relax and see how your income grows constantly. We guarantee 
the lowest prices on the web for our web hosting and bulk email campaigns. 
We only ask $200 us dollars for 1 million emails sent with your ad. 
We don't use duplicate emails. Our email base is up to date and it is 
updated weekly. Our current email data base contains over 50.000.000 
emails sorted by various parameters to meet your specific needs. No 
competitors may offer this price. The lowest price you can find on 
the net is well over $500 for 1 million
 
Don't make the mistake of bulk emailing directly to your website without 
bulletproof web hosting. Your web host will close your account and shut 
your site down in no time! No matter how long you have been with them, 
how much you are paying them,  or how beautiful your site is. There are 
companies charging thousands for bulletproof web hosting and they can't 
keep you up and running  like we can. If you host with us, your site 
will NOT BE SHUT DOWN due to complaints! Bulk email campaign together 
with bullet proof hosting will bring your business to success. Just 
imagine how many people will learn about your business or product at a 
really low price. Bulk email is considered to be the most effective way 
to advertise on the net. It is hundreds times effective than banner, 
solo ad and other campaigns. Once people use our service they always 
come back for more. We can always provide websites that use bulk email 
campaigns with our new reliable way to accept credit cards on the net 
without the need to open merchant account. You can start accepting credit 
card payments in second. It is totally free.
 
Visit our website at http://www.host4bulk.com for more information and 
to order your bulk email hosting or/and email campaign. 


From tim.one@comcast.net  Fri Nov 15 19:23:13 2002
From: tim.one@comcast.net (Tim Peters)
Date: Fri, 15 Nov 2002 14:23:13 -0500
Subject: [Spambayes] Just for fun
Message-ID: <LNBBLJKPBEHFEDALKOLCCEKKCLAB.tim.one@comcast.net>

There's a msg to this list being held for moderator approval.  I'm going to
let it thru.  The subject is

    Bullet proof bulk email friendly hosting & cheap mass email

This should be a good test of whether your classifier thinks *everything*
sent to this list is ham <oink>.


From jeremy@alum.mit.edu  Fri Nov 15 19:29:01 2002
From: jeremy@alum.mit.edu (Jeremy Hylton)
Date: Fri, 15 Nov 2002 14:29:01 -0500
Subject: [Spambayes] Just for fun
In-Reply-To: <LNBBLJKPBEHFEDALKOLCCEKKCLAB.tim.one@comcast.net>
References: <LNBBLJKPBEHFEDALKOLCCEKKCLAB.tim.one@comcast.net>
Message-ID: <15829.19197.286799.636365@slothrop.zope.com>

Here's how that message scored for me:
    Score: 0.998364543478

    Clues
    -----
    *H* 0.003146604587
    *S* 0.999875691542

How dull.  It's to hard to find good spam these days,
where good spam is defined as clever enough to get through a decent
spam filter.

Jeremy


From python-spambayes@discworld.dyndns.org  Fri Nov 15 19:34:16 2002
From: python-spambayes@discworld.dyndns.org (Charles Cazabon)
Date: Fri, 15 Nov 2002 13:34:16 -0600
Subject: [Spambayes] Just for fun
In-Reply-To: <15829.19197.286799.636365@slothrop.zope.com>;
	from jeremy@alum.mit.edu on Fri, Nov 15, 2002 at 02:29:01PM -0500
References: <LNBBLJKPBEHFEDALKOLCCEKKCLAB.tim.one@comcast.net>
	<15829.19197.286799.636365@slothrop.zope.com>
Message-ID: <20021115133416.A32235@discworld.dyndns.org>

Jeremy Hylton <jeremy@alum.mit.edu> wrote:
> 
> How dull.  It's to hard to find good spam these days,
> where good spam is defined as clever enough to get through a decent
> spam filter.

Remember that this project /is/ the first instance of a decent spam filter :),
so we can hardly blame the spammers for being a little behind.

Charles
-- 
-----------------------------------------------------------------------
Charles Cazabon                 <python-spambayes@discworld.dyndns.org>
GPL'ed software available at:     http://www.qcc.ca/~charlesc/software/
-----------------------------------------------------------------------

From tim.one@comcast.net  Fri Nov 15 20:20:55 2002
From: tim.one@comcast.net (Tim Peters)
Date: Fri, 15 Nov 2002 15:20:55 -0500
Subject: [Spambayes] Just for fun
In-Reply-To: <15829.19197.286799.636365@slothrop.zope.com>
Message-ID: <LNBBLJKPBEHFEDALKOLCIELECLAB.tim.one@comcast.net>

[Jeremy]
> Here's how that message scored for me:
>     Score: 0.998364543478
>
>     Clues
>     -----
>     *H* 0.003146604587
>     *S* 0.999875691542

Good!  On my tiny still-hapax-driven purely-mistake-based at-home classifier
(which is up 79 each of ham and spam trained on) it fared much worse:

Spam Score: 0.303213

word                                spamprob         #ham  #spam
'*H*'                               0.496939            -      -
'*S*'                               0.103364            -      -

The spambayes/python.org clues were strong (reflecting how many mailing-list
ham had to be redeemed from Unsure status over the last 2 weeks, and spam FN
leaking thru python.org):

'spambayes'                         0.0918367           2      0
'email name:spambayes'              0.0918367           2      0
'url:spambayes'                     0.0918367           2      0
'subject:Spambayes'                 0.0918367           2      0
'sender:addr:spambayes-bounces'     0.0918367           2      0
'url:mailman'                       0.114051           16      2
'url:python'                        0.120635           15      2
'sender:addr:python.org'            0.12996            20      3
'url:org'                           0.145463           23      4
'email addr:python.org'             0.145904           12      2
'url:listinfo'                      0.170558           19      4
'to:addr:python.org'                0.178246           18      4

OTOH, the strongest spam words were stronger, but overall the hapaxes "in
the middle" favored ham:

ham hapaxes

'solo'                              0.155172            1      0
'x-mailer:microsoft outlook express 5.50.4807.1700' 0.155172            1
0
'sorted'                            0.155172            1      0
'proof'                             0.155172            1      0
'parameters'                        0.155172            1      0
'however'                           0.155172            1      0
'host'                              0.155172            1      0
'considered'                        0.155172            1      0
'ask'                               0.155172            1      0
'account.'                          0.155172            1      0

high-spamprob words

'advertise'                         0.908163            0      2
'price.'                            0.908163            0      2
'lowest'                            0.908163            0      2
'base'                              0.908163            0      2
'$500'                              0.908163            0      2
'bulk'                              0.934783            0      3
'service.'                          0.934783            0      3
'effective'                         0.934783            0      3
'subject: & '                       0.934783            0      3
'free.'                             0.958716            0      5
'income'                            0.965116            0      6

When I restored my "non-insane training" at-home classifier from 2 weeks
ago, it was nailed as spam.  The hapax-driven guy hasn't changed character
for me since the first few days:  rarely makes outright mistakes, but the
Unsures remain surprising.

> How dull.  It's to hard to find good spam these days, where good
> spam is defined as clever enough to get through a decent spam filter.

One of the articles recently referenced here quoted a self-identified
spammer who said he can make $1000 in a day by sending out 400,000 spam, and
getting (just) 30 people to sign up for the porn sites he's advertising.  I
don't think he cares if spam filters can catch his stuff with 100% recall:
his customers won't run spam filters, because they *want* porn spam.  That's
where we get rich, since this system can easily be trained to accept porn
spam, but block human growth hormone spam.  OTOH, no matter which way you
cut all that, there's no incentive for porn spammers to change behavior.  To
the contrary, a system like this in wide use would help them reach their
market more effectively.

so-let's-apply-for-porn-funding-ly y'rs  - tim


From rob@hooft.net  Fri Nov 15 20:30:40 2002
From: rob@hooft.net (Rob Hooft)
Date: Fri, 15 Nov 2002 21:30:40 +0100
Subject: [Spambayes] Just for fun
References: <LNBBLJKPBEHFEDALKOLCIELECLAB.tim.one@comcast.net>
Message-ID: <3DD55970.4030500@hooft.net>

Tim Peters wrote:
> [Jeremy]
> 
>>Here's how that message scored for me:
>>    Score: 0.998364543478
>>
>>    Clues
>>    -----
>>    *H* 0.003146604587
>>    *S* 0.999875691542
> 
> 
> Good!  On my tiny still-hapax-driven purely-mistake-based at-home classifier
> (which is up 79 each of ham and spam trained on) it fared much worse:
> 
> Spam Score: 0.303213
> 
> word                                spamprob         #ham  #spam
> '*H*'                               0.496939            -      -
> '*S*'                               0.103364            -      -

So that makes 80 in your classifier....

For my one-time-only private-email few-weeks-old too-lazy-to-retrain 
classifier, this message did 0.88 due to the spambayes ham clues....

Rob


-- 
Rob W.W. Hooft  ||  rob@hooft.net  ||  http://www.hooft.net/people/rob/


From dereks@itsite.com  Fri Nov 15 17:35:52 2002
From: dereks@itsite.com (Derek Simkowiak)
Date: Fri, 15 Nov 2002 12:35:52 -0500 (EST)
Subject: [Spambayes] Just for fun
In-Reply-To: <20021115133416.A32235@discworld.dyndns.org>
Message-ID: <Pine.LNX.4.33L2.0211151155540.4913-100000@dev.itsite.com>

> > How dull.  It's to hard to find good spam these days,
> > where good spam is defined as clever enough to get through a decent
> > spam filter.

	Listening to this one would think spam is a problem of the past!

> Remember that this project /is/ the first instance of a decent spam filter :),
> so we can hardly blame the spammers for being a little behind.

	Let's not forget that SpamBayes only works for individuals or
workgroups who have the same definitation of "ham".  It doesn't help much
in enterprise-level settings with tens of thousands of users, since the
ham of such a large and varied group of people would dilute the definition
of spam too much to be useful.

	I bet that playing the numbers game one could "show" that the
helpdesk and maintenance costs of supporting a Python installation plus a
per-person ham training procedure would be more expensive (for a Uni or
Mega-Corp.) than just living with spam.  (Pure conjecture on my part, but
it is easily imagined.)

	There's another Python-based spam filter that might work better
for SMTP server-wide deployment, called "Active Spam Killer", or ASK.

http://www.paganini.net/ask/

	It's schtick is that it maintains a whitelist of people who may
email you.  When an email from a new sender comes in, it holds the email
for you, sends the person a simple confirmation messages (to which they
simply hit Reply;Send), and then that person is added to your whitelist
and their original messages is sent to you (and they are never ASKed
again).  There's also some very practical regex stuff, some migration
tools, and an ignorelist and blacklist (for situations like
http://www.psychoexgirlfriend.com/).

	It's currently targeted at individuals but if one thinks of this
as an "E-mail Firewall", where only users who actually reply to messages
are allowed to send messages to your company or campus, then this might
work out well.  Whether or not that's a desirable campus policy is up for
debate, but I know that I want it for my small company.

	I bring it up on this list to (a) remind you guys that stopping
spam at the server is far more efficient than stopping it at each user's
INBOX, and (b) because I wanted to show a completely different spam
filtering technology that doesn't depend on content at all.

	I would love to see a Python-based product that integrates well
with Postfix/etc. and lets me pick and choose enterprise-wide spam
filtering methods in whatever order I want.  Pure "ASK"-like behaviour
could be configured for large enterprise installations, but small
workgroups would only "ASK" the senders for a confirmation if their first
email looked too much like a spam (according to SpamBayes).  Add in a
Python port of SpamAssassin's methods and I think then you'd have a
serious tool for stopping spam at the server.

	In short, I don't think filtering at the per-user level is a
solution to the spam problem, because it only saves the time of the
individual to identify and delete the spam.  It does not save the cost of
delivery, which is mostly on the receiving infrastructure.


--Derek


From tim.one@comcast.net  Fri Nov 15 20:57:22 2002
From: tim.one@comcast.net (Tim Peters)
Date: Fri, 15 Nov 2002 15:57:22 -0500
Subject: [Spambayes] Just for fun
In-Reply-To: <3DD55970.4030500@hooft.net>
Message-ID: <LNBBLJKPBEHFEDALKOLCMELMCLAB.tim.one@comcast.net>

[Rob Hooft]
> So that makes 80 in your classifier....

I left this one out of the training -- the only reason it made it to the
Spambayes list is because I approved it.  SpamAssassin knew darned well it
was spam, but SpamAssassin has been castrated for msgs sent to this list; it
was held for approval just because this list has a members-only posting
policy.  BTW, that's the only "outside" spam I've seen mailed to this list
so far!  On the radio yesterday, a news story reported that the US Federal
Trade Commission did a test, wherein they posted some email addresses on the
web, then timed how long it took for them to receive their first spam:  8
minutes.  Surely they can speed that up <wink>.

> For my one-time-only private-email few-weeks-old too-lazy-to-retrain
> classifier, this message did 0.88 due to the spambayes ham clues....

Good!  I get hundreds of Mailman emails every day thru python.org, so the
python.org + Mailman clues are (I expect) much stronger in my little
database.  It *still* didn't get near my ham_cutoff, though, which is mildly
encouraging (granting that this training strategy is insane).


From tim.one@comcast.net  Fri Nov 15 21:22:02 2002
From: tim.one@comcast.net (Tim Peters)
Date: Fri, 15 Nov 2002 16:22:02 -0500
Subject: [Spambayes] Just for fun
In-Reply-To: <Pine.LNX.4.33L2.0211151155540.4913-100000@dev.itsite.com>
Message-ID: <LNBBLJKPBEHFEDALKOLCKELPCLAB.tim.one@comcast.net>

[Derek Simkowiak]
> Let's not forget that SpamBayes only works for individuals or
> workgroups who have the same definitation of "ham".

"workgroups" is too small.  The general python.org tests show that it's also
very effective for tech mailing lists, even collections of tech mailing
lists, and even very high-volume collections of tech mailing lits.

> It doesn't help much in enterprise-level settings with tens of
> thousands of users,

The python.org tech mailing lists serve tens of thousands.

> since the ham of such a large and varied group of people would dilute
> the definition of spam too much to be useful.

That's always been my belief, but it hasn't been tested properly here so
remains speculation.  Matt Sargeant earlier reported clearly worse error
rates from his more-classic Bayesian classifier when expanded to "many"
users, but they weren't so high as to suggest uselessness.  A lot depends on
what you *do* with suspected spam.  If it's bounced back to the user, it
seems about the same as bouncing back a whitelist nag.

> I bet that playing the numbers game one could "show" that the
> helpdesk and maintenance costs of supporting a Python installation
> plus a per-person ham training procedure would be more expensive
> (for a Uni or Mega-Corp.) than just living with spam.  (Pure
> conjecture on my part, but it is easily imagined.)

So is the converse <wink>.

> There's another Python-based spam filter that might work better
> for SMTP server-wide deployment, called "Active Spam Killer", or ASK.
>
> http://www.paganini.net/ask/
>
> 	It's schtick is that it maintains a whitelist of people who may
> email you.  When an email from a new sender comes in, it holds the email
> for you, sends the person a simple confirmation messages (to which they
> simply hit Reply;Send),

A deployment of *this* system could do the same, yes?  Challenge-response is
applicable to any system with a reject rule.

> and then that person is added to your whitelist and their original
> messages is sent to you (and they are never ASKed again).

Think about this for "the enterprise".  I doubt my employer would go for
this:  sales leads are sacred, and *anything* you do to make it harder for a
potential customer to contact you is Major Sin.  For this reason, no spam is
blocked by my employer.  Suspected spam merely gets a tilde prepended to the
Subject line.  Customer contacts *were* bounced when they did try to block
spam.  Bothering a customer with a whitelist nag is an approximation to
asking that customer to do business elsewhere.  For example, I can say that
if one of my sisters got a whitelist nag asking them to reply, and it had a
bunch of "funny numbers" in the subject line, they'd be afraid even to
*read* it -- they'd delete it at once, fearing it was a virus or address
harvester ("it looks funny, and it didn't come from Timmy, so I bet some
hacker intercepted my email to xyz.com and is trying to trick me into
replying").

> There's also some very practical regex stuff, some migration
> tools, and an ignorelist and blacklist (for situations like
> http://www.psychoexgirlfriend.com/).
>
> 	It's currently targeted at individuals but if one thinks of this
> as an "E-mail Firewall", where only users who actually reply to
> messages are allowed to send messages to your company or campus, then
> this might work out well.  Whether or not that's a desirable campus
> policy is up for debate, but I know that I want it for my small
> company.

Until you learn that a potential customer did business with Zope Corp
instead because we didn't nag them <wink>.

> 	I bring it up on this list to (a) remind you guys that stopping
> spam at the server is far more efficient than stopping it at each user's
> INBOX,

If the server were 100% reliable in determining what is and isn't spam, that
would certainly be true.  The costs associated with false positives are
often judged to be very high, though, and asking the sender for confirmation
is a cost too.  For tech mailing lists I think it's an easily borne cost,
but not for companies doing business with the public.

> and (b) because I wanted to show a completely different spam
> filtering technology that doesn't depend on content at all.

As a whitelist gimmicks go, it appears to be a very good one.  Whether
whitelists are appropriate depends on the intended use.

> 	I would love to see a Python-based product that integrates well
> with Postfix/etc. and lets me pick and choose enterprise-wide spam
> filtering methods in whatever order I want.  Pure "ASK"-like behaviour
> could be configured for large enterprise installations, but small
> workgroups would only "ASK" the senders for a confirmation if their
> first email looked too much like a spam (according to SpamBayes).  Add
> in a Python port of SpamAssassin's methods and I think then you'd have
> a serious tool for stopping spam at the server.

Regardless of scheme, I urge running tests and measuring error rates, else
it's just whistling in the dark.  "More technology" without such guidance is
more likely to hurt than help (unless you think a typically overburdened
admin is going to understand the interactions among half a dozen distinct
systems perfectly out of the box).

> 	In short, I don't think filtering at the per-user level is a
> solution to the spam problem, because it only saves the time of the
> individual to identify and delete the spam.

>From my POV, that *is* my "spam problem".

> It does not save the cost of delivery, which is mostly on the
> receiving infrastructure.

Now that you mention it, funding would be helpful <wink>.


From piersh@friskit.com  Fri Nov 15 21:52:35 2002
From: piersh@friskit.com (Piers Haken)
Date: Fri, 15 Nov 2002 13:52:35 -0800
Subject: [Spambayes] fix for Outlook 'Spam' field
Message-ID: <9891913C5BFE87429D71E37F08210CB9297513@zeus.sfhq.friskit.com>

First off: I've never been able to get the 'Spam' field in outlook to
work well on my system. It may be something to do with the fact that I'm
using exchange, but I always found that some messages had a rating, some
didn't and invariably the number in the field didn't match the 'show
spam clues' number.

So here's a patch. It does a couple of things:

1) firstly it changes the Class of the 'Spam' field to olPercent, which
I believe is much more appropriate than olCombination. The problem with
olCombination is that you have to manually change the field type in
outlook in order to get anything to show up. With olPercent, the column
shows up with a nice '%' sign which makes it more obvious what the
number actually means.

2) secondly it adds a checkbox 'Update spam scores' to the training
dialog. Checking this box causes the trainer to update the spam field
for ALL messages in your training folders (in a second pass, if
necessary). This means that ALL messages in your inbox have an entry in
that field, not just those that arrived since you installed the plugin.
This was a huge win for me since it allowed me to sort by the spam field
and throw away about 20 spams from my inbox that I had missed during my
initial manual pruning.


The only issue here is that in order for this to work right, you'll have
to manually delete your existing spam fields, restart outlook and then
'rescore'.

Also, Mark, you never committed that patch I sent you that fixed the
CompareIDs bug in the FolderSelector dialog. Was there a problem with
it?

Piers.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: spam_field.patch
Type: application/octet-stream
Size: 16150 bytes
Desc: spam_field.patch
Url : http://mail.python.org/pipermail/spambayes/attachments/20021115/70cf95b9/spam_field.exe
From neale@woozle.org  Fri Nov 15 21:40:55 2002
From: neale@woozle.org (Neale Pickett)
Date: 15 Nov 2002 13:40:55 -0800
Subject: [Spambayes] Just for fun
In-Reply-To: <LNBBLJKPBEHFEDALKOLCCEKKCLAB.tim.one@comcast.net>
References: <LNBBLJKPBEHFEDALKOLCCEKKCLAB.tim.one@comcast.net>
Message-ID: <w53r8dm8qag.fsf@woozle.org>

So then, Tim Peters <tim.one@comcast.net> is all like:

> There's a msg to this list being held for moderator approval.  I'm going to
> let it thru.  The subject is
> 
>     Bullet proof bulk email friendly hosting & cheap mass email
> 
> This should be a good test of whether your classifier thinks *everything*
> sent to this list is ham <oink>.

Uhhh, has anyone else noticed that mail to this list has SpamAssassin
headers in it?  I took SA out of my mail path months ago but just
noticed the headers when checking out that spam's headers.

Mightn't SA's score in the message headers bais the results of a later
SpamBayes run?

From rob@hooft.net  Fri Nov 15 21:53:47 2002
From: rob@hooft.net (Rob Hooft)
Date: Fri, 15 Nov 2002 22:53:47 +0100
Subject: [Spambayes] Better optimization loop
Message-ID: <3DD56CEB.7050406@hooft.net>

I've been playing a bit more with the weakloop concept. As Tim reported 
earlier, there is no chance that the "weak" training can be optimized 
this way. There are just too many binary choices in the training, 
resulting in a very bad optimization field.

The "train automatically" mode that Tim proposed and that is much more 
stable runs way too slowly to work as a step in an optimization.

So: I'm back at timcv.py. I removed weakloop.py from the CVS, and added 
a new 'simplexloop.py' that takes a single option: '-c commandline'. The 
command line will then be repeatedly executed with different 
bayescustomize.ini values, optimizing the cost that is reported as the 
third word of the last line of the output.

Obviously, I needed to change the output of timcv.py to report the 
flexcost, and that I did by introducing a generic CostCounter class 
which is in its own module.

I am currently running:

   python2.3 simplexloop.py -c 'python2.3 timcv.py -n 10 \
      --spam-keep=600 --ham-keep=600 -s 12345' > simplexloop.out

But I'm so curious about other peoples results that I've already 
committed this before letting it run to completion. During the small 
test runs I did make, I learned that even this cost function has very 
sharp edges. I think this is caused by very often occurring wordprobs 
that are either used or not used by a small step in 'min_prob_strength' 
or one of the other parameters. I think this is harmless if the training 
sets are large enough.

Rob

-- 
Rob W.W. Hooft  ||  rob@hooft.net  ||  http://www.hooft.net/people/rob/


From tim.one@comcast.net  Fri Nov 15 22:04:49 2002
From: tim.one@comcast.net (Tim Peters)
Date: Fri, 15 Nov 2002 17:04:49 -0500
Subject: [Spambayes] Just for fun
In-Reply-To: <w53r8dm8qag.fsf@woozle.org>
Message-ID: <LNBBLJKPBEHFEDALKOLCOEMKCLAB.tim.one@comcast.net>

[Neale Pickett]
> Uhhh, has anyone else noticed that mail to this list has SpamAssassin
> headers in it?

Yes; all email going thru python.org does.

> I took SA out of my mail path months ago but just noticed the headers
> when checking out that spam's headers.
>
> Mightn't SA's score in the message headers bais the results of a later
> SpamBayes run?

We ignore most header lines by default; unless you've done something to
change that, the classifier is blind to SA's headers (because the tokenizer
doesn't look at them).

If you enable count_all_header_lines, or add SA's keywords to the
safe_header_lines list, then the fact that they exist will be recorded (but
nothing about their content).

If you enable basic_header_tokenize and don't exclude SA's headers via the
basic_header_skip regexp list, then SA's headers will be fully tokenized.

Those are the only things you could do to make SA's headers visible, short
of changing the tokenizer source code.  By default, count_all_header_lines
is false, SA's keywords are not in safe_header_lines, and
basic_header_tokenize is false.


From noreply@sourceforge.net  Fri Nov 15 21:47:48 2002
From: noreply@sourceforge.net (noreply@sourceforge.net)
Date: Fri, 15 Nov 2002 13:47:48 -0800
Subject: [Spambayes] 
 [ spambayes-Patches-639122 ] hammie: ignore emails older than n days
Message-ID: <E18CoJY-0004CN-00@sc8-sf-web1.sourceforge.net>

Patches item #639122, was opened at 2002-11-15 15:47
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=498105&aid=639122&group_id=61702

Category: None
Group: None
Status: Open
Resolution: None
Priority: 5
Submitted By: Jason Hildebrand (jdhildeb)
Assigned to: Nobody/Anonymous (nobody)
Summary: hammie: ignore emails older than n days

Initial Comment:
Since your documentation stresses the importance of
training using only relatively recent emails, I thought
a good way to do this would be to have hammie do it for me.

So I added a new configuration option:

[Hammie]
# when training, hammie will ignore messages older than
this number of days.
# i.e. set to 365 to ignore messages older than one year.
# Set to 0 to disable any filtering by date.
ignore_old_messages: 0

The patch also modifies Hammie to output the number of
messages it read/ignored for each mail file it processes.

This option might also prove useful for doing
incremental training (i.e. set up cron to train once a
week, and set ignore_old_messages to 7).


----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=498105&aid=639122&group_id=61702

From jason@peaceworks.ca  Fri Nov 15 22:17:28 2002
From: jason@peaceworks.ca (Jason Hildebrand)
Date: 15 Nov 2002 16:17:28 -0600
Subject: [Spambayes] introduction + date filtering for hammie
Message-ID: <1037398648.10211.29.camel@trotzdem.raglan.org>

Hi all,

I'm new to this list.  I played with content-based spam-filtering a few
years ago in perl, and after coming across Gary Robinson's article (and
Graham's) was excited enough to implemented both of these approaches in
python.

I was interested in using an approach with multiple metrics, which would
include bayesian calculations as well as other ad-hoc measurements (i.e.
the percentage of sentences ending with exclamation marks).  

I took these inputs and fed them into a back-progogating neural network
(BPNN) using a python module I found on the web.  My hope was that the
neural network would find the optimum "weights" to use for combining the
multiple inputs into a single output, and would also determine the
optimum cutoff-point between ham/spam, so that no "tweaking" would be
required.  

My initial tests (training on 100-500 emails) showed the neural network
approach (using Robinson as one of the metrics) was somewhat better than
either the Graham and Robinson without using the BPNN.  However, when I
started training on larger corpuses (I've been collecting spam since
1998), its accuracy degraded.  I did some more reading on the
limitations of BPNNs (namely overtraining), and this result made sense.

So now I've ended up here.  :)

I'm still getting up-to-speed on the spambayes code.  So far, I have one
improvement to offer:

Since your documentation stresses the importance of training using only
relatively recent emails, I thought a good way to do this would be to
have hammie filter out old messages for me.  So I added a new
configuration option: 

[Hammie] 
# when training, hammie will ignore messages older than this number of
days. 
# i.e. set to 365 to ignore messages older than one year. 
# Set to 0 to disable any filtering by date. 
ignore_old_messages: 0 

I also modified Hammie to output the number of messages it read/ignored
for each mail file it processes. 

This option might also prove useful for doing incremental training (i.e.
set up cron to train once a week, and set ignore_old_messages to 7).
Caveat: this won't catch spams whose dates are deliberately set in the
past, such as January 1, 1970 (I've seen a few).

I've uploaded the patch to the sourceforge project page; hopefully
someone has time to take a look at it.

-- 
Jason D. Hildebrand
jason@peaceworks.ca


From francois.granger@free.fr  Fri Nov 15 23:32:52 2002
From: francois.granger@free.fr (=?iso-8859-1?Q?Fran=E7ois?= Granger)
Date: Sat, 16 Nov 2002 00:32:52 +0100
Subject: [Spambayes] Just for fun
In-Reply-To: <3DD55970.4030500@hooft.net>
References: <LNBBLJKPBEHFEDALKOLCIELECLAB.tim.one@comcast.net>
 <3DD55970.4030500@hooft.net>
Message-ID: <a05100303b9fb3436d2ce@[192.168.1.11]>

At 21:30 +0100 15/11/02, in message Re: [Spambayes] Just for fun, Rob 
Hooft wrote:
>Tim Peters wrote:
>>[Jeremy]
>>
>>>Here's how that message scored for me:
>>>    Score: 0.998364543478
>>>
>>>    Clues
>>>    -----
>>>    *H* 0.003146604587
>>>    *S* 0.999875691542
>>
>>
>>Good!  On my tiny still-hapax-driven purely-mistake-based at-home classifier
>>(which is up 79 each of ham and spam trained on) it fared much worse:
>>
>>Spam Score: 0.303213
>>
>>word                                spamprob         #ham  #spam
>>'*H*'                               0.496939            -      -
>>'*S*'                               0.103364            -      -
>
>So that makes 80 in your classifier....
>
>For my one-time-only private-email few-weeks-old too-lazy-to-retrain 
>classifier, this message did 0.88 due to the spambayes ham clues....

In my database traine very badly with onlu some incoming ham and spam it did:

Spam probability: 0.65072156

Clues:

*H*	0.33760104
*S*	0.63904417
content-type:text/plain	0.21812081
x-mailer:none	0.34220907
noheader:date	0.84482759
noheader:to	0.84482759

-- 
Le courrier �lectronique est un moyen de communication. Les gens devraient
se poser des questions sur les implications politiques des choix (ou non
choix) de leurs outils et technologies.
Pour des courriers propres : http://minilien.com/?IXZneLoID0 - 
http://marc.herbert.free.fr/mail/ http://expita.com/nomime.html

From lists@morpheus.demon.co.uk  Fri Nov 15 22:41:22 2002
From: lists@morpheus.demon.co.uk (Paul Moore)
Date: Fri, 15 Nov 2002 22:41:22 +0000
Subject: [Spambayes] Training on individual messages
Message-ID: <n2m-g.isyy5uct.fsf@morpheus.demon.co.uk>

I'm looking at a Gnus interface to Spambayes (I'm at home now, so I've
got rid of Outlook for the weekend :-))

The main issue is training, and in particular individual-message
training. I've added an option to hammie to train on a single message,
read from stdin. This allows me to implement a "this is ham/spam"
action without needing a temporary file.

I wonder, though - is this the right thing to do? Should Hammie be
growing more and more options (at the back of my mind is the
possibility of an "unlearn" option, needed if a message gets
misclassified) or should these sorts of things be split out into
separate utilities?

There's been some messages recently about some form of "Corpus" class
- is that going to address any of this?

Paul.
-- 
This signature intentionally left blank

From tim.one@comcast.net  Sat Nov 16 00:48:53 2002
From: tim.one@comcast.net (Tim Peters)
Date: Fri, 15 Nov 2002 19:48:53 -0500
Subject: [Spambayes] Just for fun
In-Reply-To: <a05100303b9fb3436d2ce@[192.168.1.11]>
Message-ID: <LNBBLJKPBEHFEDALKOLCCENOCLAB.tim.one@comcast.net>

[Fran=E7ois Granger]
> In my database traine very badly with onlu some incoming ham and
> spam it did:
>
> Spam probability: 0.65072156
>
> Clues:
>
> *H*=090.33760104
> *S*=090.63904417
> content-type:text/plain=090.21812081
> x-mailer:none=090.34220907
> noheader:date=090.84482759
> noheader:to=090.84482759

Something went wrong there.  That's what you'd get for an entirely em=
pty
file.  The email pkg makes up text/plain by default, and the other th=
ree are
"negative clues" generated when we *don't* see a thing in the headers=
.  This
msg did have a To header, and did have a Date header ... it also had =
an
X-Mailer header.


From tim.one@comcast.net  Sat Nov 16 04:14:44 2002
From: tim.one@comcast.net (Tim Peters)
Date: Fri, 15 Nov 2002 23:14:44 -0500
Subject: [Spambayes] Seeking a giant idle machine w/ a miserable corpus
Message-ID: <LNBBLJKPBEHFEDALKOLCEEOMCLAB.tim.one@comcast.net>

Robert Woodhead mentioned an idea for using both unigrams and bigrams that
might help, with a twist to avoid generating highly correlated clues.

Gary Robinson was independently thinking along the same lines, and offline
sketched a more fleshed-out similar scheme for doing this with unigrams,
bigrams and trigrams.

I implemented the latter but in a somewhat "purer" form.  A patch for
classifier.py is attached.

Now I don't have any data that can show improvements, so whether this might
help beats me.  It wasn't a disaster for me, which is saying something,
since previous ideas along these lines were clearly steps backward (as
measured by error rates).

So I need someone who's *not* getting great results now to try it (Anthony?
Skip?).  Big caution:  this is a memory hog.  I don't have enough RAM to run
my full c.l.py test, or even half of it.  Here's from a small-subset 10-fold
CV run:

filename:   before     tri
ham:spam:  3000:3000
                   3000:3000
fp total:        0       0
fp %:         0.00    0.00
fn total:        0       0
fn %:         0.00    0.00
unsure t:       26      42
unsure %:     0.43    0.70
real cost:   $5.20   $8.40
best cost:   $0.00   $0.00
h mean:       0.37    0.50
h sdev:       3.07    3.77
s mean:      99.92   99.87
s sdev:       1.49    2.06
mean diff:   99.55   99.37
k:           21.83   17.04

Judging from the error rates, it's got nothing going for it or against it.
Why it *might* help:  while "Python" is a very strong ham word in my tests,
"Python Video" is a porn vendor, and this scheme should reliably know the
difference.  Etc.  My data isn't hard enough for it to matter.

If this really helps someone, then a number of things follow:  cut it off
with bigrams instead; boost it to 4-grams instead; if more than bigrams are
needed for it to help, buy into some hashing scheme to make the database
burden finite again.

As I saw before with pure bigrams, conference announcements once again move
into high-scoring territory, but not nearly so bad.  For example, the OSCON
2000 announcement got penalized for

prob('electronic mail') = 0.969799
prob('and companies') = 0.973373
prob('last name:') = 0.973373
prob('the completion') = 0.973373
prob('individuals who') = 0.976644
prob('cutting edge') = 0.978469
prob('fax the') = 0.978469
prob('target audience') = 0.978469
prob('the subject line') = 0.980893
prob('send all') = 0.981928
prob('not accepted.') = 0.991493
prob('with marketing') = 0.992611
prob('your email') = 0.996391
prob('will receive') = 0.997

but also got helped by

prob('note that the') = 0.0145631
prob('the call') = 0.0167286
prob('the tutorial') = 0.0167286
prob('problems that') = 0.0302013
prob('the open source') = 0.0348837
prob('tutorial and') = 0.0412844
prob('sent via') = 0.0608351
prob('other open') = 0.0652174
prob('proposals for') = 0.0652174
prob('text with') = 0.0652174
prob('the convention') = 0.0652174
prob('with open') = 0.0652174
prob('and open') = 0.0918367
prob('convention the') = 0.0918367
prob('for programmers,') = 0.0918367
prob('itself and the') = 0.0918367
prob('open source software') = 0.0918367
prob('source software') = 0.0918367
prob('that leads') = 0.0918367
prob('wide variety') = 0.0918367

In the end, it was highly ambiguous, with

prob = 0.500000084396
prob('*H*') = 1
prob('*S*') = 1


From tim.one@comcast.net  Sat Nov 16 04:48:14 2002
From: tim.one@comcast.net (Tim Peters)
Date: Fri, 15 Nov 2002 23:48:14 -0500
Subject: [Spambayes] Seeking a giant idle machine w/ a miserable corpus
In-Reply-To: <LNBBLJKPBEHFEDALKOLCEEOMCLAB.tim.one@comcast.net>
Message-ID: <LNBBLJKPBEHFEDALKOLCKEONCLAB.tim.one@comcast.net>

[Tim]
> ...
> I implemented the latter but in a somewhat "purer" form.  A patch for
> classifier.py is attached.

Sorry about that -- it's attached to this.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: tri.patch
Type: application/octet-stream
Size: 3880 bytes
Desc: not available
Url : http://mail.python.org/pipermail/spambayes/attachments/20021115/25c41b69/tri.exe
From rob@hooft.net  Sat Nov 16 05:34:29 2002
From: rob@hooft.net (Rob Hooft)
Date: Sat, 16 Nov 2002 06:34:29 +0100
Subject: [Spambayes] Better optimization loop
References: <3DD56CEB.7050406@hooft.net>
Message-ID: <3DD5D8E5.1020708@hooft.net>

Rob Hooft wrote:
> 
> I am currently running:
> 
>   python2.3 simplexloop.py -c 'python2.3 timcv.py -n 10 \
>      --spam-keep=600 --ham-keep=600 -s 12345' > simplexloop.out
> 

....and it crashed at:

x=0.5698 p=0.0592 s=0.5238 sc=0.922 hc=0.031 65023.78
x=0.5899 p=0.0498 s=0.5485 sc=0.925 hc=-0.000 64856.12
x=0.6177 p=0.0300 s=0.5853 sc=0.934 hc=-0.065 60950.20
x=0.6002 p=0.0488 s=0.5679 sc=0.930 hc=-0.054 61612.61
x=0.6336 p=0.0320 s=0.5783 sc=0.939 hc=-0.107 58698.02
x=0.6799 p=0.0088 s=0.6141 sc=0.953 hc=-0.212 53820.35
x=0.6485 p=0.0253 s=0.6067 sc=0.953 hc=-0.164 56007.05
x=0.6802 p=0.0158 s=0.6332 sc=0.955 hc=-0.219 53535.36
x=0.7354 p=-0.0059 s=0.6879 sc=0.971 hc=-0.344 48832.47
x=0.7339 p=-0.0318 s=0.7062 sc=0.974 hc=-0.358 48426.74
x=0.8114 p=-0.0849 s=0.8000 sc=1.000 hc=-0.548 43010.31
x=0.7970 p=-0.0595 s=0.7497 sc=0.995 hc=-0.479 44800.35
x=0.8512 p=-0.0765 s=0.7981 sc=1.014 hc=-0.634 40904.48
x=0.9680 p=-0.1297 s=0.9045 sc=1.055 hc=-0.919 33847.97
x=0.9481 p=-0.1338 s=0.8958 sc=1.037 hc=-0.837 35574.58

i.e. it is reducing the cost by pulling ham_cutoff and spam_cutoff 
infinitely far apart, but was stopped by a negative log....

I will think of a better flex cost, and start over...

Rob

-- 
Rob W.W. Hooft  ||  rob@hooft.net  ||  http://www.hooft.net/people/rob/


From rob@hooft.net  Sat Nov 16 05:46:19 2002
From: rob@hooft.net (Rob Hooft)
Date: Sat, 16 Nov 2002 06:46:19 +0100
Subject: [Spambayes] Better optimization loop
References: <3DD56CEB.7050406@hooft.net> <3DD5D8E5.1020708@hooft.net>
Message-ID: <3DD5DBAB.4050101@hooft.net>

Rob Hooft wrote:
> Rob Hooft wrote:
> 
>>
>> I am currently running:
>>
>>   python2.3 simplexloop.py -c 'python2.3 timcv.py -n 10 \
>>      --spam-keep=600 --ham-keep=600 -s 12345' > simplexloop.out
>>
> 
> ....and it crashed at:
> 
> x=0.9680 p=-0.1297 s=0.9045 sc=1.055 hc=-0.919 33847.97
> x=0.9481 p=-0.1338 s=0.8958 sc=1.037 hc=-0.837 35574.58

It was caused by an infinitely silly bug in the costcount that made it 
into a proper function to maximize instead of minimize. I shouldn't do 
things in the middle of the night (nor at 6:30 in the morning, but I'm 
trying to fix the pain I may have caused...)

Rob

-- 
Rob W.W. Hooft  ||  rob@hooft.net  ||  http://www.hooft.net/people/rob/


From niltsiar@neo.rr.com  Sat Nov 16 06:11:24 2002
From: niltsiar@neo.rr.com (Todd Mokros)
Date: 16 Nov 2002 01:11:24 -0500
Subject: [Spambayes] small vulnerability patch
Message-ID: <1037427084.31134.17.camel@localhost>

here's a small patch to fix a small header vulnerability.  If a piece of
spam spoofs the header added by hammie, then procmail recipes could
match on the spoofed header.  This deletes the hammie header before
filtering.


--- ../../cvs-tracking/spambayes/hammie.py      2002-11-14
17:00:15.000000000 -0500
+++ hammie.py   2002-11-16 00:44:50.000000000 -0500
@@ -272,6 +272,8 @@
         """
 
         msg = mboxutils.get_message(msg)
+        if msg.has_key(header):
+            del msg[header]
         prob, clues = self._scoremsg(msg, True)
         if prob < ham_cutoff:
             disp = options.header_ham_string


-- 
Todd Mokros <niltsiar@neo.rr.com>

From tim.one@comcast.net  Sat Nov 16 06:36:04 2002
From: tim.one@comcast.net (Tim Peters)
Date: Sat, 16 Nov 2002 01:36:04 -0500
Subject: [Spambayes] Seeking a giant idle machine w/ a miserable corpus
In-Reply-To: <LNBBLJKPBEHFEDALKOLCEEOMCLAB.tim.one@comcast.net>
Message-ID: <LNBBLJKPBEHFEDALKOLCGEPECLAB.tim.one@comcast.net>

[Tim]
> ...
> Skip?).  Big caution:  this is a memory hog.  I don't have enough
> RAM to run my full c.l.py test, or even half of it.

So the new patch attached plays hash games to slash it.  Changing MASK to
boost it may help; it's set for 256K max hash codes as-is.

On my full c.l.py test (which has over 330K distinct words, so squashing
into 256K hash codes necessarily conflates many words):

filename:       cv     tri
ham:spam:  20000:14000
                   20000:14000
fp total:        3       0
fp %:         0.01    0.00
fn total:        0       1
fn %:         0.00    0.01
unsure t:      103     586
unsure %:     0.30    1.72
real cost:  $50.60 $118.20
best cost:  $21.40  $32.40
h mean:       0.24    1.69
h sdev:       2.76    5.70
s mean:      99.93   99.68
s sdev:       1.59    3.37
mean diff:   99.69   97.99
k:           22.92   10.80

The Unsure rate zoomed.  I'm not sure why.  The lowest-scoring spam was
absurd, a giant multi-level marketing spam written in German:

prob = 0.0580526384697
prob('*H*') = 1
prob('*S*') = 0.116105
prob('haben sie schon') = 0.00185261
prob('gegeben finanziell') = 0.00405771
prob('... ich habe') = 0.00413223
prob('die power') = 0.00418173
prob('skip:d 10 wurde mir') = 0.00464396
prob('und adresse die') = 0.00530035
prob('skip:a 10 passierte') = 0.00570342
prob('#6".') = 0.0065312
prob('ein produkt,') = 0.00715421
prob('weiteren schwung') = 0.00764007
prob('sie bei') = 0.00790861
prob('zealand ich') = 0.00872423
prob('beste') = 0.00884086
prob('sich ein fenster') = 0.00920245
prob('100 bestellungen (oder') = 0.00959488

etc.  Of course it's never seen most of those phrases at all in ham, but
hash codes don't know that.

The full quote of the Nigerian-scam spam fell from off-the-charts spam to
middling Unsure.  Again hash collisions must account for it:

Data/Ham/Set5/74506.txt
prob = 0.580354361406
prob('*H*') = 0.839291
prob('*S*') = 1
prob('report the existence') = 0.00238221
...
prob('identified the amount') = 0.00455005
prob('country. please note') = 0.00693374
prob('numbers your reply') = 0.00715421
prob('process. because the') = 0.00959488
...
prob('duties, have') = 0.0328367
prob('foreign partner.') = 0.0503757
prob('solicit your strict') = 0.0724398
prob('which chairman') = 0.0757576
prob('skip:w 10 "abass kabiru"') = 0.0812396
...
prob('25% for') = 0.0907928
prob('subject::  subject: ') = 0.0937339
prob('would use') = 0.102003
prob('for skip:m 10 intend') = 0.103881
prob('complex,') = 0.107769
prob('matter trust') = 0.108386
prob('more details this') = 0.123444
prob('subject: ( subject:)') = 0.127565
prob('present authorities, they') = 0.12886
...
prob('housing federal secretariat') = 0.207914

So like previous gimmicks using hash codes, the mistakes are unfathomable to
human eyes, although you're unlikely to see any unless you've got a lot of
training and testing data (in which case wild mistakes become more certain
the more you've got).  When the hash space is too small (as it surely was in
this test), what *would* have been mild-prob hapaxes get associated with
strong-probability phrases by accident.

Aha!  "On average" you can expect those accidents to cancel out, but
chi-combining tends to Unsure in the presence of cancellation.  I bet that
explains the bulk of the Unsure rate boost.  Sometimes the accidents will
pile up in one direction or the other, though, likely accounting for the
examples above (especially the German example, where the hash code of
virtually every phrase is an accident).
-------------- next part --------------
A non-text attachment was scrubbed...
Name: tri2.patch
Type: application/octet-stream
Size: 5256 bytes
Desc: not available
Url : http://mail.python.org/pipermail/spambayes/attachments/20021116/dd2c93b0/tri2.exe
From noreply@sourceforge.net  Sat Nov 16 12:32:50 2002
From: noreply@sourceforge.net (noreply@sourceforge.net)
Date: Sat, 16 Nov 2002 04:32:50 -0800
Subject: [Spambayes] [ spambayes-Patches-639310 ] fix for outlook 'spam' field
Message-ID: <E18D282-0005EN-00@sc8-sf-web1.sourceforge.net>

Patches item #639310, was opened at 2002-11-16 04:32
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=498105&aid=639310&group_id=61702

Category: None
Group: None
Status: Open
Resolution: None
Priority: 5
Submitted By: Piers Haken (piersh)
Assigned to: Nobody/Anonymous (nobody)
Summary: fix for outlook 'spam' field

Initial Comment:
1) firstly it changes the Class of the 'Spam' field to 
olPercent, which I believe is much more appropriate than 
olCombination. The problem with olCombination is that 
you have to manually change the field type in outlook in 
order to get anything to show up. With olPercent, the 
column shows up with a nice '%' sign which makes it 
more obvious what the number actually means.

2) secondly it adds a checkbox 'Update spam scores' to 
the training dialog. Checking this box causes the trainer 
to update the spam field for ALL messages in your 
training folders (in a second pass, if necessary). This 
means that ALL messages in your inbox have an entry 
in that field, not just those that arrived since you 
installed the plugin. This was a huge win for me since it 
allowed me to sort by the spam field and throw away 
about 20 spams from my inbox that I had missed during 
my initial manual pruning.


The only issue here is that in order for this to work right, 
you'll have to manually delete your existing spam fields, 
restart outlook and then 'rescore'.


----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=498105&aid=639310&group_id=61702

From noreply@sourceforge.net  Sat Nov 16 12:35:19 2002
From: noreply@sourceforge.net (noreply@sourceforge.net)
Date: Sat, 16 Nov 2002 04:35:19 -0800
Subject: [Spambayes] 
 [ spambayes-Patches-639312 ] fix for outlook CompareEntryIDs bug
Message-ID: <E18D2AR-0005Gl-00@sc8-sf-web1.sourceforge.net>

Patches item #639312, was opened at 2002-11-16 04:35
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=498105&aid=639312&group_id=61702

Category: None
Group: None
Status: Open
Resolution: None
Priority: 5
Submitted By: Piers Haken (piersh)
Assigned to: Nobody/Anonymous (nobody)
Summary: fix for outlook CompareEntryIDs bug

Initial Comment:
This patch reenables the CompareEntryIDs for 
comparing folder IDs. It passes both the MAPI Session 
and the Oulook Session into the dialog, one for retrieving 
the exchange-compatible IDs and the other for 
comparing them.

----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=498105&aid=639312&group_id=61702

From richie@entrian.com  Sat Nov 16 18:05:32 2002
From: richie@entrian.com (Richie Hindle)
Date: Sat, 16 Nov 2002 18:05:32 +0000
Subject: [Spambayes] Training on individual messages
In-Reply-To: <n2m-g.isyy5uct.fsf@morpheus.demon.co.uk>
References: <n2m-g.isyy5uct.fsf@morpheus.demon.co.uk>
Message-ID: <d877b4e28ed4aa7b.dlg@entrian.com>

Hi Paul,

> I wonder, though - is this the right thing to do? Should Hammie be
> growing more and more options (at the back of my mind is the
> possibility of an "unlearn" option, needed if a message gets
> misclassified) or should these sorts of things be split out into
> separate utilities?

They should be in a shared module IMHO, and you're right about this:

> There's been some messages recently about some form of "Corpus" class
> - is that going to address any of this?

Yes - Tim Stone's Corpus class, which he's just committed, encapsulates 
a corpus of emails, and lets you set up automatic training when 
adding/removing/moving messages.  So for instance, you create a Spam 
corpus, attach a Trainer object to it, and call addMessage - that adds 
the message to the corpus, and trains on that message as Spam.  Removing 
the message untrains it.  pop3proxy.py is now using this for a web-based 
training interface, which I'm hoping to commit in the next couple of 
days.

-- 
Richie Hindle
richie@entrian.com

From skip@pobox.com  Sat Nov 16 21:15:53 2002
From: skip@pobox.com (Skip Montanaro)
Date: Sat, 16 Nov 2002 15:15:53 -0600
Subject: [Spambayes] Seeking a giant idle machine w/ a miserable corpus
In-Reply-To: <LNBBLJKPBEHFEDALKOLCEEOMCLAB.tim.one@comcast.net>
References: <LNBBLJKPBEHFEDALKOLCEEOMCLAB.tim.one@comcast.net>
Message-ID: <15830.46473.661765.562628@montanaro.dyndns.org>

    Tim> So I need someone who's *not* getting great results now to try it
    Tim> (Anthony?  Skip?).  Big caution: this is a memory hog.  I don't
    Tim> have enough RAM to run my full c.l.py test, or even half of it.
    Tim> Here's from a small-subset 10-fold CV run:

Here's some data for you (thank goodness for RAM!).  "base" is CVS (no
mods).  "tri" is CVS plus your first (no hash tricks) patch.  The "tri" run
consumed about 51 minutes of CPU on my Powerbook and pretty much ran in
memory the entire time.

    filename:     base     tri
    ham:spam:  10439:6134     
                       10439:6134
    fp total:       24      25
    fp %:         0.23    0.24
    fn total:       71      55
    fn %:         1.16    0.90
    unsure t:      284     315
    unsure %:     1.71    1.90
    real cost: $367.80 $368.00
    best cost: $312.40 $324.40
    h mean:       0.68    0.71
    h sdev:       6.56    6.73
    s mean:      97.55   97.51
    s sdev:      12.70   12.35
    mean diff:   96.87   96.80
    k:            5.03    5.07

I haven't looked at any raw output.  Let me know if you want the raw
numbers.

Skip

From tim.one@comcast.net  Sat Nov 16 22:17:40 2002
From: tim.one@comcast.net (Tim Peters)
Date: Sat, 16 Nov 2002 17:17:40 -0500
Subject: [Spambayes] Seeking a giant idle machine w/ a miserable corpus
In-Reply-To: <15830.46473.661765.562628@montanaro.dyndns.org>
Message-ID: <LNBBLJKPBEHFEDALKOLCIECGCMAB.tim.one@comcast.net>

I ran my fat c.l.py test w/ the hash space clamped at 256K buckets.  That
was clearly a bad idea for that test, since there are about 330K unique
unigrams in that corpus (let alone bigrams and trigrams).

cv below is the current all-default result on that test data, excepting for

[Tokenizer]
replace_nonascii_chars: True
record_header_absence: True

The # of unsures is lower than I reported before:  by staring at the
unsures, I found 10 entirely empty (0 bytes) files in my spam corpus.  Those
got replaced with random spam from the reservoir (the empty msgs had scored
as unsure).

All other runs here are on the same data.

tri19 is the hashed trigram gimmick with the hash space boosted to 512K (19
bits of hash code).  Contrary to expectations, the Unsure rate actually
increased over the run with 256K buckets.  But it still appeared to be due
to unlucky hash collisions.

So tri20 boosted the # of hash buckets to a million.  That still didn't
help.

At that point I switched body tokenization strategy:  I've long speculated
that split-on-whitespace helped us over alphanumeric-run tokenization
because s-o-w captures a *little* contextual information from the
punctuation, and because it generates highly correlated clues in a way that
*helps* (like "Python" and "Python?" count as distinct words).  But if we're
getting context and helpful correlation from bigrams and trigrams too, it
seems plausible that the punctuation context gets in the way.  So tri20a is
with a million hash buckets, but tokenzing via re.findall with

    [\w$\-\x80-\xff]+

instead of s-o-w.  Alas, overall its "best cost" was even worse than
tri19's.  s-o-w still rules.

So tri21 went back to s-o-w, but boosted the # of hasn buckets to 2 million.
This finally started moving "in the right direction" again, but still loses
to the original unhashed "exact" unigram scheme.

Since I probably have more than a million unique unigrams + bigrams +
trigrams (viewed as text strings) in this data, 2 million hash buckets is
certainly *not* excessive.  I expect it would do better with a lot more.
But, even with the hash trickery, at 2M buckets I'm again pushing the limit
of my RAM on the fat test (which trains on more than 30,000 msgs per run).

So pushing this more would require a different database structure.  So far
the results aren't good enough to make me keen to pursue it.

filename:       cv   tri19   tri20  tri20a   tri21
ham:spam:  20000:14000     20000:14000     20000:14000
                   20000:14000     20000:14000
fp total:        3       0       0       0       0
fp %:         0.01    0.00    0.00    0.00    0.00
fn total:        0       7       8       3       2
fn %:         0.00    0.05    0.06    0.02    0.01
unsure t:       91     926    1128    1133     854
unsure %:     0.27    2.72    3.32    3.33    2.51
real cost:  $48.20 $192.20 $233.60 $229.60 $172.80
best cost:  $17.80  $36.60  $39.60  $51.00  $38.60
h mean:       0.24    0.30    0.20    0.27    0.38
h sdev:       2.73    2.25    1.93    2.38    3.00
s mean:      99.95   97.44   96.70   96.94   97.89
s sdev:       1.40   10.17   11.61   10.95    9.12
mean diff:   99.71   97.14   96.50   96.67   97.51
k:           24.14    7.82    7.13    7.25    8.05

The FN under all hashed schemes are mostly long spam in foreign languages,
and *which* of those are judged ham varies across runs (changing the # of
hash buckets, and/or the tokenization strategy, changes the set of
accidental hash collisions).  Because they're long they generate lots of
hash codes; because they're foreign languages, the hash codes hit accidental
matches; do that often enough and you're bound to get something that looks
like solid ham.  In tri21, the lowest-scoring FN was at 0.01, and happened
to be a long spam in what looks like Polish.  Non-hashing schemes are immune
to this (brand new words are ignored, and the header clues dominate the
score, which is usually enough to nail it as spam).

The increase in Unsures appears to be almost entirely due to spam.  Here's
the ham score distro (in tri21) near 50:

47.0     0
47.5     0
48.0     1 *
48.5     2 *
49.0     0
49.5     1 *
50.0     1 *

and no ham scored higher than that.  The spam score distro hear 50:

47.0     2 *
47.5     3 *
48.0     2 *
48.5     3 *
49.0     7 *
49.5   104 *
50.0   343 **
50.5    40 *
51.0    29 *
51.5    20 *
52.0     7 *
52.5    18 *
53.0    18 *

I don't know why that is (well, yes, it's a huge increase in "cancellation
disease" in spams, but I don't know *why* there's a huge cancellation
disease increase for spam but not for ham).

The quote of the Nigerian scam spam was the highest-scoring ham, scoring
exactly 0.5, with H=1 and S=1.  The H=1 appeared mostly due to extremely
strong ham clues in the headers, the strongest being:

prob('header:Subject:1 noheader:received noheader:x-abuse-info') =
4.88234e-005

Unfortunately, it's impossible to say whether that's "real" or was just a
hash accident.  It's pretty clear that this "ham clue" was an accident:

prob('7597133 federal ministry') = 0.0505618

and this even more so:

prob('housing (fmwh) nigeria.') = 0.0238095

The chance of this crap decreases as the # of hash buckets increases, but
increases the more training data you've got too.

better-the-devil-you-know?-ly y'rs  - tim


From tim.one@comcast.net  Sat Nov 16 23:51:20 2002
From: tim.one@comcast.net (Tim Peters)
Date: Sat, 16 Nov 2002 18:51:20 -0500
Subject: [Spambayes] Seeking a giant idle machine w/ a miserable corpus
In-Reply-To: <LNBBLJKPBEHFEDALKOLCIECGCMAB.tim.one@comcast.net>
Message-ID: <LNBBLJKPBEHFEDALKOLCCECLCMAB.tim.one@comcast.net>

One more in this round; bi21 is 2M hash buckets but limited to unigrams and
bigrams (no trigrams):

filename:       cv   tri21    bi21
ham:spam:  20000:14000     20000:14000
                   20000:14000
fp total:        3       0       0
fp %:         0.01    0.00    0.00
fn total:        0       2       0
fn %:         0.00    0.01    0.00
unsure t:       91     854     300
unsure %:     0.27    2.51    0.88
real cost:  $48.20 $172.80  $60.00
best cost:  $17.80  $38.60  $23.20
h mean:       0.24    0.38    0.25
h sdev:       2.73    3.00    2.61
s mean:      99.95   97.89   99.41
s sdev:       1.40    9.12    4.67
mean diff:   99.71   97.51   99.16
k:           24.14    8.05   13.62

My guess is that it's more likely bigrams benefited from suffering fewer
unfortunate hash collisions than that they're actually generating better raw
info.

The "missing test" here is exact bigrams (no hash convolutions).  I'll try
that later; may not have enough RAM for that, but should.


From vanhorn@whidbey.com  Sun Nov 17 00:22:47 2002
From: vanhorn@whidbey.com (G. Armour Van Horn)
Date: Sat, 16 Nov 2002 16:22:47 -0800
Subject: [Spambayes] Seeking a giant idle machine w/ a miserable corpus
References: <LNBBLJKPBEHFEDALKOLCCECLCMAB.tim.one@comcast.net>
Message-ID: <3DD6E157.2CF9450F@whidbey.com>

I've been meaning to ask, what do "real cost" and "best cost" actually mean?
I've seen you guys "spend" several million dollars while testing, and if it
"costs" that much to test for spam in this way, I'm going to have a heck of a
time marking it up and selling it to customers!

Van

Tim Peters wrote:

> One more in this round; bi21 is 2M hash buckets but limited to unigrams and
> bigrams (no trigrams):
>
> filename:       cv   tri21    bi21
> ham:spam:  20000:14000     20000:14000
>                    20000:14000
> fp total:        3       0       0
> fp %:         0.01    0.00    0.00
> fn total:        0       2       0
> fn %:         0.00    0.01    0.00
> unsure t:       91     854     300
> unsure %:     0.27    2.51    0.88
> real cost:  $48.20 $172.80  $60.00
> best cost:  $17.80  $38.60  $23.20
> h mean:       0.24    0.38    0.25
> h sdev:       2.73    3.00    2.61
> s mean:      99.95   97.89   99.41
> s sdev:       1.40    9.12    4.67
> mean diff:   99.71   97.51   99.16
> k:           24.14    8.05   13.62
>
> My guess is that it's more likely bigrams benefited from suffering fewer
> unfortunate hash collisions than that they're actually generating better raw
> info.
>
> The "missing test" here is exact bigrams (no hash convolutions).  I'll try
> that later; may not have enough RAM for that, but should.
>
> _______________________________________________
> Spambayes mailing list
> Spambayes@python.org
> http://mail.python.org/mailman/listinfo/spambayes

--
----------------------------------------------------------
Sign up now for Quotes of the Day, a handful of quotations
on a theme delivered every morning.
Enlightenment! Daily, for free!
mailto:twisted@whidbey.com?subject=Subscribe_QOTD

For web hosting and maintenance,
visit Van's home page: http://www.domainvanhorn.com/van/
----------------------------------------------------------


From tim.one@comcast.net  Sun Nov 17 00:44:16 2002
From: tim.one@comcast.net (Tim Peters)
Date: Sat, 16 Nov 2002 19:44:16 -0500
Subject: [Spambayes] Seeking a giant idle machine w/ a miserable corpus
In-Reply-To: <3DD6E157.2CF9450F@whidbey.com>
Message-ID: <LNBBLJKPBEHFEDALKOLCKECPCMAB.tim.one@comcast.net>

[G. Armour Van Horn]
> I've been meaning to ask, what do "real cost" and "best cost"
> actually mean?

In Options.py:

# After the display of a ham+spam histogram pair, you can get a listing
# of all the cutoff values (coinciding with histogram bucket boundaries)
# that minimize
#
#      best_cutoff_fp_weight * (# false positives) +
#      best_cutoff_fn_weight * (# false negatives) +
#      best_cutoff_unsure_weight * (# unsure msgs)
#
# This displays two cutoffs:  hamc and spamc, where
#
#     0.0 <= hamc <= spamc <= 1.0
#
# The idea is that if something scores < hamc, it's called ham; if
# something scores >= spamc, it's called spam; and everything else is
# called 'I'm not sure' -- the middle ground.
#
# Note:  You may wish to increase nbuckets, to give this scheme more
# cutoff values to analyze.
compute_best_cutoffs_from_histograms: True
best_cutoff_fp_weight:     10.00
best_cutoff_fn_weight:      1.00
best_cutoff_unsure_weight:  0.20

So by default, an FP is charged $10, an FN $1, and an unsure $0.20.  The
best cost is the lowest cost you could possibly have gotten by choosing ham
and spam cutoffs with perfect knowledge of how things would turn out.  The
real cost is how things actually turned out, using the ham and spam cutoffs
you supplied in advance.

> I've seen you guys "spend" several million dollars while testing,
> and if it "costs" that much to test for spam in this way, I'm going
> to have a heck of a time marking it up and selling it to customers!

Relax; they're Canadian dollars <wink>.


From mhammond@skippinet.com.au  Sun Nov 17 03:38:10 2002
From: mhammond@skippinet.com.au (Mark Hammond)
Date: Sun, 17 Nov 2002 14:38:10 +1100
Subject: [Spambayes] Too much information
Message-ID: <LCEPIIGDJPKCOIHOBJEPMEBCHMAA.mhammond@skippinet.com.au>

>From the spam that Tim let through - it came in at 0.8 - not too bad given
the strong ham indications from mailman.

But now the project is telling me things I really don't want to know - like
I am being undervalued.  From this spam's hints:

word                                spamprob
'$200'                              0.0918367
...
'$500'                              0.977856

Obviously the spammers are aiming too high - hit me with an offer of $200,
and I seem to like it - but offer me $500, and I turn my nose up.

Ripped-off-or-what? ly,

Mark.


From neale@woozle.org  Sun Nov 17 03:49:26 2002
From: neale@woozle.org (Neale Pickett)
Date: 16 Nov 2002 19:49:26 -0800
Subject: [Spambayes] hammie's dbm file has changed
Message-ID: <w53isyw7t4p.fsf@woozle.org>

I just want to make sure everyone is aware that hammie.py's dbm file
format has changed now.  I sent a message out about it two days ago and
didn't get any responses, so it's in now.

From skip@pobox.com  Sun Nov 17 03:50:24 2002
From: skip@pobox.com (Skip Montanaro)
Date: Sat, 16 Nov 2002 21:50:24 -0600
Subject: [Spambayes] Re: [Spambayes-checkins] 
 spambayes hammiefilter.py,NONE,1.1 README.txt,1.42,1.43
 hammie.py,1.38,1.39 mboxutils.py,1.6,1.7
In-Reply-To: <E18DGKV-0003cB-00@usw-pr-cvs1.sourceforge.net>
References: <E18DGKV-0003cB-00@usw-pr-cvs1.sourceforge.net>
Message-ID: <15831.4608.910147.312797@montanaro.dyndns.org>


    Neale> * hammie.py can now take messages on stdin, but it's ugly.  If
    Neale>   you want to do this, you should look at hammiefilter.py

I'm not sure I get this.  I use hammie.py as a filter from my procmailrc
file already.  What new feature did you add?  The ability to train on a
message on stdin?

Skip

From neale@woozle.org  Sun Nov 17 03:55:39 2002
From: neale@woozle.org (Neale Pickett)
Date: 16 Nov 2002 19:55:39 -0800
Subject: [Spambayes] A kinder, gentler hammie
Message-ID: <w53el9k7suc.fsf@woozle.org>

I've checked in hammiefilter.py, which I've been using for a few days
now.  The idea is to make the user interface (that means command-line
options) as lightweight as possible.  Now the setup for a procmail-based
solution is even easier:

  $ hammiefilter.py -n
  Created new database in hammie.db

Then you add a procmail recipie like this:

  :0 fw 
  | hammiefilter.py

And in your MUA, pipe a message to it with the -s option for spam, or
the -g option for ham.


I think I'd like to have hammiefilter check for a "~/.hammierc" file
in addition to a bayescustomize.ini file, and also set the
persistent_storage_file option to "~/.hammie.db" unless it's
overridden.  With those two changes, we'd have something supremely easy
to drop into your mail setup.  Almost as easy as SpamAssassin (except,
of course, that you don't have to keep retraining SpamAssassin).


Neale

From tim.one@comcast.net  Sun Nov 17 05:19:55 2002
From: tim.one@comcast.net (Tim Peters)
Date: Sun, 17 Nov 2002 00:19:55 -0500
Subject: [Spambayes] Seeking a giant idle machine w/ a miserable corpus
In-Reply-To: <LNBBLJKPBEHFEDALKOLCCECLCMAB.tim.one@comcast.net>
Message-ID: <LNBBLJKPBEHFEDALKOLCAEDKCMAB.tim.one@comcast.net>

[Tim]
> ...
> The "missing test" here is exact bigrams (no hash convolutions).  I'll
> try that later; may not have enough RAM for that, but should.

I didn't, but by cutting 6,000 ham out of my test data managed to complete
the test in < 3 hours with lots of disk thrashing.  cv is baseline, bi21 the
bigram gimmick w/ 2**21 hash buckets, and bix the exact bigram run:

filename:       cv    bi21     bix
ham:spam:  20000:14000     14000:14000
                   20000:14000
fp total:        3       0       0
fp %:         0.01    0.00    0.00
fn total:        0       0       0
fn %:         0.00    0.00    0.00
unsure t:       91     300      98
unsure %:     0.27    0.88    0.35
real cost:  $48.20  $60.00  $19.60
best cost:  $17.80  $23.20   $6.80
h mean:       0.24    0.25    0.25
h sdev:       2.73    2.61    2.82
s mean:      99.95   99.41   99.91
s sdev:       1.40    4.67    1.94
mean diff:   99.71   99.16   99.66
k:           24.14   13.62   20.94

There are simply no surprises in the bix output; under the covers it's
beautiful:

-> <stat> Ham scores for all runs: 14000 items; mean 0.25; sdev 2.82
-> <stat> min 0; median 3.88578e-014; max 69.8223
-> <stat> percentiles: 5% 0; 25% 0; 75% 7.29026e-009; 95% 0.00817516

3 ham scored between 50 and 50.5; 1 ham scored 69.8; all other ham scored
below 50.

-> <stat> Spam scores for all runs: 14000 items; mean 99.91; sdev 1.94
-> <stat> min 30.4227; median 100; max 100
-> <stat> percentiles: 5% 100; 25% 100; 75% 100; 95% 100

2 apam scored between 49.5 and 50.5; 1 spam scored 30.4; all other spam
scored above 50.

The best-cost cutoffs relect this sharp separation:

-> best cost for all runs: $6.80
-> per-fp cost $10.00; per-fn cost $1.00; per-unsure cost $0.20
-> achieved at 9 cutoff pairs
-> smallest ham & spam cutoffs 0.495 & 0.7
->     fp 0; fn 1; unsure ham 10; unsure spam 19
->     fp rate 0%; fn rate 0.00714%; unsure rate 0.104%
-> largest ham & spam cutoffs 0.495 & 0.74
->     fp 0; fn 1; unsure ham 10; unsure spam 19
->     fp rate 0%; fn rate 0.00714%; unsure rate 0.104%


The highest-scoring ham is the only reason for the high suggested spam
cuttoff (else .51 would have worked fine and eliminated almost all the
unsures), and was the 2nd-worst in the baseline test:  poor Vickie Mills
suffering from her obnoxious employer-generated sig.  Bigrams helped her,
but it's unclear why(!).  Her three strongest ham clues:

prob('noheader:return-path noheader:abuse-reports-to') = 0.000364471
prob('subject:Python') = 0.00116829
prob('header:From:1 header:MIME-Version:1') = 0.00129855

Now the order in which noheader metatokens get generated is an accident
inherited from the order in which a Set happens to enumerate its elements.
What I fear here is that I've stumbled into a too-good systematic difference
between c.l.py headers and BruceG's spam headers:  not in the *set* of
header lines they contain (I'm ignoring all headers that might matter), but
something much subtler having to do with how the set of all header lines in
existence may affect dict iteration order for the few headers I actually
look at.  Bigrams involving headers lines are striking in the test output;
e.g., this one is a strong spam clue, and is almost as mysterious:

    prob('header:Subject:1 noheader:errors-to') = 0.96321

If we pursue this approach, this will take much thought.

The lowest-scoring spam was the one with a uuencoded text body we throw away
unlooked-at.  A header bigram boosted its spam score in a clearly helpful
way:

    prob('from:addr:hotmail.com from:no real name:2**0') = 0.985839

Indeed, something sent from hotmail without a real name reeks to heaven of
spam, much more so than hotmail alone or no-real-name alone.

OTOH, check out the 6 strongest ham clues for the same spam:

prob('header:X-Complaints-To:1') = 4.00324e-005
prob('header:From:1 header:Date:1') = 0.000279524
prob('header:Subject:1 noheader:received') = 0.000382997
prob('noheader:x-face noheader:return-path') = 0.00045433
prob('header:Organization:1 header:Message-ID:1') = 0.00137332
prob('noheader:in-reply-to noheader:reply-to') = 0.0110694

The X-Complaints-To header has always helped this guy a lot, but why the 5
new header-bigram combos here are hammish remains a mystery.

The memory burden of this run is also a mystery.  I played with mixing
unigrams and bigrams before, and recall the c.l.py test topping out at about
120MB.  This run was over 256MB (hence the massive swapping on my 256MB
box), and it wasn't even a full run.  A difference is that, in my previous
runs, the *tokenizer* generated the unigrams and bigrams, and only for the
body.  In this run the classifier generated them, and header tokens got into
the mix too.  I suppose header bigrams are large (as strings), and that
there are a lot of them -- heck, for lots of msgs, even just the text of the
headers I look at here is larger than the msg bodies.

So while this scheme may have real promise, mysteries and practical problems
abound.  I'm out of time for looking at this.  If someone wants to pursue
it, I'll attach the classifier I used.  Based on everything I've done here,
two suggestions:

1. Every time we've tried them, hash schemes have been unsatisfying,
   due to the human-incomprehensible mistakes they make.  You're
   unlikely to see mistakes on a small run, though -- this is a
   percentage game that *eventually* loses big.

2. I've got some evidence to believe that exact bigrams may help, but
   saw nothing in the exact trigram runs to suggest they buy
   anything over that.  Trigrams helped on-topic c.l.py conference
   announcements, but they also hurt them, and *that* class of
   problem msg is already solidly Unsure under the unigram scheme.
-------------- next part --------------
# An implementation of a Bayes-like spam classifier.
#
# Paul Graham's original description:
#
#     http://www.paulgraham.com/spam.html
#
# A highly fiddled version of that can be retrieved from our CVS repository,
# via tag Last-Graham.  This made many demonstrated improvements in error
# rates over Paul's original description.
#
# This code implements Gary Robinson's suggestions, the core of which are
# well explained on his webpage:
#
#    http://radio.weblogs.com/0101454/stories/2002/09/16/spamDetection.html
#
# This is theoretically cleaner, and in testing has performed at least as
# well as our highly tuned Graham scheme did, often slightly better, and
# sometimes much better.  It also has "a middle ground", which people like:
# the scores under Paul's scheme were almost always very near 0 or very near
# 1, whether or not the classification was correct.  The false positives
# and false negatives under Gary's basic scheme (use_gary_combining) generally
# score in a narrow range around the corpus's best spam_cutoff value.
# However, it doesn't appear possible to guess the best spam_cutoff value in
# advance, and it's touchy.
#
# The chi-combining scheme used by default here gets closer to the theoretical
# basis of Gary's combining scheme, and does give extreme scores, but also
# has a very useful middle ground (small # of msgs spread across a large range
# of scores, and good cutoff values aren't touchy).
#
# This implementation is due to Tim Peters et alia.

from __future__ import generators

import math
import time
from sets import Set

from Options import options
from chi2 import chi2Q

try:
    True, False
except NameError:
    # Maintain compatibility with Python 2.2
    True, False = 1, 0


LN2 = math.log(2)       # used frequently by chi-combining

PICKLE_VERSION = 1

class WordInfo(object):
    __slots__ = ('atime',     # when this record was last used by scoring(*)
                 'spamcount', # # of spams in which this word appears
                 'hamcount',  # # of hams in which this word appears
                 'killcount', # # of times this made it to spamprob()'s nbest
                 'spamprob',  # prob(spam | msg contains this word)
                )

    # Invariant:  For use in a classifier database, at least one of
    # spamcount and hamcount must be non-zero.
    #
    # (*)atime is the last access time, a UTC time.time() value.  It's the
    # most recent time this word was used by scoring (i.e., by spamprob(),
    # not by training via learn()); or, if the word has never been used by
    # scoring, the time the word record was created (i.e., by learn()).
    # One good criterion for identifying junk (word records that have no
    # value) is to delete words that haven't been used for a long time.
    # Perhaps they were typos, or unique identifiers, or relevant to a
    # once-hot topic or scam that's fallen out of favor.  Whatever, if
    # a word is no longer being used, it's just wasting space.

    def __init__(self, atime, spamprob=options.unknown_word_prob):
        self.atime = atime
        self.spamcount = self.hamcount = self.killcount = 0
        self.spamprob = spamprob

    def __repr__(self):
        return "WordInfo%r" % repr((self.atime, self.spamcount,
                                    self.hamcount, self.killcount,
                                    self.spamprob))

    def __getstate__(self):
        return (self.atime, self.spamcount, self.hamcount, self.killcount,
                self.spamprob)

    def __setstate__(self, t):
        (self.atime, self.spamcount, self.hamcount, self.killcount,
         self.spamprob) = t

class Bayes:
    # Defining __slots__ here made Jeremy's life needlessly difficult when
    # trying to hook this all up to ZODB as a persistent object.  There's
    # no space benefit worth getting from slots in this class; slots were
    # used solely to help catch errors earlier, when this code was changing
    # rapidly.

    #__slots__ = ('wordinfo',  # map word to WordInfo record
    #             'nspam',     # number of spam messages learn() has seen
    #             'nham',      # number of non-spam messages learn() has seen
    #            )

    # allow a subclass to use a different class for WordInfo
    WordInfoClass = WordInfo

    def __init__(self):
        self.wordinfo = {}
        self.nspam = self.nham = 0

    def __getstate__(self):
        return PICKLE_VERSION, self.wordinfo, self.nspam, self.nham

    def __setstate__(self, t):
        if t[0] != PICKLE_VERSION:
            raise ValueError("Can't unpickle -- version %s unknown" % t[0])
        self.wordinfo, self.nspam, self.nham = t[1:]

    # spamprob() implementations.  One of the following is aliased to
    # spamprob, depending on option settings.

    def gary_spamprob(self, wordstream, evidence=False):
        """Return best-guess probability that wordstream is spam.

        wordstream is an iterable object producing words.
        The return value is a float in [0.0, 1.0].

        If optional arg evidence is True, the return value is a pair
            probability, evidence
        where evidence is a list of (word, probability) pairs.
        """

        from math import frexp

        # This combination method is due to Gary Robinson; see
        # http://radio.weblogs.com/0101454/stories/2002/09/16/spamDetection.html

        # The real P = this P times 2**Pexp.  Likewise for Q.  We're
        # simulating unbounded dynamic float range by hand.  If this pans
        # out, *maybe* we should store logarithms in the database instead
        # and just add them here.  But I like keeping raw counts in the
        # database (they're easy to understand, manipulate and combine),
        # and there's no evidence that this simulation is a significant
        # expense.
        P = Q = 1.0
        Pexp = Qexp = 0
        clues = self._getclues(wordstream)
        for prob, word, record in clues:
            if record is not None:  # else wordinfo doesn't know about it
                record.killcount += 1
            P *= 1.0 - prob
            Q *= prob
            if P < 1e-200:  # move back into range
                P, e = frexp(P)
                Pexp += e
            if Q < 1e-200:  # move back into range
                Q, e = frexp(Q)
                Qexp += e

        P, e = frexp(P)
        Pexp += e
        Q, e = frexp(Q)
        Qexp += e

        num_clues = len(clues)
        if num_clues:
            #P = 1.0 - P**(1./num_clues)
            #Q = 1.0 - Q**(1./num_clues)
            #
            # (x*2**e)**n = x**n * 2**(e*n)
            n = 1.0 / num_clues
            P = 1.0 - P**n * 2.0**(Pexp * n)
            Q = 1.0 - Q**n * 2.0**(Qexp * n)

            # (P-Q)/(P+Q) is in -1 .. 1; scaling into 0 .. 1 gives
            # ((P-Q)/(P+Q)+1)/2 =
            # ((P-Q+P-Q)/(P+Q)/2 =
            # (2*P/(P+Q)/2 =
            # P/(P+Q)
            prob = P/(P+Q)
        else:
            prob = 0.5

        if evidence:
            clues = [(w, p) for p, w, r in clues]
            clues.sort(lambda a, b: cmp(a[1], b[1]))
            return prob, clues
        else:
            return prob

    if options.use_gary_combining:
        spamprob = gary_spamprob

    # Across vectors of length n, containing random uniformly-distributed
    # probabilities, -2*sum(ln(p_i)) follows the chi-squared distribution
    # with 2*n degrees of freedom.  This has been proven (in some
    # appropriate sense) to be the most sensitive possible test for
    # rejecting the hypothesis that a vector of probabilities is uniformly
    # distributed.  Gary Robinson's original scheme was monotonic *with*
    # this test, but skipped the details.  Turns out that getting closer
    # to the theoretical roots gives a much sharper classification, with
    # a very small (in # of msgs), but also very broad (in range of scores),
    # "middle ground", where most of the mistakes live.  In particular,
    # this scheme seems immune to all forms of "cancellation disease":  if
    # there are many strong ham *and* spam clues, this reliably scores
    # close to 0.5.  Most other schemes are extremely certain then -- and
    # often wrong.
    def chi2_spamprob(self, wordstream, evidence=False):
        """Return best-guess probability that wordstream is spam.

        wordstream is an iterable object producing words.
        The return value is a float in [0.0, 1.0].

        If optional arg evidence is True, the return value is a pair
            probability, evidence
        where evidence is a list of (word, probability) pairs.
        """

        from math import frexp, log as ln

        # We compute two chi-squared statistics, one for ham and one for
        # spam.  The sum-of-the-logs business is more sensitive to probs
        # near 0 than to probs near 1, so the spam measure uses 1-p (so
        # that high-spamprob words have greatest effect), and the ham
        # measure uses p directly (so that lo-spamprob words have greatest
        # effect).
        #
        # For optimization, sum-of-logs == log-of-product, and f.p.
        # multiplication is a lot cheaper than calling ln().  It's easy
        # to underflow to 0.0, though, so we simulate unbounded dynamic
        # range via frexp.  The real product H = this H * 2**Hexp, and
        # likewise the real product S = this S * 2**Sexp.
        H = S = 1.0
        Hexp = Sexp = 0

        clues = self._getclues(wordstream)
        for prob, word, record in clues:
            if record is not None:  # else wordinfo doesn't know about it
                record.killcount += 1
            S *= 1.0 - prob
            H *= prob
            if S < 1e-200:  # prevent underflow
                S, e = frexp(S)
                Sexp += e
            if H < 1e-200:  # prevent underflow
                H, e = frexp(H)
                Hexp += e

        # Compute the natural log of the product = sum of the logs:
        # ln(x * 2**i) = ln(x) + i * ln(2).
        S = ln(S) + Sexp * LN2
        H = ln(H) + Hexp * LN2

        n = len(clues)
        if n:
            S = 1.0 - chi2Q(-2.0 * S, 2*n)
            H = 1.0 - chi2Q(-2.0 * H, 2*n)

            # How to combine these into a single spam score?  We originally
            # used (S-H)/(S+H) scaled into [0., 1.], which equals S/(S+H).  A
            # systematic problem is that we could end up being near-certain
            # a thing was (for example) spam, even if S was small, provided
            # that H was much smaller.
            # Rob Hooft stared at these problems and invented the measure
            # we use now, the simpler S-H, scaled into [0., 1.].
            prob = (S-H + 1.0) / 2.0
        else:
            prob = 0.5

        if evidence:
            clues = [(w, p) for p, w, r in clues]
            clues.sort(lambda a, b: cmp(a[1], b[1]))
            clues.insert(0, ('*S*', S))
            clues.insert(0, ('*H*', H))
            return prob, clues
        else:
            return prob

    if options.use_chi_squared_combining:
        spamprob = chi2_spamprob

    def learn(self, wordstream, is_spam, update_probabilities=True):
        """Teach the classifier by example.

        wordstream is a word stream representing a message.  If is_spam is
        True, you're telling the classifier this message is definitely spam,
        else that it's definitely not spam.

        If optional arg update_probabilities is False (the default is True),
        don't update word probabilities.  Updating them is expensive, and if
        you're going to pass many messages to learn(), it's more efficient
        to pass False here and call update_probabilities() once when you're
        done -- or to call learn() with update_probabilities=True when
        passing the last new example.  The important thing is that the
        probabilities get updated before calling spamprob() again.
        """

        self._add_msg(wordstream, is_spam)
        if update_probabilities:
            self.update_probabilities()

    def unlearn(self, wordstream, is_spam, update_probabilities=True):
        """In case of pilot error, call unlearn ASAP after screwing up.

        Pass the same arguments you passed to learn().
        """

        self._remove_msg(wordstream, is_spam)
        if update_probabilities:
            self.update_probabilities()

    def update_probabilities(self):
        """Update the word probabilities in the spam database.

        This computes a new probability for every word in the database,
        so can be expensive.  learn() and unlearn() update the probabilities
        each time by default.  Thay have an optional argument that allows
        to skip this step when feeding in many messages, and in that case
        you should call update_probabilities() after feeding the last
        message and before calling spamprob().
        """

        nham = float(self.nham or 1)
        nspam = float(self.nspam or 1)

        S = options.unknown_word_strength
        StimesX = S * options.unknown_word_prob

        for word, record in self.wordinfo.iteritems():
            # Compute prob(msg is spam | msg contains word).
            # This is the Graham calculation, but stripped of biases, and
            # stripped of clamping into 0.01 thru 0.99.  The Bayesian
            # adjustment following keeps them in a sane range, and one
            # that naturally grows the more evidence there is to back up
            # a probability.
            hamcount = record.hamcount
            assert hamcount <= nham
            hamratio = hamcount / nham

            spamcount = record.spamcount
            assert spamcount <= nspam
            spamratio = spamcount / nspam

            prob = spamratio / (hamratio + spamratio)

            # Now do Robinson's Bayesian adjustment.
            #
            #         s*x + n*p(w)
            # f(w) = --------------
            #           s + n
            #
            # I find this easier to reason about like so (equivalent when
            # s != 0):
            #
            #        x - p
            #  p +  -------
            #       1 + n/s
            #
            # IOW, it moves p a fraction of the distance from p to x, and
            # less so the larger n is, or the smaller s is.

            n = hamcount + spamcount
            prob = (StimesX + n * prob) / (S + n)

            if record.spamprob != prob:
                record.spamprob = prob
                # The next seemingly pointless line appears to be a hack
                # to allow a persistent db to realize the record has changed.
                self.wordinfo[word] = record

    def clearjunk(self, oldesttime):
        """Forget useless wordinfo records.  This can shrink the database size.

        A record for a word will be retained only if the word was accessed
        at or after oldesttime.
        """

        wordinfo = self.wordinfo
        mincount = float(mincount)
        tonuke = [w for w, r in wordinfo.iteritems() if r.atime < oldesttime]
        for w in tonuke:
            del wordinfo[w]

    # NOTE:  Graham's scheme had a strange asymmetry:  when a word appeared
    # n>1 times in a single message, training added n to the word's hamcount
    # or spamcount, but predicting scored words only once.  Tests showed
    # that adding only 1 in training, or scoring more than once when
    # predicting, hurt under the Graham scheme.
    # This isn't so under Robinson's scheme, though:  results improve
    # if training also counts a word only once.  The mean ham score decreases
    # significantly and consistently, ham score variance decreases likewise,
    # mean spam score decreases (but less than mean ham score, so the spread
    # increases), and spam score variance increases.
    # I (Tim) speculate that adding n times under the Graham scheme helped
    # because it acted against the various ham biases, giving frequently
    # repeated spam words (like "Viagra") a quick ramp-up in spamprob; else,
    # adding only once in training, a word like that was simply ignored until
    # it appeared in 5 distinct training hams.  Without the ham-favoring
    # biases, though, and never ignoring words, counting n times introduces
    # a subtle and unhelpful bias.
    # There does appear to be some useful info in how many times a word
    # appears in a msg, but distorting spamprob doesn't appear a correct way
    # to exploit it.
    def _add_msg(self, wordstream, is_spam):
        if is_spam:
            self.nspam += 1
        else:
            self.nham += 1

        wordinfo = self.wordinfo
        wordinfoget = wordinfo.get
        now = time.time()
        for word in self._get_all_tokens(wordstream):
            record = wordinfoget(word)
            if record is None:
                record = self.WordInfoClass(now)

            if is_spam:
                record.spamcount += 1
            else:
                record.hamcount += 1
            # Needed to tell a persistent DB that the content changed.
            wordinfo[word] = record

    def _remove_msg(self, wordstream, is_spam):
        if is_spam:
            if self.nspam <= 0:
                raise ValueError("spam count would go negative!")
            self.nspam -= 1
        else:
            if self.nham <= 0:
                raise ValueError("non-spam count would go negative!")
            self.nham -= 1

        wordinfo = self.wordinfo
        wordinfoget = wordinfo.get
        for word in self._get_all_tokens(wordstream):
            record = wordinfoget(word)
            if record is not None:
                if is_spam:
                    if record.spamcount > 0:
                        record.spamcount -= 1
                else:
                    if record.hamcount > 0:
                        record.hamcount -= 1
                if record.hamcount == 0 == record.spamcount:
                    del wordinfo[word]
                else:
                    # Needed to tell a persistent DB that the content changed.
                    wordinfo[word] = record

    def _getclues(self, wordstream):
        mindist = options.minimum_prob_strength
        unknown = options.unknown_word_prob

        rawclues = []
        pushclue = rawclues.append

        wordinfoget = self.wordinfo.get
        now = time.time()
        w1 = 'BOM'
        pos = 0
        for w2 in self._wrap_wordstream(wordstream):
            pos += 1
            first2 = w1 + " " + w2
            endpos = pos
            for word in w1, first2:
                endpos += 1
                record = wordinfoget(word)
                if record is None:
                    prob = unknown
                else:
                    record.atime = now
                    prob = record.spamprob
                distance = abs(prob - 0.5)
                if distance >= mindist:
                    pushclue((-distance, prob, word, record, pos, endpos))
            w1 = w2

        rawclues.sort()
        clues = []
        pushclue = clues.append
        atmost = options.max_discriminators
        wordseen = {}
        posseen = [0] * (pos + 4)
        for junk, prob, word, record, pos, endpos in rawclues:
            if word in wordseen:
                continue
            skip = 0
            for i in range(pos, endpos):
                if posseen[i]:
                    skip = 1
                    break
            if skip:
                continue
            pushclue((prob, word, record))
            wordseen[word] = 1
            for i in range(pos, endpos):
                posseen[i] = 1
            if len(clues) >= atmost:
                break
        # Return (prob, word, record).
        return clues

    def _wrap_wordstream(self, wordstream):
        for w in wordstream:
            yield w
        yield "EOM"

    def _get_all_tokens(self, wordstream):
        seen = {}
        w1 = 'BOM'
        for w2 in self._wrap_wordstream(wordstream):
            first2 = w1 + " " + w2
            for word in w1, first2:
                if word not in seen:
                    seen[word] = 1
                    yield word
            w1 = w2
From tim.one@comcast.net  Sun Nov 17 06:52:42 2002
From: tim.one@comcast.net (Tim Peters)
Date: Sun, 17 Nov 2002 01:52:42 -0500
Subject: [Spambayes] Seeking a giant idle machine w/ a miserable corpus
In-Reply-To: <LNBBLJKPBEHFEDALKOLCAEDKCMAB.tim.one@comcast.net>
Message-ID: <LNBBLJKPBEHFEDALKOLCMEDMCMAB.tim.one@comcast.net>

One more test result, from single-source general python.org email.  This is
a 2-fold (not 10-fold) cross validation run, so in each part it trained on
half the data and predicted against the other half.  org is baseline CVS,
orgbix is exact bigram:

filename:      org  orgbix
ham:spam:  5482:1896
                   5482:1896
fp total:        2       2
fp %:         0.04    0.04
fn total:       17      13
fn %:         0.90    0.69
unsure t:      206     235
unsure %:     2.79    3.19
real cost:  $78.20  $80.00
best cost:  $70.00  $65.40
h mean:       0.58    0.53
h sdev:       4.96    4.83
s mean:      95.74   95.28
s sdev:      14.73   14.61
mean diff:   95.16   94.75
k:            4.83    4.87

The FP were the same under both runs.  One is a one-word administrivia
request ("subscribe") buried in a veritable mountain of employer-generated
HTML disclaimers.  The other FP is a mystery (I remember corresponding about
it with GregW and we decided it was ham, but it wasn't obvious).

Both schemes have trouble with long, chatty spam; I've generally found this
a problem until training on thousands of spams.  Each FN under orgbix was
also an FN under unigrams.  Bigrams pushed four brief FN into Unsure
territory, but didn't nail them.


From rob@hooft.net  Sun Nov 17 08:07:59 2002
From: rob@hooft.net (Rob Hooft)
Date: Sun, 17 Nov 2002 09:07:59 +0100
Subject: [Spambayes] Seeking a giant idle machine w/ a miserable corpus
References: <LNBBLJKPBEHFEDALKOLCIECGCMAB.tim.one@comcast.net>
Message-ID: <3DD74E5F.9060606@hooft.net>

Tim Peters wrote:
> I ran my fat c.l.py test w/ the hash space clamped at 256K buckets.  That
> was clearly a bad idea for that test, since there are about 330K unique
> unigrams in that corpus (let alone bigrams and trigrams).

I collected some unigram statistics yesterday: training hammie on my 
2x10 sets in the corpus one by one, and after each 1600ham+580spam set 
run a program that reports the would-be collisions using the python 32 
bit hash function:

Set1 : 109280
Set2 : 183560
Set3 : 227699 (2 clashes)
Set4 : 277253 (3 clashes)
Set5 : 329662 (5)
Set6 : 362847 (7)
Set7 : 394585 (12)
Set8 : 422898 (12)
Set9 : 448767 (16)
Set10: 481393 (22)

clash:  [('1156', 0.027), ('607.80', 0.142)]
clash:  [('url:2516', 0.838), ('>beautiful', 0.142)]
clash:  [("erhc's", 0.964), ('27.7-0.144', 0.142)]
clash:  [('19271', 0.0841), ('richtig', 0.722)]
clash:  [('geleefd.', 0.142), ('20:10:05', 0.142)]
clash:  [('#000000', 0.905), ('from:name:jean richelle', 0.142)]
clash:  [('*lunit,', 0.142), ('.2635', 0.084)]
clash:  [("aminggs'", 0.142), ('m"f\'^', 0.142)]
clash:  [('02-6203-3010', 0.838), ('arona,', 0.838)]
clash:  [('dislin.graf.', 0.142), ('(inquires', 0.905)]
clash:  [('/9?!o_(jz?\\`', 0.142), ('arnhemse', 0.084)]
clash:  [('1075,1079', 0.142), ('from:name:c31', 0.838)]
clash:  [('1096377', 0.142), ('url:baoding', 0.838)]
clash:  [('url:bible', 0.565), ('scis', 0.084)]
clash:  [('334.8', 0.0596), ('\xc0\xd6\xbd\xc0\xb4\xcf\xb4\xd9.*', 0.905)]
clash:  [('d8/apex', 0.142), ('3\xb8\xb89\xc3\xb5\xbf\xf8\xc0\xbb', 0.838)]
clash:  [('subject:!!!                          ', 0.905), 
('from:addr:lll2002', 0.838)]
clash:  [('constitutes', 0.848), ('roast)', 0.142)]
clash:  [('>madison,', 0.017), ('subject:dison', 0.059)]
clash:  [('(powerpc)', 0.142), ('url:table', 0.849)]
clash:  [('subject:Complaint', 0.142), ('-24.727', 0.142)]
clash:  [('om=-96.953', 0.142), ('line-with', 0.142)]

The experienced spambayeser can see that I didn't use the standard 
parameters, this is because I did run an optimization using simplexloop 
in the background at the same time.

Here, the number of hash collisions is still fairly low, but subtract 
bits, and see it explode.....

Another thing that I learned from this, is that the number of distinct 
words with this test does not increase with the sqrt of the number of 
messages.

Here is clash.py:
-----
from hammie import DBDict
from Options import options

d=DBDict(options.persistent_storage_file,'r',('saved state',))

h={}

n=0
for k in d.iterkeys():
      n += 1
      #print k,type(d[k])
      hs=hash(k)
      if h.has_key(hs):
          h[hs].append((k,d[k].spamprob))
          print "clash: ",h[hs]
      else:
          h[hs]=[(k,d[k].spamprob)]

print n
-----

Regards,

Rob

-- 
Rob W.W. Hooft  ||  rob@hooft.net  ||  http://www.hooft.net/people/rob/


From rob@hooft.net  Sun Nov 17 08:28:26 2002
From: rob@hooft.net (Rob Hooft)
Date: Sun, 17 Nov 2002 09:28:26 +0100
Subject: [Spambayes] Better optimization loop
References: <3DD56CEB.7050406@hooft.net> <3DD5D8E5.1020708@hooft.net>
	<3DD5DBAB.4050101@hooft.net>
Message-ID: <3DD7532A.8020906@hooft.net>

Further changes in the optimization (not yet checked in, but I assume 
everyone is running trigrams now...)

I decided that we have a perfect way to optimize the ham and spam cutoff 
values in timcv already, so that I can remove these from the simplex 
optimization. To that goal I added a "delayed" flexcost to the 
CostCounter module that can use the optimal cutoffs calculated at the 
end of timcv.py. And there are only three variables left to optimize 
using simplex

I then ran one optimization on my complete (16000+5800) corpus. The 
result is that it is fighting very hard to remove fp's while introducing 
lots of unsure messages:

At the start:

-> <stat> all runs false positives: 15
-> <stat> all runs false negatives: 7
-> <stat> all runs unsure: 189
Standard Cost: $194.80
Flex Cost: $607.41
Delayed-Standard Cost: $98.80
Delayed-Flex Cost: $310.05
x=0.4990 p=0.1002 s=0.4537 310.05

And near the end:

-> <stat> all runs false positives: 5
-> <stat> all runs false negatives: 6
-> <stat> all runs unsure: 342
-> <stat> all runs false positive %: 0.03125
-> <stat> all runs false negative %: 0.103448275862
-> <stat> all runs unsure %: 1.56880733945
-> <stat> all runs cost: $124.40
Standard Cost: $124.40
Flex Cost: $589.16
Delayed-Standard Cost: $98.60
Delayed-Flex Cost: $212.28
x=0.3515 p=0.2861 s=0.2467 212.28

At this stage it actually managed to get the delayed standard cost lower 
by $0.20 (it has been higher than the starting value during much of the 
optimization). The Delayed-Flex cost is lowered by about 30%. But look 
at the hugely different parameters it had to use! Can someone else run 
with these parameters and confirm that this is an extreme that is only 
warranted by my particular corpses?

Please note that to get a delayed flex cost that is this much lower 
actually means that in the unsure area there is "50% more order" than 
before the optimization!

At some point Tim (was it you?) has reported that in other optimization 
techniques it has proven to be very bad to "focus" on the persistent and 
hopeless fp/fn messages. I fear this might bother me here.

I just started another optimization run, but lowered the cost of a fp 
from $10 to $2, and introduced another cost function that I called 
flex**2 cost because it changes the cost function for an unsure message 
from a linear function to a square function. Oops, two changes at the 
same time; but it takes such a long time to run....

More in 24 hours?

Regards,

Rob
-- 
Rob W.W. Hooft  ||  rob@hooft.net  ||  http://www.hooft.net/people/rob/


From richie@entrian.com  Sun Nov 17 14:37:52 2002
From: richie@entrian.com (Richie Hindle)
Date: Sun, 17 Nov 2002 14:37:52 +0000
Subject: [Spambayes] A kinder, gentler hammie
In-Reply-To: <w53el9k7suc.fsf@woozle.org>
References: <w53el9k7suc.fsf@woozle.org>
Message-ID: <0c9ftu81m52pcuem5j1kdrafdjo5qdea47@4ax.com>

Hi Neale,

> I think I'd like to have hammiefilter check for a "~/.hammierc" file
> in addition to a bayescustomize.ini file, and also set the
> persistent_storage_file option to "~/.hammie.db" unless it's
> overridden.

Be careful of other platforms - those filenames are meaningless on Windows
(though I guess people aren't running procmail on Windows, or are they?).
I wish I could be more constructive, but there's no equivalent of $HOME
that works on all versions of Windows (SHGetSpecialFolderPath may help, but
there's no interface to that in vanilla Python).

If you do change the default database location, please make sure you
announce it, and draw people's attention to the fact that pop3proxy uses
the same defaults!

-- 
Richie Hindle
richie@entrian.com


From lists@morpheus.demon.co.uk  Sun Nov 17 14:57:35 2002
From: lists@morpheus.demon.co.uk (Paul Moore)
Date: Sun, 17 Nov 2002 14:57:35 +0000
Subject: [Spambayes] A kinder, gentler hammie
References: <w53el9k7suc.fsf@woozle.org>
Message-ID: <n2m-g.u1igqm5c.fsf@morpheus.demon.co.uk>

Neale Pickett <neale@woozle.org> writes:

> I've checked in hammiefilter.py, which I've been using for a few days
> now.  The idea is to make the user interface (that means command-line
> options) as lightweight as possible.  Now the setup for a procmail-based
> solution is even easier:

This duplicated something I was doing just today - adding an option to
Hammie to train on a single message. Your interface is much better
than mine, though.

> I think I'd like to have hammiefilter check for a "~/.hammierc" file
> in addition to a bayescustomize.ini file, and also set the
> persistent_storage_file option to "~/.hammie.db" unless it's
> overridden.  With those two changes, we'd have something supremely easy
> to drop into your mail setup.  Almost as easy as SpamAssassin (except,
> of course, that you don't have to keep retraining SpamAssassin).

On Windows (which I use) "~" isn't handled by the OS. Applications
which use it often set the HOME environment variable properly, though,
so it *can* make sense. To make this work involves passing filenames
through os.path.expanduser().

Would anyone object to a patch which added a call to this function
wherever it made sense? It should make no difference to non-Windows
systems. On Windows, for cases where HOME is set, it will do "the
right thing" When HOME is not set, ~ will change to mean C:\, which is
a change in behaviour, but I suspect not one that will cause problems.

I'd have to rely on others to check on platforms other than Windows,
as I have no access to any other OSes.

Comments?

Paul.

-- 
This signature intentionally left blank

From lists@morpheus.demon.co.uk  Sun Nov 17 16:50:56 2002
From: lists@morpheus.demon.co.uk (Paul Moore)
Date: Sun, 17 Nov 2002 16:50:56 +0000
Subject: [Spambayes] popget.py - Gnus mail fetcher
Message-ID: <n2m-g.k7jcqgwf.fsf@morpheus.demon.co.uk>

I use Gnus on Windows for my mailreader at home, with Hamster (a local
mail/news server) as my POP3 mailbox. I could use pop3proxy for spam
detection, but it has a couple of problems:

1. As I already have a local POP server, I have to use a non-standard
   port
2. I have to remember to start the program, or package it up as a
   service, or otherwise automate the startup and hide the console
   window.

Rather than this, I wrote a small program (attached) to grab the
contents of a POP mailbox into a local file, scoring messages as it
goes. I can then put the following in my .gnus.el file to run my mail
through spambayes. It's equally usable on Unix, for people who have
reasons for not wanting to use pop3proxy there.

--- .gnus.el snippet ---
;; Popget program from spambayes setup
(setq popget "C:\\Data\\spambayes\\spambayes-test\\popget.bat")

;; Get mail via POP3
(setq mail-sources
      '((pop :server "localhost"
	     :user "XXXXXX"
	     :password "XXXXXX"
	     :program (concat popget " -u %u/%p -P %P -s %s -f %t"))))
------------------------

This works beautifully, and combined with XEmacs mail splitting, I
have a nice spam detection facility. With a little bit of work (not
hard) I can also use the new hammiefilter.py to add training
keystrokes, and I have quite a nice setup.

One other thing I'm going to add is a way to display the spam clues
for the current message in a buffer. When I'm done, I'll package up
the code in a form that can be included in the project as a sample
Gnus setup.

Paul.

-------------- next part --------------
#!/usr/bin/env python

# A program to get and classify the contents of a POP3 mailbox.

"""Usage: %(program)s [options]

Where:
    -h
        Show usage and exit
    -s SERVER
        The server from which to get the mail (default localhost)
    -P PORT
        The port to use (default 110)
    -u USER/PASS
        The username and password of the POP3 account
    -f FILE
        The file to save messages in (defaults to stdout)
    -k
        Keep messages on the POP server (default is to delete them)
    -p FILE
        use file as the persistent store.  loads data from this file if it
        exists, and saves data to this file at the end.
        Default: %(DEFAULTDB)s
    -d
        use the DBM store instead of cPickle.  The file is larger and
        creating it is slower, but checking against it is much faster,
        especially for large word databases. Default: %(USEDB)s
    -D
        the reverse of -d: use the cPickle instead of DBM
"""

import os
import sys
import getopt
import poplib
import socket
import hammie
from Options import options

try:
    True, False
except NameError:
    # Maintain compatibility with Python 2.2
    True, False = 1, 0

# For usage(); referenced by docstring above
program = sys.argv[0]
DEFAULTDB = options.persistent_storage_file

class Config:
    def __init__(self):
        self.server = 'localhost'
        self.port = 110
        self.user = None
        self.password = None
        self.DB = DEFAULTDB
        self.use_db = options.persistent_use_database
        self.filename = "<stdout>"
        self.file = sys.stdout
        self.delete = True

    def createhammie(self):
        bayes = hammie.createbayes(self.DB, self.use_db)
        self.hammie = hammie.Hammie(bayes)

FROMLINE = "From popget@spambayes.org Sat Jan 31 00:00:00 2000"

def getmail(conf):
    pop = poplib.POP3(conf.server, conf.port)
    if conf.user:
        pop.user(conf.user)
    if conf.password:
        pop.pass_(conf.password)

    n, size = pop.stat()
    num = 1
    while num <= n:
        rsp, lines, size = pop.retr(num)
        msg = "\n".join(lines)
        print >>conf.file, FROMLINE
        print >>conf.file, conf.hammie.filter(msg)
        print >>conf.file
        if conf.delete:
            pop.dele(num)
        num += 1
    pop.quit()

def usage(code, msg=''):
    """Print usage message and sys.exit(code)."""
    if msg:
        print >> sys.stderr, msg
        print >> sys.stderr
    print >> sys.stderr, __doc__ % globals()
    sys.exit(code)

def main():
    """Main program - parse options and run."""
    try:
        opts, args = getopt.getopt(sys.argv[1:], "hs:P:u:f:kp:dD")
    except getopt.error, msg:
        usage(2, msg)

    if not opts:
        usage(2, "No options given")

    conf = Config()

    for opt, arg in opts:
        if opt == '-h':
            usage(0)
        elif opt == '-s':
            conf.server = arg
        elif opt == '-P':
            try:
                conf.port = int(arg)
            except ValueError:
                usage(2, "Port must be a number ('%s' given)" % arg)
        elif opt == '-u':
            try:
                conf.user, conf.password = arg.split("/",1)
            except ValueError:
                usage(2, "-u option is USER/PASS ('%s' given)" % arg)
        elif opt == '-k':
            conf.delete = False
        elif opt == '-f':
            # Need to expand ~
            conf.filename = os.path.expanduser(arg)
            conf.file = open(conf.filename, "w")
        elif opt == '-p':
            conf.DB = arg
        elif opt == '-d':
            conf.use_db = True
        elif opt == '-D':
            conf.use_db = False

    conf.createhammie()
    try:
        getmail(conf)
    except (poplib.error_proto, socket.error), e:
        print >> sys.stderr, "POP protocol error %s" % e
        sys.exit(1)

if __name__=="__main__":
    main()
-------------- next part --------------

-- 
This signature intentionally left blank
From tim.one@comcast.net  Sun Nov 17 19:38:20 2002
From: tim.one@comcast.net (Tim Peters)
Date: Sun, 17 Nov 2002 14:38:20 -0500
Subject: [Spambayes] Testers needed with unbalanced spam::ham training data
In-Reply-To: <LNBBLJKPBEHFEDALKOLCEEFKCKAB.tim.one@comcast.net>
Message-ID: <LNBBLJKPBEHFEDALKOLCKEFOCMAB.tim.one@comcast.net>

If you have a strong imbalance between the # of ham and # of spam in your
training data (or even if you don't but can spare the effort), please do a
before-and-after test, where after adds the new option:

[Classifier]
experimental_ham_spam_imbalance_adjustment: True

I expect this option to go away and become the default, but it needs testing
first before I'll do that.

My c.l.py test has minor imbalance, and enabling this option doesn't really
matter on it:

filename:       cv    imbal
ham:spam:  20000:14000
                   20000:14000
fp total:        3       3
fp %:         0.01    0.01
fn total:        0       0
fn %:         0.00    0.00
unsure t:       91      95
unsure %:     0.27    0.28
real cost:  $48.20  $49.00
best cost:  $17.80  $17.80
h mean:       0.24    0.25
h sdev:       2.73    2.79
s mean:      99.95   99.96
s sdev:       1.40    1.32
mean diff:   99.71   99.71
k:           24.14   24.26

Since I have more ham than spam, the effect of the option is to "believe"
the hamcounts less than it used to, so that spamprobs have a harder time
getting close to 0.  That in turn makes everything a little spammier than it
used to be, so all the effects on the statistics are expected:  ham and spam
means both go up a little, ham sdev increases a little because strong ham
words aren't as strong as they were, spam sdev decreases because strong spam
words are stronger than they were, and a few edgecase hams drifted into
Unsure territory because they're judged to be a little spammier than they
were.  A *possible* effect this data doesn't suffer is an increase in FP
rate, which would again be due to everything looking a little spammier (I'm
not being accurate here!  it's really due to everything looking less hammy,
but the distinction is too subtle to belabor <wink>).  Likewise some FN may
be redeemed (but weren't in this test, since it had no FN to begin with).
All these effects will be stronger the larger the imbalance in your
ham::spam ratio.


Oops!  Looks like SourceForge is down -- I haven't been able to check in the
changes yet.  Keep trying until they show up <wink>.


From tim.one@comcast.net  Sun Nov 17 19:50:56 2002
From: tim.one@comcast.net (Tim Peters)
Date: Sun, 17 Nov 2002 14:50:56 -0500
Subject: [Spambayes] Testers needed with unbalanced spam::ham training
	data
In-Reply-To: <LNBBLJKPBEHFEDALKOLCKEFOCMAB.tim.one@comcast.net>
Message-ID: <LNBBLJKPBEHFEDALKOLCMEGACMAB.tim.one@comcast.net>

[Tim]
> ...
> Oops!  Looks like SourceForge is down -- I haven't been able to
> check in the changes yet.  Keep trying until they show up <wink>.

Fudge -- that may be a long time from now.  From SF email last week:

    On 2002-11-17 (Sunday), project CVS services, project shell
    services, project web services (including all VHOSTs), and project
    database services will be offline for a period of up to twelve
    hours, starting at 10:00 Pacific (GMT-8).  Project web services
    will be restored first, but will be brought up initially with
    read-only access to project group directory space.  Static web
    content will be served correctly during this time period, but
    application-driven and database-dependent content and CGI scripts
    will not function correctly.  Issues encountered during this time
    period SHOULD NOT be reported to SourceForge.net; they are an
    expected side-effect of this outage.

The 12-hour outage is only 2 hours old at this time if my calculations are
correct, but math is tricky <wink>.  Without CVS access, I can't even give
you a patch here!  OK, if you're eager to test, unzip and drop in the
classifier.py and Options.py from the attached.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: spambayes.zip
Type: application/x-zip-compressed
Size: 19542 bytes
Desc: not available
Url : http://mail.python.org/pipermail/spambayes/attachments/20021117/d9d688e6/spambayes.bin
From neale@woozle.org  Sun Nov 17 21:12:38 2002
From: neale@woozle.org (Neale Pickett)
Date: 17 Nov 2002 13:12:38 -0800
Subject: [Spambayes] A kinder, gentler hammie
In-Reply-To: <0c9ftu81m52pcuem5j1kdrafdjo5qdea47@4ax.com>
References: <w53el9k7suc.fsf@woozle.org>
	<0c9ftu81m52pcuem5j1kdrafdjo5qdea47@4ax.com>
Message-ID: <w5365uv7veh.fsf@woozle.org>

So then, Richie Hindle <richie@entrian.com> is all like:

> If you do change the default database location, please make sure you
> announce it, and draw people's attention to the fact that pop3proxy uses
> the same defaults!

Hey Richie.  Sorry I wasn't clearer about this--I wouldn't want to
change the default, I'd just want hammiefilter to:

1. Read in the default
2. Set the database type
3. Try to read in bayescustomize.ini (which will probably fail)
4. Read in ~/.spambayesrc or something like it

So, no modifications to anything but hammiefilter.py.  There's too much
other stuff expecting it the way it currently works, I think.

Neale

From neale@woozle.org  Sun Nov 17 21:22:47 2002
From: neale@woozle.org (Neale Pickett)
Date: 17 Nov 2002 13:22:47 -0800
Subject: [Spambayes] A kinder, gentler hammie
In-Reply-To: <n2m-g.u1igqm5c.fsf@morpheus.demon.co.uk>
References: <w53el9k7suc.fsf@woozle.org>
	<n2m-g.u1igqm5c.fsf@morpheus.demon.co.uk>
Message-ID: <w531y5j7uxk.fsf@woozle.org>

So then, Paul Moore <lists@morpheus.demon.co.uk> is all like:

> On Windows (which I use) "~" isn't handled by the OS. Applications
> which use it often set the HOME environment variable properly, though,
> so it *can* make sense. To make this work involves passing filenames
> through os.path.expanduser().

Hey cool!  The audience for hammiefilter suddenly got larger.  How would
you use something like this on a Windows box?  Can some MTAs run
messages through an external filter one at a time?  I thought only we
Unix wonks were able do that ;)

I suppose what we could do is try a number of pathnames for the ini
file, and use the first one that works.  If there's some reasonable way
to figure out what to use for a "home directory" on Windows or Mac,
without relying on non-standard environment variables, it would look
there.

Of course, it could just rely on the BAYESCUSTOMIZE environment variable
like it does now.  But then you'd have to wrap hammiefilter to set the
variable before running.  I'm doing that currently, but I think a
wrapper around a wrapper around a driver is pretty ugly.  ;)

I'll check in some code that works in a Unix environment.  Take a look
at it and let me know what would make sense to try in a Windows
environment (or just submit a change if you can do that).

Neale

From neale@woozle.org  Sun Nov 17 21:43:10 2002
From: neale@woozle.org (Neale Pickett)
Date: 17 Nov 2002 13:43:10 -0800
Subject: [Spambayes] Re: [Spambayes-checkins]  spambayes
	hammiefilter.py,NONE,1.1 README.txt,1.42,1.43 hammie.py,1.38,1.39
	mboxutils.py,1.6,1.7
In-Reply-To: <15831.4608.910147.312797@montanaro.dyndns.org>
References: <E18DGKV-0003cB-00@usw-pr-cvs1.sourceforge.net>
	<15831.4608.910147.312797@montanaro.dyndns.org>
Message-ID: <w53wunb6ff5.fsf@woozle.org>

So then, Skip Montanaro <skip@pobox.com> is all like:

>     Neale> * hammie.py can now take messages on stdin, but it's ugly.  If
>     Neale>   you want to do this, you should look at hammiefilter.py
> 
> I'm not sure I get this.  I use hammie.py as a filter from my procmailrc
> file already.  What new feature did you add?  The ability to train on a
> message on stdin?

That's right.  You can now do

  hammie.py -s -

or

  hammie.py -g -

to train on a single message.  But immediately after I wrote this code,
I decided what we really needed was a cleaner front-end.  I think a lot
of people are using hammie.py on a large existing corpus, either for
training or testing.  hammiefilter is my attempt at something you'd use
as a callout from something else.

Neale

From mhammond@skippinet.com.au  Sun Nov 17 22:38:21 2002
From: mhammond@skippinet.com.au (Mark Hammond)
Date: Mon, 18 Nov 2002 09:38:21 +1100
Subject: [Spambayes] 
 More back-patting - my brain's first FP where bayes got it right
Message-ID: <LCEPIIGDJPKCOIHOBJEPGEDNHMAA.mhammond@skippinet.com.au>

I got a strange looking mail today.  All HTML.  My brain was sure it was
spam.  Bayes scored it a pretty solid ham (0.003).

So I re-eyeballed the mail (this *was* before the first coffee of the day,
mind you), and was again sure it was spam.  Was about to hit the "Spam
Clues" to see what the story was, and marvelling how how they knew I was
"Mark" (most spammers don't).

Then I realized it actually *wasn't* spam, but a personally addressed mail.
No idea why they picked me though.

So - pre Bayes, I am *sure* this mail would have hit the bit-bucket.  My
brain was sure it was spam way past the threshold where I would have deleted
it.

[Interesting side-point: In the dollar values for bayes, we should assign a
value to the frailty of the brain based on the size or ratios in the corpus.
For example, if 50% of mail coming to my inbox is spam, I bet my brain makes
far more FP errors than if it were only 1%]

The text version of the mail is below.  It was HTML, gray background, blue
writing - big brain spam-clues <wink>

And-I'm-yet-to-see-a-bayes-FP ly,

Mark.

---
From: "Special Imaging Services" <valid_address_removed_by_markh>
To: <mhammond@skippinet.com.au>
Subject: Python...

>>Special Imaging Services

>>Secure Messaging Zone

>>Encryption Status: OFF


Hello there, Mark,

We specialise in military standard image enhancement and digital,
craniofacial reconstruction. Our default programming environment is Prolog.
How might one integrate Python into the mix?


Warmest regards,

<name>
---


From richie@entrian.com  Sun Nov 17 23:07:25 2002
From: richie@entrian.com (Richie Hindle)
Date: Sun, 17 Nov 2002 23:07:25 +0000
Subject: [Spambayes] Testers needed with unbalanced spam::ham training
	data
In-Reply-To: <LNBBLJKPBEHFEDALKOLCKEFOCMAB.tim.one@comcast.net>
References: <LNBBLJKPBEHFEDALKOLCEEFKCKAB.tim.one@comcast.net>
	<LNBBLJKPBEHFEDALKOLCKEFOCMAB.tim.one@comcast.net>
Message-ID: <rhvftu80lgp3tsm6gtpgmrlegsj99b35p3@4ax.com>


> [Classifier]
> experimental_ham_spam_imbalance_adjustment: True

Four runs, with and without experimental_ham_spam_imbalance_adjustment, and
with a 10:1 ham:spam imbalance either way: 

lowham[_adj]:  timcv.py -n10 --ham=20  --spam=200 -s1
lowspam[_adj]: timcv.py -n10 --ham=200 --spam=20  -s1

filename:   lowham lowham_adj
ham:spam:  200:2000
                   200:2000
fp total:       15       2
fp %:         7.50    1.00
fn total:        1       1
fn %:         0.05    0.05
unsure t:       37      42
unsure %:     1.68    1.91
real cost: $158.40  $29.40
best cost:  $67.20  $26.40
h mean:      17.41    8.38
h sdev:      31.13   20.20
s mean:      99.90   99.66
s sdev:       2.47    3.35
mean diff:   82.49   91.28
k:            2.46    3.88

filename:  lowspam lowspam_adj
ham:spam:  2000:200
                   2000:200
fp total:        0       1
fp %:         0.00    0.05
fn total:       10       1
fn %:         5.00    0.50
unsure t:       35      72
unsure %:     1.59    3.27
real cost:  $17.00  $25.40
best cost:  $10.80   $7.00
h mean:       0.18    1.61
h sdev:       2.08    7.13
s mean:      89.39   96.69
s sdev:      23.92   10.59
mean diff:   89.21   95.08
k:            3.43    5.37

The introduced fp in lowspam_adj is a very spammy HTML email from an ISP -
it's always showed up as an fp in my corpus.

-- 
Richie Hindle
richie@entrian.com


From tim.one@comcast.net  Sun Nov 17 23:31:26 2002
From: tim.one@comcast.net (Tim Peters)
Date: Sun, 17 Nov 2002 18:31:26 -0500
Subject: [Spambayes] Testers needed with unbalanced spam::ham training
	data
In-Reply-To: <rhvftu80lgp3tsm6gtpgmrlegsj99b35p3@4ax.com>
Message-ID: <LNBBLJKPBEHFEDALKOLCAEGOCMAB.tim.one@comcast.net>

[Richie Hindle, trying

  [Classifier]
  experimental_ham_spam_imbalance_adjustment: True
]

Thank you!

> Four runs, with and without
> experimental_ham_spam_imbalance_adjustment, and
> with a 10:1 ham:spam imbalance either way:
>
> lowham[_adj]:  timcv.py -n10 --ham=20  --spam=200 -s1
> lowspam[_adj]: timcv.py -n10 --ham=200 --spam=20  -s1
>
> filename:   lowham lowham_adj
> ham:spam:  200:2000
>                    200:2000
> fp total:       15       2
> fp %:         7.50    1.00
> fn total:        1       1
> fn %:         0.05    0.05
> unsure t:       37      42
> unsure %:     1.68    1.91
> real cost: $158.40  $29.40
> best cost:  $67.20  $26.40
> h mean:      17.41    8.38
> h sdev:      31.13   20.20
> s mean:      99.90   99.66
> s sdev:       2.47    3.35
> mean diff:   82.49   91.28
> k:            2.46    3.88

So the effect of the adjustment is to make everything less spammy:  both
means decrease, ham sdev decreases, spam sdev increases, FP get redeemed,
and FN get more likely but less so than Unsures get more likely.  The spread
is small enough that the bottom-line increase in k is important, and
everything works as hoped here.

> filename:  lowspam lowspam_adj
> ham:spam:  2000:200
>                    2000:200
> fp total:        0       1
> fp %:         0.00    0.05
> fn total:       10       1
> fn %:         5.00    0.50
> unsure t:       35      72
> unsure %:     1.59    3.27
> real cost:  $17.00  $25.40
> best cost:  $10.80   $7.00
> h mean:       0.18    1.61
> h sdev:       2.08    7.13
> s mean:      89.39   96.69
> s sdev:      23.92   10.59
> mean diff:   89.21   95.08
> k:            3.43    5.37

Now the effect is to make everything less hammy, so mirror image:  both
means increase, ham sdev increases, spam sdev decreases, FN get redeemed,
and FP get more likely but less so than Unsures get more likely.  So again
everything worked as hoped, and the bottom-line increase in k is again a
Good Thing.

Great!  That's all I could have hoped for.  If you hoped for more, you were
being unrealistic <wink>.

Curious:  both before and after, you got better results training on a lot
more ham than spam than the reverse.  Most previous reports have been the
opposite (in my own tests, I haven't noted a reliable trend in either
direction there).

> The introduced fp in lowspam_adj is a very spammy HTML email from
> an ISP - it's always showed up as an fp in my corpus.

Since the after "best cost" was under $10, it's certain that the post-run
histogram analysis found cutoffs where you would have gotten no FP.  Whether
those are cutoffs you'd be comfortable with I can't say.


From rjdsnet@yahoo.com  Thu Nov  7 19:22:34 2002
From: rjdsnet@yahoo.com (Ranieri J D Severiano)
Date: Thu, 7 Nov 2002 17:22:34 -0200
Subject: [Spambayes] hammie's dbm file has changed
Message-ID: <20021107192234.GA974@uyrapuru>

Hi,
my last CVS syncronization has generated the attached CVS/Entries.
These are the upgrades:
	hammie.py:    1.29 -> 1.35 -> 1.38
	hammiesrc.py: 1.9  -> 1.10

When I execute any program which try to access the pickle-DB, I
get this exception:


ranieri@uyrapuru:spambayes$ ./hammie.py -s ~/Mail/bulkmail
Traceback (most recent call last):
  File "./hammie.py", line 497, in ?
    main()
  File "./hammie.py", line 459, in main
    bayes = createbayes(pck, usedb, mode)
  File "./hammie.py", line 401, in createbayes
    bayes = pickle.load(fp)
  File "/usr/lib/python2.2/copy_reg.py", line 40, in _reconstructor
    obj = base.__new__(cls, state)
TypeError: ('object.__new__(X): X is not a type object (class)', <function _reconstructor at 0x8148cd4>, (<class classifier.Bayes at 0x82103b4>, <type 'object'>, None))
ranieri@uyrapuru:spambayes$ python2.2
Python 2.2.1 (#1, Sep  7 2002, 14:34:30) 
[GCC 2.95.4 20011002 (Debian prerelease)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> from cPickle import load
>>> f = open('hammie.db')
>>> o = load(f)
Traceback (most recent call last):
  File "<stdin>", line 1, in ?
  File "/usr/lib/python2.2/copy_reg.py", line 40, in _reconstructor
    obj = base.__new__(cls, state)
TypeError: ('object.__new__(X): X is not a type object (class)', <function _reconstructor at 0x8148b64>, (<class classifier.Bayes at 0x81ad6cc>, <type 'object'>, None))
>>>
ranieri@uyrapuru:spambayes$


I believe pickle-DB file format has changed too.


Thanks,
Ranieri


> 
> From: "Neale Pickett" <neale@woozle.org>
> Date: Sun, 17 Nov 2002 03:49:26
> Subject: [Spambayes] hammie's dbm file has changed
> 
> I just want to make sure everyone is aware that hammie.py's dbm file
> format has changed now.  I sent a message out about it two days ago and
> didn't get any responses, so it's in now.
> 

-------------- next part --------------
/.cvsignore/1.3/Fri Sep 20 15:24:54 2002//
/HistToGNU.py/1.7/Fri Oct  4 03:01:29 2002//
/LICENSE.txt/1.1/Sun Sep 22 04:59:54 2002//
/TESTING.txt/1.1/Thu Sep  5 20:55:02 2002//
/cdb.py/1.4/Mon Sep 23 21:20:10 2002//
/cleanarch/1.1/Thu Sep  5 16:16:43 2002//
/cmp.py/1.17/Thu Sep 26 03:20:51 2002//
/fpfn.py/1.1/Wed Sep 25 01:01:49 2002//
/heapq.py/1.1/Sun Sep 22 06:58:36 2002//
/loosecksum.py/1.3/Mon Sep 23 21:20:10 2002//
/neilfilter.py/1.4/Wed Oct  2 16:05:27 2002//
/rates.py/1.7/Wed Sep 25 02:22:15 2002//
D/Outlook2000////
D/email////
/hammiecli.py/1.2/Sun Oct 27 05:13:54 2002//
/runtest.sh/1.9/Mon Nov  4 01:10:38 2002//
/setup.py/1.9/Mon Nov  4 01:10:38 2002//
/timtest.py/1.30/Mon Nov  4 01:10:39 2002//
/unheader.py/1.8/Mon Nov  4 01:10:44 2002//
/Histogram.py/1.7/Wed Nov  6 20:23:44 2002//
/INTEGRATION.txt/1.2/Wed Nov  6 20:23:44 2002//
/Options.py/1.70/Wed Nov  6 20:23:47 2002//
/README.txt/1.42/Wed Nov  6 20:23:48 2002//
/TestDriver.py/1.28/Wed Nov  6 20:23:49 2002//
/Tester.py/1.8/Wed Nov  6 20:23:49 2002//
/chi2.py/1.8/Wed Nov  6 20:23:50 2002//
/classifier.py/1.50/Wed Nov  6 20:23:52 2002//
/hammie.py/1.38/Result of merge//
/hammiesrv.py/1.10/Result of merge//
/mboxcount.py/1.3/Wed Nov  6 20:23:53 2002//
/mboxtest.py/1.10/Wed Nov  6 20:23:54 2002//
/mboxutils.py/1.6/Wed Nov  6 20:23:54 2002//
/msgs.py/1.6/Wed Nov  6 20:23:54 2002//
/neiltrain.py/1.4/Wed Nov  6 20:23:55 2002//
/optimize.py/1.2/Sun Nov 10 19:59:22 2002//
/pop3proxy.py/1.15/Wed Nov  6 20:23:59 2002//
/rebal.py/1.9/Wed Nov  6 20:24:00 2002//
/sets.py/1.2/Wed Nov  6 20:24:01 2002//
/split.py/1.2/Wed Nov  6 20:24:01 2002//
/splitn.py/1.4/Wed Nov  6 20:24:01 2002//
/splitndirs.py/1.7/Wed Nov  6 20:24:01 2002//
/table.py/1.5/Wed Nov  6 20:24:02 2002//
/timcv.py/1.12/Wed Nov  6 20:24:02 2002//
/tokenizer.py/1.68/Wed Nov  6 20:24:07 2002//
/weakloop.py/1.2/Mon Nov 11 01:59:06 2002//
/weaktest.py/1.3/Sun Nov 10 19:59:22 2002//
From popiel@wolfskeep.com  Mon Nov 18 02:24:57 2002
From: popiel@wolfskeep.com (T. Alexander Popiel)
Date: Sun, 17 Nov 2002 18:24:57 -0800
Subject: [Spambayes] Seeking a giant idle machine w/ a miserable corpus 
In-Reply-To: Message from Tim Peters <tim.one@comcast.net> 
	<LNBBLJKPBEHFEDALKOLCAEDKCMAB.tim.one@comcast.net> 
References: <LNBBLJKPBEHFEDALKOLCAEDKCMAB.tim.one@comcast.net> 
Message-ID: <20021118022457.CC988F58D@cashew.wolfskeep.com>

In message:  <LNBBLJKPBEHFEDALKOLCAEDKCMAB.tim.one@comcast.net>
             Tim Peters <tim.one@comcast.net> writes:
>
>[Tim]
>> ...
>> The "missing test" here is exact bigrams (no hash convolutions).  I'll
>> try that later; may not have enough RAM for that, but should.

I haven't been able to do a big run of this, but here's my
results:

filename:      org  orgbix
ham:spam:  1000:1000      
                   1000:1000
fp total:        3       2
fp %:         0.30    0.20
fn total:       10       7
fn %:         1.00    0.70
unsure t:       27      28
unsure %:     1.35    1.40
real cost:  $45.40  $32.60
best cost:  $24.00  $24.20
h mean:       0.43    0.50
h sdev:       5.64    5.95
s mean:      97.94   98.28
s sdev:      11.59   10.45
mean diff:   97.51   97.78
k:            5.66    5.96

This is from a five-fold cross validation run.  Looks very nice.

- Alex

From neale@woozle.org  Mon Nov 18 02:48:46 2002
From: neale@woozle.org (Neale Pickett)
Date: 17 Nov 2002 18:48:46 -0800
Subject: [Spambayes] hammie's dbm file has changed
In-Reply-To: <20021107192234.GA974@uyrapuru>
References: <20021107192234.GA974@uyrapuru>
Message-ID: <w53smxz619t.fsf@woozle.org>

So then, Ranieri J D Severiano <rjdsnet@yahoo.com> is all like:

> ranieri@uyrapuru:spambayes$ ./hammie.py -s ~/Mail/bulkmail
> Traceback (most recent call last):
>   File "./hammie.py", line 497, in ?
>     main()
>   File "./hammie.py", line 459, in main
>     bayes = createbayes(pck, usedb, mode)
>   File "./hammie.py", line 401, in createbayes
>     bayes = pickle.load(fp)
>   File "/usr/lib/python2.2/copy_reg.py", line 40, in _reconstructor
>     obj = base.__new__(cls, state)
> TypeError: ('object.__new__(X): X is not a type object (class)', <function _reconstructor at 0x8148cd4>, (<class classifier.Bayes at 0x82103b4>, <type 'object'>, None))

Yikes!

I'm pretty sure I didn't change anything that would affect the way
pickles are stored (they don't use PersistentGrahamBayes or the DBDict
classes), but it sure does look like *something* has changed for you.

Unfortunately, SF CVS is down for the day, so I can't check to see
what's changed between those versions.

> >>> from cPickle import load
> >>> f = open('hammie.db')
> >>> o = load(f)

Could you try the same thing, importing load from pickle instead?  It
will give a better traceback.  I don't know enough yet about how the
pickling works to be able to diagnose this without some more information
first.

Neale

From neale@woozle.org  Mon Nov 18 02:51:06 2002
From: neale@woozle.org (Neale Pickett)
Date: 17 Nov 2002 18:51:06 -0800
Subject: [Spambayes] small vulnerability patch
In-Reply-To: <1037427084.31134.17.camel@localhost>
References: <1037427084.31134.17.camel@localhost>
Message-ID: <w53of8n615x.fsf@woozle.org>

So then, Todd Mokros <niltsiar@neo.rr.com> is all like:

> here's a small patch to fix a small header vulnerability.  If a piece of
> spam spoofs the header added by hammie, then procmail recipes could
> match on the spoofed header.  This deletes the hammie header before
> filtering.

Good catch, Todd!  I'll check this into CVS as soon as it comes back up
and I'm in front of a computer :)

Thanks

Neale

> 
> 
> --- ../../cvs-tracking/spambayes/hammie.py      2002-11-14
> 17:00:15.000000000 -0500
> +++ hammie.py   2002-11-16 00:44:50.000000000 -0500
> @@ -272,6 +272,8 @@
>          """
>  
>          msg = mboxutils.get_message(msg)
> +        if msg.has_key(header):
> +            del msg[header]
>          prob, clues = self._scoremsg(msg, True)
>          if prob < ham_cutoff:
>              disp = options.header_ham_string
> 
> 
> -- 
> Todd Mokros <niltsiar@neo.rr.com>
> 
> _______________________________________________
> Spambayes mailing list
> Spambayes@python.org
> http://mail.python.org/mailman/listinfo/spambayes

From tim@fourstonesExpressions.com  Mon Nov 18 02:52:49 2002
From: tim@fourstonesExpressions.com (Tim Stone - Four Stones Expressions)
Date: Sun, 17 Nov 2002 20:52:49 -0600
Subject: [Spambayes] hammie's dbm file has changed
In-Reply-To: <w53smxz619t.fsf@woozle.org>
Message-ID: <OI1UQM64TSED84XWDLK5ZGA4QP94DA.3dd85601@riven>

Here's a diff between hammie 1.38 and 1.39

cvs diff -r 1.38 -r 1.39 hammie.py (in directory C:\sourceforge\spambayes\)
Index: hammie.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/hammie.py,v
retrieving revision 1.38
retrieving revision 1.39
diff -r1.38 -r1.39
13c13
<         Can be specified more than once.
---
>         Can be specified more than once, or use - for stdin.
16c16
<         Can be specified more than once.
---
>         Can be specified more than once, or use - for stdin.
42a43
> import types
112,113c113,120
<         if self.hash.has_key(key):
<             return pickle.loads(self.hash[key])
---
>         v = self.hash[key]
>         if v[0] == 'W':
>             val = pickle.loads(v[1:])
>             # We could be sneaky, like pickle.Unpickler.load_inst,
>             # but I think that's overly confusing.
>             obj = classifier.WordInfo(0)
>             obj.__setstate__(val)
>             return obj
115c122
<             raise KeyError(key)
---
>             return pickle.loads(v)
118c125,129
<         v = pickle.dumps(val, 1)
---
>         if isinstance(val, classifier.WordInfo):
>             val = val.__getstate__()
>             v = 'W' + pickle.dumps(val, 1)
>         else:
>             v = pickle.dumps(val, 1)

*****CVS exited normally with code 1*****

- TimS

11/17/2002 8:48:46 PM, Neale Pickett <neale@woozle.org> wrote:

>So then, Ranieri J D Severiano <rjdsnet@yahoo.com> is all like:
>
>> ranieri@uyrapuru:spambayes$ ./hammie.py -s ~/Mail/bulkmail
>> Traceback (most recent call last):
>>   File "./hammie.py", line 497, in ?
>>     main()
>>   File "./hammie.py", line 459, in main
>>     bayes = createbayes(pck, usedb, mode)
>>   File "./hammie.py", line 401, in createbayes
>>     bayes = pickle.load(fp)
>>   File "/usr/lib/python2.2/copy_reg.py", line 40, in _reconstructor
>>     obj = base.__new__(cls, state)
>> TypeError: ('object.__new__(X): X is not a type object (class)', <function 
_reconstructor at 0x8148cd4>, (<class classifier.Bayes at 0x82103b4>, <type 
'object'>, None))
>
>Yikes!
>
>I'm pretty sure I didn't change anything that would affect the way
>pickles are stored (they don't use PersistentGrahamBayes or the DBDict
>classes), but it sure does look like *something* has changed for you.
>
>Unfortunately, SF CVS is down for the day, so I can't check to see
>what's changed between those versions.
>
>> >>> from cPickle import load
>> >>> f = open('hammie.db')
>> >>> o = load(f)
>
>Could you try the same thing, importing load from pickle instead?  It
>will give a better traceback.  I don't know enough yet about how the
>pickling works to be able to diagnose this without some more information
>first.
>
>Neale
>
>_______________________________________________
>Spambayes mailing list
>Spambayes@python.org
>http://mail.python.org/mailman/listinfo/spambayes
>
>
- Tim
www.fourstonesExpressions.com 


From neale@woozle.org  Mon Nov 18 03:16:45 2002
From: neale@woozle.org (Neale Pickett)
Date: 17 Nov 2002 19:16:45 -0800
Subject: [Spambayes] hammie's dbm file has changed
In-Reply-To: <OI1UQM64TSED84XWDLK5ZGA4QP94DA.3dd85601@riven>
References: <OI1UQM64TSED84XWDLK5ZGA4QP94DA.3dd85601@riven>
Message-ID: <w53k7jb5zz6.fsf@woozle.org>

So then, Tim Stone - Four Stones Expressions <tim@fourstonesExpressions.com> is all like:

> Here's a diff between hammie 1.38 and 1.39

Ah, I see SF is back up.  Thanks, Tim :)

Ranieri, I can't find anything which would affect the pickling in the
diff between revisions 1.29 and 1.39 of hammie.py.  Maybe a traceback
from the pickle module will offer some more clues as to what's gone
wrong.

Neale

From tim@fourstonesExpressions.com  Mon Nov 18 04:44:42 2002
From: tim@fourstonesExpressions.com (Tim Stone - Four Stones Expressions)
Date: Sun, 17 Nov 2002 22:44:42 -0600
Subject: [Spambayes] hammie's dbm file has changed
In-Reply-To: <w53bs4n5wbu.fsf@woozle.org>
Message-ID: <05SNRPSM08A5E0VREDHCZYTVRSQKFC7.3dd8703a@riven>

I've just finished updating the DBDictBayes class to implement the read/store 
semantic.  I looked at the DBDict class, but I decided making it do that would 
be a bit more of a problem than making one if it's delegators do it...

More comments below...

- TimS

11/17/2002 10:35:33 PM, Neale Pickett <neale@woozle.org> wrote:

>So then, Tim Stone - Four Stones Expressions <tim@fourstonesExpressions.com> 
is all like:
>
>> My Bayes module stuff has a store() method that is to store the
>> wordinfo.  This is a requirement for Richie's pop3proxy.  Right now
>> with DBDictBayes it's pretty much of a noop, only adding nham and
>> nspam to the persistent dictionary.  Can we alter dbdict, or make a
>> subclass, that accomodates this behavior?
>
>Hmm.  I just updated and got your new Bayes.py file.  I like!  This
>looks like what hammiefilter.py should be using, instead of hammie's
>DBDict.  Feel free to move that out and into your Bayes class, it looks
>like that's where it belongs.  

We could move the DBDIct class to the Bayes module, or to its own little 
module... it really is a more generally useful class.

>Just make sure everything else still
>works :)  You'll probably have to modify hammie, hammiesrv, and
>hammiefilter.  Maybe hammiefilter should be renamed to just filter, if
>it's not going to use hammie.py anymore.
>
>Would that solve your problem?  If I understand it correctly, it
>should.  I think the hammie module ought to be split up into separate
>pieces anyway.

My problem with updating hammie is that I'm not too well equipped to test the 
mods...  I can certainly take a look at it, and give it a spin, but I doubt 
that I can test all the scenarios on my simple wynd0ze snoozer.  ;)

>
>Mind if we take this discussion onto the list?  I'm sure Richie will
>have some good input on the subject.
>
>Neale
>
>
>
>
- Tim
www.fourstonesExpressions.com 


From mhammond@skippinet.com.au  Thu Nov 14 07:13:51 2002
From: mhammond@skippinet.com.au (Mark Hammond)
Date: Thu, 14 Nov 2002 18:13:51 +1100
Subject: [Spambayes] Outlook users should update
In-Reply-To: <LNBBLJKPBEHFEDALKOLCKEMACKAB.tim.one@comcast.net>
Message-ID: <LCEPIIGDJPKCOIHOBJEPAEDOHLAA.mhammond@skippinet.com.au>

And I just checked in a few changes too.  Of most note is that the plugin
should correctly filter all "unread, unscored" messages in your watch
folders at startup.  Works for me - let me know if it does for you too
<wink>

Mark.


From tim.one@comcast.net  Mon Nov 18 06:12:19 2002
From: tim.one@comcast.net (Tim Peters)
Date: Mon, 18 Nov 2002 01:12:19 -0500
Subject: [Spambayes] hammie's dbm file has changed
In-Reply-To: <w53smxz619t.fsf@woozle.org>
Message-ID: <LNBBLJKPBEHFEDALKOLCGEIACMAB.tim.one@comcast.net>

[Ranieri J D Severiano <rjdsnet@yahoo.com>]
> ranieri@uyrapuru:spambayes$ ./hammie.py -s ~/Mail/bulkmail
> Traceback (most recent call last):
>   File "./hammie.py", line 497, in ?
>     main()
>   File "./hammie.py", line 459, in main
>     bayes = createbayes(pck, usedb, mode)
>   File "./hammie.py", line 401, in createbayes
>     bayes = pickle.load(fp)
>   File "/usr/lib/python2.2/copy_reg.py", line 40, in _reconstructor
>     obj = base.__new__(cls, state)
> TypeError: ('object.__new__(X): X is not a type object
>   (class)', <function _reconstructor at 0x8148cd4>, (<class
>   classifier.Bayes at 0x82103b4>, <type 'object'>, None))

This is what happens if you try to load a Bayes pickle created before
classifier.Bayes changed from a new-style class to an old-style class.
You're best off retraining from scratch.  If you're desperate to retrieve
the old data, you can change

class Bayes:

back to

class Bayes(object):

and the pickle should load again.


From tim.one@comcast.net  Mon Nov 18 06:15:25 2002
From: tim.one@comcast.net (Tim Peters)
Date: Mon, 18 Nov 2002 01:15:25 -0500
Subject: [Spambayes] hammie's dbm file has changed
In-Reply-To: <w53k7jb5zz6.fsf@woozle.org>
Message-ID: <LNBBLJKPBEHFEDALKOLCOEIACMAB.tim.one@comcast.net>

[Neale Pickett]
> Ah, I see SF is back up.  Thanks, Tim :)

You're welcome.  Feel emboldened, for my next act I'm thinking of making the
last digit of the calendar year change in, oh, about a month and a half.

> Ranieri, I can't find anything which would affect the pickling in the
> diff between revisions 1.29 and 1.39 of hammie.py.  Maybe a traceback
> from the pickle module will offer some more clues as to what's gone
> wrong.

It's got nothing to do with hammie -- Bayes changed from a new-style to an
old-style class to make Jeremy's ZODB life easier, and a pickle of an
old-style class instance can't be loaded after the class has changed in this
way.


From tim.one@comcast.net  Mon Nov 18 06:48:54 2002
From: tim.one@comcast.net (Tim Peters)
Date: Mon, 18 Nov 2002 01:48:54 -0500
Subject: [Spambayes] Seeking a giant idle machine w/ a miserable corpus
In-Reply-To: <3DD74E5F.9060606@hooft.net>
Message-ID: <LNBBLJKPBEHFEDALKOLCAEICCMAB.tim.one@comcast.net>

[Rob Hooft]
> I collected some unigram statistics yesterday: training hammie on my
> 2x10 sets in the corpus one by one, and after each 1600ham+580spam set
> run a program that reports the would-be collisions using the python 32
> bit hash function:
>
> Set1 : 109280
> Set2 : 183560
> Set3 : 227699 (2 clashes)
> Set4 : 277253 (3 clashes)
> Set5 : 329662 (5)
> Set6 : 362847 (7)
> Set7 : 394585 (12)
> Set8 : 422898 (12)
> Set9 : 448767 (16)
> Set10: 481393 (22)

I'm assuming the "big numbers" there are the number of distinct tokens,
rather than the number of distinct 32-bit hash codes.  If so, they're all a
little better than could be expected from a truly random 32-bit hash code.

BTW, after tossing N balls into M buckets, the expected # of occupied
buckets has

mean       M - M*(1-1/M)**N
variance   M*(M-1)*(1-2/M)**N + M*(1-1/M)**N - M**2*(1-1/M)**(2*N)

Unfortunately, those expressions are numerically intractable using double
precision when M and N get large.  The exact distribution is intractable
even in theory; Knuth gives an iterative algorithm for computing it given
specific M and N, which takes time super-linear in N.  I have software left
over for this stuff from Python's years-ago experiments.

> ...
> Here, the number of hash collisions is still fairly low, but subtract
> bits, and see it explode.....

Oh yes!  The biggest pickle I've got sitting around has 327,439 tokens.
Using the last 20 bits of the hash code means using 2**20 ~= a million
buckets, and the mean number of collisions then is expected to be 46193.8,
with sdev 174.5 (nb: it's not a normal distribution).  Actually doing this
gave

    46,184  collisions using the low 20 bits of hash()
    46,481  collisions using the low 20 bits of binascii.crc32()

So either way is "random enough" for this range of numbers.

> Another thing that I learned from this, is that the number of distinct
> words with this test does not increase with the sqrt of the number of
> messages.

Perhaps not, but we're going to pretend that it is anyway because that's
such a pretty & quotable result <wink>.


From tim.one@comcast.net  Mon Nov 18 06:57:12 2002
From: tim.one@comcast.net (Tim Peters)
Date: Mon, 18 Nov 2002 01:57:12 -0500
Subject: [Spambayes] Seeking a giant idle machine w/ a miserable corpus
In-Reply-To: <20021118022457.CC988F58D@cashew.wolfskeep.com>
Message-ID: <LNBBLJKPBEHFEDALKOLCOEICCMAB.tim.one@comcast.net>

[T. Alexander Popiel, tries "exact bigrams"]
> I haven't been able to do a big run of this, but here's my
> results:

Thank you!

> filename:      org  orgbix
> ham:spam:  1000:1000
>                    1000:1000
> fp total:        3       2
> fp %:         0.30    0.20
> fn total:       10       7
> fn %:         1.00    0.70
> unsure t:       27      28
> unsure %:     1.35    1.40
> real cost:  $45.40  $32.60
> best cost:  $24.00  $24.20
> h mean:       0.43    0.50
> h sdev:       5.64    5.95
> s mean:      97.94   98.28
> s sdev:      11.59   10.45
> mean diff:   97.51   97.78
> k:            5.66    5.96
>
> This is from a five-fold cross validation run.  Looks very nice.

Yet the "best cost" measure increased; add that to the list of mysteries.
I'd be keener about it if it were clearer how to make the time and database
burdens reasonable.  A less anal way of searching for the strongest unigrams
and bigrams would probably take care of time (Gary suggested something
cheaper to begin with, but that could miss some high-strength bigrams in
favor of lower-value unigrams, and I wanted more to see the ultimate
potential here).  The database bloat is jaw-dropping, though, and I'm still
unsure why that is.  Hash codes are right out, IMO -- the goofy mistakes
they lead to are intolerable.


From tim.one@comcast.net  Mon Nov 18 07:17:38 2002
From: tim.one@comcast.net (Tim Peters)
Date: Mon, 18 Nov 2002 02:17:38 -0500
Subject: [Spambayes] Outlook users should update
In-Reply-To: <LCEPIIGDJPKCOIHOBJEPAEDOHLAA.mhammond@skippinet.com.au>
Message-ID: <LNBBLJKPBEHFEDALKOLCAEIFCMAB.tim.one@comcast.net>

[Mark Hammond]
> And I just checked in a few changes too.  Of most note is that the
> plugin should correctly filter all "unread, unscored" messages in your
> watch folders at startup.  Works for me - let me know if it does
> for you too <wink>

I never had this problem, & am happy to report that I still don't <wink>.

One other change:  the new (and I hope temporary)
experimental_ham_spam_imbalance_adjustment option is enabled by default in
the Outlook client now (this is specific to the Outlook client:  it's
disabled by default for everyone else).

t won't do you any good (or harm ...) until you convince the client to
update probabilities, though.  You do *not* need to retrain your database
from scratch.  You just need to convince it to call the classifier's
update_probabilities() method once.  The easiest way to do that may be to
drag a ham to your spam folder, then select that ham in your spam folder,
and click "Recover from spam".


From Paul.Moore@atosorigin.com  Mon Nov 18 10:00:48 2002
From: Paul.Moore@atosorigin.com (Moore, Paul)
Date: Mon, 18 Nov 2002 10:00:48 -0000
Subject: [Spambayes] Just for fun
Message-ID: <16E1010E4581B049ABC51D4975CEDB8861993D@UKDCX001.uk.int.atosorigin.com>

From: Tim Peters [mailto:tim.one@comcast.net]
> Good!  On my tiny still-hapax-driven purely-mistake-based at-home
> classifier (which is up 79 each of ham and spam trained on) it
> fared much worse:

Mine got some interesting results... The DB is trained on 366 good,
496 spam, which came mostly from collected spam over a week or so,
plus the contents of my Inbox, and then training on mistakes (not
many). My Inbox is, in some senses, a *lousy* source of ham, as it's
mainly stuff I couldn't find a better home for. So it is 99% internal
mail (ie, from Exchange rather than Internet mail) and probably
comprises a spammier-than-average slice of my ham. But if I train
on all my ham (across multiple folders) I get a massive ham:spam
imbalance. (When I next get to CVS update, I'll try Tim's new tweak to
compensate for imbalances).

I'm not good at interpreting this stuff yet, but it came out as
solidly unsure, with some interesting features. The 'sender:no real
name:2**0' as a solid ham clue is almost certainly due to Exchange
(basically because Exchange doesn't do real headers, I expect) - I see
most internet headers as good spam clues, which is mildly worrying,
although hasn't caused any real issues yet.

The obvious implication is that getting a really good training corpus
is *hard*. Probably beyond the means of the average user. But as a
lousy corpus still gives good results, it's hard to decide whether or
not to care.

Here's the clues.

Spam Score: 0.349681


word                                spamprob         #ham  #spam
'*H*'                               0.998703            -      -
'*S*'                               0.698066            -      -
'sender:no real name:2**0'          0.00884086         25      0
'subject:['                         0.0155709          14      0
'url:mailman'                       0.0167286          13      0
'url:listinfo'                      0.0180723          12      0
'specific'                          0.0196507          11      0
'is.'                               0.0238095           9      0
'url:python'                        0.0266272           8      0
'to:addr:python.org'                0.0412844           5      0
'them,'                             0.0505618           4      0
'sender:addr:python.org'            0.0505618           4      0
'problem'                           0.0521891          44      3
'url:org'                           0.0567176          28      2
'know,'                             0.0652174           3      0
'email addr:python.org'             0.0652174           3      0
'skip:_ 40'                         0.0652174           3      0
'delivery'                          0.0676112          13      1
'updated'                           0.0676112          13      1
"can't"                             0.0683657          43      4
'running'                           0.0727202          12      1
'set'                               0.0789344          54      6
'date'                              0.0912609          24      3
'mission'                           0.0918367           2      0
'sorted'                            0.0918367           2      0
'host'                              0.104237            8      1
'base'                              0.116911            7      1
'various'                           0.121676           12      2
'using'                             0.125907           73     14
'content-type:text/plain'           0.128685          326     65
'however'                           0.133102            6      1
'back'                              0.145642           40      9
'ask'                               0.149462           22      5
'site.'                             0.154513            5      1
'solve'                             0.154513            5      1
'contains'                          0.154992            9      2
'net.'                              0.155172            1      0
'url:spambayes'                     0.155172            1      0
'sender:addr:spambayes-bounces'     0.155172            1      0
'spambayes'                         0.155172            1      0
'weekly.'                           0.155172            1      0
'subject:email'                     0.155172            1      0
'shut'                              0.155172            1      0
'second.'                           0.155172            1      0
'policies'                          0.155172            1      0
'parameters'                        0.155172            1      0
'emails.'                           0.155172            1      0
'email name:spambayes'              0.155172            1      0
'duplicate'                         0.155172            1      0
'together'                          0.170569            8      2
'current'                           0.175793           25      7
'closed'                            0.184169            4      1
'paying'                            0.184169            4      1
'data'                              0.184776           17      5
'there'                             0.184986           95     29
'meet'                              0.189638            7      2
'close'                             0.189638            7      2
'need'                              0.190325           98     31
'site'                              0.190699           32     10
'being'                             0.192172           41     13
'directly'                          0.204069           15      5
'they'                              0.206035           66     23
'may'                               0.206235          100     35
'use'                               0.208203           96     34
'been'                              0.223143           93     36
'have'                              0.229481          221     89
'header:Received:9'                 0.238618           24     10
'just'                              0.251571           97     44
'like'                              0.253328           70     32
'product'                           0.261037           15      7
'not'                               0.263917          200     97
'can'                               0.263933          165     80
'reply-to:none'                     0.266838          343    169
'noheader:reply-to'                 0.266838          343    169
'will'                              0.268784          175     87
'only'                              0.27345            67     34
'that'                              0.284503          223    120
'come'                              0.28675            26     14
'for'                               0.287229          253    138
'against'                           0.292477           11      6
'down'                              0.294767           25     14
'once'                              0.294767           25     14
'new'                               0.299542           90     52
'campaign'                          0.299577            2      1
'reliable'                          0.299577            2      1
'find'                              0.304047           56     33
'already'                           0.30613            22     13
'service'                           0.308224           40     24
'well'                              0.308669           30     18
'see'                               0.308851           63     38
'way'                               0.310625           33     20
'many'                              0.311285           28     17
"don't"                             0.317645           89     56
'again.'                            0.683667            4     12
'subject:.'                         0.697304           22     69
'card'                              0.698247            5     16
'low'                               0.70042             4     13
'totally'                           0.703898            3     10
'header:Errors-To:1'                0.718457           21     73
'header:Date:1'                     0.720317          142    496
'header:From:1'                     0.720317          142    496
'us,'                               0.72912             4     15
'header:Return-Path:1'              0.737732          130    496
'to:2**0'                           0.74404           123    485
'proto:http'                        0.772652           75    346
'to:no real name:2**0'              0.775127           90    421
'net'                               0.776394            2     10
'price'                             0.776394            2     10
'sites'                             0.776394            2     10
'success.'                          0.796678            1      6
'visit'                             0.797988           15     81
'url:www'                           0.805641           49    276
'matter'                            0.81794             2     13
'url:com'                           0.818306           50    306
'effective'                         0.819813            1      7
'marketing'                         0.831585            3     21
'companies'                         0.838229            1      8
'price.'                            0.844828            0      1
'subject:Bullet'                    0.844828            0      1
'time!'                             0.844828            0      1
'relax'                             0.844828            0      1
'proof'                             0.844828            0      1
'from:addr:concentric.net'          0.844828            0      1
'friendly'                          0.844828            0      1
'campaigns.'                        0.844828            0      1
'bullet'                            0.844828            0      1
'beautiful'                         0.844828            0      1
'$200'                              0.844828            0      1
'credit'                            0.871816            3     29
'emails'                            0.875534            3     30
'income'                            0.878287            2     21
'offer'                             0.890749            7     79
'merchant'                          0.908163            0      2
'complaints'                        0.908163            0      2
'cheap'                             0.908163            0      2
'header:Mime-Version:1'             0.923952           16    266
'url:mail'                          0.935447            3     62
'adult'                             0.958716            0      5
'lowest'                            0.958716            0      5
'advertise'                         0.969799            0      7
'prices'                            0.973373            0      8
'gambling'                          0.973373            0      8
'$500'                              0.97619             0      9
'hundreds'                          0.980349            0     11
'dollars'                           0.983271            0     13
'guarantee'                         0.983271            0     13
'thousands'                         0.984429            0     14
'million'                           0.990405            0     23
'bulk'                              0.990405            0     23
'advertising'                       0.990405            0     23
'advertised'                        0.995627            0     51
'websites'                          0.995942            0     55

Paul.

From msergeant@startechgroup.co.uk  Mon Nov 18 10:40:25 2002
From: msergeant@startechgroup.co.uk (Matt Sergeant)
Date: Mon, 18 Nov 2002 10:40:25 +0000
Subject: [Spambayes] Just for fun
References: <Pine.LNX.4.33L2.0211151155540.4913-100000@dev.itsite.com>
Message-ID: <3DD8C399.8000305@startechgroup.co.uk>

Derek Simkowiak said the following on 15/11/02 17:35:
>>Remember that this project /is/ the first instance of a decent spam filter :),
>>so we can hardly blame the spammers for being a little behind.
> 
> 	Let's not forget that SpamBayes only works for individuals or
> workgroups who have the same definitation of "ham".  It doesn't help much
> in enterprise-level settings with tens of thousands of users, since the
> ham of such a large and varied group of people would dilute the definition
> of spam too much to be useful.

I think you're over-exhagerating. It most certainly *does* help, and it 
helps a lot. For a large diverse group of users statistical analysis is 
still about 90% correct. It's not the 99.9% correct you get for an 
individual's mailbox, but as part of the bigger picture (involving 
statistics, rules, DNSBL's, etc) it's a huge help.

> 	I bet that playing the numbers game one could "show" that the
> helpdesk and maintenance costs of supporting a Python installation plus a
> per-person ham training procedure would be more expensive (for a Uni or
> Mega-Corp.) than just living with spam.  (Pure conjecture on my part, but
> it is easily imagined.)

Depends how you calculate the cost of spam. For me it's an interuption, 
and for my work (which involves intense periods of coding, maths, and 
reading) an interuption means I have to start again a lot of the time. 
For a highly paid programmer that cost could be about �20. Per spam.

And ignoring my email is often not an option: I have to support a spam 
solution for over a million users.

> 	There's another Python-based spam filter that might work better
> for SMTP server-wide deployment, called "Active Spam Killer", or ASK.
> 
> http://www.paganini.net/ask/
> 
> 	It's schtick is that it maintains a whitelist of people who may
> email you.  When an email from a new sender comes in, it holds the email
> for you, sends the person a simple confirmation messages (to which they
> simply hit Reply;Send), and then that person is added to your whitelist
> and their original messages is sent to you (and they are never ASKed
> again).  There's also some very practical regex stuff, some migration
> tools, and an ignorelist and blacklist (for situations like
> http://www.psychoexgirlfriend.com/).

This is the same as TMDA. I have evidence for you that it doesn't work. 
Case in point being direct from me: someone mailed me asking a technical 
question about one of my perl modules. I mailed him back a response, on 
my own free time. I got a TMDA bounce saying I had to confirm that I was 
a real person. Well frankly, sod that. I never replied. I never used his 
web page to confirm. I just ignored it and I'm sure he never got the 
reply to his question.

Now imagine extending that to corporations, where people would be even 
less inclined to add themselves to somebody's whitelist. TMDA doesn't scale.

Matt.


From Paul.Moore@atosorigin.com  Mon Nov 18 11:17:50 2002
From: Paul.Moore@atosorigin.com (Moore, Paul)
Date: Mon, 18 Nov 2002 11:17:50 -0000
Subject: [Spambayes] A kinder, gentler hammie
Message-ID: <16E1010E4581B049ABC51D4975CEDB8861993E@UKDCX001.uk.int.atosorigin.com>

From: Neale Pickett [mailto:neale@woozle.org]

> So then, Paul Moore <lists@morpheus.demon.co.uk> is all like:
>
> > On Windows (which I use) "~" isn't handled by the OS. Applications
> > which use it often set the HOME environment variable properly,
> > though, so it *can* make sense. To make this work involves passing
> > filenames through os.path.expanduser().
>
> Hey cool! The audience for hammiefilter suddenly got larger. How
> would you use something like this on a Windows box? Can some MTAs
> run messages through an external filter one at a time? I thought
> only we Unix wonks were able do that ;)

At home, I use an application called Hamster, which is a local
NNTP/POP/IMAP server and a NNTP/POP client, downloading from my ISP
and serving the stuff up locally. Like leafnode on Unix, but for POP
as well as NNTP.

It can run filters on each mail as it comes through, but I don't do
that...

I use Gnus as a client, and for that I use popget.py (which I posted
earlier) to grab mail from Hamster and add the appropriate header.
It's slightly more convenient for me than pop3proxy, but has no
training interface.

I'd use hammiefilter for training - set a command up in Gnus to
pipe the current message out to hammiefilter as spam or ham, as
appropriate.

> I suppose what we could do is try a number of pathnames for the ini
> file, and use the first one that works. If there's some reasonable
> way to figure out what to use for a "home directory" on Windows or
> Mac, without relying on non-standard environment variables, it would
> look there.

Sounds about right. Basically, using relative pathnames in a GUI
environment is error-prone, because you can't be sure what the current
directory is. Windows' normal solution is to use the registry to hold
absolue filenames, but that tends to be messy and generate registry
bloat (as well as being very naive-user-unfriendly).

[Thinking about this, isn't it a problem for Unix as well? What's the
current directory for a procmail filter?]

The standard environment variables which *can* be used for this sort
of thing are

1. HOMEDRIVE and HOMEPATH - %HOMEDRIVE%%HOMEPATH% is basically the
   equivalent of Unix's $HOME. But for nearly all cases, these end
   up being C:\, which to my mind is a bad default.
2. USERPROFILE - %USERPROFILE% is a user-specific directory suitable
   for config information. But by default it's a directory with spaces
   in the name, which can be awkward for some purposes. It's also hard
   to navigate to in Windows explorer, which makes files stored there
   a little "hidden".

Also, many Unix ports (like XEmacs/Gnus) expect the user to set HOME,
so that they can work just like the cosy Unix environment they are
used to. (Python sort of supports this, with os.path.expanduser()).
I personally don't set a default HOME, but set HOME within each
application that expects it (via wrappers or startup scripts or
whatever).

I think "try a number of pathnames" is a sensible approach. I'd
suggest:

    %HOME%\bayescustomize.ini  --  will normally fail, as HOME is not
        set, but helps Unix compatibility for people who care, as well
        as offering an "application-specific" answer for people like me
        who use HOME that way
    %USERPROFILE%\bayescustomize.ini  --  the expected answer for
        people sophisticated enough to want to customise the
        application via an INI file.
    bayescustomize.ini  --  as a final fallback, for commandline use if
        nothing else.

Personally, I think that having an extra stage, where the INI file is
looked relative to the application files, would be good, too. (Ie, look
in the same directory as sys.argv[0]). But opinions on this are often
divided.

> Of course, it could just rely on the BAYESCUSTOMIZE environment
> variable like it does now. But then you'd have to wrap hammiefilter
> to set the variable before running. I'm doing that currently, but I
> think a wrapper around a wrapper around a driver is pretty ugly. ;)

I could use that in my Gnus setup. But what the heck, it would be nice
if it worked in a way that other people coud use as well :-)

> I'll check in some code that works in a Unix environment. Take a
> look at it and let me know what would make sense to try in a Windows
> environment (or just submit a change if you can do that).

Will do. I'll let you know, as I don't have commit privs. (I'll send
you a patch file).

Paul.

From sjoerd@acm.org  Mon Nov 18 11:29:32 2002
From: sjoerd@acm.org (Sjoerd Mullender)
Date: Mon, 18 Nov 2002 12:29:32 +0100
Subject: [Spambayes] Testers needed with unbalanced spam::ham training
	data
In-Reply-To: <LNBBLJKPBEHFEDALKOLCKEFOCMAB.tim.one@comcast.net> 
References: <LNBBLJKPBEHFEDALKOLCKEFOCMAB.tim.one@comcast.net> 
Message-ID: <20021118112937.7A3CD74B08@indus.ins.cwi.nl>

On Sun, Nov 17 2002 Tim Peters wrote:

> If you have a strong imbalance between the # of ham and # of spam in your
> training data (or even if you don't but can spare the effort), please do a
> before-and-after test, where after adds the new option:
> 
> [Classifier]
> experimental_ham_spam_imbalance_adjustment: True
> 
> I expect this option to go away and become the default, but it needs testing
> first before I'll do that.

It doesn't look like a win for me:

cv1 is all default, cv2 is with
experimental_ham_spam_imbalance_adjustment: True

filename:      cv1     cv2
ham:spam:  14600:4000
                   14600:4000
fp total:        8      16
fp %:         0.05    0.11
fn total:        3       3
fn %:         0.07    0.07
unsure t:       97     108
unsure %:     0.52    0.58
real cost: $102.40 $184.60
best cost:  $43.60 $137.80
h mean:       0.24    0.40
h sdev:       3.64    4.80
s mean:      99.44   99.65
s sdev:       5.00    3.92
mean diff:   99.20   99.25
k:           11.48   11.38

-- Sjoerd Mullender <sjoerd@acm.org>

From francois.granger@free.fr  Mon Nov 18 14:19:35 2002
From: francois.granger@free.fr (Fran=?ISO-8859-1?B?5w==?=ois Granger)
Date: Mon, 18 Nov 2002 15:19:35 +0100
Subject: [Spambayes] Classify issue with pop3proxy
Message-ID: <B9FEB586.5CCAB%francois.granger@free.fr>

If I cut&past a message in the box, it get classified. If I open it through
the [file...] button, it get the following result:

===========================
Spam probability: 0.52423052

Clues:

*H*    0.58188342
*S*    0.63034446
x-mailer:none    0.21414650
content-type:text/plain    0.24312113
message-id:invalid    0.93478261
===========================

My guess is that this is a MacOS line ending issue. But this works for
training both way. The difference I see is line 774 in onTrain wich is not
in onClassify. I sugest adding it at line 793.

Tested here, it works.


>From this morning CVS in pop3proxy.py line 763 et sqq:

    def onTrain(self, params):
        """Train on an uploaded or pasted message."""
        # Upload or paste?  Spam or ham?
        message = params.get('file') or params.get('text')
        isSpam = (params['which'] == 'Train as Spam')

        # Append the message to a file, to make it easier to rebuild
        # the database later.   This is a temporary implementation -
        # it should keep a Corpus (from Tim Stone's forthcoming message
        # management module) to manage a cache of messages.  It needs
        # to keep them for the HTML retraining interface anyway.
        message = message.replace('\r\n', '\n').replace('\r', '\n') #<====
        if isSpam:
            f = open("_pop3proxyspam.mbox", "a")
        else:
            f = open("_pop3proxyham.mbox", "a")
        f.write("From pop3proxy@spambayes.org Sat Jan 31 00:00:00 2000\n")
        f.write(message)
        f.write("\n\n")
        f.close()

        # Train on the message.
        tokens = tokenizer.tokenize(message)
        self.bayes.learn(tokens, isSpam, True)
        self.push("<p>OK. Return <a href='/'>Home</a> or train
another:</p>")
        self.push(self.pageSection % ('Train another', self.train))

    def onClassify(self, params):
        """Classify an uploaded or pasted message."""
        message = params.get('file') or params.get('text')
        tokens = tokenizer.tokenize(message)               #<====
        prob, clues = self.bayes.spamprob(tokens, evidence=True)
        self.push("<p>Spam probability: <b>%.8f</b></p>" % prob)
        self.push("<table class='sectiontable' cellspacing='0'>")
        self.push("<tr><td class='sectionheading'>Clues:</td></tr>\n")
        self.push("<tr><td class='sectionbody'><table>")
        for w, p in clues:
            self.push("<tr><td>%s</td><td>%.8f</td></tr>\n" % (w, p))
        self.push("</table></td></tr></table>")
        self.push("<p>Return <a href='/'>Home</a> or classify another:</p>")
        self.push(self.pageSection % ('Classify another', self.classify))

-- 
Le courrier est un moyen de communication. Les gens devraient
se poser des questions sur les implications politiques des choix (ou non
choix) de leurs outils et technologies. Pour des courriers propres :
<http://marc.herbert.free.fr/mail/> -- <http://minilien.com/?IXZneLoID0>


From Paul.Moore@atosorigin.com  Mon Nov 18 15:57:40 2002
From: Paul.Moore@atosorigin.com (Moore, Paul)
Date: Mon, 18 Nov 2002 15:57:40 -0000
Subject: [Spambayes] Hammiefilter doesn't write out the pickle
Message-ID: <16E1010E4581B049ABC51D4975CEDB885E2DDB@UKDCX001.uk.int.atosorigin.com>

Um, is it just me, or does hammiefilter not save the database if
you're using a pickle?

A patch is attached. It doesn't feel "clean" (it would be nice if
PersistentBayes covered both pickles and DB files, as well as any
other cases which may appear) but it'll do for now. Maybe I'll look at
the wider issue when I have some time...

Paul.

-------------- next part --------------
A non-text attachment was scrubbed...
Name: hammiefilter.patch
Type: application/octet-stream
Size: 1070 bytes
Desc: hammiefilter.patch
Url : http://mail.python.org/pipermail/spambayes/attachments/20021118/ca757103/hammiefilter.exe
From tim@fourstonesExpressions.com  Mon Nov 18 16:03:04 2002
From: tim@fourstonesExpressions.com (Tim Stone - Four Stones Expressions)
Date: Mon, 18 Nov 2002 10:03:04 -0600
Subject: [Spambayes] Hammiefilter doesn't write out the pickle
Message-ID: <WSPMVS7YWIFFEUP87SM65PNXV419NJ.3dd90f38@riven>

It's not you... you have to manually save it for the moment.

fp = open(<dbname string>, 'wb')
pickle.dump(<bayes instance>, fp, 1)
fp.close()

We're beginning to work on making hammie* use Bayes.PersistentBayes to take 
care of this kind of stuff for you, but we're not there yet.

- TimS

11/18/2002 9:57:40 AM, "Moore, Paul" <Paul.Moore@atosorigin.com> wrote:

>Um, is it just me, or does hammiefilter not save the database if
>you're using a pickle?
>
>A patch is attached. It doesn't feel "clean" (it would be nice if
>PersistentBayes covered both pickles and DB files, as well as any
>other cases which may appear) but it'll do for now. Maybe I'll look at
>the wider issue when I have some time...
>
>Paul.
>
>
>_______________________________________________
>Spambayes mailing list
>Spambayes@python.org
>http://mail.python.org/mailman/listinfo/spambayes
>
- Tim
www.fourstonesExpressions.com 


From tim@fourstonesExpressions.com  Mon Nov 18 16:13:00 2002
From: tim@fourstonesExpressions.com (Tim Stone - Four Stones Expressions)
Date: Mon, 18 Nov 2002 10:13:00 -0600
Subject: [Spambayes] Hammiefilter doesn't write out the pickle
In-Reply-To: <16E1010E4581B049ABC51D4975CEDB885E2DDC@UKDCX001.uk.int.atosorigin.com>
Message-ID: <HEQOOYSVPSQ2YQZX053WWR41GCNGA.3dd9118c@riven>

I think applying your patch will break more things than it'll fix at the 
moment.  There's a lot of in-memory training that is done for testing purposes 
that's never saved, at least that's what I believe.

The Bayes.* classes have a load/store semantic, which is fully implemented for 
PickledBayes.  We can *probably* simply change hammie.createbayes to create a 
PickledBayes object instead of a Bayes object, and everything will work 
nicely.  I can't do that till the end of this week, though.  You might check 
with Neale and see what he thinks.

BTW, the Bayes.* classes do not auto-save, but they do offer a store() method 
that you can call which makes storing the stuff easy.

- TimS

11/18/2002 10:05:59 AM, "Moore, Paul" <Paul.Moore@atosorigin.com> wrote:

>From: Tim Stone - Four Stones Expressions
>
>> It's not you... you have to manually save it for the moment.
>> 
>> fp = open(<dbname string>, 'wb')
>> pickle.dump(<bayes instance>, fp, 1)
>> fp.close()
>>
>> We're beginning to work on making hammie* use Bayes.PersistentBayes
>> to take care of this kind of stuff for you, but we're not there yet.
>
>Thanks. If work on this already under way, I'll keep out of the way :-)
>Unless there's anything I can do to help?
>
>Paul.
>
>PS Is my patch worth applying as an interim measure?
>
>
- Tim
www.fourstonesExpressions.com 


From neale@woozle.org  Mon Nov 18 16:39:43 2002
From: neale@woozle.org (Neale Pickett)
Date: 18 Nov 2002 08:39:43 -0800
Subject: [Spambayes] Hammiefilter doesn't write out the pickle
In-Reply-To: <16E1010E4581B049ABC51D4975CEDB885E2DDB@UKDCX001.uk.int.atosorigin.com>
References: <16E1010E4581B049ABC51D4975CEDB885E2DDB@UKDCX001.uk.int.atosorigin.com>
Message-ID: <w531y5i6ddc.fsf@woozle.org>

So then, "Moore, Paul" <Paul.Moore@atosorigin.com> is all like:

> Um, is it just me, or does hammiefilter not save the database if
> you're using a pickle?

Ah, no, it wouldn't do that.  As Tim Stone says, a clean solution is
pending.

In the meantime, though, I'm curious about how you're using
hammiefilter.  Loading up the entire pickle is painfully slow compared
to the dbm method, and as hammiefilter is made specifically to run once
per message and then go away, the pickle is a particularly bad fit.

Are you running hammiefilter from procmail?  How big is your pickle?

Neale


From tim@fourstonesExpressions.com  Mon Nov 18 16:43:06 2002
From: tim@fourstonesExpressions.com (Tim Stone - Four Stones Expressions)
Date: Mon, 18 Nov 2002 10:43:06 -0600
Subject: [Spambayes] Hammiefilter doesn't write out the pickle
In-Reply-To: <w531y5i6ddc.fsf@woozle.org>
Message-ID: <98FBOLZVB8HGBGB1XNON2YZYIE6YT.3dd9189a@riven>

On a related subject, has there been any work done to persist the training 
into a ZODB?  DBM has it's own set of limitations: e.g. very large database 
file. If there is ZODB work, how would I get ahold of that stuff?  I don't see 
anything like that in cvs anywhere, or maybe I'm just missing it.

- TimS

11/18/2002 10:39:43 AM, Neale Pickett <neale@woozle.org> wrote:

>So then, "Moore, Paul" <Paul.Moore@atosorigin.com> is all like:
>
>> Um, is it just me, or does hammiefilter not save the database if
>> you're using a pickle?
>
>Ah, no, it wouldn't do that.  As Tim Stone says, a clean solution is
>pending.
>
>In the meantime, though, I'm curious about how you're using
>hammiefilter.  Loading up the entire pickle is painfully slow compared
>to the dbm method, and as hammiefilter is made specifically to run once
>per message and then go away, the pickle is a particularly bad fit.
>
>Are you running hammiefilter from procmail?  How big is your pickle?
>
>Neale
>
>
>_______________________________________________
>Spambayes mailing list
>Spambayes@python.org
>http://mail.python.org/mailman/listinfo/spambayes
>
>
- Tim
www.fourstonesExpressions.com 


From Paul.Moore@atosorigin.com  Mon Nov 18 16:54:38 2002
From: Paul.Moore@atosorigin.com (Moore, Paul)
Date: Mon, 18 Nov 2002 16:54:38 -0000
Subject: [Spambayes] Hammiefilter doesn't write out the pickle
Message-ID: <16E1010E4581B049ABC51D4975CEDB88619940@UKDCX001.uk.int.atosorigin.com>

From: Neale Pickett [mailto:neale@woozle.org]
> So then, "Moore, Paul" <Paul.Moore@atosorigin.com> is all like:

>> Um, is it just me, or does hammiefilter not save the database if
>> you're using a pickle?

> Ah, no, it wouldn't do that.  As Tim Stone says, a clean solution is
> pending.

> In the meantime, though, I'm curious about how you're using
> hammiefilter.  Loading up the entire pickle is painfully slow compared
> to the dbm method, and as hammiefilter is made specifically to run =
once
> per message and then go away, the pickle is a particularly bad fit.

> Are you running hammiefilter from procmail?  How big is your pickle?

I'm not using hammiefilter in its "filter" mode at all. I was planning =
on
using it for single-message incremental training ("Train as spam/ham") =
by
piping the message to "hammiefilter -[gs]". I guess that "hammie -[gs] =
-"
is just as good for this usage (well, better - it works!)

I realise that this area is currently in a state of flux. I don't have a
problem with changing things as I go. It's just a case of "what's best
right now?"

As for why I'm using pickles, it's simply because that's the default. I
don't have enough feel for things (or a large enough base of messages) =
to
have a problem either way. (My popget.py goes via hammie.py, and so =
loads
the pickle once per scan through the POP mailbox. Performance for this
has been fine so far, but I've only just started, so don't read too much
into that...)

This would be much easier if I wasn't working in "batch mode" - mild-
mannered (as if!) Exchange/Outlook user by day, masked Gnus/POP3 user
by night :-)

Paul.

From neale@woozle.org  Mon Nov 18 16:54:56 2002
From: neale@woozle.org (Neale Pickett)
Date: 18 Nov 2002 08:54:56 -0800
Subject: [Spambayes] Hammiefilter doesn't write out the pickle
In-Reply-To: <HEQOOYSVPSQ2YQZX053WWR41GCNGA.3dd9118c@riven>
References: <HEQOOYSVPSQ2YQZX053WWR41GCNGA.3dd9118c@riven>
Message-ID: <w53u1ie4y3j.fsf@woozle.org>

So then, Tim Stone - Four Stones Expressions <tim@fourstonesExpressions.com> is all like:

> I think applying your patch will break more things than it'll fix at
> the moment.  There's a lot of in-memory training that is done for
> testing purposes that's never saved, at least that's what I believe.

I'm not sure what you're thinking of here, but regardless of whether or
not there's wasted effort, unless the pickle is written out again, any
training is useless.  I've put a variation of Paul's patch in, and as
soon as SF CVS comes back up again, I'll check it all in.

> The Bayes.* classes have a load/store semantic, which is fully
> implemented for PickledBayes.  We can *probably* simply change
> hammie.createbayes to create a PickledBayes object instead of a Bayes
> object, and everything will work nicely.  I can't do that till the end
> of this week, though.  You might check with Neale and see what he
> thinks.

Neale thinks this is the right way to do it.  If the Bayes.* classes
write out their state on destruction, we can treat them all the same.
That's easy enough, just have them call self.store() in the __del__
method.

Neale will try and see if he can check in something that does that.
Neale's having problems with SF CVS today, though.

Yours truly,

Me


From tim.one@comcast.net  Mon Nov 18 16:56:12 2002
From: tim.one@comcast.net (Tim Peters)
Date: Mon, 18 Nov 2002 11:56:12 -0500
Subject: [Spambayes] Hammiefilter doesn't write out the pickle
In-Reply-To: <98FBOLZVB8HGBGB1XNON2YZYIE6YT.3dd9189a@riven>
Message-ID: <LNBBLJKPBEHFEDALKOLCIELECMAB.tim.one@comcast.net>

[Tim Stone]
> On a related subject, has there been any work done to persist the
> training into a ZODB?  DBM has it's own set of limitations: e.g. very
> large database file. If there is ZODB work, how would I get ahold of
> that stuff?  I don't see anything like that in cvs anywhere, or maybe
> I'm just missing it.

The project's pspam directory contains Jeremy Hylton's work on integrating
the classifier with ZODB and ZEO.


From neale@woozle.org  Mon Nov 18 17:09:19 2002
From: neale@woozle.org (Neale Pickett)
Date: 18 Nov 2002 09:09:19 -0800
Subject: [Spambayes] Hammiefilter doesn't write out the pickle
In-Reply-To: <16E1010E4581B049ABC51D4975CEDB88619940@UKDCX001.uk.int.atosorigin.com>
References: <16E1010E4581B049ABC51D4975CEDB88619940@UKDCX001.uk.int.atosorigin.com>
Message-ID: <w53ptt24xfk.fsf@woozle.org>

So then, "Moore, Paul" <Paul.Moore@atosorigin.com> is all like:

> I realise that this area is currently in a state of flux. I don't have a
> problem with changing things as I go. It's just a case of "what's best
> right now?"

CVS is back up, so the answer is now "hammiefilter".

Wait, no, it's down again.  The answer is now "hammie".

Your setup is nearly identical to mine.  Here is some elisp to bind "B
h" to "train as ham and move to another folder" and "B s" to "train as
spam and move to the spam folder".  Drop it into .gnus.

  (defun pipe-message (command)
    (interactive "sCommand: ")
    (save-window-excursion
      (gnus-summary-show-article 'raw)
      (gnus-summary-select-article-buffer)
      (shell-command-on-region (point-min) (point-max) command))
    (gnus-summary-show-article))

  (defun spam ()
      (interactive)
        (pipe-message "/home/neale/bin/hammie -s")
        (gnus-summary-move-article 1 "spam"))

  (defun notspam ()
      (interactive)
        (pipe-message "/home/neale/bin/hammie -g")
        (gnus-summary-move-article 1))

  (add-hook
   'gnus-sum-load-hook
   (lambda nil
     (define-key gnus-summary-mode-map [(B) (h)] 'notspam)
     (define-key gnus-summary-mode-map [(B) (s)] 'spam)))

> As for why I'm using pickles, it's simply because that's the default.

Ah, that's what I figured.  So the new hammiefilter is going to change
the default to the dbm method, but *only* for hammiefilter.  That sucks,
cause now there's this weird disparity between the two.  Maybe we should
consider changing the default in Options.py...

> This would be much easier if I wasn't working in "batch mode" - mild-
> mannered (as if!) Exchange/Outlook user by day, masked Gnus/POP3 user
> by night :-)

Everyone has their own dirty little secret, but please spare us the
details of what you do in those phone booths ;)

Considering getting a cell phone now,

Neale

From neale@woozle.org  Mon Nov 18 17:18:37 2002
From: neale@woozle.org (Neale Pickett)
Date: 18 Nov 2002 09:18:37 -0800
Subject: [Spambayes] Just for fun
In-Reply-To: <3DD8C399.8000305@startechgroup.co.uk>
References: <Pine.LNX.4.33L2.0211151155540.4913-100000@dev.itsite.com>
	<3DD8C399.8000305@startechgroup.co.uk>
Message-ID: <w53lm3q4x02.fsf@woozle.org>

So then, Matt Sergeant <msergeant@startechgroup.co.uk> is all like:

> This is the same as TMDA. I have evidence for you that it doesn't
> work. Case in point being direct from me: someone mailed me asking a
> technical question about one of my perl modules. I mailed him back a
> response, on my own free time. I got a TMDA bounce saying I had to
> confirm that I was a real person. Well frankly, sod that. I never
> replied. I never used his web page to confirm. I just ignored it and I'm
> sure he never got the reply to his question.

I seem to recall about five or so years back, some guy asked a question
on comp.lang.python, and his email address was something like
"bob@nospam.sittingduck.com".  Then some guy named "Guido" replied to
the question with a very good answer, but the reply bounced because of
the "nospam." part.  The "Guido" chap declared that he wasn't inclined
to jump through hoops for the privilege of answering the guy's question.

That guy never got his question answered either, and I don't imagine
Guido has changed his stance since then.  It may seem like a good idea
at first, but you ignore mail at your peril.  Other people are busy
enough with their own spam problems, don't make them responsible for
yours too.

Neale


From tim.one@comcast.net  Mon Nov 18 17:28:32 2002
From: tim.one@comcast.net (Tim Peters)
Date: Mon, 18 Nov 2002 12:28:32 -0500
Subject: [Spambayes] Just for fun
In-Reply-To: 
 <16E1010E4581B049ABC51D4975CEDB8861993D@UKDCX001.uk.int.atosorigin.com>
Message-ID: <LNBBLJKPBEHFEDALKOLCOELHCMAB.tim.one@comcast.net>

[Moore, Paul]
> ...
> I'm not good at interpreting this stuff yet, but it came out as
> solidly unsure, with some interesting features. The 'sender:no real
> name:2**0' as a solid ham clue is almost certainly due to Exchange
> (basically because Exchange doesn't do real headers, I expect)

If there is no Sender header, no token is generated.  You get 'sender:no
real name:2**0' only if there *is* a Sender header (and it doesn't contain a
real name).  The Outlook client's _GetFakeHeaders() doesn't synthesize a
Sender header, either.  So that token must come from internet mail.  It may
be a ham clue for you because some mailing lists create a Sender field
without a real name.  For example, the mailing-list version of
comp.lang.python adds this to its headers:

    Sender: python-list-admin@python.org

So that makes 'sender:no real name:2**0' a ham clue for me too.  That's
fine!  In my corpus, it is a ham indicator.

> - I see most internet headers as good spam clues, which is mildly
> worrying, although hasn't caused any real issues yet.

If your spam comes from the internet, it's appropriate <wink>.

> The obvious implication is that getting a really good training corpus
> is *hard*. Probably beyond the means of the average user.

The best possible training corpus is the email they actually get, correctly
classified.  If they know their own judgment about ham vs spam, all the rest
should happen by magic.  It's still hard for clients to do that, though.


From richie@entrian.com  Mon Nov 18 18:01:48 2002
From: richie@entrian.com (Richie Hindle)
Date: Mon, 18 Nov 2002 18:01:48 +0000
Subject: [Spambayes] A kinder, gentler hammie
In-Reply-To: <16E1010E4581B049ABC51D4975CEDB8861993E@UKDCX001.uk.int.atosorigin.com>
References: <16E1010E4581B049ABC51D4975CEDB8861993E@UKDCX001.uk.int.atosorigin.com>
Message-ID: <079itu84n9lil7sqae4j9gge1sgppps34h@4ax.com>

Hi Paul,

> The standard [Windows] environment variables which *can* be used for
> this sort of thing are
> 
> 1. HOMEDRIVE and HOMEPATH - %HOMEDRIVE%%HOMEPATH% is basically the
>    equivalent of Unix's $HOME. But for nearly all cases, these end
>    up being C:\, which to my mind is a bad default.
> 2. USERPROFILE - %USERPROFILE% is a user-specific directory suitable
>    for config information. But by default it's a directory with spaces
>    in the name, which can be awkward for some purposes. It's also hard
>    to navigate to in Windows explorer, which makes files stored there
>    a little "hidden".

Not true on 98:

C:\WIN98>set
PROMPT=$p$g
winbootdir=C:\WIN98
COMSPEC=C:\COMMAND.COM
TMPDIR=c:\win98\temp
TEMP=C:\win98\temp
TMP=c:\win98\temp
HOME=D:\BIN\richie
PATH=[paranoia]
QTDIR=C:\qt
TMAKEPATH=C:\qt\tmake\lib\win32-msvc
windir=C:\WIN98

and the only reason 'HOME' is there is that I manually added it - possibly
for the sake of Cygwin, but certainly for something that your typical
Windows user won't have.

Having said that, I agree with this:

> I think "try a number of pathnames" is a sensible approach.

...but is there a fallback that *always* works?  I'm not sure whether there
is - is argv[0] guaranteed to work, even in frozen / py2exe'd / Installer'd
/ cx_Frozen / Squeezed / etc. applications?

-- 
Richie Hindle
richie@entrian.com


From richie@entrian.com  Mon Nov 18 18:02:07 2002
From: richie@entrian.com (Richie Hindle)
Date: Mon, 18 Nov 2002 18:02:07 +0000
Subject: [Spambayes] Hammiefilter doesn't write out the pickle
In-Reply-To: <w53u1ie4y3j.fsf@woozle.org>
References: <HEQOOYSVPSQ2YQZX053WWR41GCNGA.3dd9118c@riven>
	<w53u1ie4y3j.fsf@woozle.org>
Message-ID: <45aituot7k3emkpek6l60i228qbcu184mu@4ax.com>

Hi Neale,

> Neale thinks this is the right way to do it.  If the Bayes.* classes
> write out their state on destruction, we can treat them all the same.
> That's easy enough, just have them call self.store() in the __del__
> method.

Richie thinks this is a bad move.  Here's a minor rant I sent to Tim Stone
when he did exactly this in his Bayes module:

--------------------------------------------------------------------------

PersistentBayes.__del__() calls store() - this seems like a bad thing for
three reasons.  One is that I might not want to save my changes to the
database - pop3proxy has an explicit "Save & Shutdown" and "Shutdown"
buttons to give the user control over whether the database is saved or not
(to let you do speculative training and discard the results, for instance).
[This is the least important of the three reasons.  Four, four reasons!]
Also, the pop3proxy self-test uses an in-memory bayes instance that it
never wants to write to disk.  Secondly, it's unpredictable when __del__
will be called, or even *whether* it will be called - this:

class A:
    def __del__(self):
        print "A.__del__"

class B:
    def __del__(self):
        print "B.__del__"

a = A()
b = B()
a.b = b
b.a = a
print "Exiting..."

won't call either __del__ method in the current CPython implementation.

Thirdly, if users of PersistentBayes explicitly call store() - which seems
like the right thing to do - the database will be written out twice.  [And
that can take *a long time*.]

[snip]

I've found another reason why PersistentBayes.__del__() is a bad thing -
self.db_name isn't set in the case where a PickledBayes is created using a
filename that doesn't exist (which is done by the pop3proxy self-test) -
that was leading to exceptions being throw from __del__, which is a
notoriously hard problem to track down.

--------------------------------------------------------------------------

I'd much rather have an explicit store() method and document the fact that
storage may be pre-empted by certain implementations.  Relying on __del__
is nasty.

-- 
Richie Hindle
richie@entrian.com


From neale@woozle.org  Mon Nov 18 18:44:18 2002
From: neale@woozle.org (Neale Pickett)
Date: 18 Nov 2002 10:44:18 -0800
Subject: [Spambayes] Hammiefilter doesn't write out the pickle
In-Reply-To: <45aituot7k3emkpek6l60i228qbcu184mu@4ax.com>
References: <HEQOOYSVPSQ2YQZX053WWR41GCNGA.3dd9118c@riven>
	<w53u1ie4y3j.fsf@woozle.org>
	<45aituot7k3emkpek6l60i228qbcu184mu@4ax.com>
Message-ID: <w53bs4m4t19.fsf@woozle.org>

So then, Richie Hindle <richie@entrian.com> is all like:

> Richie thinks this is a bad move.  Here's a minor rant I sent to Tim Stone
> when he did exactly this in his Bayes module:

Neale thinks Richie makes some good points here.

My original reason for wanting to have the DB flush itself on deletion
was something to do with exceptions while training large corpora.  I
think it's time for that hack to go now.  There's nothing wrong with
explicitly calling store() on shutdown, as you say, it's cleaner and
more predictable.  So let's agree to do that.  Rather, I'll cave to
what's right and modify my dodgy code to do what yours is already doing
:)

I think it may finally be time to give hammie a big makeover--it should
just provide the Hammie class, and not be executable.  I'll ponder this
and post a big diff to see what you all think.

Neale


From richie@entrian.com  Mon Nov 18 19:06:18 2002
From: richie@entrian.com (Richie Hindle)
Date: Mon, 18 Nov 2002 19:06:18 +0000
Subject: [Spambayes] Hammiefilter doesn't write out the pickle
In-Reply-To: <w53bs4m4t19.fsf@woozle.org>
References: <HEQOOYSVPSQ2YQZX053WWR41GCNGA.3dd9118c@riven>
	<w53u1ie4y3j.fsf@woozle.org> <45aituot7k3emkpek6l60i228qbcu184mu@4ax.com>
	<w53bs4m4t19.fsf@woozle.org>
Message-ID: <63eituclg186qi547gha5vruvpmq5fq1rt@4ax.com>

Hi Neale,

> I think it may finally be time to give hammie a big makeover--it should
> just provide the Hammie class, and not be executable.

You might be right.  Especially given that Hammie can be used remotely via
XML-RPC, I wonder whether Tim Stone's Bayes class and Hammie should be
rolled into one, and any client (including pop3proxy) that currently uses
classifier.Bayes or Bayes.XXXBayes should used the new class(es) - that
would unify the API across all the clients, and make that API available
remotely for (almost) free.  We could even document it... 8-)

-- 
Richie Hindle
richie@entrian.com


From richie@entrian.com  Mon Nov 18 19:17:43 2002
From: richie@entrian.com (Richie Hindle)
Date: Mon, 18 Nov 2002 19:17:43 +0000
Subject: [Spambayes] Classify issue with pop3proxy
In-Reply-To: <B9FEB586.5CCAB%francois.granger@free.fr>
References: <B9FEB586.5CCAB%francois.granger@free.fr>
Message-ID: <ph9itu4a9s6fgemgbr1o1hbpps9okvq9tp@4ax.com>

Hi Fran�ois,

> The difference I see is line 774 in onTrain wich is not
> in onClassify. I sugest adding it at line 793.

Thanks for that - I've checked in your patch.  Could you (or someone else
on a Mac) check that it works?  Many thanks.

-- 
Richie Hindle
richie@entrian.com


From richie@entrian.com  Mon Nov 18 19:18:57 2002
From: richie@entrian.com (Richie Hindle)
Date: Mon, 18 Nov 2002 19:18:57 +0000
Subject: [Spambayes] New web training interface for pop3proxy
Message-ID: <63bitu8kflib5at5tplk79qgm4deo4bohp@4ax.com>

Hi,

I've just checked in a new web training interface for pop3proxy.  It keeps
a cache of all the messages that it's proxied (using Tim Stone's Corpus
modules), and presents a web page with these untrained messages on, one
page per day's messages.  You check a Ham/Spam/Discard box next to each one
and submit them for training.  It also keeps trained messages but there's
no interface for *re*training yet - that and automatic training will come
soon, along with cache expiry.

I've put up a mockup at http://entrian.com/review2.html - none of the
buttons or links there works, but you can see what it looks like.  What I
want to do soon is auto-train on 'sure' spams and hams, and split the
training interface into 'Review hams', 'Review spams' and 'Review unsure'.
Or something.  I probably need to look at the way the Outlook stuff does
this.

One consequence of this is that pop3proxy will create three subdirectories
under its working directory in which to keep its caches:
pop3proxy-spam-cache, pop3proxy-ham-cache and pop3proxy-unknown-cache.  In
the somewhat unlikely event that you already have directories with these
names (!) you can configure them in bayescustomize.ini.

I've also fixed some problems that Fran�ois was having on the Mac, whereby
it was falling over trying to re-open the log file, and uploading of
messages to classify wasn't working.

-- 
Richie Hindle
richie@entrian.com


From dereks@itsite.com  Mon Nov 18 17:11:21 2002
From: dereks@itsite.com (Derek Simkowiak)
Date: Mon, 18 Nov 2002 12:11:21 -0500 (EST)
Subject: [Spambayes] Just for fun
In-Reply-To: <w53lm3q4x02.fsf@woozle.org>
Message-ID: <Pine.LNX.4.33L2.0211181149520.6678-100000@dev.itsite.com>

> "bob@nospam.sittingduck.com".  Then some guy named "Guido" replied to

	Some guy named "Guido" on comp.lang.python?  What a coincidence!

	I think it would be best if we took the discussion of the merits
of ASK off this list.  I only wanted to mention it to get people thinking
and give my perspective on INBOX-based filtering... not to start a
long-running discussion of something that is not SpamBayes.


> the "nospam." part.  The "Guido" chap declared that he wasn't inclined
> to jump through hoops for the privilege of answering the guy's question.

	For the record: I never suggested that ASK should be used for
addresses where you EXPECT to get unsolicited emails.  Published addresses
like "sales@foo.com", "info@foo.com", and any email address used on a list
or newsgroup would of course be bad candidates for ASK.

	But for any email address where you do not expect unsolicited
emails (spam or not), I think it would be a reasonable protection.  If
somebody NEEDS to get a message through to you, then asking them to hit
"Reply;Send" one time is not unreasonable (in my opinion).

	If they won't take the time and effort to do that then I don't
really want their message... which is exactly why this technique works for
spam.  (Note that all friends, family, biz associates, etc. are
automatically added by pointing the tool at your pre-existing INBOX.)

	(I only wish I could get this system for my home telephone!)


>  Other people are busy enough with their own spam problems, don't make
> them responsible for yours too.

	I never suggested that anyone be made responsible for anything.


--Derek


From skip@pobox.com  Mon Nov 18 20:41:17 2002
From: skip@pobox.com (Skip Montanaro)
Date: Mon, 18 Nov 2002 14:41:17 -0600
Subject: [Spambayes] New web training interface for pop3proxy
Message-ID: <15833.20589.376685.686723@montanaro.dyndns.org>


    Richie> I've put up a mockup at http://entrian.com/review2.html...

Some suggestions:

    * I think you need a 'defer' choice in addition to discard/ham/spam.  I
      may well want to train on some obvious ones right now, but don't have
      the time to investigate others which will require some thought.

    * It would be nice if the subject was 'hot' so you can click on it and
      view the entire message in a new window.

    * Given that the time to classify a message is pretty cheap, it would
      also be nice if your interface preset the radio buttons based on an
      initial classification of each message.  This suggests you need an
      'unsure' radio button as well.

Skip

From tim.one@comcast.net  Mon Nov 18 21:09:37 2002
From: tim.one@comcast.net (Tim Peters)
Date: Mon, 18 Nov 2002 16:09:37 -0500
Subject: [Spambayes] New web training interface for pop3proxy
In-Reply-To: <63bitu8kflib5at5tplk79qgm4deo4bohp@4ax.com>
Message-ID: <LNBBLJKPBEHFEDALKOLCCENJCMAB.tim.one@comcast.net>

[Richie Hindle]
> ...
> What I want to do soon is auto-train on 'sure' spams and hams, and
> split the training interface into 'Review hams', 'Review spams' and
> 'Review unsure'.  Or something.  I probably need to look at the way the
> Outlook stuff does this.

The Outlook client doesn't train by magic on anything correctly classified
(yet).  Ham folders display a "Delete as Spam" button, Spam folders a
"Recover from Spam" button, and the Unsure folder has both.  Msgs are
trained by magic only when selected and you click one of those buttons, or
when you drag a msg from kind of folder to another.  In part this simply
reflects our inability so far to decide on "the best" way to train.  Manual
training in the Outlook client sucks up entire folders.


From richie@entrian.com  Mon Nov 18 22:52:04 2002
From: richie@entrian.com (Richie Hindle)
Date: Mon, 18 Nov 2002 22:52:04 +0000
Subject: [Spambayes] New web training interface for pop3proxy
In-Reply-To: <15833.20589.376685.686723@montanaro.dyndns.org>
References: <15833.20589.376685.686723@montanaro.dyndns.org>
Message-ID: <9qqitu09afl7kufdmr41kn28tfn9nfjpur@4ax.com>


[Skip]
> * I think you need a 'defer' choice in addition to discard/ham/spam.  I
>   may well want to train on some obvious ones right now, but don't have
>   the time to investigate others which will require some thought.

Good idea, yes.

> * It would be nice if the subject was 'hot' so you can click on it and
>   view the entire message in a new window.

Already on the to-do list - see pop3proxy.py.  8-)

> * Given that the time to classify a message is pretty cheap, it would
>   also be nice if your interface preset the radio buttons based on an
>   initial classification of each message.  This suggests you need an
>   'unsure' radio button as well.

I did think of that, then I thought that I was far more likely to make a
mistake just scanning down the list thinking "yeah, yeah, yeah, looks ok"
than actually having to click something for each message.  I also thought
of highlighting the classification decisions in some other way, like
colouring the rows, but decided against that for the same reason.  I think
this whole issue will go away in version two - see below.

[Tim]
> The Outlook client doesn't train by magic on anything correctly classified
> (yet).  Ham folders display a "Delete as Spam" button, Spam folders a
> "Recover from Spam" button, and the Unsure folder has both.  Msgs are
> trained by magic only when selected and you click one of those buttons, or
> when you drag a msg from kind of folder to another.

I've been thinking that the next version of the web interface would work
the same way - rather than a single page of untrained messages, you'd get
three pages for ham-judged, spam-judged and unsure.  There might need to be
an initial training period where this didn't happen.  Or maybe not - I ran
this at work today "from scratch" with an empty database, training as I
went, and in about 30 messages I had lots of unsures, no fps and only about
three fns.  Being only one working day it had a large ham bias - we'll see
what happens tomorrow after I train it on the night's spam.  I reckon it'll
do well.

By presenting the messages as three pre-judged lists, am I contradicting my
own statement that the messages shouldn't show up as prejudged in the
current 'unclassified' list?  8-)  I don't think so, because spotting a ham
in a bunch of spams, or vice versa, is much easier than spotting whether
any of a whole mixture of messages is misclassified.

-- 
Richie Hindle
richie@entrian.com


From lists@morpheus.demon.co.uk  Mon Nov 18 23:15:44 2002
From: lists@morpheus.demon.co.uk (Paul Moore)
Date: Mon, 18 Nov 2002 23:15:44 +0000
Subject: [Spambayes] Just for fun
References: 
	<16E1010E4581B049ABC51D4975CEDB8861993D@UKDCX001.uk.int.atosorigin.com>
	<LNBBLJKPBEHFEDALKOLCOELHCMAB.tim.one@comcast.net>
Message-ID: <n2m-g.65uusc4f.fsf@morpheus.demon.co.uk>

Tim Peters <tim.one@comcast.net> writes:

>> - I see most internet headers as good spam clues, which is mildly
>> worrying, although hasn't caused any real issues yet.
>
> If your spam comes from the internet, it's appropriate <wink>.

A good chunk of ham comes from the Internet, too, but that chunk isn't
available in my training set. It could be (to an extent) but see below.

>> The obvious implication is that getting a really good training corpus
>> is *hard*. Probably beyond the means of the average user.
>
> The best possible training corpus is the email they actually get, correctly
> classified.  If they know their own judgment about ham vs spam, all the rest
> should happen by magic.  It's still hard for clients to do that, though.

Agreed (on both points - it's the best and it's hard).

In practice, I'm not completely comfortable with the approach of
starting from nothing and training only on new mail [1]. But collecting a
truly representative corpus isn't easy. The overhead of religiously
collecting and manually classifying all mail for a reasonable period
is prohibitive, and any attempt to just grab existing filed mails will
always introduce bias [2].

I'm really just trying to get to grips with what can be done to ease
the "entry cost" of the system.

Paul.

[1] It works (pretty much any training method works remarkably well)
    but as has been reported here before, unsures are surprising. And
    worse than that, in my experience, is the fact that training on an
    error or unsure and then rescoring it can show it still as
    unsure. This is *very* offputting - you just told the system it is
    spam, how come the system ignored you? (I know the answer, but
    it's almost impossible to make it feel like reasonable behaviour).

[2] The main forms of bias I see with my mail are on the one hand,
    massive imbalance in numbers, because I keep all sorts of ancient
    junk whereas I (used to) delete spam instantly. On the other hand,
    taking just my inbox excludes almost all ham which originated from
    the internet (as a simple example). Tomorrow, I'm hoping to try
    your new option to compensate for imbalance. Let's hit it with a
    truly massive ratio and see how it goes!

-- 
This signature intentionally left blank

From tim.one@comcast.net  Mon Nov 18 23:29:07 2002
From: tim.one@comcast.net (Tim Peters)
Date: Mon, 18 Nov 2002 18:29:07 -0500
Subject: [Spambayes] More back-patting - my brain's first FP where bayes
 got it right
In-Reply-To: <LCEPIIGDJPKCOIHOBJEPGEDNHMAA.mhammond@skippinet.com.au>
Message-ID: <LNBBLJKPBEHFEDALKOLCEEPHCMAB.tim.one@comcast.net>

[Mark Hammond, humbled by amazing ham-sniffing powers]

I suppose this would be a good time to confess that I seed each database
with "craniofacial reconstruction" as killer-strong ham clues?

> ...
> The text version of the mail is below.  It was HTML, gray background,
> blue writing - big brain spam-clues <wink>

Fortunately or not, because the tokenizer strips HTML decorations, the
classifier is blind to info about colors, font styles, and font sizes.  I
usually don't mind because there are so many other spammy things about spam,
but it's still an abstract nag.

> And-I'm-yet-to-see-a-bayes-FP ly,

You will.  I confess that I zip thru my Spam folder faster each day,
though -- there's never ham in it anymore.

BTW, I gave up on my mistake-driven classifier experiment.  I kept getting
several porn spam as Unsure every day, and got tired of digging thru it.
Now I'm training on each spam that doesn't score 100, and each ham that
doesn't score 0.  Amazingly, that's added a hell of a lot more spam than ham
to the training data -- now up to 99 ham and 149 spam.  Porn spam no longer
rates as Unsure, and I'm happier.  Perhaps that's just due to the drop in
forced stimulation, though <wink>.


From tim@fourstonesExpressions.com  Tue Nov 19 00:10:47 2002
From: tim@fourstonesExpressions.com (Tim Stone - Four Stones Expressions)
Date: Mon, 18 Nov 2002 18:10:47 -0600
Subject: [Spambayes] Hammiefilter doesn't write out the pickle
In-Reply-To: <63eituclg186qi547gha5vruvpmq5fq1rt@4ax.com>
Message-ID: <NLRPO63VGBRNQ2ZHBHDE9SRNKNJRN.3dd98187@riven>

I think we've got some real potential for a great little api here.  I do have
some questions about the data storage.  We've agreed that an explicit store is 
the way we want to go, which I think is correct.  However, dbm really doesn't 
support this.  I fooled with a couple ideas (hacks) to make DBDict behave in a 
load/store fashion, and the best thing I can come up with is to actually make 
a working copy of the dbm file, which is then used for the session.  When 
store() is called, the original is replaced with the working copy.   There are 
some difficulties with this approach.  If store is never called, then there is 
no guaranteed way to clean up the working copy.  Replacing the original with 
the working copy may be a bit difficult, because dbm doesn't support a close 
method...

SOOOOO... Tim Stone's question is: "Should I go ahead and do that?"

- TimS

11/18/2002 1:06:18 PM, Richie Hindle <richie@entrian.com> wrote:

>Hi Neale,
>
>> I think it may finally be time to give hammie a big makeover--it should
>> just provide the Hammie class, and not be executable.
>
>You might be right.  Especially given that Hammie can be used remotely via
>XML-RPC, I wonder whether Tim Stone's Bayes class and Hammie should be
>rolled into one, and any client (including pop3proxy) that currently uses
>classifier.Bayes or Bayes.XXXBayes should used the new class(es) - that
>would unify the API across all the clients, and make that API available
>remotely for (almost) free.  We could even document it... 8-)
>
>-- 
>Richie Hindle
>richie@entrian.com
>
>
>_______________________________________________
>Spambayes mailing list
>Spambayes@python.org
>http://mail.python.org/mailman/listinfo/spambayes
>
>
- Tim
www.fourstonesExpressions.com 


From neale@woozle.org  Tue Nov 19 00:35:55 2002
From: neale@woozle.org (Neale Pickett)
Date: 18 Nov 2002 16:35:55 -0800
Subject: [Spambayes] proposed changes to hammie & co.
Message-ID: <w53wuna2y6s.fsf@woozle.org>

okay, here's the big diff I was talking about.  This would take all
hammie functionality out of hammie.  So there would need to be yet
another hammie*.py file, a front-end to this new hammie class which acts
like the all-singing, all-dancing program that hammie is currently.

This moves everything but the Hammie class out of hammie.py.  DBDict
goes into its own module, which you could take out and use elsewhere if
you wanted.  PersistentBayes goes away, replaced by a the DBDictBayes
class in Bayes.py.  I haven't had time to implement the rest of the
stuff yet, but that would be what'd go into the new front-end.

So the happy hammie family would then stand at:

  hammie.py
  |-- hammiefilter.py
  |-- pop3proxy.py
  |-- hammiesrv.py
  \-- hammie-new-front-end.py

This change appears to work fine with hammiefilter and pop3proxy.  But
it's a pretty big change, so I'd like to hear what at least Richie and
Tim Stone think before I commit anything.

Neale

? Outlook2000
? diff
? email
? hammiebatch.py
Index: Bayes.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/Bayes.py,v
retrieving revision 1.5
diff -u -r1.5 Bayes.py
--- Bayes.py	18 Nov 2002 13:04:20 -0000	1.5
+++ Bayes.py	19 Nov 2002 00:24:57 -0000
@@ -56,11 +56,10 @@
 all the spambayes contributors."
 
 import Corpus
-from classifier import Bayes
+import classifier
 from Options import options
-from hammie import DBDict     # hammie only for DBDict, which should
-                              # probably really be somewhere else
 import cPickle as pickle
+import dbdict
 import errno
 import copy
 import anydbm
@@ -69,7 +68,7 @@
 NO_UPDATEPROBS = False   # Probabilities will not be autoupdated with training
 UPDATEPROBS = True       # Probabilities will be autoupdated with training
 
-class PersistentBayes(Bayes):
+class PersistentBayes(classifier.Bayes):
     '''Persistent Bayes database object'''
 
     def __init__(self, db_name):
@@ -169,12 +168,49 @@
         self.wordinfo, self.nspam, self.nham = t[1:]
 
 
+class WIDict(dbdict.DBDict):
+    """DBDict optimized for holding lots of WordInfo objects.
+
+    Normally, the pickler can figure out that you're pickling the same
+    type thing over and over, and will just tag the type with a new
+    byte, thus reducing Administrative Pickle Bloat(R).  Since the
+    DBDict continually creates new picklers, however, nothing ever gets
+    the chance to do this optimization.
+
+    The WIDict class forces this optimization by stealing the
+    (currently) unused 'W' pickle type for WordInfo objects.  This
+    results in about a 50% reduction in database size.
+
+    """
+
+    def __getitem__(self, key):
+        v = self.hash[key]
+        if v[0] == 'W':
+            val = pickle.loads(v[1:])
+            # We could be sneaky, like pickle.Unpickler.load_inst,
+            # but I think that's overly confusing.
+            obj = classifier.WordInfo(0)
+            obj.__setstate__(val)
+            return obj
+        else:
+            return pickle.loads(v)
+
+    def __setitem__(self, key, val):
+        if isinstance(val, classifier.WordInfo):
+            val = val.__getstate__()
+            v = 'W' + pickle.dumps(val, 1)
+        else:
+            v = pickle.dumps(val, 1)
+        self.hash[key] = v
+
+
 class DBDictBayes(PersistentBayes):
     '''Bayes object persisted in a hammie.DB_Dict'''
 
-    def __init__(self, db_name):
+    def __init__(self, db_name, mode='c'):
         '''Constructor(database name)'''
 
+        self.mode = mode
         self.db_name = db_name
         self.statekey = "saved state"
 
@@ -186,7 +222,8 @@
         if Corpus.Verbose:
             print 'Loading state from',self.db_name,'DB_Dict'
 
-        self.wordinfo = DBDict(self.db_name, 'c')
+        self.wordinfo = WIDict(self.db_name, self.mode,
+                               iterskip=[self.statekey])
 
         if self.wordinfo.has_key(self.statekey):
 
@@ -216,7 +253,7 @@
 
     def __init__(self, bayes, trainertype, updateprobs=NO_UPDATEPROBS):
         '''Constructor(Bayes, \
-                       Corpus.SPAM|Corpus.HAM), updprobs(True|False)'''
+            Corpus.SPAM|Corpus.HAM), updprobs(True|False)'''
 
         self.bayes = bayes
         self.trainertype = trainertype
@@ -286,4 +323,4 @@
 
 
 if __name__ == '__main__':
-    print >>sys.stderr, __doc__
\ No newline at end of file
+    print >>sys.stderr, __doc__
Index: dbdict.py
===================================================================
RCS file: dbdict.py
diff -N dbdict.py
--- /dev/null	1 Jan 1970 00:00:00 -0000
+++ dbdict.py	19 Nov 2002 00:24:57 -0000
@@ -0,0 +1,92 @@
+#! /usr/bin/env python
+
+from __future__ import generators
+import dbhash
+try:
+    import cPickle as pickle
+except ImportError:
+    import pickle
+
+class DBDict:
+    """Database Dictionary.
+
+    This wraps a dbhash database to make it look even more like a
+    dictionary, much like the built-in shelf class.  The difference is
+    that a DBDict supports all dict methods.
+
+    Call it with the database.  Optionally, you can specify a list of
+    keys to skip when iterating.  This only affects iterators; things
+    like .keys() still list everything.  For instance:
+
+    >>> d = DBDict('goober.db', 'c', ('skipme', 'skipmetoo'))
+    >>> d['skipme'] = 'booga'
+    >>> d['countme'] = 'wakka'
+    >>> print d.keys()
+    ['skipme', 'countme']
+    >>> for k in d.iterkeys():
+    ...     print k
+    countme
+
+    """
+
+    def __init__(self, dbname, mode, iterskip=()):
+        self.hash = dbhash.open(dbname, mode)
+        self.iterskip = iterskip
+
+    def __getitem__(self, key):
+        return pickle.loads(self.hash[key])
+
+    def __setitem__(self, key, val):
+        self.hash[key] = pickle.dumps(val, 1)
+
+    def __delitem__(self, key, val):
+        del(self.hash[key])
+
+    def __iter__(self, fn=None):
+        k = self.hash.first()
+        while k != None:
+            key = k[0]
+            val = self.__getitem__(key)
+            if key not in self.iterskip:
+                if fn:
+                    yield fn((key, val))
+                else:
+                    yield (key, val)
+            try:
+                k = self.hash.next()
+            except KeyError:
+                break
+
+    def __contains__(self, name):
+        return self.has_key(name)
+
+    def __getattr__(self, name):
+        # Pass the buck
+        return getattr(self.hash, name)
+
+    def get(self, key, dfl=None):
+        if self.has_key(key):
+            return self[key]
+        else:
+            return dfl
+
+    def iteritems(self):
+        return self.__iter__()
+
+    def iterkeys(self):
+        return self.__iter__(lambda k: k[0])
+
+    def itervalues(self):
+        return self.__iter__(lambda k: k[1])
+
+open = DBDict
+
+def _test():
+    import doctest
+    import dbdict
+
+    doctest.testmod(dbdict)
+
+if __name__ == '__main__':
+    _test()
+
Index: hammie.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/hammie.py,v
retrieving revision 1.40
diff -u -r1.40 hammie.py
--- hammie.py	18 Nov 2002 18:13:54 -0000	1.40
+++ hammie.py	19 Nov 2002 00:24:57 -0000
@@ -1,57 +1,11 @@
 #! /usr/bin/env python
 
-# A driver for the classifier module and Tim's tokenizer that you can
-# call from procmail.
-
-"""Usage: %(program)s [options]
-
-Where:
-    -h
-        show usage and exit
-    -g PATH
-        mbox or directory of known good messages (non-spam) to train on.
-        Can be specified more than once, or use - for stdin.
-    -s PATH
-        mbox or directory of known spam messages to train on.
-        Can be specified more than once, or use - for stdin.
-    -u PATH
-        mbox of unknown messages.  A ham/spam decision is reported for each.
-        Can be specified more than once.
-    -r
-        reverse the meaning of the check (report ham instead of spam).
-        Only meaningful with the -u option.
-    -p FILE
-        use file as the persistent store.  loads data from this file if it
-        exists, and saves data to this file at the end.
-        Default: %(DEFAULTDB)s
-    -d
-        use the DBM store instead of cPickle.  The file is larger and
-        creating it is slower, but checking against it is much faster,
-        especially for large word databases. Default: %(USEDB)s
-    -D
-        the reverse of -d: use the cPickle instead of DBM
-    -f
-        run as a filter: read a single message from stdin, add an
-        %(DISPHEADER)s header, and write it to stdout.  If you want to
-        run from procmail, this is your option.
-"""
-
-from __future__ import generators
-
-import sys
-import os
-import types
-import getopt
-import mailbox
-import glob
-import email
-import errno
-import anydbm
-import cPickle as pickle
 
+import dbdict
 import mboxutils
-import classifier
+import Bayes
 from Options import options
+from tokenizer import tokenize
 
 try:
     True, False
@@ -60,166 +14,14 @@
     True, False = 1, 0
 
 
-program = sys.argv[0] # For usage(); referenced by docstring above
-
-# Name of the header to add in filter mode
-DISPHEADER = options.hammie_header_name
-DEBUGHEADER = options.hammie_debug_header_name
-DODEBUG = options.hammie_debug_header
-
-# Default database name
-DEFAULTDB = options.persistent_storage_file
-
-# Probability at which a message is considered spam
-SPAM_THRESHOLD = options.spam_cutoff
-HAM_THRESHOLD = options.ham_cutoff
-
-# Probability limit for a clue to be added to the DISPHEADER
-SHOWCLUE = options.clue_mailheader_cutoff
-
-# Use a database? If False, use a pickle
-USEDB = options.persistent_use_database
-
-# Tim's tokenizer kicks far more booty than anything I would have
-# written.  Score one for analysis ;)
-from tokenizer import tokenize
-
-class DBDict:
-
-    """Database Dictionary.
-
-    This wraps an anydbm to make it look even more like a dictionary.
-
-    Call it with the name of your database file.  Optionally, you can
-    specify a list of keys to skip when iterating.  This only affects
-    iterators; things like .keys() still list everything.  For instance:
-
-    >>> d = DBDict('/tmp/goober.db', ('skipme', 'skipmetoo'))
-    >>> d['skipme'] = 'booga'
-    >>> d['countme'] = 'wakka'
-    >>> print d.keys()
-    ['skipme', 'countme']
-    >>> for k in d.iterkeys():
-    ...     print k
-    countme
-
-    """
-
-    def __init__(self, dbname, mode, iterskip=()):
-        self.hash = anydbm.open(dbname, mode)
-        self.iterskip = iterskip
-
-    def __getitem__(self, key):
-        v = self.hash[key]
-        if v[0] == 'W':
-            val = pickle.loads(v[1:])
-            # We could be sneaky, like pickle.Unpickler.load_inst,
-            # but I think that's overly confusing.
-            obj = classifier.WordInfo(0)
-            obj.__setstate__(val)
-            return obj
-        else:
-            return pickle.loads(v)
-
-    def __setitem__(self, key, val):
-        if isinstance(val, classifier.WordInfo):
-            val = val.__getstate__()
-            v = 'W' + pickle.dumps(val, 1)
-        else:
-            v = pickle.dumps(val, 1)
-        self.hash[key] = v
-
-    def __delitem__(self, key, val):
-        del(self.hash[key])
-
-    def __iter__(self, fn=None):
-        k = self.hash.first()
-        while k != None:
-            key = k[0]
-            val = self.__getitem__(key)
-            if key not in self.iterskip:
-                if fn:
-                    yield fn((key, val))
-                else:
-                    yield (key, val)
-            try:
-                k = self.hash.next()
-            except KeyError:
-                break
-
-    def __contains__(self, name):
-        return self.has_key(name)
-
-    def __getattr__(self, name):
-        # Pass the buck
-        return getattr(self.hash, name)
-
-    def get(self, key, dfl=None):
-        if self.has_key(key):
-            return self[key]
-        else:
-            return dfl
-
-    def iteritems(self):
-        return self.__iter__()
-
-    def iterkeys(self):
-        return self.__iter__(lambda k: k[0])
-
-    def itervalues(self):
-        return self.__iter__(lambda k: k[1])
-
-
-class PersistentBayes(classifier.Bayes):
-
-    """A persistent Bayes classifier.
-
-    This is just like classifier.Bayes, except that the dictionary is a
-    database.  You take less disk this way and you can pretend it's
-    persistent.  The tradeoffs vs. a pickle are: 1. it's slower
-    training, but faster checking, and 2. it needs less memory to run,
-    but takes more space on the hard drive.
+class Hammie:
+    """A spambayes mail filter.
 
-    On destruction, an instantiation of this class will write its state
-    to a special key.  When you instantiate a new one, it will attempt
-    to read these values out of that key again, so you can pick up where
-    you left off.
+    This implements the basic functionality needed to score, filter, or
+    train.  
 
     """
 
-    # XXX: Would it be even faster to remember (in a list) which keys
-    # had been modified, and only recalculate those keys?  No sense in
-    # going over the entire word database if only 100 words are
-    # affected.
-
-    # XXX: Another idea: cache stuff in memory.  But by then maybe we
-    # should just use ZODB.
-
-    def __init__(self, dbname, mode):
-        classifier.Bayes.__init__(self)
-        self.statekey = "saved state"
-        self.wordinfo = DBDict(dbname, mode, (self.statekey,))
-        self.dbmode = mode
-
-        self.restore_state()
-
-    def __del__(self):
-        #super.__del__(self)
-        self.save_state()
-
-    def save_state(self):
-        if self.dbmode != 'r':
-            self.wordinfo[self.statekey] = (self.nham, self.nspam)
-
-    def restore_state(self):
-        if self.wordinfo.has_key(self.statekey):
-            self.nham, self.nspam = self.wordinfo[self.statekey]
-
-
-class Hammie:
-
-    """A spambayes mail filter"""
-
     def __init__(self, bayes):
         self.bayes = bayes
 
@@ -262,9 +64,9 @@
             import traceback
             traceback.print_exc()
 
-    def filter(self, msg, header=DISPHEADER, spam_cutoff=SPAM_THRESHOLD,
-               ham_cutoff=HAM_THRESHOLD, debugheader=DEBUGHEADER,
-               debug=DODEBUG):
+    def filter(self, msg, header=None, spam_cutoff=None,
+               ham_cutoff=None, debugheader=None,
+               debug=None):
         """Score (judge) a message and add a disposition header.
 
         msg can be a string, a file object, or a Message object.
@@ -282,6 +84,17 @@
 
         """
 
+        if header == None:
+            header = options.hammie_header_name
+        if spam_cutoff == None:
+            spam_cutoff = options.spam_cutoff
+        if ham_cutoff == None:
+            ham_cutoff = options.ham_cutoff
+        if debugheader == None:
+            debugheader = options.hammie_debug_header_name
+        if debug == None:
+            debug = options.hammie_debug_header
+
         msg = mboxutils.get_message(msg)
         try:
             del msg[header]
@@ -348,163 +161,47 @@
 
         self.train(msg, True)
 
-    def update_probabilities(self):
+    def update_probabilities(self, store=True):
         """Update probability values.
 
         You would want to call this after a training session.  It's
         pretty slow, so if you have a lot of messages to train, wait
         until you're all done before calling this.
 
+        Unless store is false, the peristent store will be written after
+        updating probabilities.
+
         """
 
         self.bayes.update_probabilities()
+        if store:
+            self.store()
 
+    def store(self):
+        """Write out the persistent store.
+
+        This makes sure the persistent store reflects what is currently
+        in memory.  You would want to do this after a write and before
+        exiting.
+
+        """
+
+        self.bayes.store()
+
+
+def open(filename, usedb=True, mode='r'):
+    """Open a file, returning a Hammie instance.
+
+    If usedb is False, open as a pickle instead of a DBDict.  mode is
+
+    used as the flag to open DBDict objects.  'c' for read-write (create
+    if needed), 'r' for read-only, 'w' for read-write.
+
+    """
 
-def train(hammie, msgs, is_spam):
-    """Train bayes with all messages from a mailbox."""
-    mbox = mboxutils.getmbox(msgs)
-    i = 0
-    for msg in mbox:
-        i += 1
-        # XXX: Is the \r a Unixism?  I seem to recall it working in DOS
-        # back in the day.  Maybe it's a line-printer-ism ;)
-        sys.stdout.write("\r%6d" % i)
-        sys.stdout.flush()
-        hammie.train(msg, is_spam)
-    print
-
-def score(hammie, msgs, reverse=0):
-    """Score (judge) all messages from a mailbox."""
-    # XXX The reporting needs work!
-    mbox = mboxutils.getmbox(msgs)
-    i = 0
-    spams = hams = 0
-    for msg in mbox:
-        i += 1
-        prob, clues = hammie.score(msg, True)
-        if hasattr(msg, '_mh_msgno'):
-            msgno = msg._mh_msgno
-        else:
-            msgno = i
-        isspam = (prob >= SPAM_THRESHOLD)
-        if isspam:
-            spams += 1
-            if not reverse:
-                print "%6s %4.2f %1s" % (msgno, prob, isspam and "S" or "."),
-                print hammie.formatclues(clues)
-        else:
-            hams += 1
-            if reverse:
-                print "%6s %4.2f %1s" % (msgno, prob, isspam and "S" or "."),
-                print hammie.formatclues(clues)
-    return (spams, hams)
-
-def createbayes(pck=DEFAULTDB, usedb=False, mode='r'):
-    """Create a Bayes instance for the given pickle (which
-    doesn't have to exist).  Create a PersistentBayes if
-    usedb is True."""
     if usedb:
-        bayes = PersistentBayes(pck, mode)
+        b = Bayes.DBDictBayes(filename, mode)
     else:
-        bayes = None
-        try:
-            fp = open(pck, 'rb')
-        except IOError, e:
-            if e.errno <> errno.ENOENT: raise
-        else:
-            bayes = pickle.load(fp)
-            fp.close()
-        if bayes is None:
-            bayes = classifier.Bayes()
-    return bayes
-
-def usage(code, msg=''):
-    """Print usage message and sys.exit(code)."""
-    if msg:
-        print >> sys.stderr, msg
-        print >> sys.stderr
-    print >> sys.stderr, __doc__ % globals()
-    sys.exit(code)
-
-def main():
-    """Main program; parse options and go."""
-    try:
-        opts, args = getopt.getopt(sys.argv[1:], 'hdDfg:s:p:u:r')
-    except getopt.error, msg:
-        usage(2, msg)
-
-    if not opts:
-        usage(2, "No options given")
-
-    pck = DEFAULTDB
-    good = []
-    spam = []
-    unknown = []
-    reverse = 0
-    do_filter = False
-    usedb = USEDB
-    mode = 'r'
-    for opt, arg in opts:
-        if opt == '-h':
-            usage(0)
-        elif opt == '-g':
-            good.append(arg)
-            mode = 'c'
-        elif opt == '-s':
-            spam.append(arg)
-            mode = 'c'
-        elif opt == '-p':
-            pck = arg
-        elif opt == "-d":
-            usedb = True
-        elif opt == "-D":
-            usedb = False
-        elif opt == "-f":
-            do_filter = True
-        elif opt == '-u':
-            unknown.append(arg)
-        elif opt == '-r':
-            reverse = 1
-    if args:
-        usage(2, "Positional arguments not allowed")
-
-    save = False
-
-    bayes = createbayes(pck, usedb, mode)
-    h = Hammie(bayes)
-
-    for g in good:
-        print "Training ham (%s):" % g
-        train(h, g, False)
-        save = True
-
-    for s in spam:
-        print "Training spam (%s):" % s
-        train(h, s, True)
-        save = True
-
-    if save:
-        h.update_probabilities()
-        if not usedb and pck:
-            fp = open(pck, 'wb')
-            pickle.dump(bayes, fp, 1)
-            fp.close()
-
-    if do_filter:
-        msg = sys.stdin.read()
-        filtered = h.filter(msg)
-        sys.stdout.write(filtered)
-
-    if unknown:
-        (spams, hams) = (0, 0)
-        for u in unknown:
-            if len(unknown) > 1:
-                print "Scoring", u
-            s, g = score(h, u, reverse)
-            spams += s
-            hams += g
-        print "Total %d spam, %d ham" % (spams, hams)
-
+        b = Bayes.PickledBayes(filename)
+    return Hammie(b)
 
-if __name__ == "__main__":
-    main()
Index: hammiefilter.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/hammiefilter.py,v
retrieving revision 1.2
diff -u -r1.2 hammiefilter.py
--- hammiefilter.py	18 Nov 2002 18:14:04 -0000	1.2
+++ hammiefilter.py	19 Nov 2002 00:24:57 -0000
@@ -51,43 +51,37 @@
     print >> sys.stderr, __doc__ % globals()
     sys.exit(code)
 
-def jar_pickle(h):
-    if not options.persistent_use_database:
-        import pickle
-        fp = open(options.persistent_storage_file, 'wb')
-        pickle.dump(h.bayes, fp, 1)
-        fp.close()
-    
-
-def hammie_open(mode):
-    b = hammie.createbayes(options.persistent_storage_file,
-                           options.persistent_use_database,
-                           mode)
-    return hammie.Hammie(b)
-
 def newdb():
-    h = hammie_open('n')
-    jar_pickle(h)
+    h = hammie.open(options.persistent_storage_file,
+                    options.persistent_use_database,
+                    'n')
+    h.store()
     print "Created new database in", options.persistent_storage_file
 
 def filter():
-    h = hammie_open('r')
+    h = hammie.open(options.persistent_storage_file,
+                    options.persistent_use_database,
+                    'r')
     msg = sys.stdin.read()
     print h.filter(msg)
 
 def train_ham():
-    h = hammie_open('w')
+    h = hammie.open(options.persistent_storage_file,
+                    options.persistent_use_database,
+                    'w')
     msg = sys.stdin.read()
     h.train_ham(msg)
     h.update_probabilities()
-    jar_pickle(h)    
+    h.store()
 
 def train_spam():
-    h = hammie_open('w')
+    h = hammie.open(options.persistent_storage_file,
+                    options.persistent_use_database,
+                    'w')
     msg = sys.stdin.read()
     h.train_spam(msg)
     h.update_probabilities()
-    jar_pickle(h)    
+    h.store()
 
 def main():
     action = filter

From tim.one@comcast.net  Tue Nov 19 02:54:09 2002
From: tim.one@comcast.net (Tim Peters)
Date: Mon, 18 Nov 2002 21:54:09 -0500
Subject: [Spambayes] Better optimization loop
In-Reply-To: <3DD7532A.8020906@hooft.net>
Message-ID: <LNBBLJKPBEHFEDALKOLCCEALCNAB.tim.one@comcast.net>

[Rob Hooft, simplifying simplex]
> ...
> I decided that we have a perfect way to optimize the ham and spam
> cutoff values in timcv already, so that I can remove these from the
> simplex optimization.

Good observation!  That should help.  simplex isn't fast in the best of
cases, and in this case ...

> To that goal I added a "delayed" flexcost to the CostCounter module
> that can use the optimal cutoffs calculated at the end of timcv.py.

Those can be pretty extreme; e.g., I've seen it suggest ham_cutoff of 0.99
and spam_cutoff of 0.995 to get rid of "impossible" FP.

> And there are only three variables left to optimize using simplex
>
> I then ran one optimization on my complete (16000+5800) corpus. The
> result is that it is fighting very hard to remove fp's while
> introducing lots of unsure messages:
>
> At the start:
>
> -> <stat> all runs false positives: 15
> -> <stat> all runs false negatives: 7
> -> <stat> all runs unsure: 189
> Standard Cost: $194.80
> Flex Cost: $607.41
> Delayed-Standard Cost: $98.80
> Delayed-Flex Cost: $310.05
> x=0.4990 p=0.1002 s=0.4537 310.05
>
> And near the end:
>
> -> <stat> all runs false positives: 5
> -> <stat> all runs false negatives: 6
> -> <stat> all runs unsure: 342
> -> <stat> all runs false positive %: 0.03125
> -> <stat> all runs false negative %: 0.103448275862
> -> <stat> all runs unsure %: 1.56880733945
> -> <stat> all runs cost: $124.40
> Standard Cost: $124.40
> Flex Cost: $589.16
> Delayed-Standard Cost: $98.60
> Delayed-Flex Cost: $212.28
> x=0.3515 p=0.2861 s=0.2467 212.28
>
> At this stage it actually managed to get the delayed standard cost
> lower by $0.20 (it has been higher than the starting value during much
> of the optimization). The Delayed-Flex cost is lowered by about 30%.
> But look at the hugely different parameters it had to use! Can someone
> else run  with these parameters and confirm that this is an extreme
> that is only  warranted by my particular corpses?

I can try <wink>.  Here's a 10-fold CV with 6K random ham and 6K random spam
from my c.l.py test data;  baseline on the left, while the right has

[Classifier]
unknown_word_prob: 0.3515
minimum_prob_strength: 0.2861
unknown_word_strength: 0.2467

filename:     base    simp
ham:spam:  6000:6000
                   6000:6000
fp total:        2       1
fp %:         0.03    0.02
fn total:        0       0
fn %:         0.00    0.00
unsure t:       46     101
unsure %:     0.38    0.84
real cost:  $29.20  $30.20
best cost:  $12.80  $11.80
h mean:       0.42    0.71
h sdev:       3.65    4.81
s mean:      99.96   99.89
s sdev:       1.21    1.94
mean diff:   99.54   99.18
k:           20.48   14.69

It did a little better here too.  The best-cost analyses show that it's also
nuking FP at the expense of unsures:

base:

-> best cost for all runs: $12.80
-> achieved at 2 cutoff pairs
-> smallest ham & spam cutoffs 0.52 & 0.95
->     fp 1; fn 1; unsure ham 2; unsure spam 7
->     fp rate 0.0167%; fn rate 0.0167%; unsure rate 0.075%
-> largest ham & spam cutoffs 0.525 & 0.95
->     fp 1; fn 1; unsure ham 2; unsure spam 7
->     fp rate 0.0167%; fn rate 0.0167%; unsure rate 0.075%

simp:

-> best cost for all runs: $12.80
-> best cost for all runs: $11.80
-> achieved at ham & spam cutoffs 0.495 & 0.995
->     fp 0; fn 0; unsure ham 10; unsure spam 49
->     fp rate 0%; fn rate 0%; unsure rate 0.492%


> Please note that to get a delayed flex cost that is this much lower
> actually means that in the unsure area there is "50% more order" than
> before the optimization!
>
> At some point Tim (was it you?) has reported that in other optimization
> techniques it has proven to be very bad to "focus" on the persistent
> and hopeless fp/fn messages. I fear this might bother me here.

Ya, I reported that from a paper wrestling with boosting, but it's a common
observation.  Even in simple settings!  Say you're doing a least-squares
linear regression on this data:

x  f(x)
-  ----
1   1.9
2   4.1
3   5.9
4 -10.0
5  10.1
6  12.1
7  13.8

If you throw out (4, -10), you get an excellent fit to everything that
remains.  If you leave it in, you still get "an answer", but it's not a good
fit to anything.  A 6th-degree polynomial fits all the data perfectly, but
the resulting snaky curve is almost certainly a terrible fit to the
population from which this sample was taken.  A few spam and ham are just
unlike their brethren, but from what I've seen of those, no mechanical
gimmick is going to classify them correctly.  Give up and be happy <wink>.

> I just started another optimization run, but lowered the cost of a fp
> from $10 to $2, and introduced another cost function that I called
> flex**2 cost because it changes the cost function for an unsure message
> from a linear function to a square function. Oops, two changes at the
> same time; but it takes such a long time to run....

When I try a new thing, I usually start with several runs but on *much* less
data per run.  If at least 3 of 5 show the effect I was hoping for, I may
push on; but if 3 of 5 don't, I either give up on it, or change the rules to
4 of 7 (if I'm really in love with the idea <wink>).

it's-almost-impossible-not-to-cheat-sometimes-ly y'rs  - tim


From msurface@myvine.com  Tue Nov 19 02:50:00 2002
From: msurface@myvine.com (Mitchell Surface)
Date: Mon, 18 Nov 2002 21:50:00 -0500
Subject: [Spambayes] Training questions
Message-ID: <20021119025000.GA17060@brewer.fwn.fortwayne.com>

I've been lurking here for a while and I finally decided to give this a
try. I've read the docs and how to do the initial training seems pretty
clear as does setting up a procmail recipe to handle the filtered
messages.

I do have a couple of questions that I don't remember seeing. Once past
the initial training, how do you train on additional ham and spam? Does
hammie.py just append new data to what's already there? If so, how can
you untrain a misclassified message?

Thanks for all the work everybody's put in to this.

-- 
Mitchell Surface N9OSL
Fort Wayne, IN USA

The Bible is not my book, and Christianity is not my religion.  I could never
give assent to the long, complicated statements of Christian dogma.
	-- Abraham Lincoln

-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
Url : http://mail.python.org/pipermail/spambayes/attachments/20021118/3fa9dfba/attachment.bin
From tim.one@comcast.net  Tue Nov 19 04:21:36 2002
From: tim.one@comcast.net (Tim Peters)
Date: Mon, 18 Nov 2002 23:21:36 -0500
Subject: [Spambayes] RE: chi-combining
In-Reply-To: <LNBBLJKPBEHFEDALKOLCAEAHCNAB.tim.one@comcast.net>
Message-ID: <LNBBLJKPBEHFEDALKOLCIEBACNAB.tim.one@comcast.net>

In an offline thread with Greg Louis (who's working on bogofilter), I tried
an experiment using just the S, then just the H, components of our spamprob
calculation.  We currently return (1+S-H)/2.  The "justs" result here just
returns S, the "justh" just returns 1-H.  justs is a comparative disaster,
but the more I stare at it, the more I think justh did surprisingly well:


filename:     base   justs   justh
ham:spam:  6000:6000       6000:6000
                   6000:6000
fp total:        2       8       2
fp %:         0.03    0.13    0.03
fn total:        0       0       4
fn %:         0.00    0.00    0.07
unsure t:       40      59       6
unsure %:     0.33    0.49    0.05
real cost:  $28.00  $91.80  $25.20
best cost:   $4.00  $22.40   $6.60
h mean:       0.38    0.69    0.08
h sdev:       3.53    5.81    2.18
s mean:      99.96   99.99   99.92
s sdev:       1.41    0.45    2.58
mean diff:   99.58   99.30   99.84
k:           20.16   15.86   20.97

Similar results were obtained from another trial on different 6K samples
from my c.l.py test data.  If you hate FP a lot, and would rather suffer a
few FN in return for skipping lots of unsures, justh looks like it may be a
viable strategy.  Despite that H is less sensitive to high-spamprob words
than to low-spamprob words (and S the reverse), at least on this data spam
still scores very high under H.

If you want to try this, in chi2_spamprob replace

            prob = (S-H + 1.0) / 2.0

with

            prob = 1.0 - H


From rob@hooft.net  Tue Nov 19 10:27:22 2002
From: rob@hooft.net (Rob W.W. Hooft)
Date: Tue, 19 Nov 2002 11:27:22 +0100
Subject: [Spambayes] RE: chi-combining
References: <LNBBLJKPBEHFEDALKOLCIEBACNAB.tim.one@comcast.net>
Message-ID: <3DDA120A.8070606@hooft.net>

Tim Peters wrote:
> In an offline thread with Greg Louis (who's working on bogofilter), I tried
> an experiment using just the S, then just the H, components of our spamprob
> calculation.  We currently return (1+S-H)/2.  The "justs" result here just
> returns S, the "justh" just returns 1-H.  justs is a comparative disaster,
> but the more I stare at it, the more I think justh did surprisingly well:

Try your "invisible ham" spam with this. I'm sure it will score 
rock-solid ham. By using "justh" you're basically telling spammers that 
you're not sensitive to spam words, as long as there is enough of the 
message that looks like ham!

The two cases where this makes a difference are

    H=1 S=1 : this is the case I just described: A message that looks
              like both ham and spam would be unsure before, but will now
              result in a Ham score.
    H=0 S=0 : A message that doesn't look like anything seen before used
              to result in an unsure, but will now result in a "Spam"
              disposition.

I suspect that 1-H is easier to counter for the ephemeral "smart 
spammer" than (1+S-H)/2. It is another form of cancellation disease.

Rob

-- 
Rob W.W. Hooft  ||  rob@hooft.net  ||  http://www.hooft.net/people/rob/


From tdickenson@devmail.geminidataloggers.co.uk  Tue Nov 19 11:13:58 2002
From: tdickenson@devmail.geminidataloggers.co.uk (Toby Dickenson)
Date: Tue, 19 Nov 2002 11:13:58 +0000
Subject: [Spambayes] More back-patting - my brain's first FP where bayes
	got it right
In-Reply-To: <LNBBLJKPBEHFEDALKOLCEEPHCMAB.tim.one@comcast.net>
References: <LNBBLJKPBEHFEDALKOLCEEPHCMAB.tim.one@comcast.net>
Message-ID: <200211191113.58375.tdickenson@devmail.geminidataloggers.co.uk>

On Monday 18 November 2002 11:29 pm, Tim Peters wrote:

> BTW, I gave up on my mistake-driven classifier experiment.  I kept gett=
ing
> several porn spam as Unsure every day, and got tired of digging thru it=
=2E
> Now I'm training on each spam that doesn't score 100, and each ham that
> doesn't score 0.  Amazingly, that's added a hell of a lot more spam tha=
n
> ham to the training data -- now up to 99 ham and 149 spam.  Porn spam n=
o
> longer rates as Unsure, and I'm happier.  Perhaps that's just due to th=
e
> drop in forced stimulation, though <wink>.

Why exclude spams that score 100 from training?  Even these really spammy=
=20
spams might contain clues that would help to classify other more marginal=
=20
spam.


From francois.granger@free.fr  Tue Nov 19 11:51:56 2002
From: francois.granger@free.fr (Fran=?ISO-8859-1?B?5w==?=ois Granger)
Date: Tue, 19 Nov 2002 12:51:56 +0100
Subject: [Spambayes] New web training interface for pop3proxy
In-Reply-To: <9qqitu09afl7kufdmr41kn28tfn9nfjpur@4ax.com>
Message-ID: <B9FFE46C.5CD97%francois.granger@free.fr>

on 18/11/02 23:52, Richie Hindle at richie@entrian.com wrote:

> By presenting the messages as three pre-judged lists, am I contradicting my
> own statement that the messages shouldn't show up as prejudged in the
> current 'unclassified' list?  8-)  I don't think so, because spotting a ham
> in a bunch of spams, or vice versa, is much easier than spotting whether
> any of a whole mixture of messages is misclassified.

What if you show the raw spambrob number close to the buttons ?

It would give a clue on what the system found the message to be ?

-- 
Le courrier est un moyen de communication. Les gens devraient
se poser des questions sur les implications politiques des choix (ou non
choix) de leurs outils et technologies. Pour des courriers propres :
<http://marc.herbert.free.fr/mail/> -- <http://minilien.com/?IXZneLoID0>


From skip@pobox.com  Tue Nov 19 13:30:42 2002
From: skip@pobox.com (Skip Montanaro)
Date: Tue, 19 Nov 2002 07:30:42 -0600
Subject: [Spambayes] New web training interface for pop3proxy
In-Reply-To: <9qqitu09afl7kufdmr41kn28tfn9nfjpur@4ax.com>
References: <15833.20589.376685.686723@montanaro.dyndns.org>
        <9qqitu09afl7kufdmr41kn28tfn9nfjpur@4ax.com>
Message-ID: <15834.15618.465106.671756@montanaro.dyndns.org>


    >> * Given that the time to classify a message is pretty cheap, it would
    >> also be nice if your interface preset the radio buttons based on an
    >> initial classification of each message.  This suggests you need an
    >> 'unsure' radio button as well.

    Richie> I did think of that, then I thought that I was far more likely
    Richie> to make a mistake just scanning down the list thinking "yeah,
    Richie> yeah, yeah, looks ok" than actually having to click something
    Richie> for each message.

That suggests to me that you need to group messages together based upon
their initial classification.  That way, instead of a haphazard arrangement
of button settings:


they are clumped:

    D   H   S   U
                *
            *
            *
            *
        *
        *
        *
        *

perhaps with a <hr> between sections.  By clumping things together like this
I think it makes it easier to detect an outlier within the group.  (The
background color should probably alternate between light grey and white to
help direct the eyes from the subject to the proper radio button when
changes are needed.  I say this without ever having seen or used pop3proxy,
and can't recall from your mockup if you already do this or not.)

    Richie> I've been thinking that the next version of the web interface
    Richie> would work the same way - rather than a single page of untrained
    Richie> messages, you'd get three pages for ham-judged, spam-judged and
    Richie> unsure.

Sure, same idea.  A spam slipped through into my python mailbox yesterday.
Stood out like a sore thumb.

-- 
Skip Montanaro - skip@pobox.com
http://www.mojam.com/
http://www.musi-cal.com/

From jm@jmason.org  Tue Nov 19 14:14:11 2002
From: jm@jmason.org (Justin Mason)
Date: Tue, 19 Nov 2002 14:14:11 +0000
Subject: [Spambayes] Another software in the field 
In-Reply-To: Message from "T. Alexander Popiel" <popiel@wolfskeep.com> 
	<20021115182039.BD3F3F54C@cashew.wolfskeep.com> 
Message-ID: <20021119141417.6FC4B16F16@jmason.org>


(a bit late in replying! I suffered from inbox overload ;)

T. Alexander Popiel said:
> If the received parser were a little smarter about parsing iPlanet
> received lines, it would have "pcp736393pcs.reston01.va.comcast.net"
> instead of "cj569191b" as the first element in the sequence, and
> the match list would have been 2 -> 1 -> 2 -> 0 -> 0, yielding:
> 
>   message-id-generation:skipped 0
> 
> I suspect that high skipped numbers would be a strong spam indicator,
> howing where message ids were omitted in the sent mail and/or received
> headers naively forged to prevent backtracking.

It would be interesting to test this; we do something similar in
SpamAssassin to find possibly-forged hostnames in the Received
headers, and we do try to figure out where in the Received chain
the Message-id was added.

Two problems we've seen:

  - some totally-legit senders, especially auto-generated mails, have a
    bad habit of leaving out the Message-Id until it gets to *your* MX.
    Annoying, but allowed by the RFCs.  This test would have to figure
    this out in some way; maybe by adding the sender's hostname or domain
    to the token, so the legit folks gain ham hits, but spammers remain
    as 1-spam 0-ham hapaxes?

  - some senders use e.g. hostname "mylittlecompany.com" on their desktop
    machine or home LAN, then connect via a commodity-DSL connection,
    resulting in a reverse-lookup of "dsl43-234.bigisp.net".  In other
    words, the rDNS does not match what the sender wishes it did ;)
    Not a problem in this case, but worth noting when talking about
    Received-header parsing.

--j.

From Paul.Moore@atosorigin.com  Tue Nov 19 15:58:03 2002
From: Paul.Moore@atosorigin.com (Moore, Paul)
Date: Tue, 19 Nov 2002 15:58:03 -0000
Subject: [Spambayes] Training
Message-ID: <16E1010E4581B049ABC51D4975CEDB88619943@UKDCX001.uk.int.atosorigin.com>

One thing I've just noticed using the Outlook client (although I think
it's a feature of the algorithm, rather than specific to the client).

A couple of hams ended up in my "Unsure" folder. No problem, I trained
on them as ham. But they hit my inbox with a spam score *still* in the
25-35 region. So if I refilter, they pop up in my unsure folder again.
Nothing I can do will make the messages score as ham.

Refiling based on spam scores is a rare operation (I only noticed
this because I forgot to tick the "only unscored messages" checkbox
in the filter dialog), but the behaviour is annoying, as well as being
unnerving. I don't think there's anything which can be done at the
algorithm level (the algorithm is effectively saying "OK, I know
you're saying it's ham, but it still looks pretty odd to me...") but
at the client/user interface level, maybe there should be an extra
property "Trained", which says that this message has been specifically
confirmed as ham or spam, so that it won't get filtered. I'm not sure
how, or if, this would translate to other types of client.

Paul.

From neale@woozle.org  Tue Nov 19 17:13:51 2002
From: neale@woozle.org (Neale Pickett)
Date: 19 Nov 2002 09:13:51 -0800
Subject: [Spambayes] Hammiefilter doesn't write out the pickle
In-Reply-To: <NLRPO63VGBRNQ2ZHBHDE9SRNKNJRN.3dd98187@riven>
References: <NLRPO63VGBRNQ2ZHBHDE9SRNKNJRN.3dd98187@riven>
Message-ID: <w53of8l32k0.fsf@woozle.org>

So then, Tim Stone - Four Stones Expressions <tim@fourstonesExpressions.com> is all like:

> I think we've got some real potential for a great little api here.  I
> do have some questions about the data storage.  We've agreed that an
> explicit store is the way we want to go, which I think is correct.
> However, dbm really doesn't support this.  I fooled with a couple
> ideas (hacks) to make DBDict behave in a load/store fashion, and the
> best thing I can come up with is to actually make a working copy of
> the dbm file, which is then used for the session.  When store() is
> called, the original is replaced with the working copy.  There are
> some difficulties with this approach.  If store is never called, then
> there is no guaranteed way to clean up the working copy.  Replacing
> the original with the working copy may be a bit difficult, because dbm
> doesn't support a close method...

Yeah.  I ran into the same problem yesterday.  As I thought about it, I
realized this must have been why I implemented the __del__ method of
DBDict.

The problem, really, with DBDict is that there is this meta-information
it has to store (nham, nspam).  If individual db entries are updated but
the meta-info isn't, your database is corrupt, game over.  That problem
manifests itself in two ways:

1. You need to be very careful about when you hit ^C when running hammie
2. The pop3proxy's "store" method doesn't really do anything

But couldn't this be adequately explained by merely stating that the
DBDict method stores things instantaneously?  If we're careful to always
update nham and nspam *before* writing any new wordinfo, then the worst
you can do would be start training, then hit ^C right away--equivalent
to training on an empty message.  And people running the pop3proxy would
have to be aware that the way the proxy is working is always in sync
with what's on the disk.  I don't see either of these as a huge
problem.

So we need to write out nham and nspam before writing out the new
WordInfo counts.  I don't think it'd be much of a penalty to do this
before every message in a batch training run, and of course for the
pickle method it's no difference at all whether you add one before or
after training on a message.

Neale

From neale@woozle.org  Tue Nov 19 17:26:02 2002
From: neale@woozle.org (Neale Pickett)
Date: 19 Nov 2002 09:26:02 -0800
Subject: [Spambayes] Training questions
In-Reply-To: <20021119025000.GA17060@brewer.fwn.fortwayne.com>
References: <20021119025000.GA17060@brewer.fwn.fortwayne.com>
Message-ID: <w53k7j931zp.fsf@woozle.org>

So then, Mitchell Surface <msurface@myvine.com> is all like:

> I've been lurking here for a while and I finally decided to give this
> a try. I've read the docs and how to do the initial training seems
> pretty clear as does setting up a procmail recipe to handle the
> filtered messages.

Thanks for the report!  It's always good to hear that someone can
actually *use* the blasted thing :)

> I do have a couple of questions that I don't remember seeing. Once
> past the initial training, how do you train on additional ham and
> spam?  Does hammie.py just append new data to what's already there? If
> so, how can you untrain a misclassified message?

You'll want to run "hammie.py -g" on ham, and "hammie.py -s" on spam.
That will tokenize the new messages you give it, and increment the
frequency counts for those tokens in your database.  In a sense, it is
appending the new data to what's already there.

hammie.py currently doesn't have a way to untrain messages.  But I'll
add that in the next generation hammie!  Thanks for pointing that out!

Neale

From neale@woozle.org  Tue Nov 19 17:46:27 2002
From: neale@woozle.org (Neale Pickett)
Date: 19 Nov 2002 09:46:27 -0800
Subject: [Spambayes] hammie, pop3proxy, and persistent_use_database
Message-ID: <w53fztx311o.fsf@woozle.org>

It seems like we're getting a fair amount of people using hammie who
just want it to filter their mail.  These folks, I am guessing, are just
accepting the default values for things, assuming those must be a good
place to start.

Unfortunately, if you're running hammie out of procmail, the pickle
method is going to start to get really slow as your training set gets
larger.  As fast as the pickler is, it's still having to slurp in the
entire file every time you run it.  I'm talking several orders of
magnitude here.

On the other hand, pop3proxy probably works best when using a pickle,
since it starts up once and can score many emails.  hammiesrv works
similarly, but I don't think anyone is using that :)

So, what would you say to moving the persistent_use_database option into
per-service configuration?  Specifically:

Index: Options.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/Options.py,v
retrieving revision 1.72
diff -u -r1.72 Options.py
--- Options.py  18 Nov 2002 19:14:48 -0000      1.72
+++ Options.py  19 Nov 2002 17:44:16 -0000
@@ -348,10 +348,11 @@
 # The default database path used by hammie
 persistent_storage_file: hammie.db
 
-# hammie can use either a database (quick to score one message) or a pickle
-# (quick to train on huge amounts of messages). Set this to True to use a
-# database by default.
-persistent_use_database: False
+[hammiefilter]
+# hammiefilter can use either a database (quick to score one message) or
+# a pickle (quick to train on huge amounts of messages). Set this to
+# True to use a database by default.
+hammiefilter_persistent_use_database: False
 
 [pop3proxy]
 # pop3proxy settings - pop3proxy also respects the options in the
 # Hammie
@@ -366,6 +367,7 @@
 pop3proxy_spam_cache: pop3proxy-spam-cache
 pop3proxy_ham_cache: pop3proxy-ham-cache
 pop3proxy_unknown_cache: pop3proxy-unknown-cache
+pop3proxy_persistent_use_database: False
 
 [html_ui]
 html_ui_port: 8880
@@ -440,6 +442,8 @@
                'hammie_debug_header': boolean_cracker,
                'hammie_debug_header_name': string_cracker,
                },
+    'hammiefilter' : {'hammiefilter_persistent_use_database':
     boolean_cracker,
+                      },
     'pop3proxy': {'pop3proxy_server_name': string_cracker,
                   'pop3proxy_server_port': int_cracker,
                   'pop3proxy_port': int_cracker,
@@ -448,6 +452,7 @@
                   'pop3proxy_spam_cache': string_cracker,
                   'pop3proxy_ham_cache': string_cracker,
                   'pop3proxy_unknown_cache': string_cracker,
+                  'pop3proxy_persistent_use_database': string_cracker,
                   },
     'html_ui': {'html_ui_port': int_cracker,
                 'html_ui_launch_browser': boolean_cracker,


From rob@hooft.net  Tue Nov 19 17:47:02 2002
From: rob@hooft.net (Rob Hooft)
Date: Tue, 19 Nov 2002 18:47:02 +0100
Subject: [Spambayes] Better optimization loop
References: <LNBBLJKPBEHFEDALKOLCCEALCNAB.tim.one@comcast.net>
Message-ID: <3DDA7916.5010102@hooft.net>

Tim Peters wrote:
> [Rob Hooft, simplifying simplex]
> 
>>...
>>I decided that we have a perfect way to optimize the ham and spam
>>cutoff values in timcv already, so that I can remove these from the
>>simplex optimization.
> 
> 
> Good observation!  That should help.  simplex isn't fast in the best of
> cases, and in this case ...

Anyone that has a faster optimization algorithm lying around is welcome 
to replace my Simplex code.

>>To that goal I added a "delayed" flexcost to the CostCounter module
>>that can use the optimal cutoffs calculated at the end of timcv.py.
> 
> Those can be pretty extreme; e.g., I've seen it suggest ham_cutoff of 0.99
> and spam_cutoff of 0.995 to get rid of "impossible" FP.

They are in any case better than any other alternative I could think of. 
But if you disagree, you can change the order in which the 
CostCounter.default() builds up the cost counters; the optimization 
always uses the last one.

> It did a little better here too.  The best-cost analyses show that it's also
> nuking FP at the expense of unsures:
> 
> base:
> 
> -> best cost for all runs: $12.80
> -> achieved at 2 cutoff pairs
> -> smallest ham & spam cutoffs 0.52 & 0.95
> ->     fp 1; fn 1; unsure ham 2; unsure spam 7
> ->     fp rate 0.0167%; fn rate 0.0167%; unsure rate 0.075%
> -> largest ham & spam cutoffs 0.525 & 0.95
> ->     fp 1; fn 1; unsure ham 2; unsure spam 7
> ->     fp rate 0.0167%; fn rate 0.0167%; unsure rate 0.075%
> 
> simp:
> 
> -> best cost for all runs: $12.80
> -> best cost for all runs: $11.80
> -> achieved at ham & spam cutoffs 0.495 & 0.995
> ->     fp 0; fn 0; unsure ham 10; unsure spam 49
> ->     fp rate 0%; fn rate 0%; unsure rate 0.492%

Very similar to my case. I'm seriously thinking about removing the 
"hopeless" and "almost hopeless" messages from my corpses. I agree with 
the bayesian statistics that they can't be correctly classified.

> Ya, I reported that from a paper wrestling with boosting, but it's a common
> observation.  Even in simple settings!  Say you're doing a least-squares
> linear regression on this data:
> 
> x  f(x)
> -  ----
> 1   1.9
> 2   4.1
> 3   5.9
> 4 -10.0
> 5  10.1
> 6  12.1
> 7  13.8
> 
> If you throw out (4, -10), you get an excellent fit to everything that
> remains.  If you leave it in, you still get "an answer", but it's not a good
> fit to anything.

Press et al. report about a "robust fit", which is not a least squares 
but a least absolute deviates fit. It is insensitive to outliers.
Is there an analog idea for us?

> When I try a new thing, I usually start with several runs but on *much* less
> data per run.  If at least 3 of 5 show the effect I was hoping for, I may
> push on; but if 3 of 5 don't, I either give up on it, or change the rules to
> 4 of 7 (if I'm really in love with the idea <wink>).

These optimizations are very sensitive to step-functions, so I need lots 
of data to run them. With a small data set it will stop wherever you 
start it.

Further results I obtained: My idea of running with an fp cost of $2 and 
a square cost function didn't work. It doesn't optimize to a consistent 
position. Increasing the cost of an fp back to $10 and running with the 
same square function did do a reasonable job, it optimized to:

[Classifier]
unknown_word_prob = 0.520415
minimum_prob_strength = 0.315104
unknown_word_strength = 0.215393

So the unknown_word_prob is now back to 0.5 again! I just committed my 
changes to the optimization code, any hints on improvements are welcome.

Rob

-- 
Rob W.W. Hooft  ||  rob@hooft.net  ||  http://www.hooft.net/people/rob/


From tdickenson@devmail.geminidataloggers.co.uk  Tue Nov 19 18:07:54 2002
From: tdickenson@devmail.geminidataloggers.co.uk (Toby Dickenson)
Date: Tue, 19 Nov 2002 18:07:54 +0000
Subject: [Spambayes] hammie, pop3proxy, and persistent_use_database
In-Reply-To: <w53fztx311o.fsf@woozle.org>
References: <w53fztx311o.fsf@woozle.org>
Message-ID: <200211191807.54767.tdickenson@devmail.geminidataloggers.co.uk>

On Tuesday 19 November 2002 5:46 pm, Neale Pickett wrote:

> hammiesrv works
> similarly, but I don't think anyone is using that :)

Im using hammiesrv out of procmail, but I would be eager to change if I w=
as=20
the only one.


From neale@woozle.org  Tue Nov 19 18:48:31 2002
From: neale@woozle.org (Neale Pickett)
Date: 19 Nov 2002 10:48:31 -0800
Subject: [Spambayes] hammie, pop3proxy, and persistent_use_database
In-Reply-To: <200211191807.54767.tdickenson@devmail.geminidataloggers.co.uk>
References: <w53fztx311o.fsf@woozle.org>
	<200211191807.54767.tdickenson@devmail.geminidataloggers.co.uk>
Message-ID: <w53bs4l2y68.fsf@woozle.org>

So then, Toby Dickenson <tdickenson@devmail.geminidataloggers.co.uk> is all like:

> On Tuesday 19 November 2002 5:46 pm, Neale Pickett wrote:
> 
> > hammiesrv works
> > similarly, but I don't think anyone is using that :)
> 
> Im using hammiesrv out of procmail, but I would be eager to change if I was 
> the only one.

Oh, neat!  Well there's no need to remove it, although I think when the
number of file names starting with "hammie" exceeds ten, I may get
kicked off the project ;)

I'm assuming it's working for you; is that a well-placed assumption?

Neale

From lists@morpheus.demon.co.uk  Tue Nov 19 19:19:01 2002
From: lists@morpheus.demon.co.uk (Paul Moore)
Date: Tue, 19 Nov 2002 19:19:01 +0000
Subject: [Spambayes] Offtopic - getting bounce messages for spam
Message-ID: <n2m-g.of8lcqqi.fsf@morpheus.demon.co.uk>

Sorry, this is offtopic, but I'm hoping that the concentration of spam
experts on this group may be able to help me.

I've just started receiving undeliverable message reports for spam,
sent to people I've never heard of. It looks to me like someone is
managing to impersonate me when they send spam out. I'm fairly sure
I'm not running an open relay (is there a way of checking for
certain?), so I guess someone is spoofing headers or something. I've
heard of this sort of thing before, but never experienced this myself.

Two questions, really:

a) Is this something I should worry about (am I likely to end up on
   blacklists or the like)?
b) What can I do about it in any case?

Once again, apologies for this being offtopic, but I don't want to
just glibly ignore it, and I wasn't sure where else to ask...

Paul.
-- 
This signature intentionally left blank

From rob@hooft.net  Tue Nov 19 19:35:58 2002
From: rob@hooft.net (Rob Hooft)
Date: Tue, 19 Nov 2002 20:35:58 +0100
Subject: [Spambayes] Offtopic - getting bounce messages for spam
References: <n2m-g.of8lcqqi.fsf@morpheus.demon.co.uk>
Message-ID: <3DDA929E.1060003@hooft.net>

Paul Moore wrote:
> Sorry, this is offtopic, but I'm hoping that the concentration of spam
> experts on this group may be able to help me.
> 
> I've just started receiving undeliverable message reports for spam,
> sent to people I've never heard of. 

Same here. Happened to me twice this week. I'm also worried about being 
flooded by this kind of messages.... Do I have to train my filter to 
consider this as spam?

Rob

-- 
Rob W.W. Hooft  ||  rob@hooft.net  ||  http://www.hooft.net/people/rob/


From skip@pobox.com  Tue Nov 19 19:42:25 2002
From: skip@pobox.com (Skip Montanaro)
Date: Tue, 19 Nov 2002 13:42:25 -0600
Subject: [Spambayes] Offtopic - getting bounce messages for spam
In-Reply-To: <n2m-g.of8lcqqi.fsf@morpheus.demon.co.uk>
References: <n2m-g.of8lcqqi.fsf@morpheus.demon.co.uk>
Message-ID: <15834.37921.335242.13553@montanaro.dyndns.org>


    Paul> I've just started receiving undeliverable message reports for
    Paul> spam, sent to people I've never heard of. 

I awoke this morning to 350+ such messages in my unsure mailbox.  All but a
couple were bounces, like you said, for email addresses I didn't know.  I
scanned a few to see what they were, trained on a few, then modified an
Emacs macro I use for bulk deletion of this sort of stuff.  Having it sniff
for "wipe out your credit card debt" and "yahoo.co.jp.webhosting_hotpicks"
seemed to catch all of them.

    Paul> a) Is this something I should worry about (am I likely to end up
    Paul>    on blacklists or the like)?

I don't know.  As far as I know that hasn't happened to mail.mojam.com.

    Paul> b) What can I do about it in any case?

Besides train on enough of them so they are reliably caught as spam, I
suspect there's little you can do.

Skip

From skip@pobox.com  Tue Nov 19 19:58:55 2002
From: skip@pobox.com (Skip Montanaro)
Date: Tue, 19 Nov 2002 13:58:55 -0600
Subject: [Spambayes] CipherTrust?
Message-ID: <15834.38911.435802.71321@montanaro.dyndns.org>

This ad came at the head of my eWEEK Security mailing:

    CHOKING ON SPAM? Stop spam! -- Learn the TOP 10 Techniques 
    To Control Spam. Spam used to be annoying, now it is a
    critical business problem. Reclaim your mail server. PROTECT
    YOUR EMAIL SYSTEM against spam and other threats before they
    reach your mail server(s). FREE White Paper shows you how!
    http://eletters1.ziffdavis.com/cgi-bin10/flo?y=eSyU0EWaTF0E4J0sQU0Ax

Any idea what they do and how they do it?  The link was to an information
signup page.  I didn't feel like asking for even more mail so I didn't
submit.

-- 
Skip Montanaro - skip@pobox.com
http://www.mojam.com/
http://www.musi-cal.com/

From tim.one@comcast.net  Tue Nov 19 19:59:41 2002
From: tim.one@comcast.net (Tim Peters)
Date: Tue, 19 Nov 2002 14:59:41 -0500
Subject: [Spambayes] More back-patting - my brain's first FP where bayes
 got it right
In-Reply-To: <200211191113.58375.tdickenson@devmail.geminidataloggers.co.uk>
Message-ID: <BIEJKCLHCIOIHAGOKOLHAEMIDPAA.tim.one@comcast.net>

[Tim]
> BTW, I gave up on my mistake-driven classifier experiment.  I
> kept getting several porn spam as Unsure every day, and got tired
> of digging thru it.  Now I'm training on each spam that doesn't
> score 100, and each ham that doesn't score 0. ...

[Toby Dickenson]
> Why exclude spams that score 100 from training?  Even these really spammy
> spams might contain clues that would help to classify other more marginal
> spam.

Absolutely, but that's a different experiment.  I've already done "proper"
training and know it works great for me.  These are experiments in doing
silly training.  A vast majority of spam scores 100 (on the Outlook client's
0..100 integer scale), and a vast majority of ham scores 0.  Training on
everything that doesn't score at an extreme is a less-extreme variant of
mistake-based training, which, left to their own devices, is what real
people are almost ceratinly going to do.  I'm trying to get a feel for what
the system does then.

Purely mistake-based training with reasonable cutoff values turned out to
work very well wrt the FN and FP rates, but not so well wrt the Unsure rate,
and the Unsures remained surprising the entire time I tried it.  While it
wasn't prone to outright mistakes after the first day, the Unsures remained
irritatingly obvious (to human eyes) after two weeks.

Training on the 83 and 96 etc spam too appears to be fixing that rapidly.
Curiously, I'm finding much less non-0 ham than non-100 spam (my training
ratio on new msgs has gone from about 1:1 spam:ham (purely mistake-based) to
about 11:1 spam to ham (training on all non-extremes)).


From skip@pobox.com  Tue Nov 19 20:09:32 2002
From: skip@pobox.com (Skip Montanaro)
Date: Tue, 19 Nov 2002 14:09:32 -0600
Subject: [Spambayes] More back-patting - my brain's first FP where bayes
	got it right
In-Reply-To: <BIEJKCLHCIOIHAGOKOLHAEMIDPAA.tim.one@comcast.net>
References: <200211191113.58375.tdickenson@devmail.geminidataloggers.co.uk>
        <BIEJKCLHCIOIHAGOKOLHAEMIDPAA.tim.one@comcast.net>
Message-ID: <15834.39548.393074.819295@montanaro.dyndns.org>


    Tim> [Toby Dickenson]
    >> Why exclude spams that score 100 from training?  Even these really
    >> spammy spams might contain clues that would help to classify other
    >> more marginal spam.

    Tim> Absolutely, but that's a different experiment.  I've already done
    Tim> "proper" training and know it works great for me.  These are
    Tim> experiments in doing silly training.  

If you're taking notes on this in various files in CVS I wouldn't call it
"silly training".  How about "realistic training"?

Skip

From tim.one@comcast.net  Tue Nov 19 20:29:22 2002
From: tim.one@comcast.net (Tim Peters)
Date: Tue, 19 Nov 2002 15:29:22 -0500
Subject: [Spambayes] Offtopic - getting bounce messages for spam
In-Reply-To: <n2m-g.of8lcqqi.fsf@morpheus.demon.co.uk>
Message-ID: <BIEJKCLHCIOIHAGOKOLHMEMLDPAA.tim.one@comcast.net>

[Paul Moore]
> Sorry, this is offtopic, but I'm hoping that the concentration of spam
> experts on this group may be able to help me.
>
> I've just started receiving undeliverable message reports for spam,
> sent to people I've never heard of.

It's even more fun when real people write to you demanding to be taken off
your porn (whatever) list.

> It looks to me like someone is managing to impersonate me when they
> send spam out.

Stare at the headers:  it's usually a very shallow impersonation.  For
example, the Received headers are likely to point back to machines you've
never heard of -- or even countries.

> I'm fairly sure I'm not running an open relay (is there a way of
> checking for certain?),

Turn your machine off for a week and see if it stops <wink>.

> so I guess someone is spoofing headers or something. I've
> heard of this sort of thing before, but never experienced
> this myself.
>
> Two questions, really:
>
> a) Is this something I should worry about (am I likely to end up on
>    blacklists or the like)?

Not on a well-run blacklist.  This kind of spoofing is common.

> b) What can I do about it in any case?

Nothing that I know of, unless you want to pee away hours digging thru the
headers for clues.  By the time you find the perpetrators (if ever), they
will have moved on.

your-email-address-is-just-a-string-of-characters-ly y'rs  - tim


From lists@morpheus.demon.co.uk  Tue Nov 19 21:30:33 2002
From: lists@morpheus.demon.co.uk (Paul Moore)
Date: Tue, 19 Nov 2002 21:30:33 +0000
Subject: [Spambayes] A kinder, gentler hammie
References: 
	<16E1010E4581B049ABC51D4975CEDB8861993E@UKDCX001.uk.int.atosorigin.com>
	<079itu84n9lil7sqae4j9gge1sgppps34h@4ax.com>
Message-ID: <n2m-g.el9hckna.fsf@morpheus.demon.co.uk>

Richie Hindle <richie@entrian.com> writes:

> Hi Paul,
>
>> The standard [Windows] environment variables which *can* be used for
>> this sort of thing are
>> 
>> 1. HOMEDRIVE and HOMEPATH - %HOMEDRIVE%%HOMEPATH% is basically the
>>    equivalent of Unix's $HOME. But for nearly all cases, these end
>>    up being C:\, which to my mind is a bad default.
>> 2. USERPROFILE - %USERPROFILE% is a user-specific directory
>>    suitable for config information. But by default it's a directory
>>    with spaces in the name, which can be awkward for some
>>    purposes. It's also hard to navigate to in Windows explorer,
>>    which makes files stored there a little "hidden".
>
> Not true on 98:

*sigh*. I forgot about Win98.

> Having said that, I agree with this:
>
>> I think "try a number of pathnames" is a sensible approach.
>
> ...but is there a fallback that *always* works?  I'm not sure
> whether there is - is argv[0] guaranteed to work, even in frozen /
> py2exe'd / Installer'd / cx_Frozen / Squeezed / etc. applications?

I think there probably isn't. After all, you can't even guarantee that
argv[0] is on a writable medium. :-(

Paul.

-- 
This signature intentionally left blank

From mhammond@skippinet.com.au  Tue Nov 19 21:59:46 2002
From: mhammond@skippinet.com.au (Mark Hammond)
Date: Wed, 20 Nov 2002 08:59:46 +1100
Subject: [Spambayes] Training
In-Reply-To: <16E1010E4581B049ABC51D4975CEDB88619943@UKDCX001.uk.int.atosorigin.com>
Message-ID: <LCEPIIGDJPKCOIHOBJEPEEMOHMAA.mhammond@skippinet.com.au>

{Paul]
> Refiling based on spam scores is a rare operation (I only noticed
> this because I forgot to tick the "only unscored messages" checkbox
> in the filter dialog), but the behaviour is annoying, as well as being
> unnerving. I don't think there's anything which can be done at the
> algorithm level (the algorithm is effectively saying "OK, I know
> you're saying it's ham, but it still looks pretty odd to me...") but
> at the client/user interface level, maybe there should be an extra
> property "Trained", which says that this message has been specifically
> confirmed as ham or spam, so that it won't get filtered. I'm not sure
> how, or if, this would translate to other types of client.

To be honest, I am less worried about "re-filtering" as that should be very
rare.

My concern is almost identical though - the *next* email that looks the
same.  Let's say I subscribe to a weekly newsletter.  This weeks comes in,
gets marked as unsure, so I train.  Next weeks comes in - again, it trains
as unsure.  Repeat ad nauseum.

I saw this a real lot when I had a high ham:spam inbalance - training had no
obvious effect.  I am still hoping to try Tim's new adjustment, but I wonder
if somehow similar maths could be exploited.  For example, manually training
a message could be seen as "intense training", wereas a normal train is -
well - normal.  The point of manual training is that the system got it
wrong, and the user want to see the error stop.  "normal" training is just
giving the system fairly "general" instructions.

The only reason I mention this is because last time I mentioned something
that demonstrated my ignorance, Tim promptly replied confirming it, then
subsequently made the change anyway <wink>.

Mark.


From tim@fourstonesExpressions.com  Tue Nov 19 21:56:52 2002
From: tim@fourstonesExpressions.com (Tim Stone - Four Stones Expressions)
Date: Tue, 19 Nov 2002 15:56:52 -0600
Subject: [Spambayes] proposed changes to hammie & co.
Message-ID: <TR51NJJHB8UQD0FBL6FAD9ZVMKIGA9.3ddab3a4@riven>

Neale, I'm ok with these changes.  I have more to make, but go ahead and make 
these alterations.  Particularly, I've got a dbdict class that supports 
load/store, so we don't have to worry about training that blows up before we 
save nham and nspam.

I think we should think about where WordInfo class goes...

I'm not sure I like having mode on the dbdict constructor, although I 
understand why you have it.  No harm done, as it defaults anyway.

I think we should take Bayes out of classifier and put it in Bayes.py

I like widict as a class, but it could be abstracted another notch by simply 
specifying the class to instantiate when you find a 'w' in the pickle, as an 
operand on the constructor.

I'll wait for your checkin, and do some more work on the dbdict module to add 
my load/store stuff...

Travelling, and taking a break from meeting... more when I get to the hotel.

- TimS

11/18/2002 6:35:55 PM, Neale Pickett <neale@woozle.org> wrote:

>okay, here's the big diff I was talking about.  This would take all
>hammie functionality out of hammie.  So there would need to be yet
>another hammie*.py file, a front-end to this new hammie class which acts
>like the all-singing, all-dancing program that hammie is currently.
>
>This moves everything but the Hammie class out of hammie.py.  DBDict
>goes into its own module, which you could take out and use elsewhere if
>you wanted.  PersistentBayes goes away, replaced by a the DBDictBayes
>class in Bayes.py.  I haven't had time to implement the rest of the
>stuff yet, but that would be what'd go into the new front-end.
>
>So the happy hammie family would then stand at:
>
>  hammie.py
>  |-- hammiefilter.py
>  |-- pop3proxy.py
>  |-- hammiesrv.py
>  \-- hammie-new-front-end.py
>
>This change appears to work fine with hammiefilter and pop3proxy.  But
>it's a pretty big change, so I'd like to hear what at least Richie and
>Tim Stone think before I commit anything.
>
>Neale
>
>? Outlook2000
>? diff
>? email
>? hammiebatch.py
>Index: Bayes.py
>===================================================================
>RCS file: /cvsroot/spambayes/spambayes/Bayes.py,v
>retrieving revision 1.5
>diff -u -r1.5 Bayes.py
>--- Bayes.py	18 Nov 2002 13:04:20 -0000	1.5
>+++ Bayes.py	19 Nov 2002 00:24:57 -0000
>@@ -56,11 +56,10 @@
> all the spambayes contributors."
> 
> import Corpus
>-from classifier import Bayes
>+import classifier
> from Options import options
>-from hammie import DBDict     # hammie only for DBDict, which should
>-                              # probably really be somewhere else
> import cPickle as pickle
>+import dbdict
> import errno
> import copy
> import anydbm
>@@ -69,7 +68,7 @@
> NO_UPDATEPROBS = False   # Probabilities will not be autoupdated with 
training
> UPDATEPROBS = True       # Probabilities will be autoupdated with training
> 
>-class PersistentBayes(Bayes):
>+class PersistentBayes(classifier.Bayes):
>     '''Persistent Bayes database object'''
> 
>     def __init__(self, db_name):
>@@ -169,12 +168,49 @@
>         self.wordinfo, self.nspam, self.nham = t[1:]
> 
> 
>+class WIDict(dbdict.DBDict):
>+    """DBDict optimized for holding lots of WordInfo objects.
>+
>+    Normally, the pickler can figure out that you're pickling the same
>+    type thing over and over, and will just tag the type with a new
>+    byte, thus reducing Administrative Pickle Bloat(R).  Since the
>+    DBDict continually creates new picklers, however, nothing ever gets
>+    the chance to do this optimization.
>+
>+    The WIDict class forces this optimization by stealing the
>+    (currently) unused 'W' pickle type for WordInfo objects.  This
>+    results in about a 50% reduction in database size.
>+
>+    """
>+
>+    def __getitem__(self, key):
>+        v = self.hash[key]
>+        if v[0] == 'W':
>+            val = pickle.loads(v[1:])
>+            # We could be sneaky, like pickle.Unpickler.load_inst,
>+            # but I think that's overly confusing.
>+            obj = classifier.WordInfo(0)
>+            obj.__setstate__(val)
>+            return obj
>+        else:
>+            return pickle.loads(v)
>+
>+    def __setitem__(self, key, val):
>+        if isinstance(val, classifier.WordInfo):
>+            val = val.__getstate__()
>+            v = 'W' + pickle.dumps(val, 1)
>+        else:
>+            v = pickle.dumps(val, 1)
>+        self.hash[key] = v
>+
>+
> class DBDictBayes(PersistentBayes):
>     '''Bayes object persisted in a hammie.DB_Dict'''
> 
>-    def __init__(self, db_name):
>+    def __init__(self, db_name, mode='c'):
>         '''Constructor(database name)'''
> 
>+        self.mode = mode
>         self.db_name = db_name
>         self.statekey = "saved state"
> 
>@@ -186,7 +222,8 @@
>         if Corpus.Verbose:
>             print 'Loading state from',self.db_name,'DB_Dict'
> 
>-        self.wordinfo = DBDict(self.db_name, 'c')
>+        self.wordinfo = WIDict(self.db_name, self.mode,
>+                               iterskip=[self.statekey])
> 
>         if self.wordinfo.has_key(self.statekey):
> 
>@@ -216,7 +253,7 @@
> 
>     def __init__(self, bayes, trainertype, updateprobs=NO_UPDATEPROBS):
>         '''Constructor(Bayes, \
>-                       Corpus.SPAM|Corpus.HAM), updprobs(True|False)'''
>+            Corpus.SPAM|Corpus.HAM), updprobs(True|False)'''
> 
>         self.bayes = bayes
>         self.trainertype = trainertype
>@@ -286,4 +323,4 @@
> 
> 
> if __name__ == '__main__':
>-    print >>sys.stderr, __doc__
>\ No newline at end of file
>+    print >>sys.stderr, __doc__
>Index: dbdict.py
>===================================================================
>RCS file: dbdict.py
>diff -N dbdict.py
>--- /dev/null	1 Jan 1970 00:00:00 -0000
>+++ dbdict.py	19 Nov 2002 00:24:57 -0000
>@@ -0,0 +1,92 @@
>+#! /usr/bin/env python
>+
>+from __future__ import generators
>+import dbhash
>+try:
>+    import cPickle as pickle
>+except ImportError:
>+    import pickle
>+
>+class DBDict:
>+    """Database Dictionary.
>+
>+    This wraps a dbhash database to make it look even more like a
>+    dictionary, much like the built-in shelf class.  The difference is
>+    that a DBDict supports all dict methods.
>+
>+    Call it with the database.  Optionally, you can specify a list of
>+    keys to skip when iterating.  This only affects iterators; things
>+    like .keys() still list everything.  For instance:
>+
>+    >>> d = DBDict('goober.db', 'c', ('skipme', 'skipmetoo'))
>+    >>> d['skipme'] = 'booga'
>+    >>> d['countme'] = 'wakka'
>+    >>> print d.keys()
>+    ['skipme', 'countme']
>+    >>> for k in d.iterkeys():
>+    ...     print k
>+    countme
>+
>+    """
>+
>+    def __init__(self, dbname, mode, iterskip=()):
>+        self.hash = dbhash.open(dbname, mode)
>+        self.iterskip = iterskip
>+
>+    def __getitem__(self, key):
>+        return pickle.loads(self.hash[key])
>+
>+    def __setitem__(self, key, val):
>+        self.hash[key] = pickle.dumps(val, 1)
>+
>+    def __delitem__(self, key, val):
>+        del(self.hash[key])
>+
>+    def __iter__(self, fn=None):
>+        k = self.hash.first()
>+        while k != None:
>+            key = k[0]
>+            val = self.__getitem__(key)
>+            if key not in self.iterskip:
>+                if fn:
>+                    yield fn((key, val))
>+                else:
>+                    yield (key, val)
>+            try:
>+                k = self.hash.next()
>+            except KeyError:
>+                break
>+
>+    def __contains__(self, name):
>+        return self.has_key(name)
>+
>+    def __getattr__(self, name):
>+        # Pass the buck
>+        return getattr(self.hash, name)
>+
>+    def get(self, key, dfl=None):
>+        if self.has_key(key):
>+            return self[key]
>+        else:
>+            return dfl
>+
>+    def iteritems(self):
>+        return self.__iter__()
>+
>+    def iterkeys(self):
>+        return self.__iter__(lambda k: k[0])
>+
>+    def itervalues(self):
>+        return self.__iter__(lambda k: k[1])
>+
>+open = DBDict
>+
>+def _test():
>+    import doctest
>+    import dbdict
>+
>+    doctest.testmod(dbdict)
>+
>+if __name__ == '__main__':
>+    _test()
>+
>Index: hammie.py
>===================================================================
>RCS file: /cvsroot/spambayes/spambayes/hammie.py,v
>retrieving revision 1.40
>diff -u -r1.40 hammie.py
>--- hammie.py	18 Nov 2002 18:13:54 -0000	1.40
>+++ hammie.py	19 Nov 2002 00:24:57 -0000
>@@ -1,57 +1,11 @@
> #! /usr/bin/env python
> 
>-# A driver for the classifier module and Tim's tokenizer that you can
>-# call from procmail.
>-
>-"""Usage: %(program)s [options]
>-
>-Where:
>-    -h
>-        show usage and exit
>-    -g PATH
>-        mbox or directory of known good messages (non-spam) to train on.
>-        Can be specified more than once, or use - for stdin.
>-    -s PATH
>-        mbox or directory of known spam messages to train on.
>-        Can be specified more than once, or use - for stdin.
>-    -u PATH
>-        mbox of unknown messages.  A ham/spam decision is reported for each.
>-        Can be specified more than once.
>-    -r
>-        reverse the meaning of the check (report ham instead of spam).
>-        Only meaningful with the -u option.
>-    -p FILE
>-        use file as the persistent store.  loads data from this file if it
>-        exists, and saves data to this file at the end.
>-        Default: %(DEFAULTDB)s
>-    -d
>-        use the DBM store instead of cPickle.  The file is larger and
>-        creating it is slower, but checking against it is much faster,
>-        especially for large word databases. Default: %(USEDB)s
>-    -D
>-        the reverse of -d: use the cPickle instead of DBM
>-    -f
>-        run as a filter: read a single message from stdin, add an
>-        %(DISPHEADER)s header, and write it to stdout.  If you want to
>-        run from procmail, this is your option.
>-"""
>-
>-from __future__ import generators
>-
>-import sys
>-import os
>-import types
>-import getopt
>-import mailbox
>-import glob
>-import email
>-import errno
>-import anydbm
>-import cPickle as pickle
> 
>+import dbdict
> import mboxutils
>-import classifier
>+import Bayes
> from Options import options
>+from tokenizer import tokenize
> 
> try:
>     True, False
>@@ -60,166 +14,14 @@
>     True, False = 1, 0
> 
> 
>-program = sys.argv[0] # For usage(); referenced by docstring above
>-
>-# Name of the header to add in filter mode
>-DISPHEADER = options.hammie_header_name
>-DEBUGHEADER = options.hammie_debug_header_name
>-DODEBUG = options.hammie_debug_header
>-
>-# Default database name
>-DEFAULTDB = options.persistent_storage_file
>-
>-# Probability at which a message is considered spam
>-SPAM_THRESHOLD = options.spam_cutoff
>-HAM_THRESHOLD = options.ham_cutoff
>-
>-# Probability limit for a clue to be added to the DISPHEADER
>-SHOWCLUE = options.clue_mailheader_cutoff
>-
>-# Use a database? If False, use a pickle
>-USEDB = options.persistent_use_database
>-
>-# Tim's tokenizer kicks far more booty than anything I would have
>-# written.  Score one for analysis ;)
>-from tokenizer import tokenize
>-
>-class DBDict:
>-
>-    """Database Dictionary.
>-
>-    This wraps an anydbm to make it look even more like a dictionary.
>-
>-    Call it with the name of your database file.  Optionally, you can
>-    specify a list of keys to skip when iterating.  This only affects
>-    iterators; things like .keys() still list everything.  For instance:
>-
>-    >>> d = DBDict('/tmp/goober.db', ('skipme', 'skipmetoo'))
>-    >>> d['skipme'] = 'booga'
>-    >>> d['countme'] = 'wakka'
>-    >>> print d.keys()
>-    ['skipme', 'countme']
>-    >>> for k in d.iterkeys():
>-    ...     print k
>-    countme
>-
>-    """
>-
>-    def __init__(self, dbname, mode, iterskip=()):
>-        self.hash = anydbm.open(dbname, mode)
>-        self.iterskip = iterskip
>-
>-    def __getitem__(self, key):
>-        v = self.hash[key]
>-        if v[0] == 'W':
>-            val = pickle.loads(v[1:])
>-            # We could be sneaky, like pickle.Unpickler.load_inst,
>-            # but I think that's overly confusing.
>-            obj = classifier.WordInfo(0)
>-            obj.__setstate__(val)
>-            return obj
>-        else:
>-            return pickle.loads(v)
>-
>-    def __setitem__(self, key, val):
>-        if isinstance(val, classifier.WordInfo):
>-            val = val.__getstate__()
>-            v = 'W' + pickle.dumps(val, 1)
>-        else:
>-            v = pickle.dumps(val, 1)
>-        self.hash[key] = v
>-
>-    def __delitem__(self, key, val):
>-        del(self.hash[key])
>-
>-    def __iter__(self, fn=None):
>-        k = self.hash.first()
>-        while k != None:
>-            key = k[0]
>-            val = self.__getitem__(key)
>-            if key not in self.iterskip:
>-                if fn:
>-                    yield fn((key, val))
>-                else:
>-                    yield (key, val)
>-            try:
>-                k = self.hash.next()
>-            except KeyError:
>-                break
>-
>-    def __contains__(self, name):
>-        return self.has_key(name)
>-
>-    def __getattr__(self, name):
>-        # Pass the buck
>-        return getattr(self.hash, name)
>-
>-    def get(self, key, dfl=None):
>-        if self.has_key(key):
>-            return self[key]
>-        else:
>-            return dfl
>-
>-    def iteritems(self):
>-        return self.__iter__()
>-
>-    def iterkeys(self):
>-        return self.__iter__(lambda k: k[0])
>-
>-    def itervalues(self):
>-        return self.__iter__(lambda k: k[1])
>-
>-
>-class PersistentBayes(classifier.Bayes):
>-
>-    """A persistent Bayes classifier.
>-
>-    This is just like classifier.Bayes, except that the dictionary is a
>-    database.  You take less disk this way and you can pretend it's
>-    persistent.  The tradeoffs vs. a pickle are: 1. it's slower
>-    training, but faster checking, and 2. it needs less memory to run,
>-    but takes more space on the hard drive.
>+class Hammie:
>+    """A spambayes mail filter.
> 
>-    On destruction, an instantiation of this class will write its state
>-    to a special key.  When you instantiate a new one, it will attempt
>-    to read these values out of that key again, so you can pick up where
>-    you left off.
>+    This implements the basic functionality needed to score, filter, or
>+    train.  
> 
>     """
> 
>-    # XXX: Would it be even faster to remember (in a list) which keys
>-    # had been modified, and only recalculate those keys?  No sense in
>-    # going over the entire word database if only 100 words are
>-    # affected.
>-
>-    # XXX: Another idea: cache stuff in memory.  But by then maybe we
>-    # should just use ZODB.
>-
>-    def __init__(self, dbname, mode):
>-        classifier.Bayes.__init__(self)
>-        self.statekey = "saved state"
>-        self.wordinfo = DBDict(dbname, mode, (self.statekey,))
>-        self.dbmode = mode
>-
>-        self.restore_state()
>-
>-    def __del__(self):
>-        #super.__del__(self)
>-        self.save_state()
>-
>-    def save_state(self):
>-        if self.dbmode != 'r':
>-            self.wordinfo[self.statekey] = (self.nham, self.nspam)
>-
>-    def restore_state(self):
>-        if self.wordinfo.has_key(self.statekey):
>-            self.nham, self.nspam = self.wordinfo[self.statekey]
>-
>-
>-class Hammie:
>-
>-    """A spambayes mail filter"""
>-
>     def __init__(self, bayes):
>         self.bayes = bayes
> 
>@@ -262,9 +64,9 @@
>             import traceback
>             traceback.print_exc()
> 
>-    def filter(self, msg, header=DISPHEADER, spam_cutoff=SPAM_THRESHOLD,
>-               ham_cutoff=HAM_THRESHOLD, debugheader=DEBUGHEADER,
>-               debug=DODEBUG):
>+    def filter(self, msg, header=None, spam_cutoff=None,
>+               ham_cutoff=None, debugheader=None,
>+               debug=None):
>         """Score (judge) a message and add a disposition header.
> 
>         msg can be a string, a file object, or a Message object.
>@@ -282,6 +84,17 @@
> 
>         """
> 
>+        if header == None:
>+            header = options.hammie_header_name
>+        if spam_cutoff == None:
>+            spam_cutoff = options.spam_cutoff
>+        if ham_cutoff == None:
>+            ham_cutoff = options.ham_cutoff
>+        if debugheader == None:
>+            debugheader = options.hammie_debug_header_name
>+        if debug == None:
>+            debug = options.hammie_debug_header
>+
>         msg = mboxutils.get_message(msg)
>         try:
>             del msg[header]
>@@ -348,163 +161,47 @@
> 
>         self.train(msg, True)
> 
>-    def update_probabilities(self):
>+    def update_probabilities(self, store=True):
>         """Update probability values.
> 
>         You would want to call this after a training session.  It's
>         pretty slow, so if you have a lot of messages to train, wait
>         until you're all done before calling this.
> 
>+        Unless store is false, the peristent store will be written after
>+        updating probabilities.
>+
>         """
> 
>         self.bayes.update_probabilities()
>+        if store:
>+            self.store()
> 
>+    def store(self):
>+        """Write out the persistent store.
>+
>+        This makes sure the persistent store reflects what is currently
>+        in memory.  You would want to do this after a write and before
>+        exiting.
>+
>+        """
>+
>+        self.bayes.store()
>+
>+
>+def open(filename, usedb=True, mode='r'):
>+    """Open a file, returning a Hammie instance.
>+
>+    If usedb is False, open as a pickle instead of a DBDict.  mode is
>+
>+    used as the flag to open DBDict objects.  'c' for read-write (create
>+    if needed), 'r' for read-only, 'w' for read-write.
>+
>+    """
> 
>-def train(hammie, msgs, is_spam):
>-    """Train bayes with all messages from a mailbox."""
>-    mbox = mboxutils.getmbox(msgs)
>-    i = 0
>-    for msg in mbox:
>-        i += 1
>-        # XXX: Is the \r a Unixism?  I seem to recall it working in DOS
>-        # back in the day.  Maybe it's a line-printer-ism ;)
>-        sys.stdout.write("\r%6d" % i)
>-        sys.stdout.flush()
>-        hammie.train(msg, is_spam)
>-    print
>-
>-def score(hammie, msgs, reverse=0):
>-    """Score (judge) all messages from a mailbox."""
>-    # XXX The reporting needs work!
>-    mbox = mboxutils.getmbox(msgs)
>-    i = 0
>-    spams = hams = 0
>-    for msg in mbox:
>-        i += 1
>-        prob, clues = hammie.score(msg, True)
>-        if hasattr(msg, '_mh_msgno'):
>-            msgno = msg._mh_msgno
>-        else:
>-            msgno = i
>-        isspam = (prob >= SPAM_THRESHOLD)
>-        if isspam:
>-            spams += 1
>-            if not reverse:
>-                print "%6s %4.2f %1s" % (msgno, prob, isspam and "S" or 
"."),
>-                print hammie.formatclues(clues)
>-        else:
>-            hams += 1
>-            if reverse:
>-                print "%6s %4.2f %1s" % (msgno, prob, isspam and "S" or 
"."),
>-                print hammie.formatclues(clues)
>-    return (spams, hams)
>-
>-def createbayes(pck=DEFAULTDB, usedb=False, mode='r'):
>-    """Create a Bayes instance for the given pickle (which
>-    doesn't have to exist).  Create a PersistentBayes if
>-    usedb is True."""
>     if usedb:
>-        bayes = PersistentBayes(pck, mode)
>+        b = Bayes.DBDictBayes(filename, mode)
>     else:
>-        bayes = None
>-        try:
>-            fp = open(pck, 'rb')
>-        except IOError, e:
>-            if e.errno <> errno.ENOENT: raise
>-        else:
>-            bayes = pickle.load(fp)
>-            fp.close()
>-        if bayes is None:
>-            bayes = classifier.Bayes()
>-    return bayes
>-
>-def usage(code, msg=''):
>-    """Print usage message and sys.exit(code)."""
>-    if msg:
>-        print >> sys.stderr, msg
>-        print >> sys.stderr
>-    print >> sys.stderr, __doc__ % globals()
>-    sys.exit(code)
>-
>-def main():
>-    """Main program; parse options and go."""
>-    try:
>-        opts, args = getopt.getopt(sys.argv[1:], 'hdDfg:s:p:u:r')
>-    except getopt.error, msg:
>-        usage(2, msg)
>-
>-    if not opts:
>-        usage(2, "No options given")
>-
>-    pck = DEFAULTDB
>-    good = []
>-    spam = []
>-    unknown = []
>-    reverse = 0
>-    do_filter = False
>-    usedb = USEDB
>-    mode = 'r'
>-    for opt, arg in opts:
>-        if opt == '-h':
>-            usage(0)
>-        elif opt == '-g':
>-            good.append(arg)
>-            mode = 'c'
>-        elif opt == '-s':
>-            spam.append(arg)
>-            mode = 'c'
>-        elif opt == '-p':
>-            pck = arg
>-        elif opt == "-d":
>-            usedb = True
>-        elif opt == "-D":
>-            usedb = False
>-        elif opt == "-f":
>-            do_filter = True
>-        elif opt == '-u':
>-            unknown.append(arg)
>-        elif opt == '-r':
>-            reverse = 1
>-    if args:
>-        usage(2, "Positional arguments not allowed")
>-
>-    save = False
>-
>-    bayes = createbayes(pck, usedb, mode)
>-    h = Hammie(bayes)
>-
>-    for g in good:
>-        print "Training ham (%s):" % g
>-        train(h, g, False)
>-        save = True
>-
>-    for s in spam:
>-        print "Training spam (%s):" % s
>-        train(h, s, True)
>-        save = True
>-
>-    if save:
>-        h.update_probabilities()
>-        if not usedb and pck:
>-            fp = open(pck, 'wb')
>-            pickle.dump(bayes, fp, 1)
>-            fp.close()
>-
>-    if do_filter:
>-        msg = sys.stdin.read()
>-        filtered = h.filter(msg)
>-        sys.stdout.write(filtered)
>-
>-    if unknown:
>-        (spams, hams) = (0, 0)
>-        for u in unknown:
>-            if len(unknown) > 1:
>-                print "Scoring", u
>-            s, g = score(h, u, reverse)
>-            spams += s
>-            hams += g
>-        print "Total %d spam, %d ham" % (spams, hams)
>-
>+        b = Bayes.PickledBayes(filename)
>+    return Hammie(b)
> 
>-if __name__ == "__main__":
>-    main()
>Index: hammiefilter.py
>===================================================================
>RCS file: /cvsroot/spambayes/spambayes/hammiefilter.py,v
>retrieving revision 1.2
>diff -u -r1.2 hammiefilter.py
>--- hammiefilter.py	18 Nov 2002 18:14:04 -0000	1.2
>+++ hammiefilter.py	19 Nov 2002 00:24:57 -0000
>@@ -51,43 +51,37 @@
>     print >> sys.stderr, __doc__ % globals()
>     sys.exit(code)
> 
>-def jar_pickle(h):
>-    if not options.persistent_use_database:
>-        import pickle
>-        fp = open(options.persistent_storage_file, 'wb')
>-        pickle.dump(h.bayes, fp, 1)
>-        fp.close()
>-    
>-
>-def hammie_open(mode):
>-    b = hammie.createbayes(options.persistent_storage_file,
>-                           options.persistent_use_database,
>-                           mode)
>-    return hammie.Hammie(b)
>-
> def newdb():
>-    h = hammie_open('n')
>-    jar_pickle(h)
>+    h = hammie.open(options.persistent_storage_file,
>+                    options.persistent_use_database,
>+                    'n')
>+    h.store()
>     print "Created new database in", options.persistent_storage_file
> 
> def filter():
>-    h = hammie_open('r')
>+    h = hammie.open(options.persistent_storage_file,
>+                    options.persistent_use_database,
>+                    'r')
>     msg = sys.stdin.read()
>     print h.filter(msg)
> 
> def train_ham():
>-    h = hammie_open('w')
>+    h = hammie.open(options.persistent_storage_file,
>+                    options.persistent_use_database,
>+                    'w')
>     msg = sys.stdin.read()
>     h.train_ham(msg)
>     h.update_probabilities()
>-    jar_pickle(h)    
>+    h.store()
> 
> def train_spam():
>-    h = hammie_open('w')
>+    h = hammie.open(options.persistent_storage_file,
>+                    options.persistent_use_database,
>+                    'w')
>     msg = sys.stdin.read()
>     h.train_spam(msg)
>     h.update_probabilities()
>-    jar_pickle(h)    
>+    h.store()
> 
> def main():
>     action = filter
>
>_______________________________________________
>Spambayes mailing list
>Spambayes@python.org
>http://mail.python.org/mailman/listinfo/spambayes
>
>
- Tim
www.fourstonesExpressions.com 


From bkc@murkworks.com  Tue Nov 19 22:17:34 2002
From: bkc@murkworks.com (Brad Clements)
Date: Tue, 19 Nov 2002 17:17:34 -0500
Subject: [Spambayes] re: ciphertrust
Message-ID: <3DDA711E.25949.343B3AFE@localhost>


http://www.ciphertrust.com/ironmail/anti-spam.htm


Is this it? I can't really understand it through all the marketing speak.


Brad Clements,                bkc@murkworks.com   (315)268-1000
http://www.murkworks.com                          (315)268-9812 Fax
AOL-IM: BKClements


From francois.granger@free.fr  Tue Nov 19 22:25:32 2002
From: francois.granger@free.fr (=?iso-8859-1?Q?Fran=E7ois?= Granger)
Date: Tue, 19 Nov 2002 23:25:32 +0100
Subject: [Spambayes] Training
In-Reply-To: <LCEPIIGDJPKCOIHOBJEPEEMOHMAA.mhammond@skippinet.com.au>
References: <LCEPIIGDJPKCOIHOBJEPEEMOHMAA.mhammond@skippinet.com.au>
Message-ID: <a05100322ba006aa4586b@[192.168.1.11]>

At 8:59 +1100 20/11/02, in message RE: [Spambayes] Training, Mark 
Hammond wrote:
>The only reason I mention this is because last time I mentioned something
>that demonstrated my ignorance, Tim promptly replied confirming it, then
>subsequently made the change anyway <wink>.

So, I am not the only one ! ;-)
-- 
Le courrier �lectronique est un moyen de communication. Les gens devraient
se poser des questions sur les implications politiques des choix (ou non
choix) de leurs outils et technologies.
Pour des courriers propres : http://minilien.com/?IXZneLoID0 - 
http://marc.herbert.free.fr/mail/ http://expita.com/nomime.html

From skip@pobox.com  Tue Nov 19 22:33:53 2002
From: skip@pobox.com (Skip Montanaro)
Date: Tue, 19 Nov 2002 16:33:53 -0600
Subject: [Spambayes] re: ciphertrust
In-Reply-To: <3DDA711E.25949.343B3AFE@localhost>
References: <3DDA711E.25949.343B3AFE@localhost>
Message-ID: <15834.48209.376350.996825@montanaro.dyndns.org>


    Brad> http://www.ciphertrust.com/ironmail/anti-spam.htm

    Brad> Is this it? I can't really understand it through all the marketing
    Brad> speak.

I suspect if there's any content it will be in the white paper which you
need to register to get.  Must be an expensive solution if they won't tell
you anything about it without getting your vital (sales) statistics.

Skip

From tim.one@comcast.net  Tue Nov 19 22:39:37 2002
From: tim.one@comcast.net (Tim Peters)
Date: Tue, 19 Nov 2002 17:39:37 -0500
Subject: [Spambayes] Training
In-Reply-To: <LCEPIIGDJPKCOIHOBJEPEEMOHMAA.mhammond@skippinet.com.au>
Message-ID: <BIEJKCLHCIOIHAGOKOLHAENODPAA.tim.one@comcast.net>

[Mark Hammond]
> ...
> My concern is almost identical though - the *next* email that looks the
> same.  Let's say I subscribe to a weekly newsletter.  This weeks comes in,
> gets marked as unsure, so I train.  Next weeks comes in - again, it trains
> as unsure.  Repeat ad nauseum.
>
> I saw this a real lot when I had a high ham:spam inbalance -
> training had no obvious effect.

Conflating this, though, there were glitches in the Outlook client back than
that prevented retraining and/or rescoring from working as intended.

> I am still hoping to try Tim's new adjustment,

Note that it's already enabled in the Outlook client (but not in the general
codebase yet) -- the first time you do anything that recomputes the
probabilities, it will kick in with full force.

That's actually going to make the described problem worse:  when you have a
lot more ham than spam, the effect of the adjustment is to make everything
"less hammy" than it was.  This should help a lot when training on spam, but
makes training on ham *less* effective than it was.  In effect, it's saying
that training on new ham is much less valuable than training on new spam,
because you already have way more of the former.

> but I wonder if somehow similar maths could be exploited.  For example,
> manually training a message could be seen as "intense training", wereas a
> normal train is - well - normal.  The point of manual training is that the
> system got it wrong, and the user want to see the error stop.  "normal"
> training is just giving the system fairly "general" instructions.

You could feed a msg into training more than once as ham (or spam).  The
classifier doesn't know the difference between training on a single msg N
times, and training on N different msgs.  We could even feed the msg in, in
a loop, until the score went out of Unsure territory.  That would be
novel -- picture the effects on the system if I were to do this with my
Nigerian-scam quote.  Brrr!

But no matter how we cut this, so long as there's more of one kind of data
than the other, the class with the lesser amount of data is the one that
limits potential accuracy.

> The only reason I mention this is because last time I mentioned something
> that demonstrated my ignorance, Tim promptly replied confirming it, then
> subsequently made the change anyway <wink>.

Familiar patterns are such a comfort to us all <wink>.


From rob@hooft.net  Tue Nov 19 22:41:14 2002
From: rob@hooft.net (Rob Hooft)
Date: Tue, 19 Nov 2002 23:41:14 +0100
Subject: [Spambayes] More back-patting - my brain's first FP where bayes
 got it right
References: <200211191113.58375.tdickenson@devmail.geminidataloggers.co.uk>
	<BIEJKCLHCIOIHAGOKOLHAEMIDPAA.tim.one@comcast.net>
	<15834.39548.393074.819295@montanaro.dyndns.org>
Message-ID: <3DDABE0A.9090409@hooft.net>

Skip Montanaro wrote:
>     Tim> [Toby Dickenson]
>     >> Why exclude spams that score 100 from training?  Even these really
>     >> spammy spams might contain clues that would help to classify other
>     >> more marginal spam.
> 
>     Tim> Absolutely, but that's a different experiment.  I've already done
>     Tim> "proper" training and know it works great for me.  These are
>     Tim> experiments in doing silly training.  
> 
> If you're taking notes on this in various files in CVS I wouldn't call it
> "silly training".  How about "realistic training"?

Why realistic? Minimalistic?

I've seen my favorite being discussed, but I'd like to see more 
statistics on it: only train on all ham/spam messages automatically 
without any user interaction after an initial training phase of 
minimally 10-30 messages. This should automatically adapt to gradual 
changes. If this would really work, it would be my realistic variant... 
Integration into the MUA could only make it better.

Hm. I just adapted weaktest to be a bit more flexible, such that all 
these strategies can be tested. There are four new flags to the 
weaktest.py program:

  -d <key>: selects the "decisionmaker"; i.e. the strategy used to decide
       whether a message is trained on. There is a choice between:
        all : train on all messages
        allbut0and100 : train on all spam < 0.995 and ham >0.005
        unsureandfalses : train on Unsure and fp/fn only
        unsureonly : train on Unsure only.
  -u <key>: selects the "update strategy".
        always : updates counts after every trained message
        sometimes : trains every 10th
  -m int : uses the first "int" messages for training only (default 10)
  -v : increases verbosity.

I'm open to ideas (and results).

Rob

-- 
Rob W.W. Hooft  ||  rob@hooft.net  ||  http://www.hooft.net/people/rob/


From bkc@murkworks.com  Tue Nov 19 22:45:43 2002
From: bkc@murkworks.com (Brad Clements)
Date: Tue, 19 Nov 2002 17:45:43 -0500
Subject: [Spambayes] re: ciphertrust
In-Reply-To: <15834.48209.376350.996825@montanaro.dyndns.org>
References: <3DDA711E.25949.343B3AFE@localhost>
Message-ID: <3DDA77B8.19706.345501F9@localhost>

On 19 Nov 2002 at 16:33, Skip Montanaro wrote:

> I suspect if there's any content it will be in the white paper which you
> need to register to get.  Must be an expensive solution if they won't tell
> you anything about it without getting your vital (sales) statistics.

I've registered for it. Probably can't pass it on but I'll summarize.


Brad Clements,                bkc@murkworks.com   (315)268-1000
http://www.murkworks.com                          (315)268-9812 Fax
AOL-IM: BKClements


From noreply@sourceforge.net  Tue Nov 19 22:43:30 2002
From: noreply@sourceforge.net (noreply@sourceforge.net)
Date: Tue, 19 Nov 2002 14:43:30 -0800
Subject: [Spambayes] 
 [ spambayes-Patches-639312 ] fix for outlook CompareEntryIDs bug
Message-ID: <E18EH5e-0005qv-00@sc8-sf-web4.sourceforge.net>

Patches item #639312, was opened at 2002-11-16 23:35
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=498105&aid=639312&group_id=61702

Category: None
Group: None
Status: Open
Resolution: None
Priority: 5
Submitted By: Piers Haken (piersh)
>Assigned to: Mark Hammond (mhammond)
Summary: fix for outlook CompareEntryIDs bug

Initial Comment:
This patch reenables the CompareEntryIDs for 
comparing folder IDs. It passes both the MAPI Session 
and the Oulook Session into the dialog, one for retrieving 
the exchange-compatible IDs and the other for 
comparing them.

----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=498105&aid=639312&group_id=61702

From noreply@sourceforge.net  Tue Nov 19 22:44:07 2002
From: noreply@sourceforge.net (noreply@sourceforge.net)
Date: Tue, 19 Nov 2002 14:44:07 -0800
Subject: [Spambayes] [ spambayes-Patches-639310 ] fix for outlook 'spam' field
Message-ID: <E18EH6F-00012v-00@sc8-sf-web2.sourceforge.net>

Patches item #639310, was opened at 2002-11-16 23:32
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=498105&aid=639310&group_id=61702

Category: None
Group: None
Status: Open
Resolution: None
Priority: 5
Submitted By: Piers Haken (piersh)
>Assigned to: Mark Hammond (mhammond)
Summary: fix for outlook 'spam' field

Initial Comment:
1) firstly it changes the Class of the 'Spam' field to 
olPercent, which I believe is much more appropriate than 
olCombination. The problem with olCombination is that 
you have to manually change the field type in outlook in 
order to get anything to show up. With olPercent, the 
column shows up with a nice '%' sign which makes it 
more obvious what the number actually means.

2) secondly it adds a checkbox 'Update spam scores' to 
the training dialog. Checking this box causes the trainer 
to update the spam field for ALL messages in your 
training folders (in a second pass, if necessary). This 
means that ALL messages in your inbox have an entry 
in that field, not just those that arrived since you 
installed the plugin. This was a huge win for me since it 
allowed me to sort by the spam field and throw away 
about 20 spams from my inbox that I had missed during 
my initial manual pruning.


The only issue here is that in order for this to work right, 
you'll have to manually delete your existing spam fields, 
restart outlook and then 'rescore'.


----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=498105&aid=639310&group_id=61702

From lists@morpheus.demon.co.uk  Tue Nov 19 23:18:30 2002
From: lists@morpheus.demon.co.uk (Paul Moore)
Date: Tue, 19 Nov 2002 23:18:30 +0000
Subject: [Spambayes] Training
References: 
	<16E1010E4581B049ABC51D4975CEDB88619943@UKDCX001.uk.int.atosorigin.com>
	<LCEPIIGDJPKCOIHOBJEPEEMOHMAA.mhammond@skippinet.com.au>
Message-ID: <n2m-g.3cpxcfnd.fsf@morpheus.demon.co.uk>

"Mark Hammond" <mhammond@skippinet.com.au> writes:

> My concern is almost identical though - the *next* email that looks the
> same.  Let's say I subscribe to a weekly newsletter.  This weeks comes in,
> gets marked as unsure, so I train.  Next weeks comes in - again, it trains
> as unsure.  Repeat ad nauseum.

Good point. That would be *really* annoying after a while.

> I saw this a real lot when I had a high ham:spam inbalance - training had no
> obvious effect.

This happened to me today, with Tim's new adjustment switched on, with
a 10:1 ham:spam imbalance. IIRC, Tim's change means that with this
sort of imbalance, ham clues will only have 10% of their normal
effect, so saying "This is ham" will be pretty much ignored :-(

> I am still hoping to try Tim's new adjustment, but I wonder if
> somehow similar maths could be exploited.  For example, manually
> training a message could be seen as "intense training", wereas a
> normal train is - well - normal.  The point of manual training is
> that the system got it wrong, and the user want to see the error
> stop.  "normal" training is just giving the system fairly "general"
> instructions.

I'm not sure. All training is basically saying "these specific
messages *are* ham/spam". Whether this is done in bulk, or on an
individual basis, shouldn't matter. A naive view says that therefore
trained messages will score 0/100 "by definition". But the maths
doesn't work like that, and nothing is going to make it.

But I think it's a reasonable assumption that any messages which have
been explicitly trained will no longer hit the "unsure" range. I just
can't see a way of making even that assumption be true.

Paul.
-- 
This signature intentionally left blank

From neale@woozle.org  Tue Nov 19 23:39:45 2002
From: neale@woozle.org (Neale Pickett)
Date: 19 Nov 2002 15:39:45 -0800
Subject: [Spambayes] proposed changes to hammie & co.
In-Reply-To: <TR51NJJHB8UQD0FBL6FAD9ZVMKIGA9.3ddab3a4@riven>
References: <TR51NJJHB8UQD0FBL6FAD9ZVMKIGA9.3ddab3a4@riven>
Message-ID: <w53ptt1164e.fsf@woozle.org>

So then, Tim Stone - Four Stones Expressions <tim@fourstonesExpressions.com> is all like:

> I'll wait for your checkin, and do some more work on the dbdict module
> to add my load/store stuff...

Okay Tim, I'll tell you what.  I'm going to create a branch and check in
everything I've got.  I'm branching because what I have right now breaks
some existing functionality.

In the branch, we can play around with moving things out of the
classifier, moving options, etc.  When we get something that we think is
stable, and everyone else okays it, we can merge it all back in to HEAD.

I've called the branch "hammie-playground".  To get to it, just 

  $ cvs update -r hammie-playground

The branch need not be around for a long time, just long enough to work
out all these changes.

Neale

From skip@pobox.com  Tue Nov 19 23:40:08 2002
From: skip@pobox.com (Skip Montanaro)
Date: Tue, 19 Nov 2002 17:40:08 -0600
Subject: [Spambayes] More back-patting - my brain's first FP where bayes
	got it right
In-Reply-To: <3DDABE0A.9090409@hooft.net>
References: <200211191113.58375.tdickenson@devmail.geminidataloggers.co.uk>
        <BIEJKCLHCIOIHAGOKOLHAEMIDPAA.tim.one@comcast.net>
        <15834.39548.393074.819295@montanaro.dyndns.org>
        <3DDABE0A.9090409@hooft.net>
Message-ID: <15834.52184.806714.753216@montanaro.dyndns.org>


    Tim> Absolutely, but that's a different experiment.  I've already done
    Tim> "proper" training and know it works great for me.  These are
    Tim> experiments in doing silly training.

    Skip> If you're taking notes on this in various files in CVS I wouldn't
    Skip> call it "silly training".  How about "realistic training"?

    Rob> Why realistic? Minimalistic?

Realistic in the sense that the sort of training Tim is trying now probably
mimics what you can expect from average users over time.  You can't expect
people to always train on everything.  Even with a slick user interface that
won't be much better than just hitting "delete" for each spam.  You have to
assume people are going to be gung-ho at the beginning, then taper off when
either performance gets good enough or the novelty wears off.  One stop on
the way to not training at all is to only train on FPs, FNs and unsures.

Maybe "real world" is a better term than "realistic".

Skip

From richie@entrian.com  Tue Nov 19 23:59:26 2002
From: richie@entrian.com (Richie Hindle)
Date: Tue, 19 Nov 2002 23:59:26 +0000
Subject: [Spambayes] Training from scratch
Message-ID: <b81ltuohkf0sdgoa4f5e425u9ikvuc08ot@4ax.com>


I started a new database from scratch yesterday morning at work, and
trained it via the web interface as the messages arrived.  Courtesy of the
shiny new pop3graph.py (as yet uncommitted), this is how it behaved over
the first 36 hours:


   . - Number of messages over time
   * - Number of correctly classified messages over time


 |                                                 . 99
 |                                                .
 |                                               .
 |                                              .
 |                                             .
 |                                            .
 |                                           .
 |                                          .
 |                                         .
 |                                        .
 |                                       .
 |                                      .
 |                                     .           * 74
 |                                    .           *
 |                                   .           *
 |                                  .          **
 |                                 .          *
 |                                .          *
 |                               .          *
 |                              .          *
 |                             .          *
 |                            .          *
 |                           .         **
 |                          .        **
 |                         .        *
 |                        .        *
 |                       .       **
 |                      .       *
 |                     .       *
 |                    .       *
 |                   .       *
 |                  .      **
 |                 .     **
 |                .     *
 |               .     *
 |              .     *
 |             .    **
 |            .    *
 |           .    *
 |          .   **
 |         .   *
 |        .   *
 |       .   *
 |      .   *
 |     .  **
 |    . **
 |   ***
 |  *
 | *
 ___________________________________________________


(that should really plot the derivative of the second line as well, but you
 can see that it very quickly got close to parallel with the total number).

This is utterly unscientific I know, but very encouraging.  Not one of the
misclassifications was an FP!  Though that's probably down to the fact that
most of the early messages I trained it on were hams.  This could be worth
bearing in mind when thinking about training strategies (if I'm right) -
since FPs are more damaging than FNs, maybe people should be encouraged
(forced?) to train on a bunch of hams before any spams.

-- 
Richie Hindle
richie@entrian.com


From richie@entrian.com  Tue Nov 19 23:59:53 2002
From: richie@entrian.com (Richie Hindle)
Date: Tue, 19 Nov 2002 23:59:53 +0000
Subject: [Spambayes] Offtopic - getting bounce messages for spam
In-Reply-To: <n2m-g.of8lcqqi.fsf@morpheus.demon.co.uk>
References: <n2m-g.of8lcqqi.fsf@morpheus.demon.co.uk>
Message-ID: <5liltuk2j7o5uc0l85s97cdq1890jvjgsi@4ax.com>

Hi Paul,

> I've just started receiving undeliverable message reports for spam,
> sent to people I've never heard of. [...]
> What can I do about it in any case?

This happened to me last year.  I received 28,000 bounce emails in a period
of about two weeks.  I wrote a web-based POP3 gateway that lets you
batch-delete messages using regular expressions:
http://entrian.com/cgi-bin/pop3.py

This is very dangerous - you can wipe all your emails by abusing it, or
simply by misunderstanding it.  It also passes your POP3 password in plain
text across the internet, if that worries you.  And it may be buggy.  And
if anyone can think of any other reasons why people shouldn't use it,
please post them.  But when your email account is rendered completely
useless by all this, it will be a godsend.  8-)

-- 
Richie Hindle
richie@entrian.com


From neale@woozle.org  Wed Nov 20 00:02:58 2002
From: neale@woozle.org (Neale Pickett)
Date: 19 Nov 2002 16:02:58 -0800
Subject: [Spambayes] proposed changes to hammie & co.
In-Reply-To: <TR51NJJHB8UQD0FBL6FAD9ZVMKIGA9.3ddab3a4@riven>
References: <TR51NJJHB8UQD0FBL6FAD9ZVMKIGA9.3ddab3a4@riven>
Message-ID: <w53lm3p151p.fsf@woozle.org>

I just realized that I failed to respond to your points :)


So then, Tim Stone - Four Stones Expressions <tim@fourstonesExpressions.com> is all like:

> Neale, I'm ok with these changes.  I have more to make, but go ahead
> and make these alterations.  Particularly, I've got a dbdict class
> that supports load/store, so we don't have to worry about training
> that blows up before we save nham and nspam.

I'm curious about how you're doing this.  I briefly had a DBDict which
cached anything you tried to write to it, but it didn't seem like an
improvement so I dropped it, figuring ZODB was probably a better
solution.

> I think we should think about where WordInfo class goes...

That's rather unorthodox.  Why?

> I'm not sure I like having mode on the dbdict constructor, although I
> understand why you have it.  No harm done, as it defaults anyway.

I'm not sure I like it either, but I didn't know where else to put it.
If you think of a better solution, feel free to change it.

> I think we should take Bayes out of classifier and put it in Bayes.py

Now that's downright heretical!  ;)  It makes sense, I think, Bayes.py
being where all the Bayes stuff hangs out.  But if you take WordInfo out
of classifier, and you take Bayes out of classifier, all you'll have
left is two constants.  Maybe you just want to rename classifier.py.  I
wonder what the other Tim thinks about this idea...

> I like widict as a class, but it could be abstracted another notch by
> simply specifying the class to instantiate when you find a 'w' in the
> pickle, as an operand on the constructor.

I'm leaning heavily toward dictching WIDict and subclassing
Pickler/Unpickler; I think that's the Right Thing.  It will be slower
running, but maybe not significantly so.  I'll run some trials when I
get home.

Neale

From tim@fourstonesExpressions.com  Wed Nov 20 03:39:01 2002
From: tim@fourstonesExpressions.com (Tim Stone - Four Stones Expressions)
Date: Tue, 19 Nov 2002 21:39:01 -0600
Subject: [Spambayes] proposed changes to hammie & co.
In-Reply-To: <w53ptt1164e.fsf@woozle.org>
Message-ID: <FB945TNFDLKPKO76GBMA7VU65SO.3ddb03d5@riven>

So then, Neale Pickett is all like:

11/19/2002 5:39:45 PM, Neale Pickett <neale@woozle.org> wrote:

>So then, Tim Stone - Four Stones Expressions <tim@fourstonesExpressions.com> 
is all like:
>
>> I'll wait for your checkin, and do some more work on the dbdict module
>> to add my load/store stuff...
>
>Okay Tim, I'll tell you what.  I'm going to create a branch and check in
>everything I've got.  I'm branching because what I have right now breaks
>some existing functionality.

Ok, I've got the branch right now... I'll make my little tweaks.  You'll see 
how I do the load/store stuff with the dbm in LSDBDict(DBDict).  Basically, 
keeps a working file...

I have a few other tweaks to the Corpus stuff that are really unrelated to 
this work, more to do with Richie's needs.  I'll put them in the playground as 
well, just for consistency's sake.

>
>In the branch, we can play around with moving things out of the
>classifier, moving options, etc.  When we get something that we think is
>stable, and everyone else okays it, we can merge it all back in to HEAD.
>
>I've called the branch "hammie-playground".  To get to it, just 
>
>  $ cvs update -r hammie-playground
>
>The branch need not be around for a long time, just long enough to work
>out all these changes.
>

>> I think we should think about where WordInfo class goes...
>> I think we should take Bayes out of classifier and put it in Bayes.py
>
>That's rather unorthodox.  Why?
>Now that's downright heretical!  ;)  It makes sense, I think, Bayes.py
>being where all the Bayes stuff hangs out.  But if you take WordInfo out
>of classifier, and you take Bayes out of classifier, all you'll have
>left is two constants.  Maybe you just want to rename classifier.py.  I
>wonder what the other Tim thinks about this idea...
>

Yeah, the more I think about it, the more I realize my issue is that 
classifier kinda doesn't tell me what's in there.  WordInfo and Bayes 
superclass... Doesn't really matter to me, but would make more sense to me 
from a packaging point of view to simply have one file to distribute rather 
than two...

>I'm leaning heavily toward dictching WIDict and subclassing
>Pickler/Unpickler; I think that's the Right Thing.  It will be slower
>running, but maybe not significantly so.  I'll run some trials when I
>get home.

I don't see WIDict in the playground, so I assume you've ditched it already?  
But I don't see a pickle subclass either... am I missing something.  Haven't 
tried running anything yet, so maybe it will become obvious to me when I do... 
<wink>

>Neale
>
>
- Tim
www.fourstonesExpressions.com 


From tim_one@email.msn.com  Wed Nov 20 03:43:00 2002
From: tim_one@email.msn.com (Tim Peters)
Date: Tue, 19 Nov 2002 22:43:00 -0500
Subject: [Spambayes] proposed changes to hammie & co.
In-Reply-To: <w53lm3p151p.fsf@woozle.org>
Message-ID: <LNBBLJKPBEHFEDALKOLCGEJMCNAB.tim_one@email.msn.com>

[Tim Stone]
>> I think we should take Bayes out of classifier and put it in Bayes.py

[Neale Pickett]
> Now that's downright heretical!  ;)  It makes sense, I think, Bayes.py
> being where all the Bayes stuff hangs out.  But if you take WordInfo out
> of classifier, and you take Bayes out of classifier, all you'll have
> left is two constants.  Maybe you just want to rename classifier.py.  I
> wonder what the other Tim thinks about this idea...

Heresy is fine, but I don't understand what the goal of this is.

Jeremy previously added

    WordInfoClass = WordInfo

to the Bayes class so that subclasses (of Bayes) could specify the kind of
WordInfo structure they want to use.  The methods in Bayes never call
WordInfo directly, they always invoke self.WordInfoClass().  Bayes doesn't
care, so long as whatever the WordInfoClass() factory returns supports the
attributes the classifier accesses.

Subclassing is a clean & correct way to provide variants.  It's unfortunate
that Bayes also became an old-style class for a different reason, as
subclassing is much more efficient with new-style classes.  Then again,
classifier methods aren't called that often, so it's hard to get excited
about that.

Taking the classifier class out of classifer.py doesn't make sense to me on
the face of it, but maybe it would if I understood the goal.


From tim_one@email.msn.com  Wed Nov 20 03:52:26 2002
From: tim_one@email.msn.com (Tim Peters)
Date: Tue, 19 Nov 2002 22:52:26 -0500
Subject: [Spambayes] Training from scratch
In-Reply-To: <b81ltuohkf0sdgoa4f5e425u9ikvuc08ot@4ax.com>
Message-ID: <LNBBLJKPBEHFEDALKOLCCEJOCNAB.tim_one@email.msn.com>

[Richie Hindle]
> ...
> This could be worth bearing in mind when thinking about training
> strategies (if I'm right) - since FPs are more damaging than FNs, maybe
> people should be encouraged (forced?) to train on a bunch of hams before
> any spams.

If they don't, just about everything will come out as spam (every word
trained on will have a by-counting spamprob of 1.0, and the Baysian
adjustment will move that closer to 0.5 but not to less than 0.5).

The Outlook client doesn't allow you to train before specifying at least one
ham and one spam folder.  That doesn't stop a deterimined idiot from
specifying empty folders, though.


From tim_one@email.msn.com  Wed Nov 20 04:01:40 2002
From: tim_one@email.msn.com (Tim Peters)
Date: Tue, 19 Nov 2002 23:01:40 -0500
Subject: [Spambayes] More back-patting - my brain's first FP where bayes
	got it right
In-Reply-To: <15834.52184.806714.753216@montanaro.dyndns.org>
Message-ID: <LNBBLJKPBEHFEDALKOLCAEJPCNAB.tim_one@email.msn.com>

[Skip Montanaro]
> ...
> You have to assume people are going to be gung-ho at the beginning,
> then taper off when either performance gets good enough or the novelty
> wears off.  One stop on the way to not training at all is to only train
> on FPs, FNs and unsures.
>
> Maybe "real world" is a better term than "realistic".

Failing to account for human behavior would be a failing of the client,
then.  For this to work superbly, the client is going to have to train on
msgs without the *user*'s guidance.  I ran a quick experiment on that
earlier (the classifier training on its own decisions, simply assuming they
were correct), even to the extent of training on false positives as spam
assuming the user doesn't look at their spam folder at all after a while.
The results were indeed superb, but it so happened there were no false
positives during the test run (and I haven't had time to continue with more
of those tests, alas).

I doubt by-hand training will work except for geeks; they'll end up doing
mistake-based training, after an initial flurry of training on 5-year-old
ham <0.9 wink>.


From tim_one@email.msn.com  Wed Nov 20 04:14:05 2002
From: tim_one@email.msn.com (Tim Peters)
Date: Tue, 19 Nov 2002 23:14:05 -0500
Subject: [Spambayes] Training
In-Reply-To: <n2m-g.3cpxcfnd.fsf@morpheus.demon.co.uk>
Message-ID: <LNBBLJKPBEHFEDALKOLCKEJPCNAB.tim_one@email.msn.com>

[Paul Moore]
> ...
> This happened to me today, with Tim's new adjustment switched on, with
> a 10:1 ham:spam imbalance. IIRC, Tim's change means that with this
> sort of imbalance, ham clues will only have 10% of their normal
> effect, so saying "This is ham" will be pretty much ignored :-(

It affects only the Bayesian adjustment to the by-counting spamprob
estimates, and the adjustment isn't a linear function, so 10:1 -> 10% isn't
what happens.  For what really happens, study update_probabilities <wink>.

The effect of the Bayesian adjustment is *always* to move a by-counting
estimate closer to 0.5 (unknown_word_prob).  It can never increase the
distance of a by-counting estimate from 0.5.  So even if the Bayesian
adjustment weren't done at all, a hamprob can only get as low as the data
says it should get, and that's purely a matter of how often the word has
been seen in trained ham and trained spam.  Doing better than that would
require major psychic powers.

> I'm not sure. All training is basically saying "these specific
> messages *are* ham/spam". Whether this is done in bulk, or on an
> individual basis, shouldn't matter. A naive view says that therefore
> trained messages will score 0/100 "by definition". But the maths
> doesn't work like that, and nothing is going to make it.

You could train on a message over and over and over ... again, until the
score became arbitrarily close to 0 or 100.  It would probably ruin the
classifier for most other msgs, though.

> But I think it's a reasonable assumption that any messages which have
> been explicitly trained will no longer hit the "unsure" range. I just
> can't see a way of making even that assumption be true.

I have an FP that's an entire Nigerian scam msg, prefaced by a one-line
comment saying something like "Jeez, here's another Nigerian wire scam --
this has been around for 20 years".  Think about it <wink>.


From tim_one@email.msn.com  Wed Nov 20 04:18:12 2002
From: tim_one@email.msn.com (Tim Peters)
Date: Tue, 19 Nov 2002 23:18:12 -0500
Subject: [Spambayes] CipherTrust?
In-Reply-To: <15834.38911.435802.71321@montanaro.dyndns.org>
Message-ID: <LNBBLJKPBEHFEDALKOLCCEKACNAB.tim_one@email.msn.com>

>     http://eletters1.ziffdavis.com/cgi-bin10/flo?y=eSyU0EWaTF0E4J0sQU0Ax

Rummage around on the site it points to:

    http://www.ciphertrust.com/ironmail/anti-spam.htm

I like the "surgical precision in spam-blocking to eliminate false
positives" bit.  Screw these percentage error rates!  From now on we're
surgically precise (not to mention anatomically correct).


From tim@fourstonesExpressions.com  Wed Nov 20 04:39:19 2002
From: tim@fourstonesExpressions.com (Tim Stone - Four Stones Expressions)
Date: Tue, 19 Nov 2002 22:39:19 -0600
Subject: [Spambayes] proposed changes to hammie & co.
In-Reply-To: <LNBBLJKPBEHFEDALKOLCGEJMCNAB.tim_one@email.msn.com>
Message-ID: <NID0WSC0D8X3W42D0NJ76FDLKLJE993.3ddb11f7@riven>

11/19/2002 9:43:00 PM, "Tim Peters" <tim_one@email.msn.com> wrote:

>[Tim Stone]
>>> I think we should take Bayes out of classifier and put it in Bayes.py
>
>[Neale Pickett]
>> Now that's downright heretical!  ;)  It makes sense, I think, Bayes.py
>> being where all the Bayes stuff hangs out.  But if you take WordInfo out
>> of classifier, and you take Bayes out of classifier, all you'll have
>> left is two constants.  Maybe you just want to rename classifier.py.  I
>> wonder what the other Tim thinks about this idea...
>
>Heresy is fine, but I don't understand what the goal of this is.

See below

>
>Jeremy previously added
>
>    WordInfoClass = WordInfo
>
>to the Bayes class so that subclasses (of Bayes) could specify the kind of
>WordInfo structure they want to use.  The methods in Bayes never call
>WordInfo directly, they always invoke self.WordInfoClass().  Bayes doesn't
>care, so long as whatever the WordInfoClass() factory returns supports the
>attributes the classifier accesses.

Looks like we've got an impedance mismatch here.  The WIDict class that Neale 
made always assumes WordInfo.  We'll have to fix that.  If Neale subclasses 
Pickle to do this, it'll still need to know what class to instantiate.  
Interesting.

>
>Subclassing is a clean & correct way to provide variants.  It's unfortunate
>that Bayes also became an old-style class for a different reason, as
>subclassing is much more efficient with new-style classes.  Then again,
>classifier methods aren't called that often, so it's hard to get excited
>about that.
>
>Taking the classifier class out of classifer.py doesn't make sense to me on
>the face of it, but maybe it would if I understood the goal.

I don't have strong feelings about it... we could just as easily put all the 
stuff that's in Bayes.py into classifier.py.  One file is better than two, at 
least in this instance.

>
>
>
- Tim
www.fourstonesExpressions.com 


From neale@woozle.org  Wed Nov 20 04:42:36 2002
From: neale@woozle.org (Neale Pickett)
Date: 19 Nov 2002 20:42:36 -0800
Subject: [Spambayes] proposed changes to hammie & co.
In-Reply-To: <LNBBLJKPBEHFEDALKOLCGEJMCNAB.tim_one@email.msn.com>
References: <LNBBLJKPBEHFEDALKOLCGEJMCNAB.tim_one@email.msn.com>
Message-ID: <w53d6p026o3.fsf@woozle.org>

So then, "Tim Peters" <tim_one@email.msn.com> is all like:

> Heresy is fine, but I don't understand what the goal of this is.
>
> [snip]
>
> Taking the classifier class out of classifer.py doesn't make sense to
> me on the face of it, but maybe it would if I understood the goal.

Right now we have only one classifier, a Bayesian classifier, so when
Tim Stone consolidated all the PersistentBayes classes into a Bayes
class, it seemed (to me) like all the things called "Bayes" should be
found there.

Having gotten home and ingested some cabbage pie, I think the classifier
module is fine, and we should instead rename the Bayes module to
something like "Persistent".  In the future, concievably, there may be
another non-Bayesian classifier that we will still want to wrap with our
cool persistence classes.  So the misnomer is the Bayes module, not the
classifier module.

I think I'll have humble pie for dessert ;)

Neale

From tim_one@email.msn.com  Wed Nov 20 05:01:51 2002
From: tim_one@email.msn.com (Tim Peters)
Date: Wed, 20 Nov 2002 00:01:51 -0500
Subject: [Spambayes] Better optimization loop
In-Reply-To: <3DDA7916.5010102@hooft.net>
Message-ID: <LNBBLJKPBEHFEDALKOLCIEKCCNAB.tim_one@email.msn.com>

[Tim]
>> Good observation!  That should help.  simplex isn't fast in the best of
>> cases, and in this case ...

[Rob Hooft]
> Anyone that has a faster optimization algorithm lying around is welcome
> to replace my Simplex code.

Twasn't a criticism, just an observation about downhill Simplex, in anyone's
implementation.  Multidimensional optimization is a darned hard problem, and
this approach is at least pretty robust.

>>> To that goal I added a "delayed" flexcost to the CostCounter module
>>> that can use the optimal cutoffs calculated at the end of timcv.py.

>> Those can be pretty extreme; e.g., I've seen it suggest ham_cutoff of
>> 0.99 and spam_cutoff of 0.995 to get rid of "impossible" FP.

> They are in any case better than any other alternative I could think of.
> But if you disagree, you can change the order in which the
> CostCounter.default() builds up the cost counters; the optimization
> always uses the last one.

I don't disagree.  The point was that the "optimal cutoffs" are *also*
working like mad to accommodate outliers at the expense of everything else.
So long as FP are viewed as an approximation to the end of the world, all
attempts to optimize settings are going to focus on them.

> ...
> Very similar to my case. I'm seriously thinking about removing the
> "hopeless" and "almost hopeless" messages from my corpses. I agree with
> the bayesian statistics that they can't be correctly classified.

Whether it's a good idea to remove them depends on the goal <wink>.  I keep
mine in my test data so that the error rates reflect real life.  But there
are about 10 ham in my c.l.py data I simply don't care about, and it doesn't
bother me a bit if they pop back into my FP set (indeed, the last few rounds
of changes boosted my c.l.py total from 1 FP to 3 FP -- BFD!  FP Happen, and
the last few round of changes had helpful effects on almost everything
else).  In that sense, it's wholly unrealistic (but perhaps pragmatically
necessary) to say that each FP (and FN, and Unsure) has exactly the same
cost as every other.  Some FP simply don't matter, while others matter a
lot.  Likewise, I find some kinds of spam much more irritating than others,
and although my c.l.py data has no FN remaining, there are about 50 spam
there I really enjoy so I'd like to penalize the system for not letting me
see them <wink>.

> ...
> Press et al. report about a "robust fit", which is not a least squares
> but a least absolute deviates fit. It is insensitive to outliers.
> Is there an analog idea for us?

I don't know, but am not sanguine:  there's a specific cost function we're
trying to minimize, and despite that it's unrealistic it's better than
nothing.  Introducing this cost measure was a real help!  Trying to squeeze
the last penny out of it probably isn't, though -- it's not that good a
model of reality.  It does *generally* help us by saying FP are worse than
FN are worse than Unsure, and attaching a concrete figure of merit to that
aggregate judgment, but I don't take that number as more than an indicator
where "a lot smaller is better".  Small changes in it don't bother or cheer
me.

> ...
> Further results I obtained: My idea of running with an fp cost of $2 and
> a square cost function didn't work. It doesn't optimize to a consistent
> position. Increasing the cost of an fp back to $10 and running with the
> same square function did do a reasonable job, it optimized to:
>
> [Classifier]
> unknown_word_prob = 0.520415
> minimum_prob_strength = 0.315104
> unknown_word_strength = 0.215393
>
> So the unknown_word_prob is now back to 0.5 again!

More, I bet 0.52 is closer to the true unknown-word probability in your data
(take all the words that have appeared at least, say, 5 times, and average
their spamprobs; that's about the best guess we can make for the spamprob of
a word we see for the first time; in the three corpora I measured this on,
0.52 was the smallest empirical value I saw).  The other two act to look
only at very extreme words, and to keep words extreme longer in the face of
contrary evidence (a hapax is strong enough to survive minimum_prob_strength
of 0.3 even with s at the default 0.45; they're even more extreme at s
0.22).  Guessing "the true spamprob" may have room for improvement.  OTOH,
if you have more ham than spam, then x=0.52 is acting to make things "less
hammy", and a benefit may come from that.  In that case, enabling the new
ham/spam imbalance adjustment option may help even more.


From tim@fourstonesExpressions.com  Wed Nov 20 05:06:56 2002
From: tim@fourstonesExpressions.com (Tim Stone - Four Stones Expressions)
Date: Tue, 19 Nov 2002 23:06:56 -0600
Subject: [Spambayes] proposed changes to hammie & co.
In-Reply-To: <w53d6p026o3.fsf@woozle.org>
Message-ID: <DAY9KILI96DALH7343215YXWE9A0FD.3ddb1870@riven>

11/19/2002 10:42:36 PM, Neale Pickett <neale@woozle.org> wrote:

>So then, "Tim Peters" <tim_one@email.msn.com> is all like:
>
>> Heresy is fine, but I don't understand what the goal of this is.
>>
>> [snip]
>>
>> Taking the classifier class out of classifer.py doesn't make sense to
>> me on the face of it, but maybe it would if I understood the goal.
>
>Right now we have only one classifier, a Bayesian classifier, so when
>Tim Stone consolidated all the PersistentBayes classes into a Bayes
>class, it seemed (to me) like all the things called "Bayes" should be
>found there.
>
>Having gotten home and ingested some cabbage pie, I think the classifier
>module is fine, and we should instead rename the Bayes module to
>something like "Persistent".  In the future, concievably, there may be
>another non-Bayesian classifier that we will still want to wrap with our
>cool persistence classes.  So the misnomer is the Bayes module, not the
>classifier module.

How about PersistentClassifier?

>
>I think I'll have humble pie for dessert ;)
>
>Neale
>
>
- Tim
www.fourstonesExpressions.com 


From neale@woozle.org  Wed Nov 20 05:18:45 2002
From: neale@woozle.org (Neale Pickett)
Date: 19 Nov 2002 21:18:45 -0800
Subject: [Spambayes] proposed changes to hammie & co.
In-Reply-To: <DAY9KILI96DALH7343215YXWE9A0FD.3ddb1870@riven>
References: <DAY9KILI96DALH7343215YXWE9A0FD.3ddb1870@riven>
Message-ID: <w53vg2szumi.fsf@woozle.org>

So then, Tim Stone - Four Stones Expressions <tim@fourstonesExpressions.com> is all like:

> How about PersistentClassifier?

Yech.  Since the things are kinda doing what the standard shelve module
does, and we keep calling them "stores", how about "store"?


From tim@fourstonesExpressions.com  Wed Nov 20 05:28:55 2002
From: tim@fourstonesExpressions.com (Tim Stone - Four Stones Expressions)
Date: Tue, 19 Nov 2002 23:28:55 -0600
Subject: [Spambayes] proposed changes to hammie & co.
In-Reply-To: <w538yzo25f4.fsf@woozle.org>
Message-ID: <SNA9YT06TPZWSMGBF1WC8ZT286X3.3ddb1d97@riven>

11/19/2002 11:09:35 PM, Neale Pickett <neale@woozle.org> wrote:

>So then, Tim Stone - Four Stones Expressions <tim@fourstonesExpressions.com> 
is all like:
>
>> Neale, I just checked in dbdict and Bayes.  Lemme know what you think.  
>
>Okay, so you're just copying the file and then renaming it later.  It
>looks like you're trying to wrap the dbm file with a transactional
>model.  Copying isn't an atomic operation though, so locking will be a
>problem.  

See below... I don't think Richie is so much after transactionality as 'forget 
it' mode.

I agree that locking is a problem.  I don't like the implementation too 
much... I experimented with keeping an in-memory cache, but that gets hard to 
manage memory consumption.  These bayes databases might get kinda large... So 
I figured I'd let the dbm implementation manage memory.  If it's too stupid to 
do a good job, then we (someone) should fix that.

Perhaps in the long run, ZODB is the final answer.  But pickles in particular 
are so portable... dbm files are so fast... different strokes for different 
folks, I guess.
 
>
>I still don't understand why a DBDict needs load/store.  It'd be so much
>easier just have store() call self.db.sync() and make load() a noop.  Is
>there something out there which depends on the disk version being
>different from the memory version?

As nearly as I can tell, the dbm implementations vary on when they write stuff 
to persistent storage.  Sync only offers the guarantee that the memory and 
persistent versions match.  Richie has presented the requirement that the 
dictionary be able to forget what has happened...

>
>> Also, I tried pop3proxy with the playground branch, and it doesn't
>> work.  It looks like we got a back level of Options.py.  I'm not sure
>> how to get it up to snuff...
>
>There was a thinko in pop3proxy, but now I'm getting a weird
>AssertionError.  Is this something with ther Persistence classes, maybe?
>It looks like nspam isn't getting udpated:
>
>Traceback (most recent call last):
>  File "/usr/lib/python2.3/threading.py", line 410, in __bootstrap
>    self.run()
>  File "/usr/lib/python2.3/threading.py", line 398, in run
>    apply(self.__target, self.__args, self.__kwargs)
>  File "./pop3proxy.py", line 1306, in runProxy
>    state.bayes.learn(tokenizer.tokenize(spam1), True)
>  File "classifier.py", line 298, in learn
>    self.update_probabilities()
>  File "classifier.py", line 345, in update_probabilities
>    assert spamcount <= nspam
>AssertionError
>
>Workin' on it.
>
>
- Tim
www.fourstonesExpressions.com 


From tim@fourstonesExpressions.com  Wed Nov 20 05:32:19 2002
From: tim@fourstonesExpressions.com (Tim Stone - Four Stones Expressions)
Date: Tue, 19 Nov 2002 23:32:19 -0600
Subject: [Spambayes] proposed changes to hammie & co.
In-Reply-To: <w53vg2szumi.fsf@woozle.org>
Message-ID: <ZURNUP732UUS73IHID4VQBATR8731C7.3ddb1e63@riven>

persistent might be better.  This gives us:

class PersistentBayes(classifier.Bayes):

class DBDictBayes(PersistentBayes)

bayes = persistent.DBDictBayes('mydb')

11/19/2002 11:18:45 PM, Neale Pickett <neale@woozle.org> wrote:

>So then, Tim Stone - Four Stones Expressions <tim@fourstonesExpressions.com> 
is all like:
>
>> How about PersistentClassifier?
>
>Yech.  Since the things are kinda doing what the standard shelve module
>does, and we keep calling them "stores", how about "store"?
>
>
>
- Tim
www.fourstonesExpressions.com 


From Paul.Moore@atosorigin.com  Wed Nov 20 09:04:18 2002
From: Paul.Moore@atosorigin.com (Moore, Paul)
Date: Wed, 20 Nov 2002 09:04:18 -0000
Subject: [Spambayes] Outlook weirdness
Message-ID: <16E1010E4581B049ABC51D4975CEDB88619944@UKDCX001.uk.int.atosorigin.com>

This morning I started Outlook. I hadn't upgraded spambayes - it's the
same version as yesterday. But the training data was completely gone!
The manager dialog said that there was no training data, and filters
were disabled.

But the pickle was there, and yesterday everything was working fine.
And even stranger, while I was watching a message came in and got
filed in "Unsure".

The only thing I can think of which may be relevant is that after I
had shutdown Outlook cleanly (or so it looked) last night, when I shut
my machine down, I got a message saying Outlook was not responding
and was being closed down. Looks like some form of rogue instance of
Outlook... Whether that had an effect, I don't know.

I'm sure all of these strange effects I get are related to my using
Exchange with Active Directory as my server, but I've no idea how to
diagnose them :-(

Paul.

From mhammond@skippinet.com.au  Wed Nov 20 09:43:32 2002
From: mhammond@skippinet.com.au (Mark Hammond)
Date: Wed, 20 Nov 2002 20:43:32 +1100
Subject: [Spambayes] Training from scratch
In-Reply-To: <LNBBLJKPBEHFEDALKOLCCEJOCNAB.tim_one@email.msn.com>
Message-ID: <LCEPIIGDJPKCOIHOBJEPEEPGHMAA.mhammond@skippinet.com.au>

> The Outlook client doesn't allow you to train before specifying
> at least one
> ham and one spam folder.  That doesn't stop a deterimined idiot from
> specifying empty folders, though.

It doesn't let you enable filtering until there are at least 5 ham and 5
spam in the database though.

Not sure why I bothered with that - you should never underestimate the
determination of an idiot <wink>

Mark.


From msergeant@startechgroup.co.uk  Wed Nov 20 10:22:35 2002
From: msergeant@startechgroup.co.uk (Matt Sergeant)
Date: Wed, 20 Nov 2002 10:22:35 +0000
Subject: [Spambayes] Offtopic - getting bounce messages for spam
References: <n2m-g.of8lcqqi.fsf@morpheus.demon.co.uk>
Message-ID: <3DDB626B.90703@startechgroup.co.uk>

Paul Moore said the following on 19/11/02 19:19:
> Sorry, this is offtopic, but I'm hoping that the concentration of spam
> experts on this group may be able to help me.
> 
> I've just started receiving undeliverable message reports for spam,
> sent to people I've never heard of. It looks to me like someone is
> managing to impersonate me when they send spam out. I'm fairly sure
> I'm not running an open relay (is there a way of checking for
> certain?), so I guess someone is spoofing headers or something. I've
> heard of this sort of thing before, but never experienced this myself.

It's called a Joe-Job. I get these *all* the time. See the Spam-L FAQ 
(google will find it for you) for details.

> Two questions, really:
> 
> a) Is this something I should worry about (am I likely to end up on
>    blacklists or the like)?
> b) What can I do about it in any case?

There's little you can do except try and detect it and dump the mails, 
unless you want to spend the effort finding which relay it came 
through/from and trying to get the IP added to various DNSBL's. Though 
the problem is you have to know what DNSBL's the mail server that the 
bounce is coming from uses.

Generally I ignore them - they tend to last a day or so before moving on 
to annoy someone else.

Matt.


From richie@entrian.com  Wed Nov 20 10:28:52 2002
From: richie@entrian.com (richie@entrian.com)
Date: Wed, 20 Nov 2002 10:28:52 +0000
Subject: [Spambayes] Re: proposed changes to hammie & co.
Message-ID: <E18ES6G-00074e-0U@anchor-post-35.mail.demon.net>


[Tim Stone]
> Richie has presented the requirement that the 
> dictionary be able to forget what has happened...

This isn't a huge requirement - it's nice that the pop3proxy's test code
doesn't write anything to the disk, but that's now been achieved by losing
__del__.  I mentioned that people might want to do speculative training and
not save the results, but that can always be achieved by specifying a
temporary DB name on the command line.  The ability to forget is a nice-to-
have, not a requirement.  Quoting myself: "I'd much rather have an explicit
store() method and document the fact that storage may be pre-empted by
certain implementations."

-- 
Richie Hindle
richie@entrian.com


From msergeant@startechgroup.co.uk  Wed Nov 20 10:25:54 2002
From: msergeant@startechgroup.co.uk (Matt Sergeant)
Date: Wed, 20 Nov 2002 10:25:54 +0000
Subject: [Spambayes] re: ciphertrust
References: <3DDA711E.25949.343B3AFE@localhost>
Message-ID: <3DDB6332.3040406@startechgroup.co.uk>

Brad Clements said the following on 19/11/02 22:17:
> http://www.ciphertrust.com/ironmail/anti-spam.htm
> 
> 
> Is this it? I can't really understand it through all the marketing speak.

I met with these guys a few weeks ago. Basically it's a custom rule set. 
A bit like SpamAssassin. They use customer feedback to expand their 
ruleset. They also use Razor and I think some DNSBL's.

Let me know offline if you want any more info.

Matt.


From piersh@friskit.com  Wed Nov 20 10:57:26 2002
From: piersh@friskit.com (Piers Haken)
Date: Wed, 20 Nov 2002 02:57:26 -0800
Subject: [Spambayes] Outlook weirdness
Message-ID: <9891913C5BFE87429D71E37F08210CB9297516@zeus.sfhq.friskit.com>

I have seen this a couple of times, too. I have noticed (by watching
PythonWin) that Outlook can take some time to actually shutdown, while
saving the db, after the UI has been closed. I have 3500:2050 ham:spam.

Piers.

> -----Original Message-----
> From: Moore, Paul [mailto:Paul.Moore@atosorigin.com]=20
> Sent: Wednesday, November 20, 2002 1:04 AM
> To: Spambayes (E-mail)
> Subject: [Spambayes] Outlook weirdness
>=20
>=20
> This morning I started Outlook. I hadn't upgraded spambayes -=20
> it's the same version as yesterday. But the training data was=20
> completely gone! The manager dialog said that there was no=20
> training data, and filters were disabled.
>=20
> But the pickle was there, and yesterday everything was=20
> working fine. And even stranger, while I was watching a=20
> message came in and got filed in "Unsure".
>=20
> The only thing I can think of which may be relevant is that=20
> after I had shutdown Outlook cleanly (or so it looked) last=20
> night, when I shut my machine down, I got a message saying=20
> Outlook was not responding and was being closed down. Looks=20
> like some form of rogue instance of Outlook... Whether that=20
> had an effect, I don't know.
>=20
> I'm sure all of these strange effects I get are related to my=20
> using Exchange with Active Directory as my server, but I've=20
> no idea how to diagnose them :-(
>=20
> Paul.
>=20
> _______________________________________________
> Spambayes mailing list
> Spambayes@python.org http://mail.python.org/mailman/listinfo/spambayes
>=20
From rob@hooft.net  Wed Nov 20 12:17:01 2002
From: rob@hooft.net (Rob W.W. Hooft)
Date: Wed, 20 Nov 2002 13:17:01 +0100
Subject: [Spambayes] Better optimization loop
References: <LNBBLJKPBEHFEDALKOLCIEKCCNAB.tim_one@email.msn.com>
Message-ID: <3DDB7D3D.1020306@hooft.net>

Tim Peters wrote:
> [Tim]

>>>Good observation!  That should help.  simplex isn't fast in the best of
>>>cases, and in this case ...

> [Rob Hooft]

>>Anyone that has a faster optimization algorithm lying around is welcome
>>to replace my Simplex code.

[Tim]

> Twasn't a criticism, just an observation about downhill Simplex, in anyone's
> implementation.  Multidimensional optimization is a darned hard problem, and
> this approach is at least pretty robust.

It wasn't anger, it was a genuine invitation.... ;-) I'm running these 
tests, and they are taking daaaayysss, so I really welcome anyone that 
has alternatives.

One alternative I thought of is to keep the wordcounts lying around, and 
only calling update function once before starting scoring. But I'm not 
sure I would be the best person to try that (read: I'm sure someone else 
can do that 10x faster than I can).

Another speedup I could use is a version of Bayes that calculates the 
spamprob from the numbers on demand instead of calculating them for all 
words everytime. This pays of for all cases where the training batch is 
very small (~1 message).

Rob

-- 
Rob W.W. Hooft  ||  rob@hooft.net  ||  http://www.hooft.net/people/rob/


From richie@entrian.com  Wed Nov 20 12:46:20 2002
From: richie@entrian.com (Richie Hindle)
Date: Wed, 20 Nov 2002 12:46:20 +0000
Subject: [Spambayes] pop3proxy now supports multiple POP3 accounts
In-Reply-To: <PL41LI3WZVVSMH3174ZYWGD9B8PN85.3dda24c0@riven>
References: <E18Dh1N-0001Tb-0U@anchor-post-35.mail.demon.net>
	<PL41LI3WZVVSMH3174ZYWGD9B8PN85.3dda24c0@riven>
Message-ID: <f10ntuklm4ghuofiu6hom8e3ceqpkqvuv8@4ax.com>


[Tim Stone]
> One thing to think about... I'm going to be running more than one of these, 
> because I have several pop3 accounts.  Would it be possible to make pop3proxy 
> configurable to proxy more than one pop3 account?  I'd like to share the same 
> bayes database between them all...

This is now done.  It creates a listening port for each account [I don't
like the popular idea of munging the POP3 username to include the hostname,
because it complicates the proxy - simple is good].  See Options.py for the
new ini-file settings - you give it a list of POP3 servers and a
corresponding list of local ports to listen on.  The old settings still
work but are deprecated and give a warning.  It passes this test:

[pop3proxy]
# Evil self-proxying test.
pop3proxy_servers: localhost:8110, localhost:111
pop3proxy_ports: 111, 110

where localhost:8110 is a local test POP3 server a la "pop3proxy.py -t".
This makes it run two proxies, one pointing at the other.  The messages
come back classified twice (because it doesn't yet strip existing
X-Hammie-Disposition headers - must fix that 8-)

And this from a single-threaded program.  Sometimes asyncore winds me up,
but sometimes I really have to take my hat off to it.

-- 
Richie Hindle
richie@entrian.com


From richie@entrian.com  Wed Nov 20 12:46:16 2002
From: richie@entrian.com (Richie Hindle)
Date: Wed, 20 Nov 2002 12:46:16 +0000
Subject: [Spambayes] New web training interface for pop3proxy
In-Reply-To: <15834.15618.465106.671756@montanaro.dyndns.org>
References: <15833.20589.376685.686723@montanaro.dyndns.org>
	<9qqitu09afl7kufdmr41kn28tfn9nfjpur@4ax.com>
	<15834.15618.465106.671756@montanaro.dyndns.org>
Message-ID: <gp0ntu41535klnj60q5b68u1u21n7fmbi4@4ax.com>

Hi Skip,

> That suggests to me that you need to group messages together based upon
> their initial classification.  That way [...] they are clumped:
> 
>     D   H   S   U
>                 *
>             *
>             *
>             *
>         *
>         *

Excellent plan.  I've just committed this, with a heading between each
clump.  For those not using the proxy, there's a mockup at
http://www.entrian.com/review3.html

> The background color should probably alternate between light grey and white

Implemented in the very first version using the time machine - thanks for
the suggestion!  8-)

-- 
Richie Hindle
richie@entrian.com


From seant@iname.com  Wed Nov 20 12:46:23 2002
From: seant@iname.com (Sean True)
Date: Wed, 20 Nov 2002 07:46:23 -0500
Subject: [Spambayes] Outlook weirdness
In-Reply-To: <9891913C5BFE87429D71E37F08210CB9297516@zeus.sfhq.friskit.com>
Message-ID: <MJEHLHJKGINLONDMMKNEKEGCHIAA.seant@iname.com>

>
> I have seen this a couple of times, too. I have noticed (by watching
> PythonWin) that Outlook can take some time to actually shutdown, while
> saving the db, after the UI has been closed. I have 3500:2050 ham:spam.

I have a pretty persistent Outlook shutdown problem. I have a 6K Spam, 7K
Ham training set,
and an Outlook that commonly uses 40-50MB of memory. Often when I close
Outlook, it will stay
in memory. (Leaving an icon in the task bar, too). When I "restart" it,
meaning, I think,
restart the UI, I get no Spam manager icons, even though the addin is still
running cheerfully
and filtering.

For a while I thought this happened only after the addin had thrown an
exception, but that does
not appear to be the case. I am suspicious of interactions with my virus
scanner (Mcaffee), but
have had problems even with Mcaffee disabled. I haven't been aggressive
enough to try uninstalling
Mcaffee -- I'd rather give up the addin, much as I like it!

> The only thing I can think of which may be relevant is that
> after I had shutdown Outlook cleanly (or so it looked) last
> night, when I shut my machine down, I got a message saying
> Outlook was not responding and was being closed down. Looks
> like some form of rogue instance of Outlook... Whether that
> had an effect, I don't know.
>

If that happens again, take a look at the task manager, and see if it is
really gone. If it's not,
I find that terminating the task with prejudice from task manager usually
kills it. At the expense
of a very long start up as Outlook reconstructs the mailbox database (I
think).

> I'm sure all of these strange effects I get are related to my
> using Exchange with Active Directory as my server, but I've
> no idea how to diagnose them :-(

I run in a pure internet environment (pop3 servers only). Outlook is a very
difficult and unforgiving platform to work with (thanks, Mark!) even without
Exchange in the picture.

-- Sean


From jeremy@alum.mit.edu  Wed Nov 20 12:52:22 2002
From: jeremy@alum.mit.edu (Jeremy Hylton)
Date: Wed, 20 Nov 2002 07:52:22 -0500
Subject: [Spambayes] proposed changes to hammie & co.
In-Reply-To: <SNA9YT06TPZWSMGBF1WC8ZT286X3.3ddb1d97@riven>
References: <w538yzo25f4.fsf@woozle.org>
	<SNA9YT06TPZWSMGBF1WC8ZT286X3.3ddb1d97@riven>
Message-ID: <15835.34182.266661.912333@slothrop.zope.com>

>>>>> "TS" == Tim Stone <- Four Stones Expressions <tim@fourstonesExpressions.com>> writes:

  TS> Perhaps in the long run, ZODB is the final answer.  But pickles
  TS> in particular are so portable... dbm files are so
  TS> fast... different strokes for different folks, I guess.

ZODB uses pickles, so is about as portable.  Don't know how it's speed
compares to dbm files, but would expect it's roughly comparable.  What
are you concerns other than portability and as-fast-as-dbm-files?  The
advantage of ZODB is that it takes about two dozen lines of code to
make the classifier persistent.

One likely concern is that your users have to install ZODB or you
have to package it for them.

  >> I still don't understand why a DBDict needs load/store.  It'd be
  >> so much easier just have store() call self.db.sync() and make
  >> load() a noop.  Is there something out there which depends on the
  >> disk version being different from the memory version?

  TS> As nearly as I can tell, the dbm implementations vary on when
  TS> they write stuff to persistent storage.  Sync only offers the
  TS> guarantee that the memory and persistent versions match.  Richie
  TS> has presented the requirement that the dictionary be able to
  TS> forget what has happened...

Another advantage of ZODB is that it's transactional, which makes it
possible to forget what has happened.  It also makes it possible for
multiple processes to share the database in a sane way.

Jeremy


From fgranger@teleprosoft.com  Wed Nov 20 11:58:41 2002
From: fgranger@teleprosoft.com (Fran=?ISO-8859-1?B?5w==?=ois Granger)
Date: Wed, 20 Nov 2002 12:58:41 +0100
Subject: [Spambayes] Another soft for the collection
Message-ID: <BA013781.5CED5%fgranger@teleprosoft.com>

I did not saw it on the web page:
http://spambayes.sourceforge.net/related.html

I got it from:
http://db.tidbits.com/getbits.acgi?tbart=06994

The site of the product

http://www.c-command.com/spamsieve/

Salutations,
Francois Granger
-- 
fgranger@teleprosoft.com - <http://www.teleprosoft.com>
tel: +33 1 41 88 48 00 - Fax: + 33 1 41 88 48 48


From jeremy@alum.mit.edu  Wed Nov 20 13:13:26 2002
From: jeremy@alum.mit.edu (Jeremy Hylton)
Date: Wed, 20 Nov 2002 08:13:26 -0500
Subject: [Spambayes] pop3proxy now supports multiple POP3 accounts
In-Reply-To: <f10ntuklm4ghuofiu6hom8e3ceqpkqvuv8@4ax.com>
References: <E18Dh1N-0001Tb-0U@anchor-post-35.mail.demon.net>
	<PL41LI3WZVVSMH3174ZYWGD9B8PN85.3dda24c0@riven>
	<f10ntuklm4ghuofiu6hom8e3ceqpkqvuv8@4ax.com>
Message-ID: <15835.35446.435871.609607@slothrop.zope.com>

>>>>> "RH" == Richie Hindle <richie@entrian.com> writes:

  RH> This is now done.  It creates a listening port for each account
  RH> [I don't like the popular idea of munging the POP3 username to
  RH> include the hostname, because it complicates the proxy - simple
  RH> is good].

Oh, I see you're aware of the approach.  It's a trivial amount of code
in the proxy.  You're already using asyncore so you can't really be
worried about complexity <wink>.

It's much easier for a user to understand what's going on when the pop
client's configuration has some mention of the name of the real pop
server.  I started off uses two different servers on two different
ports, but my client never gave me any indication of where the mail
came from.  With the new change, the status line tells me what server
it's getting mail from.

The implementation is this short.  It's mostly parsing the user name
in name, host, port.

    def read_user(self):
        # XXX This could be cleaned up a bit.
        line = self.rfile.readline()
        if line == "":
            return False
        parts = line.split()
        if parts[0] != "USER":
            self.wfile.write("-ERR Invalid command; must specify USER first")
            return False
        user = parts[1]
        i = user.rfind("@")
        username = user[:i]
        server = user[i+1:]
        i = server.find(":")
        if i == -1:
            server = server, 110
        else:
            port = int(server[i+1:])
            server = server[:i], port
        zLOG.LOG("POP3", zLOG.INFO, "Got connect for %s" % repr(server))
        self.connect_pop(server)
        self.pop_wfile.write("USER %s\r\n" % username)
        resp = self.pop_rfile.readline()
        # As long the server responds OK, just swallow this reponse.
        if resp.startswith("+OK"):
            return True
        else:
            return False

Jeremy


From Paul.Moore@atosorigin.com  Wed Nov 20 13:15:50 2002
From: Paul.Moore@atosorigin.com (Moore, Paul)
Date: Wed, 20 Nov 2002 13:15:50 -0000
Subject: [Spambayes] Outlook weirdness
Message-ID: <16E1010E4581B049ABC51D4975CEDB885E2DEC@UKDCX001.uk.int.atosorigin.com>

From: Sean True [mailto:seant@iname.com]
> I have a pretty persistent Outlook shutdown problem. I have a
> 6K Spam, 7K Ham training set, and an Outlook that commonly uses
> 40-50MB of memory. Often when I close Outlook, it will stay in
> memory. (Leaving an icon in the task bar, too). When I "restart" it,
> meaning, I think, restart the UI, I get no Spam manager icons, even
> though the addin is still running cheerfully and filtering.

Outlook has two exit options "Exit" and "Exit and log off". I've never
quite understood the difference, but I wonder if what you're seeing is
related - Outlook finishing, closing down the UI, but then remaining
for ages in the background while the addin saves the pickles and
tidies up. This would also tie in with the high memory footprint, as
the pickle method keeps the database in memory.

Would it be worth trying the DBM format for the database? I think
this would give faster startup/shutdown times, and lower memory
consumption, at the expense of on-disk database size and slower
filtering (although I doubt that this difference would be an issue).

Unfortunately, the addin doesn't hook into the main code's persistence
structure (as far as I can see) so switching formats isn't as simple
as just changing the INI file. I'll look into it and give it a try at
some point.

Paul.

From richie@entrian.com  Wed Nov 20 13:50:29 2002
From: richie@entrian.com (richie@entrian.com)
Date: Wed, 20 Nov 2002 13:50:29 +0000
Subject: [Spambayes] pop3proxy now supports multiple POP3 accounts
In-Reply-To: <15835.35446.435871.609607@slothrop.zope.com>
Message-ID: <E18EVFN-0002b3-0U@anchor-post-39.mail.demon.net>

Hi Jeremy,

> Oh, I see you're aware of the approach.  It's a trivial amount of
> code in the proxy.

It's not so much that - I've since realised that it simply won't work
with some email clients.  As your code says:

if parts[0] != "USER":
    self.wfile.write("-ERR Invalid command; must specify USER first")

but that's not always obeyed.  For instance, RFC 2449 adds extensions
to POP3, including the CAPA command:

> The POP3 CAPA command returns a list of capabilities supported
> by the POP3 server.  It is available in both the AUTHORIZATION
> and TRANSACTION states.

meaning that the first command given by the client might be CAPA.
This gets you into a chicken-and-egg situation whereby you need to
proxy the CAPA command but you don't know which server to connect
to because you haven't seen the USER command yet.  I've seen this
in the real world - Fran�ois Granger's client sends CAPA.  This is
also why pop3proxy will proxy unknown commands.

> You're already using asyncore so you can't really be worried
> about complexity <wink>.

(-8  .helps which, demand on backwards work to brain my rewired I've

-- 
Richie Hindle
richie@entrian.com


From papaDoc@videotron.ca  Wed Nov 20 14:09:34 2002
From: papaDoc@videotron.ca (papaDoc)
Date: Wed, 20 Nov 2002 09:09:34 -0500
Subject: [Spambayes] New  pop3proxy options
Message-ID: <3DDB979E.8050604@videotron.ca>

    Hi,

This is my first contribution  ;-)

This is a patch for the Options.py
363,364c363,364
< pop3proxy_servers: ""
< pop3proxy_ports: ""
---
 > pop3proxy_servers:
 > pop3proxy_ports:


By the way I'm trying to use pop3proxy with Mozilla 1.1. I'm creating a 
new account
which point to localhost:110. I can retreive the messages and they are 
scored but I can't display
the body I see only the Subject line. (When you click on the subject 
line the body should be display
in the subwindow below) in the mail tools of mozilla. But when I look in 
the Inbox there are their with
their body ?????


From bkc@murkworks.com  Wed Nov 20 14:47:35 2002
From: bkc@murkworks.com (Brad Clements)
Date: Wed, 20 Nov 2002 09:47:35 -0500
Subject: [Spambayes] re: ciphertrust
In-Reply-To: <3DDB6332.3040406@startechgroup.co.uk>
Message-ID: <3DDB5925.32059.37C5A44E@localhost>

On 20 Nov 2002 at 10:25, Matt Sergeant wrote:

> I met with these guys a few weeks ago. Basically it's a custom rule set. A
> bit like SpamAssassin. They use customer feedback to expand their ruleset.
> They also use Razor and I think some DNSBL's.
> 
> Let me know offline if you want any more info.

So much for heuristics..


Brad Clements,                bkc@murkworks.com   (315)268-1000
http://www.murkworks.com                          (315)268-9812 Fax
AOL-IM: BKClements


From bkc@murkworks.com  Wed Nov 20 14:53:18 2002
From: bkc@murkworks.com (Brad Clements)
Date: Wed, 20 Nov 2002 09:53:18 -0500
Subject: [Spambayes] Another soft for the collection
In-Reply-To: <BA013781.5CED5%fgranger@teleprosoft.com>
Message-ID: <3DDB5A7C.2661.37CADF75@localhost>

On 20 Nov 2002 at 12:58, Fran=E7ois Granger wrote:

> The site of the product
> 
> http://www.c-command.com/spamsieve/

In their screen shot, under Corpus.

It shows 3778 unused words. Huh?


Brad Clements,                bkc@murkworks.com   (315)268-1000
http://www.murkworks.com                          (315)268-9812 Fax
AOL-IM: BKClements


From seant@iname.com  Wed Nov 20 14:53:13 2002
From: seant@iname.com (Sean True)
Date: Wed, 20 Nov 2002 09:53:13 -0500
Subject: [Spambayes] Outlook weirdness
In-Reply-To: <16E1010E4581B049ABC51D4975CEDB885E2DEC@UKDCX001.uk.int.atosorigin.com>
Message-ID: <MJEHLHJKGINLONDMMKNECEGKHIAA.seant@iname.com>

> [PAUL] Outlook has two exit options "Exit" and "Exit and log off". I've
never
> quite understood the difference, but I wonder if what you're seeing is
> related - Outlook finishing, closing down the UI, but then remaining
> for ages in the background while the addin saves the pickles and
> tidies up. This would also tie in with the high memory footprint, as
> the pickle method keeps the database in memory.

I believe that the "Exit and log off" option is Exchange specific. I don't
get that.
If no messages are trained, the database won't be dirty, and doesn't get
written.
So, that's not the likely culprit.

>
> Would it be worth trying the DBM format for the database? I think
> this would give faster startup/shutdown times, and lower memory
> consumption, at the expense of on-disk database size and slower
> filtering (although I doubt that this difference would be an issue).
>
Slower *training* would be an issue, however.

> Unfortunately, the addin doesn't hook into the main code's persistence
> structure (as far as I can see) so switching formats isn't as simple
> as just changing the INI file. I'll look into it and give it a try at
> some point.
Brave guy. Me, I'm a coward.

There are old coders, and there are bold coders --  but there are no old,
bold coders.

-- Sean


From B-Morgan@concentric.net  Wed Nov 20 15:08:46 2002
From: B-Morgan@concentric.net (Brad Morgan)
Date: Wed, 20 Nov 2002 08:08:46 -0700
Subject: [Spambayes] Outlook weirdness
In-Reply-To: <MJEHLHJKGINLONDMMKNECEGKHIAA.seant@iname.com>
Message-ID: <NABBJOOEOFODEALNMJAJOEGJHDAA.B-Morgan@concentric.net>

Outlook (Internet-only) has been occasionally "hanging around" on me long
before I tried any sort of spam filtering.  Sometimes it hangs around
collecting messages, sometimes it just prevents another version from
starting up.

I haven't found any pattern to the unsuccessful shutdowns and Microsoft
certianly hasn't either for all the patches they've put out.  IMO, its their
bug plain and simple.

Regards,

Brad

-----Original Message-----
From: spambayes-bounces@python.org
[mailto:spambayes-bounces@python.org]On Behalf Of Sean True
Sent: Wednesday, November 20, 2002 7:53 AM
To: Moore, Paul; Sean True; Piers Haken; Spambayes (E-mail)
Subject: RE: [Spambayes] Outlook weirdness


> [PAUL] Outlook has two exit options "Exit" and "Exit and log off". I've
never
> quite understood the difference, but I wonder if what you're seeing is
> related - Outlook finishing, closing down the UI, but then remaining
> for ages in the background while the addin saves the pickles and
> tidies up. This would also tie in with the high memory footprint, as
> the pickle method keeps the database in memory.

I believe that the "Exit and log off" option is Exchange specific. I don't
get that.
If no messages are trained, the database won't be dirty, and doesn't get
written.
So, that's not the likely culprit.

>
> Would it be worth trying the DBM format for the database? I think
> this would give faster startup/shutdown times, and lower memory
> consumption, at the expense of on-disk database size and slower
> filtering (although I doubt that this difference would be an issue).
>
Slower *training* would be an issue, however.

> Unfortunately, the addin doesn't hook into the main code's persistence
> structure (as far as I can see) so switching formats isn't as simple
> as just changing the INI file. I'll look into it and give it a try at
> some point.
Brave guy. Me, I'm a coward.

There are old coders, and there are bold coders --  but there are no old,
bold coders.

-- Sean


_______________________________________________
Spambayes mailing list
Spambayes@python.org
http://mail.python.org/mailman/listinfo/spambayes


From msergeant@startechgroup.co.uk  Wed Nov 20 15:28:21 2002
From: msergeant@startechgroup.co.uk (Matt Sergeant)
Date: Wed, 20 Nov 2002 15:28:21 +0000
Subject: [Spambayes] re: ciphertrust
References: <3DDB5925.32059.37C5A44E@localhost>
Message-ID: <3DDBAA15.6000108@startechgroup.co.uk>

Brad Clements said the following on 20/11/02 14:47:
> On 20 Nov 2002 at 10:25, Matt Sergeant wrote:
> 
>>I met with these guys a few weeks ago. Basically it's a custom rule set. A
>>bit like SpamAssassin. They use customer feedback to expand their ruleset.
>>They also use Razor and I think some DNSBL's.
>>
>>Let me know offline if you want any more info.
> 
> So much for heuristics..

Rules == heuristics. Or are you using a different dictionary to me? :-)


From Paul.Moore@atosorigin.com  Wed Nov 20 16:05:54 2002
From: Paul.Moore@atosorigin.com (Moore, Paul)
Date: Wed, 20 Nov 2002 16:05:54 -0000
Subject: [Spambayes] Outlook weirdness
Message-ID: <16E1010E4581B049ABC51D4975CEDB88619948@UKDCX001.uk.int.atosorigin.com>

From: Sean True [mailto:seant@iname.com]
>> Would it be worth trying the DBM format for the database? I think
>> this would give faster startup/shutdown times, and lower memory
>> consumption, at the expense of on-disk database size and slower
>> filtering (although I doubt that this difference would be an issue).
>>
> Slower *training* would be an issue, however.

I can't imagine the training getting much slower than it is at the
moment for me :-( The pickle isn't being dumped to disk when I hit
"Delete as spam", but the operation is taking over a second. No
idea why...

This is with 700 spam and 7000 ham (or so) in the DB, giving a 7M
pickle. Outlook's using 36M of RAM.

Maybe it's just Outlook being slow...

Paul.

From seant@iname.com  Wed Nov 20 16:15:11 2002
From: seant@iname.com (Sean True)
Date: Wed, 20 Nov 2002 11:15:11 -0500
Subject: [Spambayes] Outlook weirdness
In-Reply-To: <NABBJOOEOFODEALNMJAJOEGJHDAA.B-Morgan@concentric.net>
Message-ID: <MJEHLHJKGINLONDMMKNEEEGPHIAA.seant@iname.com>


> Outlook (Internet-only) has been occasionally "hanging around" on me long
> before I tried any sort of spam filtering.  Sometimes it hangs around
> collecting messages, sometimes it just prevents another version from
> starting up.
>
> I haven't found any pattern to the unsuccessful shutdowns and Microsoft
> certianly hasn't either for all the patches they've put out.
> IMO, its their
> bug plain and simple.
>

Most bugs are their fault. In this case, however, having the addin UI go
away is annoying,
and most other products appear to avoid it. On the long list of things
before this is not
alpha code. <grin>

-- Sean


From skip@pobox.com  Wed Nov 20 17:15:45 2002
From: skip@pobox.com (Skip Montanaro)
Date: Wed, 20 Nov 2002 11:15:45 -0600
Subject: [Spambayes] New web training interface for pop3proxy
In-Reply-To: <gp0ntu41535klnj60q5b68u1u21n7fmbi4@4ax.com>
References: <15833.20589.376685.686723@montanaro.dyndns.org>
        <9qqitu09afl7kufdmr41kn28tfn9nfjpur@4ax.com>
        <15834.15618.465106.671756@montanaro.dyndns.org>
        <gp0ntu41535klnj60q5b68u1u21n7fmbi4@4ax.com>
Message-ID: <15835.49985.259869.725900@montanaro.dyndns.org>


    Richie> I've just committed this, with a heading between each clump.
    Richie> For those not using the proxy, there's a mockup at
    Richie> http://www.entrian.com/review3.html

Looks very nice.  Heck, I may have to actually figure out how to use this.
Of course, I need figure out how to throw in an ssh tunnel and point
fetchmail at the proxy...

Skip


From richie@entrian.com  Wed Nov 20 17:23:02 2002
From: richie@entrian.com (Richie Hindle)
Date: Wed, 20 Nov 2002 17:23:02 +0000
Subject: [Spambayes] New web training interface for pop3proxy
In-Reply-To: <15835.49985.259869.725900@montanaro.dyndns.org>
References: <15833.20589.376685.686723@montanaro.dyndns.org>
	<9qqitu09afl7kufdmr41kn28tfn9nfjpur@4ax.com>
	<15834.15618.465106.671756@montanaro.dyndns.org>
	<gp0ntu41535klnj60q5b68u1u21n7fmbi4@4ax.com>
	<15835.49985.259869.725900@montanaro.dyndns.org>
Message-ID: <p5hntucff32llijbqkl8k2tsm73aosinem@4ax.com>

Hi Skip,

>     Richie> http://www.entrian.com/review3.html
> 
> Looks very nice.  Heck, I may have to actually figure out how to use this.
> Of course, I need figure out how to throw in an ssh tunnel and point
> fetchmail at the proxy...

Should you feel the urge to write a HOW-TO while you do that, I won't stop
you.  8-)

-- 
Richie Hindle
richie@entrian.com


From skip@pobox.com  Wed Nov 20 17:27:22 2002
From: skip@pobox.com (Skip Montanaro)
Date: Wed, 20 Nov 2002 11:27:22 -0600
Subject: [Spambayes] Another soft for the collection
In-Reply-To: <3DDB5A7C.2661.37CADF75@localhost>
References: <BA013781.5CED5%fgranger@teleprosoft.com>
        <3DDB5A7C.2661.37CADF75@localhost>
Message-ID: <15835.50682.656616.985380@montanaro.dyndns.org>


    >> http://www.c-command.com/spamsieve/

    Brad> In their screen shot, under Corpus.

    Brad> It shows 3778 unused words. Huh?

Maybe it ignores hapaxes?

S

From db3l@fitlinxx.com  Wed Nov 20 17:37:01 2002
From: db3l@fitlinxx.com (David Bolen)
Date: 20 Nov 2002 12:37:01 -0500
Subject: [Spambayes] Re: Outlook users should update
References: <LNBBLJKPBEHFEDALKOLCKEMACKAB.tim.one@comcast.net>
	<LCEPIIGDJPKCOIHOBJEPAEDOHLAA.mhammond@skippinet.com.au>
Message-ID: <uk7j8ta6a.fsf@fitlinxx.com>

"Mark Hammond" <mhammond@skippinet.com.au> writes:

> And I just checked in a few changes too.  Of most note is that the plugin
> should correctly filter all "unread, unscored" messages in your watch
> folders at startup.  Works for me - let me know if it does for you too
> <wink>

For what it's worth, unfortunately it doesn't seem to work in my
environment with an Exchange server (at least not in my case), nor did
the prior version I was running from CVS.  I just get "Processing 0
missed spam in folder 'Inbox' took 7.46883ms" at startup.  I realize
that I'm in the minority of Outlook users around here as an Exchange
corporate user. :-)

Oh, and while I'm commenting about things without contributing a fix,
I may as well mention that it seems like some recent training changes
(somewhere around a pull from CVS I did on the 14th, when it added
stuff to identify messages with no body and what not), some messages
fail to train with a traceback (in the trace window) like (XXX is a
really long hex id):

Error training message '<MAPIMsgStoreMsg, 'Welcome to see me' (read) id=XXX>'
Traceback (most recent call last):
  File ".\spambayes\Outlook2000\train.py", line 67, in train_folder
    if train_message(message, isspam, mgr):
  File ".\spambayes\Outlook2000\train.py", line 42, in train_message
    stream = msg.GetEmailPackageObject()
  File ".\spambayes\Outlook2000\msgstore.py", line 535, in GetEmailPackageObject
    text = self._GetMessageText()
  File ".\spambayes\Outlook2000\msgstore.py", line 457, in _GetMessageText
    0)       # any # of results is fine
com_error: (-2147221246, 'Invalid window handle', None, None)


It's just the occasional message - 3 out of 3000 in my last training -
and it's reproduceable on a given message, but there doesn't seem to
be obvious commonality - at least at the MUA level - to the messages.

I hesitated to comment before now in part since I wasn't sure of the
expected state of the earlier (~11/14) fetch I had done from CVS, and
since "real work" suddenly hit hard last week and I felt guilty about
not debugging further (I just manually filter at startup right now and
ignore the occasional training failure).  But if there is any specific
data you might want me to acquire I'd be happy to see what I could do.
I do intend to poke more deeply when I get a chance.

-- David


From db3l@fitlinxx.com  Wed Nov 20 17:53:29 2002
From: db3l@fitlinxx.com (David Bolen)
Date: 20 Nov 2002 12:53:29 -0500
Subject: [Spambayes] Re: Outlook weirdness
References: <9891913C5BFE87429D71E37F08210CB9297516@zeus.sfhq.friskit.com>
	<MJEHLHJKGINLONDMMKNEKEGCHIAA.seant@iname.com>
Message-ID: <ufztwt9eu.fsf@fitlinxx.com>

"Sean True" <seant@iname.com> writes:

> I have a pretty persistent Outlook shutdown problem. I have a 6K Spam, 7K
> Ham training set,
> and an Outlook that commonly uses 40-50MB of memory. Often when I close
> Outlook, it will stay
> in memory. (Leaving an icon in the task bar, too). When I "restart" it,
> meaning, I think,
> restart the UI, I get no Spam manager icons, even though the addin is still
> running cheerfully
> and filtering.

Usually this is due to the addin generating an exception previously or
for some reason Outlook thinking it failed and doesn't want to reload.
Although the fact that you say it's still filtering is definitely at
odds with that idea - unless it never really unloaded previously.

I've found it very useful to just keep a trace window (using the
win32traceutil module) running all the time.  Technically you can pick
up existing trace messages from before you started tracing as long as
the addin is still running, but I normally just leave the trace task
running at all times.  It let's me validate that the addin is shutting
down when I think it is and starting up when I think it is or check
for issues when something seems amiss.

Note that due to COM interaction, if anything with respect to the
Outlook process remains in memory, the addin will too.  But you can
tell that in the trace window since there won't be any disconnect
indication.

-- David


From DavidA@ActiveState.com  Wed Nov 20 18:08:37 2002
From: DavidA@ActiveState.com (David Ascher)
Date: Wed, 20 Nov 2002 10:08:37 -0800
Subject: [Spambayes] New web training interface for pop3proxy
In-Reply-To: <15833.20589.376685.686723@montanaro.dyndns.org>
References: <15833.20589.376685.686723@montanaro.dyndns.org>
	<9qqitu09afl7kufdmr41kn28tfn9nfjpur@4ax.com>
	<15834.15618.465106.671756@montanaro.dyndns.org>
	<gp0ntu41535klnj60q5b68u1u21n7fmbi4@4ax.com>
Message-ID: <3DDBCFA5.8070607@ActiveState.com>

Richie Hindle wrote:

> Excellent plan.  I've just committed this, with a heading between each
> clump.  For those not using the proxy, there's a mockup at
> http://www.entrian.com/review3.html

Looks good!

Suggestions:

   - (if necessary) learn to love JavaScript and provide keyboard navigation, so 
that users can do "down,down,down,h,down,down,down,s,..."  If you want an 
example of how this feels before you bother, you can get VPM from us 
(ActiveState), which ships with Komodo Professional (you can get a trial for 
free).  The JS is probably easy to find as well.

   - Make 'hovertips' that display the first few lines of the body (stripped of 
html and whitespace), to aid in classification for when the headers aren't 
enough.  If that's too hard, make a link on each message that shows a popup with 
the contents of the email.

What happens after you click on Train?  Does it go to the next day, or just 
refresh the current page?

--david


From richie@entrian.com  Wed Nov 20 20:01:45 2002
From: richie@entrian.com (Richie Hindle)
Date: Wed, 20 Nov 2002 20:01:45 +0000
Subject: [Spambayes] New web training interface for pop3proxy
In-Reply-To: <3DDBCFA5.8070607@ActiveState.com>
References: <15833.20589.376685.686723@montanaro.dyndns.org>
	<9qqitu09afl7kufdmr41kn28tfn9nfjpur@4ax.com>
	<15834.15618.465106.671756@montanaro.dyndns.org>
	<gp0ntu41535klnj60q5b68u1u21n7fmbi4@4ax.com>
	<3DDBCFA5.8070607@ActiveState.com>
Message-ID: <aupntugk287b70mg37lcfg8uk8afed63fo@4ax.com>

Hi David,

> Looks good!

Thanks!

> (if necessary) learn to love JavaScript and provide keyboard navigation

Good plan.  My only concern is that it might break on some browsers, but I
can always limit it.  I'll add it to the to-do list.

> If you want an 
> example of how this feels before you bother, you can get VPM from us 
> (ActiveState), which ships with Komodo Professional (you can get a trial for 
> free).  The JS is probably easy to find as well.

Thanks - I'll have a look.

> Make 'hovertips' that display the first few lines of the body (stripped of 
> html and whitespace), to aid in classification for when the headers aren't 
> enough.  If that's too hard, make a link on each message that shows a popup with 
> the contents of the email.

Linking to the message is already on the to-do list, but I like the
hovertip idea as well.

> What happens after you click on Train?  Does it go to the next day, or just 
> refresh the current page?

It refreshes the current page if you deferred any messages, otherwise it
goes to the next or previous page.

-- 
Richie Hindle
richie@entrian.com


From francois.granger@free.fr  Wed Nov 20 20:11:06 2002
From: francois.granger@free.fr (=?iso-8859-1?Q?Fran=E7ois?= Granger)
Date: Wed, 20 Nov 2002 21:11:06 +0100
Subject: [Spambayes] proposed changes to hammie & co.
In-Reply-To: <15835.34182.266661.912333@slothrop.zope.com>
References: <w538yzo25f4.fsf@woozle.org>
 <SNA9YT06TPZWSMGBF1WC8ZT286X3.3ddb1d97@riven>
 <15835.34182.266661.912333@slothrop.zope.com>
Message-ID: <a0510032bba019cae3fb5@[192.168.1.11]>

At 7:52 -0500 20/11/02, in message Re: [Spambayes] proposed changes 
to hammie & co., Jeremy Hylton wrote:
>
>One likely concern is that your users have to install ZODB or you
>have to package it for them.

This is a real issue for "normal" user......

-- 
Le courrier �lectronique est un moyen de communication. Les gens devraient
se poser des questions sur les implications politiques des choix (ou non
choix) de leurs outils et technologies.
Pour des courriers propres : http://minilien.com/?IXZneLoID0 - 
http://marc.herbert.free.fr/mail/ http://expita.com/nomime.html

From rob@hooft.net  Wed Nov 20 20:05:11 2002
From: rob@hooft.net (Rob Hooft)
Date: Wed, 20 Nov 2002 21:05:11 +0100
Subject: [Spambayes] Outlook weirdness
References: <16E1010E4581B049ABC51D4975CEDB88619948@UKDCX001.uk.int.atosorigin.com>
Message-ID: <3DDBEAF7.7020709@hooft.net>

Moore, Paul wrote:
> From: Sean True [mailto:seant@iname.com]
> 
>>>Would it be worth trying the DBM format for the database? I think
>>>this would give faster startup/shutdown times, and lower memory
>>>consumption, at the expense of on-disk database size and slower
>>>filtering (although I doubt that this difference would be an issue).
>>>
>>
>>Slower *training* would be an issue, however.
> 
> 
> I can't imagine the training getting much slower than it is at the
> moment for me :-( The pickle isn't being dumped to disk when I hit
> "Delete as spam", but the operation is taking over a second. No
> idea why...

Isn't that the update_spamprob? It is updating ~300k spam probabilities, 
where you are going to use only a few every time. The current Bayes is 
optimized for training on hundreds of messages at a time, and then 
scoring hundreds. For "training one, scoring one" it would be more 
efficient to delay the calculation of the spam probs until they are needed.

Rob


-- 
Rob W.W. Hooft  ||  rob@hooft.net  ||  http://www.hooft.net/people/rob/


From richie@entrian.com  Wed Nov 20 20:47:43 2002
From: richie@entrian.com (Richie Hindle)
Date: Wed, 20 Nov 2002 20:47:43 +0000
Subject: [Spambayes] New  pop3proxy options
In-Reply-To: <3DDB979E.8050604@videotron.ca>
References: <3DDB979E.8050604@videotron.ca>
Message-ID: <5ksntu0ruug9br0m26g77q8pfqluaaaora@4ax.com>

Hi papaDoc,

> This is a patch for the Options.py

Thanks - not sure why those double-quotes were there, but it all seems to
work without them.  I'll check in the patch when I next do a check-in.

> By the way I'm trying to use pop3proxy with Mozilla 1.1. [...]
> I can't display the body I see only the Subject line.

I'm not sure I understand this.  Are you talking about the web training
interface, http://localhost:8880/review ?  You can't (yet) view the message
bodies in there - clicking on the Subject line *shouldn't* do anything yet.

> But when I look in the Inbox there are their with their body ?????

So in your email client, as opposed to your web browser, you do see the
whole message?  It sounds like it's working as intended - have I
misunderstood something?

-- 
Richie Hindle
richie@entrian.com


From papaDoc@videotron.ca  Wed Nov 20 21:00:34 2002
From: papaDoc@videotron.ca (papaDoc)
Date: Wed, 20 Nov 2002 16:00:34 -0500
Subject: [Spambayes] New  pop3proxy options
References: <3DDB979E.8050604@videotron.ca>
 <5ksntu0ruug9br0m26g77q8pfqluaaaora@4ax.com>
Message-ID: <3DDBF7F2.10502@videotron.ca>

Hi Richie,

>>By the way I'm trying to use pop3proxy with Mozilla 1.1. [...]
>>I can't display the body I see only the Subject line.
>>
>
>I'm not sure I understand this.  Are you talking about the web training
>interface, http://localhost:8880/review ?  You can't (yet) view the message
>bodies in there - clicking on the Subject line *shouldn't* do anything yet.
>
>  
>
>>But when I look in the Inbox there are their with their body ?????
>>    
>>
>
>So in your email client, as opposed to your web browser, you do see the
>whole message?  It sounds like it's working as intended - have I
>misunderstood something?
>  
>

In http://localhost:8880/review as expected I only see the Subject 
(until someone makes linking to the real mail working ;-)  )

What I was trying to say is:

Using the mail tools provided with Mozilla 1.1 I see only the
Subject, Sender, Date when one mail is selected. Nothing in the
"body" window.

When I go into the directory where the mail is saved by Mozilla
c:/something/ and look into the file Inbox everything look OK.

Mozilla retreive the mail from localhost:110 and pop3proxy from
pop.my_prodived.com:110.

When Mozilla retreive the mail directly from pop.my_prodived.com:110
I have no problem.

papaDoc


From neale@woozle.org  Wed Nov 20 21:16:25 2002
From: neale@woozle.org (Neale Pickett)
Date: 20 Nov 2002 13:16:25 -0800
Subject: [Spambayes] Better optimization loop
In-Reply-To: <3DDB7D3D.1020306@hooft.net>
References: <LNBBLJKPBEHFEDALKOLCIEKCCNAB.tim_one@email.msn.com>
	<3DDB7D3D.1020306@hooft.net>
Message-ID: <w53r8dgymae.fsf@woozle.org>

So then, "Rob W.W. Hooft" <rob@hooft.net> is all like:

> Another speedup I could use is a version of Bayes that calculates the
> spamprob from the numbers on demand instead of calculating them for
> all words everytime. This pays of for all cases where the training
> batch is very small (~1 message).

Funny you should bring that up, Rob, because I happen to be working on
exactly that.  The only way I could think to do it was to pass in a new
option to Bayes.learn() and Bayes.unlearn().

I've therefore removed the update_probabilities option and replaced it
with update_word_probabilities.  My thinking here is that asking things to
run Bayes.update_probabilities() when they need it isn't too much of a
burden (most of them call it explicitly anyway), but learn() and
unlearn() are the *only* places that individual word rescoring can
happen.

The changed methods become:

    def learn(self, wordstream, is_spam, update_word_probabilities=True):
        self._add_msg(wordstream, is_spam, update_word_probabilities)

    def unlearn(self, wordstream, is_spam, update_word_probabilities=True):
        self._remove_msg(wordstream, is_spam, update_word_probabilities)

    def _add_msg(self, wordstream, is_spam, update_word_probabilities):
        ...

    def _remove_msg(self, wordstream, is_spam, update_word_probabilities):
        ...

And inside the for loop in _add_msg() and _remove_msg() is this:

            if update_word_probabilities:
                self.update_word_probability(word, record)
            else:
                # Needed to tell a persistent DB that the content
                # changed.
                wordinfo[word] = record

I'll check all this in to the hammie-playground branch as soon as I can
be sure it doesn't break anything.  If we all think it's kosher, I'll
merge it into HEAD.

Neale


From richie@entrian.com  Wed Nov 20 21:25:28 2002
From: richie@entrian.com (Richie Hindle)
Date: Wed, 20 Nov 2002 21:25:28 +0000
Subject: [Spambayes] New  pop3proxy options
In-Reply-To: <3DDBF7F2.10502@videotron.ca>
References: <3DDB979E.8050604@videotron.ca>
	<5ksntu0ruug9br0m26g77q8pfqluaaaora@4ax.com> <3DDBF7F2.10502@videotron.ca>
Message-ID: <hbvntus6mhron32r0dk62ckmvgj7f77hi2@4ax.com>


> Using the mail tools provided with Mozilla 1.1 I see only the
> Subject, Sender, Date when one mail is selected. Nothing in the
> "body" window.
> 
> When I go into the directory where the mail is saved by Mozilla
> c:/something/ and look into the file Inbox everything look OK.

Ah, OK, I see!  Thanks for that - the proxy must be changing the messages
in some way that the Mozilla tools don't like.  I'll look into it.

-- 
Richie Hindle
richie@entrian.com


From rob@hooft.net  Wed Nov 20 21:28:51 2002
From: rob@hooft.net (Rob Hooft)
Date: Wed, 20 Nov 2002 22:28:51 +0100
Subject: [Spambayes] Better optimization loop
References: <LNBBLJKPBEHFEDALKOLCIEKCCNAB.tim_one@email.msn.com>
	<3DDB7D3D.1020306@hooft.net> <w53r8dgymae.fsf@woozle.org>
Message-ID: <3DDBFE93.4060600@hooft.net>

Neale Pickett wrote:
> So then, "Rob W.W. Hooft" <rob@hooft.net> is all like:
> 
> 
>>Another speedup I could use is a version of Bayes that calculates the
>>spamprob from the numbers on demand instead of calculating them for
>>all words everytime. This pays of for all cases where the training
>>batch is very small (~1 message).

> And inside the for loop in _add_msg() and _remove_msg() is this:
> 
>             if update_word_probabilities:
>                 self.update_word_probability(word, record)
>             else:
>                 # Needed to tell a persistent DB that the content
>                 # changed.
>                 wordinfo[word] = record

I was thinking along different lines: when the train size and the score 
size are both approximately 1 message, we can forget about the word 
probabilities altogether. Just don't store them anywhere anymore, and 
calculate the individual word probabilities from the raw counts while 
scoring. This will not only save time because lots of words that enter 
the database will "never" be used again (hapaxes...), but it should also 
shrink the database size. If it is too slow then we can make a cache out 
of a dictionary mapping raw count tuples to probabilities to speed it up.

Rob

-- 
Rob W.W. Hooft  ||  rob@hooft.net  ||  http://www.hooft.net/people/rob/


From richie@entrian.com  Wed Nov 20 21:30:38 2002
From: richie@entrian.com (Richie Hindle)
Date: Wed, 20 Nov 2002 21:30:38 +0000
Subject: [Spambayes] Documentation...
Message-ID: <ddtntuo5m5gddnp835hdohlj2rrtllu3kl@4ax.com>


This may be premature, but as part of helping John Draper set up the
spambayes software I've made a start on some user documentation.  It could
go on the website, or maybe in with the source code - I'm not sure we're
ready to give the impression that this stuff is ready for "normal people"
to use yet.

This stuff refers to the current, unpackaged sources - if we ever package
it up, the documentation will be very different.  But I'm guessing that's a
long way off, and in the meantime we'll all be asked by friends and project
newcomers to explain how it all fits together and how to get it up and
running - this is an attempt to let us say "Here, read this!" when that
happens.

It tries to target both technical and non-technical users (though for some
fairly high values of "non-technical") and may well fall between two stools
as a result.  I'll check it in either with the sources or the website
depending on feedback.  If anyone spots glaring omissions, factual
inaccuracies or downright rudeness, either let me know or edit it after I
check it in - I'm not claiming any editorial rights!

It's also somewhat biased towards the POP3 proxy and the web interface (for
obvious reasons 8-) and lacks any detail on the Outlook plugin because I'm
not one of those lucky Outlook users... this is a not-at-all-veiled plea
for contributors or users who know about the lacking areas to step forward
and write some words!

--------------------------------------------------------------------------

> First some concepts:

 o 'Ham' is the opposite of 'Spam'. 8-)

 o At no point does any part of Spambayes delete emails.  All it does is
   classify them by adding a header that tells you whether they look like
   spam or not.  It's then up to you to use your email software to do
   something in response to that header (the Outlook plug-in does some of
   the work for you).

 o The header that the software adds is called X-Hammie-Disposition (mostly
   for historical reasons, and you can customise it) and has a value of
   Yes, No or Unsure.


> There are six main components to the Spambayes system:

 o A database.  Loosely speaking, this is a collection of words and
   associated spam and ham probabilities.  The database says "If a message
   contains the word 'Viagra' then there's a 98% chance that it's spam, and
   a 2% chance that it's ham."  This database is created by training - you
   give it messages, tell it whether those messages are ham or spam, and it
   adjusts its probabilities accordingly.  How to train it is covered
   below.  By default it lives in a file called "hammie.db".

 o The tokeniser/classifier.  This is the core engine of the system.  The
   tokenizer splits emails into tokens (words, roughly speaking), and the
   classifier looks at those tokens to determine whether the message looks
   like spam or not.  You don't use the tokeniser/classifier directly -
   it powers the other parts of the system.

 o The POP3 proxy.  This sits between your email client (Eudora, Outlook
   Express, etc) and your email server, and adds the classification header
   to emails as you download them.  A typical user's email setup looks
   like this:

       +-----------------+                              +-------------+
       | Outlook Express |      Internet or intranet    |             |
       |  (or similar)   | <--------------------------> | POP3 server |
       |                 |                              |             |
       +-----------------+                              +-------------+

   The POP3 server runs either at your ISP for internet mail, or somewhere
   on your internal network for corporate mail.  The POP3 proxy sits in the
   middle and adds the classification header as you retrieve your email:

       +-----------------+        +------------+        +-------------+
       | Outlook Express |        | Spambayes  |        |             |
       |  (or similar)   | <----> | POP3 proxy | <----> | POP3 server |
       |                 |        |            |        |             |
       +-----------------+        +------------+        +-------------+

   So where you currently have your email client configured to talk to
   say, "pop3.my-isp.com", you instead configure the *proxy* to talk to
   "pop3.my-isp.com" and configure your email client to talk to the proxy.
   The POP3 proxy can live on your PC, or on the same machine as the POP3
   server, or on a different machine entirely, it really doesn't matter.
   Say it's living on your PC, you'd configure your email client to talk
   to "localhost".

 o The web interface.  This is a server that runs alongside the POP3 proxy
   and lets you control it through the web.  You can upload emails to it
   for training or classification, query the probabilities database ("How
   many of my emails really *do* contain the word Viagra"?) and most
   importantly, train it on the emails you've received.  When you start
   using the system, unless you train it using the Hammie script it will
   classify most things as Unsure, and often make mistakes.  But it keeps
   copies of all the email's its seen, and through the web interface you
   can train it by going through a list of all the emails you've received
   and checking a Ham/Spam box next to each one.  After training on a few
   messages (say 20 spams and 20 hams), you'll find that it's getting it
   right most of the time.   The web training interface automatically
   checks the Ham/Spam boxes according to what it thinks, so all you need
   to do it correct the odd mistake - it's very quick and easy.

 o The Outlook plug-in.  For Outlook 2000 users (not Outlook Express) this
   lets you manage the whole thing from within Outlook.  You set up a Ham
   folder and a Spam folder, and train it simply by dragging messages into
   those folders.  Alternatively there are buttons to do the same thing. 
   And it integrates into Outlook's filtering system to make it easy to
   file all the suspected spam into its own folder, for instance.

 o The Hammie script.  This does three jobs: command-line training,
   procmail filtering, and XML-RPC.  To train on a whole collection of
   messages, stored either as mbox files or as collections of message files
   in a directory, you run "hammie.py -g ham -s spam", where 'ham' is the
   mbox file or directory containing ham, and 'spam' is the mbox file or
   directory containing spam.  Procmail filtering is a unix-based email
   filtering system - to use Hammie as a procmail filter, run it as
   "hammie.py -f" from a procmail rule.  It will read a message from its
   input, add the header, and write it to its output.  Hammie can also
   run as an XML-RPC server, so that a programmer can write code that uses
   a remote server to classify emails programmatically - see hammiesrv.py.


> Where things live:

The Hammie script is called hammie.py.  The POP3 proxy and the web
interface live in pop3proxy.py.  The Outlook plug-in lives in the
Outlook2000 subdirectory - see the README.txt in that directory for more
information on that.

As well as these components, there's also a whole pile of utility scripts,
test harnesses and so on - see README.txt and TESTING.txt in the spambayes
distribution for more information.


> Configuration:

The system is configured through a file called "bayescustomize.ini".  In
here you can configure the name and type of your database, the POP3
server(s) you want to proxy to, the ports you want the proxy and the web
interface to run on, and so on.  You can also control details like how sure
you want the system to be that message really is spam before it marks it as
such.  The default values for all the options, and the documentation for
them, all lives in Options.py.  To change an option, create a
bayescustomize.ini and add the option to that - don't edit Options.py.


> Requirements:

To run the software, you need Python 2.2 or above.  You also need version
2.4.3 or above of the Python "email" package.  If you're running the CVS
version of Python (known as 2.3a0) then you already have this.  If not, you
can download it from http://mimelib.sf.net and install it - unpack the
archive, cd to the email-2.4.3 directory and type "python setup.py
install".  This will install it into your Python site-packages directory.
You'll also need to move aside the standard "email" library - go to your
Python "Lib" directory and rename "email" to "email_old".


> Setup on unix (Windows/Mac users can ignore this bit):

On a unix machine, unless you're running as root (which we strongly advise
you don't!) you can't run the proxy on port 110.  Besides, you quite
possibly already have a POP3 server running on that port.  You need to run
it on an unprivileged port, say 1110.  You do this by adding the line

pop3proxy_ports: 1110

to bayescustomize.ini - all will become clear in the next section.  Where
we talk about port 110, you use port 1110.


> Minimal setup for using the POP3 proxy and web interface:

The minimum you need too do to get started is create a bayescustomize.ini
containing the following:

[pop3proxy]
pop3proxy_servers: pop3.my-isp.com

where "pop3.my-isp.com" is wherever you currently have your email client
configured to collect mail from.

You can now run the proxy by running "python pop3proxy.py".  This will
print some status messages, which should include:

BayesProxyListener listening on port 110.
UserInterfaceListener listening on port 8880.

What that means is that the POP3 proxy is ready for your email client to
connect to it (110 is the standard port number for POP3 - you can use a
different one by adding a line to bayescustomize.ini - see Options.py) and
that the web interface is ready for your browser to connect to it.  The
address of the web interface is http://localhost:8880/ (or if you're
running it on a different machine, replace 'localhost' with the name of the
machine).  You can have a look at the web interface now, but it won't be
very interesting because the system is untrained and has seen no messages
yet.


> Reading emails and training the classifier:

You now need to configure your email client to talk to the proxy instead of
the real email server.  Change your equivalent of "pop3.my-isp.com" to
"localhost" (or to the name of the machine you're running the proxy on) in
your email client's setup.  Hit "Get new email" and look at the headers of
the emails (send yourself an email if you don't have any!) - there should
be an X-Hammie-Disposition header there.  It probably says "Unsure",
because you haven't done any training yet.  You should be able to create a
mail folder called "Suspected spam" and set up a filtering rule that puts
emails with an "X-Hammie-Disposition: Yes" heading into that folder.
(Eventually we should publish instructions on how to do this in all the
popular email clients).

You can now train the system through the web interface - follow the "Review
messages" link and you'll see a list of the emails that the system has seen
so far.  Check the appropriate boxes and hit Train.  The messages disappear
(eventually you'll be able to get back to them, for instance to correct any
training mistakes) and if you go back to the home page you'll see that the
"Total emails trained" has increased.

Once you've done this on a few spams and a few hams, you'll find that the
X-Hammie-Disposition header is getting it right most of the time.  The more
you train it the more accurate it gets.  There's no need to train it on
every message you receive, but you should train on a few spams and a few
hams on a regular basis.  You should also try to train it on about the same
number of spams as hams.

You can train it on lots of message in one go using the Hammie script, as
explained above.

--------------------------------------------------------------------------

-- 
Richie Hindle
richie@entrian.com


From mhammond@skippinet.com.au  Wed Nov 20 21:52:12 2002
From: mhammond@skippinet.com.au (Mark Hammond)
Date: Thu, 21 Nov 2002 08:52:12 +1100
Subject: [Spambayes] Outlook weirdness
In-Reply-To: <NABBJOOEOFODEALNMJAJOEGJHDAA.B-Morgan@concentric.net>
Message-ID: <LCEPIIGDJPKCOIHOBJEPMEBHHNAA.mhammond@skippinet.com.au>

> Outlook (Internet-only) has been occasionally "hanging around" on me long
> before I tried any sort of spam filtering.  Sometimes it hangs around
> collecting messages, sometimes it just prevents another version from
> starting up.
>
> I haven't found any pattern to the unsuccessful shutdowns and Microsoft
> certianly hasn't either for all the patches they've put out.
> IMO, its their
> bug plain and simple.

This is my experience too.  File->Exit *does* have more luck sometimes,
whereas just closing the window may not work as expected.  Add stuff like CE
Inbox synchronization tools and the various addins people install, and it
could be anything.  For example, if our addin keeps a reference to a certain
COM object, then we end up with a COM circular reference, and Outlook never
shuts down.  Hard to call COM circular references a bug any more than they
are in pre-GC Python.

I have seen this problem enough times in the past though that I am fairly
confident that we never cause it.

Mark.


From guido@python.org  Wed Nov 20 21:49:20 2002
From: guido@python.org (Guido van Rossum)
Date: Wed, 20 Nov 2002 16:49:20 -0500
Subject: [Spambayes] LJ article
Message-ID: <200211202149.gAKLnLw28459@pcp02138704pcs.reston01.va.comcast.net>

A while ago I promised Linux Journal an article about Spambayes.  I
got as far as setting up an outline when I found out I have no time to
write it.  Neither does Tim, who co-volunteered with me.  Gary
Robinson is still volunteering to write the sections about the math,
but most of the article is intended to be not very mathematical, and
he can't write that.  So...  Maybe there's someone here who is
interested in writing this article?  Fame and fortune for you and for
SpamBayes!  (Plus, I think LJ pays its authors.)

If you're interested, write to Don Marti <dmarti@ssc.com> for
details.  DON'T WRITE ME!

--Guido van Rossum (home page: http://www.python.org/~guido/)

From richie@entrian.com  Wed Nov 20 22:48:55 2002
From: richie@entrian.com (Richie Hindle)
Date: Wed, 20 Nov 2002 22:48:55 +0000
Subject: [Spambayes] New web training interface for pop3proxy
In-Reply-To: <3DDBCFA5.8070607@ActiveState.com>
References: <15833.20589.376685.686723@montanaro.dyndns.org>
	<9qqitu09afl7kufdmr41kn28tfn9nfjpur@4ax.com>
	<15834.15618.465106.671756@montanaro.dyndns.org>
	<gp0ntu41535klnj60q5b68u1u21n7fmbi4@4ax.com>
	<3DDBCFA5.8070607@ActiveState.com>
Message-ID: <ju3otuofsrrr5gf5em75qs4bg45k3lug37@4ax.com>


[David Ascher]
> learn to love JavaScript and provide keyboard navigation

I looked at this, and the keyboard navigation is pretty good already: Tab
Tab Tab Left Tab Tab Right Right, etc.  One keystroke to move from message
to message, possibly multiple keystrokes on an arrow key rather than a
single 'h' or 's', but you don't need to move your hands to do it.  I've
left it on the to-do list, but I've persuaded myself that it's unnecessary.
Other people's browser's mileage may vary.

> Make 'hovertips' that display the first few lines of the body

This is done.  The code to strip HTML content uses a regular expression
from tokenizer.py which is commented "Cheap-ass gimmick", so I'm interested
to see how well people find it works!  (Apologies to Tim - it seems to work
extremely well.)  Rest assures it's safe from HTML content leaking into the
web interface - the worst that will happen is that you'll see HTML source
in the hovertip.

-- 
Richie Hindle
richie@entrian.com


From lists@morpheus.demon.co.uk  Wed Nov 20 22:38:55 2002
From: lists@morpheus.demon.co.uk (Paul Moore)
Date: Wed, 20 Nov 2002 22:38:55 +0000
Subject: [Spambayes] New web training interface for pop3proxy
References: <15833.20589.376685.686723@montanaro.dyndns.org>
	<9qqitu09afl7kufdmr41kn28tfn9nfjpur@4ax.com>
	<15834.15618.465106.671756@montanaro.dyndns.org>
	<gp0ntu41535klnj60q5b68u1u21n7fmbi4@4ax.com>
	<3DDBCFA5.8070607@ActiveState.com>
	<aupntugk287b70mg37lcfg8uk8afed63fo@4ax.com>
Message-ID: <n2m-g.65ur27eo.fsf@morpheus.demon.co.uk>

Richie Hindle <richie@entrian.com> writes:

>> What happens after you click on Train?  Does it go to the next day, or just 
>> refresh the current page?
>
> It refreshes the current page if you deferred any messages, otherwise it
> goes to the next or previous page.

It's locking up for me. There are no messages in the command prompt
window - is there any way to get it to produce trace messages (looking
at the code, the answer seems to be "no"...)?

Do I need to rebuild the database after upgrading? I didn't, and the
user interface said "Total emails trained: Spam: 0 Ham: 0". This
doesn't tally with reality - I'd trained on a batch of messages (using
hammie.py) before starting (BTW, adding a bulk training interface
might be nice, although using hammie.py seems to work OK in practice).

Paul.
-- 
This signature intentionally left blank

From richie@entrian.com  Wed Nov 20 23:26:49 2002
From: richie@entrian.com (Richie Hindle)
Date: Wed, 20 Nov 2002 23:26:49 +0000
Subject: [Spambayes] New web training interface for pop3proxy
In-Reply-To: <n2m-g.65ur27eo.fsf@morpheus.demon.co.uk>
References: <15833.20589.376685.686723@montanaro.dyndns.org>
	<9qqitu09afl7kufdmr41kn28tfn9nfjpur@4ax.com>
	<15834.15618.465106.671756@montanaro.dyndns.org>
	<gp0ntu41535klnj60q5b68u1u21n7fmbi4@4ax.com>
	<3DDBCFA5.8070607@ActiveState.com>
	<aupntugk287b70mg37lcfg8uk8afed63fo@4ax.com>
	<n2m-g.65ur27eo.fsf@morpheus.demon.co.uk>
Message-ID: <ht5otu8aj39i0umrsonuqvanh4vdgqc114@4ax.com>

Hi Paul,

> It's locking up for me.

Urk.  That's bad (and new - no-one's reported that before).  And this
happened when you hit the Train button, yes?

> There are no messages in the command prompt
> window - is there any way to get it to produce trace messages (looking
> at the code, the answer seems to be "no"...)?

It's "no".  8-)  Like I say, no-one's reported it locking before, and I've
never seen it.  You usually get a traceback when something goes wrong.  So
your console says something like:

Loading database... Done.
BayesProxyListener listening on port 110.
UserInterfaceListener listening on port 8880.

and nothing else, and the process is still running, but you can't get a
page served to your browser?  What error message do you get from the
browser?  If it's one of those pointless IE error pages, could you try
telnetting to port 8880 and saying "GET / HTTP/1.0"?  Can you even connect
with telnet?  How about port 110?

> Do I need to rebuild the database after upgrading? I didn't, and the
> user interface said "Total emails trained: Spam: 0 Ham: 0". This
> doesn't tally with reality - I'd trained on a batch of messages (using
> hammie.py) before starting

You shouldn't need to do anything, and even if you did I'd expect it to
give an error rather than quietly failing.  Are you using a pickle or a
DBM?  In fact, could you send me your bayescustomize.ini and/or details of
the command line you're using (off-list if you prefer)?

> (BTW, adding a bulk training interface
> might be nice, although using hammie.py seems to work OK in practice).

I agree - training by uploading an mbox file is on the list.

-- 
Richie Hindle
richie@entrian.com


From popiel@wolfskeep.com  Wed Nov 20 23:31:28 2002
From: popiel@wolfskeep.com (T. Alexander Popiel)
Date: Wed, 20 Nov 2002 15:31:28 -0800
Subject: [Spambayes] Split DB and no update_probabilities
Message-ID: <20021120233128.C5A6FF5A7@cashew.wolfskeep.com>

I've got a patch to split the word database into two pieces:
one with the ham and spam counts, and one with the spamprobs.
At the same time, I got rid of the timestamps and killcounts,
both of which are rarely used (and can be added back in by
subclasses if they're really wanted).

With this patch, the spamprobs can be cheaply discarded en mass,
leading to a friendlier interface for incremental training.  To
go along with this, update_probabilities is eliminated, and the
probabilities are generated on demand and merely cached in the
spamprob database.

Unfortunately, this patch breaks all preexisting databases.
It wouldn't be too hard to write a bit of code to take an old
pickle and create a count database from it, but I'm too lazy
to do that at the moment.  (The spamprob database would, of
course, take care of itself.)  This patch also likely breaks
most client code, since update_probabilities no longer exists,
and the learn and unlearn methods don't (optionally) take a
boolean to control whether update_probabilities is called.
I could have left in a noop update_probabilities and ignored
the optional arguments to learn and unlearn... but this is
alpha code, and a clean break from the old ways is probably
better in the long run.

Since this patch does break stuff rather severely, I'm just
including it in this email for people to look at and play
with, instead of checking it in and thereby forcing it
down people's throats.

Enjoy.

- Alex

Index: TestDriver.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/TestDriver.py,v
retrieving revision 1.30
diff -c -r1.30 TestDriver.py
*** TestDriver.py	19 Nov 2002 17:43:27 -0000	1.30
--- TestDriver.py	20 Nov 2002 23:18:17 -0000
***************
*** 304,325 ****
              prob, clues = c.spamprob(e, True)
              printmsg(e, prob, clues)
  
-         if options.show_best_discriminators > 0:
-             print
-             print "    best discriminators:"
-             stats = [(-1, None)] * options.show_best_discriminators
-             smallest_killcount = -1
-             for w, r in c.wordinfo.iteritems():
-                 if r.killcount > smallest_killcount:
-                     heapreplace(stats, (r.killcount, w))
-                     smallest_killcount = stats[0][0]
-             stats.sort()
-             for count, w in stats:
-                 if count < 0:
-                     continue
-                 r = c.wordinfo[w]
-                 print "        %r %d %g" % (w, r.killcount, r.spamprob)
- 
          if options.show_histograms:
              printhist("this pair:", local_ham_hist, local_spam_hist)
          self.trained_ham_hist += local_ham_hist
--- 304,309 ----
Index: Tester.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/Tester.py,v
retrieving revision 1.8
diff -c -r1.8 Tester.py
*** Tester.py	7 Nov 2002 22:30:04 -0000	1.8
--- Tester.py	20 Nov 2002 23:18:18 -0000
***************
*** 59,69 ****
          learn = self.classifier.learn
          if hamstream is not None:
              for example in hamstream:
!                 learn(example, False, False)
          if spamstream is not None:
              for example in spamstream:
!                 learn(example, True, False)
!         self.classifier.update_probabilities()
  
      # Untrain the classifier on streams of ham and spam.  Updates
      # probabilities before returning, and resets test results.
--- 59,68 ----
          learn = self.classifier.learn
          if hamstream is not None:
              for example in hamstream:
!                 learn(example, False)
          if spamstream is not None:
              for example in spamstream:
!                 learn(example, True)
  
      # Untrain the classifier on streams of ham and spam.  Updates
      # probabilities before returning, and resets test results.
***************
*** 72,82 ****
          unlearn = self.classifier.unlearn
          if hamstream is not None:
              for example in hamstream:
!                 unlearn(example, False, False)
          if spamstream is not None:
              for example in spamstream:
!                 unlearn(example, True, False)
!         self.classifier.update_probabilities()
  
      # Run prediction on each sample in stream.  You're swearing that stream
      # is entirely composed of spam (is_spam True), or of ham (is_spam False).
--- 71,80 ----
          unlearn = self.classifier.unlearn
          if hamstream is not None:
              for example in hamstream:
!                 unlearn(example, False)
          if spamstream is not None:
              for example in spamstream:
!                 unlearn(example, True)
  
      # Run prediction on each sample in stream.  You're swearing that stream
      # is entirely composed of spam (is_spam True), or of ham (is_spam False).
Index: classifier.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/classifier.py,v
retrieving revision 1.53
diff -c -r1.53 classifier.py
*** classifier.py	18 Nov 2002 18:23:09 -0000	1.53
--- classifier.py	20 Nov 2002 23:18:18 -0000
***************
*** 46,91 ****
  
  LN2 = math.log(2)       # used frequently by chi-combining
  
! PICKLE_VERSION = 1
  
! class WordInfo(object):
!     __slots__ = ('atime',     # when this record was last used by scoring(*)
!                  'spamcount', # # of spams in which this word appears
!                  'hamcount',  # # of hams in which this word appears
!                  'killcount', # # of times this made it to spamprob()'s nbest
!                  'spamprob',  # prob(spam | msg contains this word)
!                 )
! 
!     # Invariant:  For use in a classifier database, at least one of
!     # spamcount and hamcount must be non-zero.
!     #
!     # (*)atime is the last access time, a UTC time.time() value.  It's the
!     # most recent time this word was used by scoring (i.e., by spamprob(),
!     # not by training via learn()); or, if the word has never been used by
!     # scoring, the time the word record was created (i.e., by learn()).
!     # One good criterion for identifying junk (word records that have no
!     # value) is to delete words that haven't been used for a long time.
!     # Perhaps they were typos, or unique identifiers, or relevant to a
!     # once-hot topic or scam that's fallen out of favor.  Whatever, if
!     # a word is no longer being used, it's just wasting space.
! 
!     def __init__(self, atime, spamprob=options.unknown_word_prob):
!         self.atime = atime
!         self.spamcount = self.hamcount = self.killcount = 0
          self.spamprob = spamprob
  
      def __repr__(self):
!         return "WordInfo%r" % repr((self.atime, self.spamcount,
!                                     self.hamcount, self.killcount,
!                                     self.spamprob))
  
      def __getstate__(self):
!         return (self.atime, self.spamcount, self.hamcount, self.killcount,
!                 self.spamprob)
  
      def __setstate__(self, t):
!         (self.atime, self.spamcount, self.hamcount, self.killcount,
!          self.spamprob) = t
  
  class Bayes:
      # Defining __slots__ here made Jeremy's life needlessly difficult when
--- 46,82 ----
  
  LN2 = math.log(2)       # used frequently by chi-combining
  
! PICKLE_VERSION = 2
  
! class CountInfo(object):
!     __slots__ = ('spamcount', 'hamcount')
! 
!     def __init__(self):
!         self.spamcount = self.hamcount = 0
! 
!     def __repr__(self):
!         return "CountInfo%r" % repr((self.spamcount, self.hamcount))
! 
!     def __getstate__(self):
!         return (self.spamcount, self.hamcount)
! 
!     def __setstate__(self, t):
!         (self.spamcount, self.hamcount) = t
! 
! class ProbInfo(object):
!     __slots__ = ('spamprob')
! 
!     def __init__(self, spamprob=options.unknown_word_prob):
          self.spamprob = spamprob
  
      def __repr__(self):
!         return "ProbInfo%r" % repr((self.spamprob,))
  
      def __getstate__(self):
!         return (self.spamprob,)
  
      def __setstate__(self, t):
!         (self.spamprob,) = t
  
  class Bayes:
      # Defining __slots__ here made Jeremy's life needlessly difficult when
***************
*** 100,118 ****
      #            )
  
      # allow a subclass to use a different class for WordInfo
!     WordInfoClass = WordInfo
  
      def __init__(self):
!         self.wordinfo = {}
          self.nspam = self.nham = 0
  
      def __getstate__(self):
!         return PICKLE_VERSION, self.wordinfo, self.nspam, self.nham
  
      def __setstate__(self, t):
          if t[0] != PICKLE_VERSION:
              raise ValueError("Can't unpickle -- version %s unknown" % t[0])
!         self.wordinfo, self.nspam, self.nham = t[1:]
  
      # spamprob() implementations.  One of the following is aliased to
      # spamprob, depending on option settings.
--- 91,112 ----
      #            )
  
      # allow a subclass to use a different class for WordInfo
!     CountInfoClass = CountInfo
!     ProbInfoClass = ProbInfo
  
      def __init__(self):
!         self.countinfo = {}
!         self.probinfo = {}
          self.nspam = self.nham = 0
  
      def __getstate__(self):
!         return (PICKLE_VERSION, self.countinfo, self.probinfo,
!                 self.nspam, self.nham)
  
      def __setstate__(self, t):
          if t[0] != PICKLE_VERSION:
              raise ValueError("Can't unpickle -- version %s unknown" % t[0])
!         self.countinfo, self.probinfo, self.nspam, self.nham = t[1:]
  
      # spamprob() implementations.  One of the following is aliased to
      # spamprob, depending on option settings.
***************
*** 143,151 ****
          P = Q = 1.0
          Pexp = Qexp = 0
          clues = self._getclues(wordstream)
!         for prob, word, record in clues:
!             if record is not None:  # else wordinfo doesn't know about it
!                 record.killcount += 1
              P *= 1.0 - prob
              Q *= prob
              if P < 1e-200:  # move back into range
--- 137,143 ----
          P = Q = 1.0
          Pexp = Qexp = 0
          clues = self._getclues(wordstream)
!         for prob, word in clues:
              P *= 1.0 - prob
              Q *= prob
              if P < 1e-200:  # move back into range
***************
*** 232,240 ****
          Hexp = Sexp = 0
  
          clues = self._getclues(wordstream)
!         for prob, word, record in clues:
!             if record is not None:  # else wordinfo doesn't know about it
!                 record.killcount += 1
              S *= 1.0 - prob
              H *= prob
              if S < 1e-200:  # prevent underflow
--- 224,230 ----
          Hexp = Sexp = 0
  
          clues = self._getclues(wordstream)
!         for prob, word in clues:
              S *= 1.0 - prob
              H *= prob
              if S < 1e-200:  # prevent underflow
***************
*** 277,283 ****
      if options.use_chi_squared_combining:
          spamprob = chi2_spamprob
  
!     def learn(self, wordstream, is_spam, update_probabilities=True):
          """Teach the classifier by example.
  
          wordstream is a word stream representing a message.  If is_spam is
--- 267,273 ----
      if options.use_chi_squared_combining:
          spamprob = chi2_spamprob
  
!     def learn(self, wordstream, is_spam):
          """Teach the classifier by example.
  
          wordstream is a word stream representing a message.  If is_spam is
***************
*** 294,323 ****
          """
  
          self._add_msg(wordstream, is_spam)
-         if update_probabilities:
-             self.update_probabilities()
  
!     def unlearn(self, wordstream, is_spam, update_probabilities=True):
          """In case of pilot error, call unlearn ASAP after screwing up.
  
          Pass the same arguments you passed to learn().
          """
  
          self._remove_msg(wordstream, is_spam)
-         if update_probabilities:
-             self.update_probabilities()
- 
-     def update_probabilities(self):
-         """Update the word probabilities in the spam database.
- 
-         This computes a new probability for every word in the database,
-         so can be expensive.  learn() and unlearn() update the probabilities
-         each time by default.  Thay have an optional argument that allows
-         to skip this step when feeding in many messages, and in that case
-         you should call update_probabilities() after feeding the last
-         message and before calling spamprob().
-         """
  
          nham = float(self.nham or 1)
          nspam = float(self.nspam or 1)
  
--- 284,300 ----
          """
  
          self._add_msg(wordstream, is_spam)
  
!     def unlearn(self, wordstream, is_spam):
          """In case of pilot error, call unlearn ASAP after screwing up.
  
          Pass the same arguments you passed to learn().
          """
  
          self._remove_msg(wordstream, is_spam)
  
+     # Compute the probability reflected by a set of counts.
+     def _compute_probability(self, record):
          nham = float(self.nham or 1)
          nspam = float(self.nspam or 1)
  
***************
*** 330,406 ****
          S = options.unknown_word_strength
          StimesX = S * options.unknown_word_prob
  
!         for word, record in self.wordinfo.iteritems():
!             # Compute p(word) = prob(msg is spam | msg contains word).
!             # This is the Graham calculation, but stripped of biases, and
!             # stripped of clamping into 0.01 thru 0.99.  The Bayesian
!             # adjustment following keeps them in a sane range, and one
!             # that naturally grows the more evidence there is to back up
!             # a probability.
!             hamcount = record.hamcount
!             assert hamcount <= nham
!             hamratio = hamcount / nham
! 
!             spamcount = record.spamcount
!             assert spamcount <= nspam
!             spamratio = spamcount / nspam
! 
!             prob = spamratio / (hamratio + spamratio)
  
!             # Now do Robinson's Bayesian adjustment.
!             #
!             #         s*x + n*p(w)
!             # f(w) = --------------
!             #           s + n
!             #
!             # I find this easier to reason about like so (equivalent when
!             # s != 0):
!             #
!             #        x - p
!             #  p +  -------
!             #       1 + n/s
!             #
!             # IOW, it moves p a fraction of the distance from p to x, and
!             # less so the larger n is, or the smaller s is.
  
!             # Experimental:
!             # Picking a good value for n is interesting:  how much empirical
!             # evidence do we really have?  If nham == nspam,
!             # hamcount + spamcount makes a lot of sense, and the code here
!             # does that by default.
!             # But if, e.g., nham is much larger than nspam, p(w) can get a
!             # lot closer to 0.0 than it can get to 1.0.  That in turn makes
!             # strong ham words (high hamcount) much stronger than strong
!             # spam words (high spamcount), and that makes the accidental
!             # appearance of a strong ham word in spam much more damaging than
!             # the accidental appearance of a strong spam word in ham.
!             # So we don't give hamcount full credit when nham > nspam (or
!             # spamcount when nspam > nham):  instead we knock hamcount down
!             # to what it would have been had nham been equal to nspam.  IOW,
!             # we multiply hamcount by nspam/nham when nspam < nham; or, IOOW,
!             # we don't "believe" any count to an extent more than
!             # min(nspam, nham) justifies.
! 
!             n = hamcount * spam2ham  +  spamcount * ham2spam
!             prob = (StimesX + n * prob) / (S + n)
! 
!             if record.spamprob != prob:
!                 record.spamprob = prob
!                 # The next seemingly pointless line appears to be a hack
!                 # to allow a persistent db to realize the record has changed.
!                 self.wordinfo[word] = record
! 
!     def clearjunk(self, oldesttime):
!         """Forget useless wordinfo records.  This can shrink the database size.
! 
!         A record for a word will be retained only if the word was accessed
!         at or after oldesttime.
!         """
  
!         wordinfo = self.wordinfo
!         tonuke = [w for w, r in wordinfo.iteritems() if r.atime < oldesttime]
!         for w in tonuke:
!             del wordinfo[w]
  
      # NOTE:  Graham's scheme had a strange asymmetry:  when a word appeared
      # n>1 times in a single message, training added n to the word's hamcount
--- 307,373 ----
          S = options.unknown_word_strength
          StimesX = S * options.unknown_word_prob
  
!         # Compute p(word) = prob(msg is spam | msg contains word).
!         # This is the Graham calculation, but stripped of biases, and
!         # stripped of clamping into 0.01 thru 0.99.  The Bayesian
!         # adjustment following keeps them in a sane range, and one
!         # that naturally grows the more evidence there is to back up
!         # a probability.
!         hamcount = record.hamcount
!         assert hamcount <= nham
!         hamratio = hamcount / nham
! 
!         spamcount = record.spamcount
!         assert spamcount <= nspam
!         spamratio = spamcount / nspam
  
!         prob = spamratio / (hamratio + spamratio)
  
!         # Now do Robinson's Bayesian adjustment.
!         #
!         #         s*x + n*p(w)
!         # f(w) = --------------
!         #           s + n
!         #
!         # I find this easier to reason about like so (equivalent when
!         # s != 0):
!         #
!         #        x - p
!         #  p +  -------
!         #       1 + n/s
!         #
!         # IOW, it moves p a fraction of the distance from p to x, and
!         # less so the larger n is, or the smaller s is.
  
!         # Experimental:
!         # Picking a good value for n is interesting:  how much empirical
!         # evidence do we really have?  If nham == nspam,
!         # hamcount + spamcount makes a lot of sense, and the code here
!         # does that by default.
!         # But if, e.g., nham is much larger than nspam, p(w) can get a
!         # lot closer to 0.0 than it can get to 1.0.  That in turn makes
!         # strong ham words (high hamcount) much stronger than strong
!         # spam words (high spamcount), and that makes the accidental
!         # appearance of a strong ham word in spam much more damaging than
!         # the accidental appearance of a strong spam word in ham.
!         # So we don't give hamcount full credit when nham > nspam (or
!         # spamcount when nspam > nham):  instead we knock hamcount down
!         # to what it would have been had nham been equal to nspam.  IOW,
!         # we multiply hamcount by nspam/nham when nspam < nham; or, IOOW,
!         # we don't "believe" any count to an extent more than
!         # min(nspam, nham) justifies.
! 
!         n = hamcount * spam2ham  +  spamcount * ham2spam
!         prob = (StimesX + n * prob) / (S + n)
! 
!         return prob
! 
!     # Forget all the cached probability information.
!     # This is usually done in the process of learning or unlearning,
!     # since changing nham and nspam changes the probability of nearly
!     # every word in the database.
!     def _wipe_probinfo(self):
!         self.probinfo = {}
  
      # NOTE:  Graham's scheme had a strange asymmetry:  when a word appeared
      # n>1 times in a single message, training added n to the word's hamcount
***************
*** 427,447 ****
              self.nspam += 1
          else:
              self.nham += 1
  
!         wordinfo = self.wordinfo
!         wordinfoget = wordinfo.get
!         now = time.time()
          for word in Set(wordstream):
!             record = wordinfoget(word)
              if record is None:
!                 record = self.WordInfoClass(now)
  
              if is_spam:
                  record.spamcount += 1
              else:
                  record.hamcount += 1
              # Needed to tell a persistent DB that the content changed.
!             wordinfo[word] = record
  
      def _remove_msg(self, wordstream, is_spam):
          if is_spam:
--- 394,414 ----
              self.nspam += 1
          else:
              self.nham += 1
+         self._wipe_probinfo()
  
!         countinfo = self.countinfo
!         countinfoget = countinfo.get
          for word in Set(wordstream):
!             record = countinfoget(word)
              if record is None:
!                 record = self.CountInfoClass()
  
              if is_spam:
                  record.spamcount += 1
              else:
                  record.hamcount += 1
              # Needed to tell a persistent DB that the content changed.
!             countinfo[word] = record
  
      def _remove_msg(self, wordstream, is_spam):
          if is_spam:
***************
*** 452,462 ****
              if self.nham <= 0:
                  raise ValueError("non-spam count would go negative!")
              self.nham -= 1
  
!         wordinfo = self.wordinfo
!         wordinfoget = wordinfo.get
          for word in Set(wordstream):
!             record = wordinfoget(word)
              if record is not None:
                  if is_spam:
                      if record.spamcount > 0:
--- 419,430 ----
              if self.nham <= 0:
                  raise ValueError("non-spam count would go negative!")
              self.nham -= 1
+         self._wipe_probinfo()
  
!         countinfo = self.countinfo
!         countinfoget = countinfo.get
          for word in Set(wordstream):
!             record = countinfoget(word)
              if record is not None:
                  if is_spam:
                      if record.spamcount > 0:
***************
*** 465,474 ****
                      if record.hamcount > 0:
                          record.hamcount -= 1
                  if record.hamcount == 0 == record.spamcount:
!                     del wordinfo[word]
                  else:
                      # Needed to tell a persistent DB that the content changed.
!                     wordinfo[word] = record
  
      def _getclues(self, wordstream):
          mindist = options.minimum_prob_strength
--- 433,454 ----
                      if record.hamcount > 0:
                          record.hamcount -= 1
                  if record.hamcount == 0 == record.spamcount:
!                     del countinfo[word]
                  else:
                      # Needed to tell a persistent DB that the content changed.
!                     countinfo[word] = record
! 
!     # Handle the generation and caching of the spamprob values.
!     def _getprobability(self, word):
!         record = self.probinfo.get(word)
!         if record is None:
!             counts = self.countinfo.get(word)
!             if counts is None:
!                 return options.unknown_word_prob
!             record = self.ProbInfoClass()
!             record.spamprob = self._compute_probability(counts)
!             self.probinfo[word] = record
!         return record.spamprob
  
      def _getclues(self, wordstream):
          mindist = options.minimum_prob_strength
***************
*** 477,497 ****
          clues = []  # (distance, prob, word, record) tuples
          pushclue = clues.append
  
!         wordinfoget = self.wordinfo.get
!         now = time.time()
          for word in Set(wordstream):
!             record = wordinfoget(word)
!             if record is None:
!                 prob = unknown
!             else:
!                 record.atime = now
!                 prob = record.spamprob
              distance = abs(prob - 0.5)
              if distance >= mindist:
!                 pushclue((distance, prob, word, record))
  
          clues.sort()
          if len(clues) > options.max_discriminators:
              del clues[0 : -options.max_discriminators]
!         # Return (prob, word, record).
          return [t[1:] for t in clues]
--- 457,471 ----
          clues = []  # (distance, prob, word, record) tuples
          pushclue = clues.append
  
!         probget = self._getprobability
          for word in Set(wordstream):
!             prob = probget(word)
              distance = abs(prob - 0.5)
              if distance >= mindist:
!                 pushclue((distance, prob, word))
  
          clues.sort()
          if len(clues) > options.max_discriminators:
              del clues[0 : -options.max_discriminators]
!         # Return (prob, word).
          return [t[1:] for t in clues]

From popiel@wolfskeep.com  Wed Nov 20 23:33:45 2002
From: popiel@wolfskeep.com (T. Alexander Popiel)
Date: Wed, 20 Nov 2002 15:33:45 -0800
Subject: [Spambayes] Better optimization loop 
In-Reply-To: Message from Neale Pickett <neale@woozle.org> 
   of "20 Nov 2002 13:16:25 PST." <w53r8dgymae.fsf@woozle.org> 
References: <LNBBLJKPBEHFEDALKOLCIEKCCNAB.tim_one@email.msn.com>
	<3DDB7D3D.1020306@hooft.net>  <w53r8dgymae.fsf@woozle.org> 
Message-ID: <20021120233345.75B1DF5A7@cashew.wolfskeep.com>

In message:  <w53r8dgymae.fsf@woozle.org>
             Neale Pickett <neale@woozle.org> writes:
>So then, "Rob W.W. Hooft" <rob@hooft.net> is all like:
>
>> Another speedup I could use is a version of Bayes that calculates the
>> spamprob from the numbers on demand instead of calculating them for
>> all words everytime. This pays of for all cases where the training
>> batch is very small (~1 message).
>
>Funny you should bring that up, Rob, because I happen to be working on
>exactly that.  The only way I could think to do it was to pass in a new
>option to Bayes.learn() and Bayes.unlearn().

Argh.  I was working on it, too... hence the patch I just sent out.
Oh, well... no big deal.  It looks like our implementations are
significantly different, though.  Might be worth looking at both
and seeing which is better.

- Alex

From neale@woozle.org  Thu Nov 21 00:13:44 2002
From: neale@woozle.org (Neale Pickett)
Date: 20 Nov 2002 16:13:44 -0800
Subject: [Spambayes] Better optimization loop
In-Reply-To: <20021120233345.75B1DF5A7@cashew.wolfskeep.com>
References: <LNBBLJKPBEHFEDALKOLCIEKCCNAB.tim_one@email.msn.com>
	<3DDB7D3D.1020306@hooft.net> <w53r8dgymae.fsf@woozle.org>
	<20021120233345.75B1DF5A7@cashew.wolfskeep.com>
Message-ID: <w53lm3nzsnb.fsf@woozle.org>

So then, "T. Alexander Popiel" <popiel@wolfskeep.com> is all like:

> Argh.  I was working on it, too... hence the patch I just sent out.
> Oh, well... no big deal.  It looks like our implementations are
> significantly different, though.  Might be worth looking at both
> and seeing which is better.

I think what you did is a little closer to what Rob suggested to me in
response.  It sounds like a pretty good idea to me.  What I've been
doing in my idle time for the past few hours is playing around with
having the WordInfo class compute its own probability.  I did this by
defining two new methods:

    def probability(self):
        if not self.spamprob:
            self.update_probability()
        return self.spamprob

    def update_probability(self, nham, nspam):
        [basically the same code as Bayes.update_probabilites]

My idea was that you'd have to score the probability for each word
whenever you use it first, but after that the probability is cached.
Long-running things like the pop proxy will get the benefit of the
cached probabilities, and short-lived things like hammiefilter get much
faster training, and only slightly slower scoring.  At least, that's
what I expect.  I haven't tested this yet.


From popiel@wolfskeep.com  Thu Nov 21 01:31:50 2002
From: popiel@wolfskeep.com (T. Alexander Popiel)
Date: Wed, 20 Nov 2002 17:31:50 -0800
Subject: [Spambayes] Better optimization loop 
In-Reply-To: Message from Neale Pickett <neale@woozle.org> 
   of "20 Nov 2002 16:13:44 PST." <w53lm3nzsnb.fsf@woozle.org> 
References: <LNBBLJKPBEHFEDALKOLCIEKCCNAB.tim_one@email.msn.com>
	<3DDB7D3D.1020306@hooft.net> <w53r8dgymae.fsf@woozle.org>
	<20021120233345.75B1DF5A7@cashew.wolfskeep.com>  <w53lm3nzsnb.fsf@woozle.org> 
Message-ID: <20021121013150.925D0F5A7@cashew.wolfskeep.com>

In message:  <w53lm3nzsnb.fsf@woozle.org>
             Neale Pickett <neale@woozle.org> writes:
>
>What I've been doing in my idle time for the past few hours is playing
>around with having the WordInfo class compute its own probability.

[snip]

>My idea was that you'd have to score the probability for each word
>whenever you use it first, but after that the probability is cached.
>Long-running things like the pop proxy will get the benefit of the
>cached probabilities, and short-lived things like hammiefilter get much
>faster training, and only slightly slower scoring.  At least, that's
>what I expect.  I haven't tested this yet.

What this seems to lack is a good (cheap) way to invalidate the
cache.  Since changing the amount of training data affects the
bayesian adjustment to the probability for just about every word
in the database, being able to invalidate the cache is important.
(Yes, I know I keep harping on this, but a lot of the ideas
circulating on this topic seem to ignore it.)

FWIW, I did a small time test on the patch I posted... and it seems
to run marginally faster than the original code in the classic timcv
setting.  I think that getting rid of tracking the timestamps (and
making the change non-optional, unlike the first buggy version I
mentioned about a week ago) offset the added work of checking mutiple
places on a cache miss.

Of course, it'll be much faster than dealing with update_probabilities
in the fine-grained train-a-few, classify-a-few, train-a-few-again
setting... but I haven't actually tested that.  I need to do that.

- Alex

From gustav@morpheus.demon.co.uk  Wed Nov 20 22:57:56 2002
From: gustav@morpheus.demon.co.uk (Paul Moore)
Date: Wed, 20 Nov 2002 22:57:56 +0000
Subject: [Spambayes] New web training interface for pop3proxy
References: <15833.20589.376685.686723@montanaro.dyndns.org>
	<9qqitu09afl7kufdmr41kn28tfn9nfjpur@4ax.com>
	<15834.15618.465106.671756@montanaro.dyndns.org>
	<gp0ntu41535klnj60q5b68u1u21n7fmbi4@4ax.com>
	<3DDBCFA5.8070607@ActiveState.com>
	<aupntugk287b70mg37lcfg8uk8afed63fo@4ax.com>
	<65ur27eo.fsf@morpheus.demon.co.uk>
Message-ID: <n2m-g.zns3zw5n.fsf@morpheus.demon.co.uk>

Paul Moore <lists@morpheus.demon.co.uk> writes:

> Do I need to rebuild the database after upgrading? I didn't, and the
> user interface said "Total emails trained: Spam: 0 Ham: 0". This
> doesn't tally with reality - I'd trained on a batch of messages (using
> hammie.py) before starting (BTW, adding a bulk training interface
> might be nice, although using hammie.py seems to work OK in practice).

Rebuilding seems to have fixed things, although I don't know quite
why.

One thing, when I went to the review screen and trained on the
messages there, I got an exception:

>pop3proxy.py -l 8110 -d -b localhost
Loading database... Done.
BayesProxyListener listening on port 8110.
UserInterfaceListener listening on port 8880.
error: uncaptured python exception, closing channel <__main__.UserInterface conn
ected at 0x8e5a20> (exceptions.AttributeError:'tuple' object has no attribute 'h
amcount' [C:\Python22\lib\asyncore.py|poll|99] [C:\Python22\lib\asyncore.py|hand
le_read_event|394] [C:\Python22\lib\asynchat.py|handle_read|112] [C:\Application
s\Spambayes\pop3proxy.py|found_terminator|720] [C:\Applications\Spambayes\pop3pr
oxy.py|onRequest|745] [C:\Applications\Spambayes\pop3proxy.py|onReview|980] [C:\
Applications\Spambayes\classifier.py|update_probabilities|340])

Sorry, no time to diagnose further just now.
Paul.
-- 
This signature intentionally left blank

From tim.one@comcast.net  Thu Nov 21 02:12:01 2002
From: tim.one@comcast.net (Tim Peters)
Date: Wed, 20 Nov 2002 21:12:01 -0500
Subject: [Spambayes] Outlook weirdness
In-Reply-To: <NABBJOOEOFODEALNMJAJOEGJHDAA.B-Morgan@concentric.net>
Message-ID: <LNBBLJKPBEHFEDALKOLCKEAECOAB.tim.one@comcast.net>

[Brad Morgan]
> Outlook (Internet-only) has been occasionally "hanging around" on me
> long before I tried any sort of spam filtering.  Sometimes it hangs
> around collecting messages, sometimes it just prevents another
> version from starting up.

Same here, of course.  About a year ago, it did this every time for a solid
week, and lost all my toolbar customizations each time.  Sometimes it also
lost my rules, ditto my view customizations.  Then it away.  Then it came
back.  Etc.  I haven't noticed any increase in frequency or severity since I
started using the addin.

> I haven't found any pattern to the unsuccessful shutdowns and
> Microsoft certianly hasn't either for all the patches they've put out.

I've noticed that it happens only after I bring up Outlook <wink>.

> IMO, its their bug plain and simple.

It seems that way.

We "should be" more paranoid, though.  For examples, spin off a thread to
rewrite the pickle to disk whenever it gets dirty, and rename the last N
pickle files instead of overwriting them.  MS doesn't write 5 copies of the
registry for fun either <0.9 wink>.  A persistent DB may eventually make
more sense too.  I know!  Let's store the words and spamprobs in the
registry <aaaaackckck>.


From tim@fourstonesExpressions.com  Thu Nov 21 02:37:55 2002
From: tim@fourstonesExpressions.com (Tim Stone - Four Stones Expressions)
Date: Wed, 20 Nov 2002 20:37:55 -0600
Subject: [Spambayes] proposed changes to hammie & co.
In-Reply-To: <w53vg2szumi.fsf@woozle.org>
Message-ID: <F0XV61MI64TQRCBYSRQZTLK533QPWT.3ddc4703@riven>

Neale, where are we on the playground stuff?  We're getting out of sync with 
pop3proxy... I suppose cvs will be able to merge to a point.

I'm ok with making load() a noop and store() = sync() in the dbdict class.  We 
could then do away with lsdbdict.  This is much cleaner.  Sans objections, 
I'll do that by tomorrow night.

Once that's done, what's left?

- Tims

11/19/2002 11:18:45 PM, Neale Pickett <neale@woozle.org> wrote:

>So then, Tim Stone - Four Stones Expressions <tim@fourstonesExpressions.com> 
is all like:
>
>> How about PersistentClassifier?
>
>Yech.  Since the things are kinda doing what the standard shelve module
>does, and we keep calling them "stores", how about "store"?
>
>
>
- Tim
www.fourstonesExpressions.com 


From tim.one@comcast.net  Thu Nov 21 03:15:50 2002
From: tim.one@comcast.net (Tim Peters)
Date: Wed, 20 Nov 2002 22:15:50 -0500
Subject: [Spambayes] Just visiting
Message-ID: <LNBBLJKPBEHFEDALKOLCCEAKCOAB.tim.one@comcast.net>

Making major progress on Zope3 has become a top priority for my (and
Guido's, and Jeremy's, and Barry's) employer, to an extent that precludes
Zope Corp work on projects that don't directly contribute to that.  This
project doesn't have beans to do with Zope, so my involvement here has
become a strictly "spare time" thing.  Since good news always comes in
threes, the first alpha of Python 2.3 is due out at the end of the year, and
large parts of what I wanted to accomplish there have also become "spare
time".  In most of my spare time since learning this, I've been waiting for
the third piece of good news <wink>.

I won't vanish, but I have to cut waaaaaay back on this list.  I'm proud of
what we've all accomplished here, and would hate to see it die -- especially
while it still sucks too much for my sisters to use <wink>.  In case you're
wondering, I approve of switching the default WordInfo thingie to compute
probs on demand, and will take the remaining minute of today's spare time to
remind that 2.2's properties would still allow for using .spamprob notation.
Like

>>> class Whatever(object):
...     def _implementation_of_x_read_attr(self):
...         return 2.0 * 4
...     x = property(_implementation_of_x_read_attr)
...
>>> w = Whatever()
>>> w.x
8.0
>>>

s/x/spamprob/ and set your imagination free.  .spamprob can do anything, and
even different things when getting, setting, or deleting:

>>> print property.__doc__
property(fget=None, fset=None, fdel=None, doc=None) -> property attribute

fget is a function to be used for getting an attribute value, and likewise
fset is a function for setting, and fdel a function for del'ing, an
attribute.  Typical use is to define a managed attribute x:
class C(object):
    def getx(self): return self.__x
    def setx(self, value): self.__x = value
    def delx(self): del self.__x
    x = property(getx, setx, delx, "I'm the 'x' property.")
>>>


From rjdsnet@yahoo.com  Wed Nov 20 15:41:16 2002
From: rjdsnet@yahoo.com (Ranieri J D Severiano)
Date: Wed, 20 Nov 2002 13:41:16 -0200
Subject: [Spambayes] hammiefilter and hammiecli improvements
Message-ID: <20021120154116.GA1052@uyrapuru>

Hi,
What you think about add the "-d" option to hammiefilter.py ?

-----------------
def main():
    action = filter
    opts, args = getopt.getopt(sys.argv[1:], 'hngsd') ### HERE
    for opt, arg in opts:
        if opt == '-h':
            usage(0)
        elif opt == '-g':
            action = train_ham
        elif opt == '-s':
            action = train_spam
        elif opt == "-n":
            action = newdb
        elif opt == '-d':    ###\
            global USEDB     ### - AND HERE
            USEDB = True     ###/
    action()
-----------------


My other suggestion is to fix the print statement of hammiecli.py :


-----------------
def main():
    msg = sys.stdin.read()
    try:
        x = xmlrpclib.ServerProxy(RPCBASE)
        m = xmlrpclib.Binary(msg)
        out = x.filter(m)
        print out.data    ### HERE
    except:
        if __debug__:
            import traceback
            traceback.print_exc()
        print msg
-----------------

Now, you can get the message and pass it to procmail
or another filter.


Thanks,
Ranieri J D Severiano

From neale@woozle.org  Thu Nov 21 04:04:26 2002
From: neale@woozle.org (Neale Pickett)
Date: 20 Nov 2002 20:04:26 -0800
Subject: [Spambayes] proposed changes to hammie & co.
In-Reply-To: <F0XV61MI64TQRCBYSRQZTLK533QPWT.3ddc4703@riven>
References: <F0XV61MI64TQRCBYSRQZTLK533QPWT.3ddc4703@riven>
Message-ID: <w53d6ozzhyt.fsf@woozle.org>

So then, Tim Stone - Four Stones Expressions <tim@fourstonesExpressions.com> is all like:

> Neale, where are we on the playground stuff?  We're getting out of sync with 
> pop3proxy... I suppose cvs will be able to merge to a point.

I'm currently entwined with mucking the heck out of WordInfo.  I've got
a neato scheme based on Alex's patch and comments where the WordInfo
classes still compute their own probabilities, but also keep a revision
number which is compared against a MetaInfo class.  The neato thing
here, at least from the perspective of DBDict, is that all the meta
information is now bundled up in a handy object.

> I'm ok with making load() a noop and store() = sync() in the dbdict
> class.  We could then do away with lsdbdict.  This is much cleaner.
> Sans objections, I'll do that by tomorrow night.

Yeah, go for it.  I knew I'd wear you down ;)

> Once that's done, what's left?

Nothing, really, we just have to present a summary of all the changes
we've made, get sign-off, and then I'll merge back into head.

I'd also like to take the hedge trimmers to the options class, option
names like "hammiefilter_persistent_storage_file" are a little long,
methinks.  But that can wait.

Neale

From neale@woozle.org  Thu Nov 21 04:38:43 2002
From: neale@woozle.org (Neale Pickett)
Date: 20 Nov 2002 20:38:43 -0800
Subject: [Spambayes] hammiefilter and hammiecli improvements
In-Reply-To: <20021120154116.GA1052@uyrapuru>
References: <20021120154116.GA1052@uyrapuru>
Message-ID: <w537kf7zgdo.fsf@woozle.org>

So then, Ranieri J D Severiano <rjdsnet@yahoo.com> is all like:

> Hi,
> What you think about add the "-d" option to hammiefilter.py ?

Hi Ranieri.  Good eye!  I think, at least with hammiefilter, you
shouldn't even have an option--USEDB=True should be the only thing you can
do.  Pickling/unpickling is just too slow for something which is
supposed to run quickly and then exit.  I'm going to do this soon; I
just haven't figured out yet how to do this without messing up other
things like pop3proxy.

> My other suggestion is to fix the print statement of hammiecli.py :

>         print out.data    ### HERE

Yes, that is an important detail, isn't it?  I've checked in a fix.
Thanks!

Neale

From popiel@wolfskeep.com  Thu Nov 21 05:36:28 2002
From: popiel@wolfskeep.com (T. Alexander Popiel)
Date: Wed, 20 Nov 2002 21:36:28 -0800
Subject: [Spambayes] proposed changes to hammie & co. 
In-Reply-To: Message from Neale Pickett <neale@woozle.org> 
   of "20 Nov 2002 20:04:26 PST." <w53d6ozzhyt.fsf@woozle.org> 
References: <F0XV61MI64TQRCBYSRQZTLK533QPWT.3ddc4703@riven>
	<w53d6ozzhyt.fsf@woozle.org> 
Message-ID: <20021121053628.DD924F5A7@cashew.wolfskeep.com>

In message:  <w53d6ozzhyt.fsf@woozle.org>
             Neale Pickett <neale@woozle.org> writes:
>
>I'm currently entwined with mucking the heck out of WordInfo.  I've got
>a neato scheme based on Alex's patch and comments where the WordInfo
>classes still compute their own probabilities, but also keep a revision
>number which is compared against a MetaInfo class.

Eww, do we gotta?  I thought I was trying to make the DB smaller. ;-)
But yes, revision-stamping everything will work just as well as my
patch... though I suspect it'll be slower, just 'cause it's digging
through more data.  Not that speed is a big issue for anything other
than bulk testing.

>The neato thing here, at least from the perspective of DBDict, is that
>all the meta information is now bundled up in a handy object.

This is unalloyed good.

- Alex (opinionated)

From neale@woozle.org  Thu Nov 21 06:08:00 2002
From: neale@woozle.org (Neale Pickett)
Date: 20 Nov 2002 22:08:00 -0800
Subject: [Spambayes] proposed changes to hammie & co.
In-Reply-To: <20021121053628.DD924F5A7@cashew.wolfskeep.com>
References: <F0XV61MI64TQRCBYSRQZTLK533QPWT.3ddc4703@riven>
	<w53d6ozzhyt.fsf@woozle.org>
	<20021121053628.DD924F5A7@cashew.wolfskeep.com>
Message-ID: <w53y97nxxof.fsf@woozle.org>

So then, "T. Alexander Popiel" <popiel@wolfskeep.com> is all like:

> In message:  <w53d6ozzhyt.fsf@woozle.org>
>              Neale Pickett <neale@woozle.org> writes:
> >
> >I'm currently entwined with mucking the heck out of WordInfo.  I've got
> >a neato scheme based on Alex's patch and comments where the WordInfo
> >classes still compute their own probabilities, but also keep a revision
> >number which is compared against a MetaInfo class.
> 
> Eww, do we gotta?  I thought I was trying to make the DB smaller. ;-)

Ah, but the only thing *stored* is (spamcount, hamcount).  The
probability is calculated the first time you ask for it.  If you don't
update nspam or nham, the next time you ask for it it gives the cached
value.  So the database is small, but you still get the in-memory
probability caching if you're using a pickle or ZODB.

But now that words are computing their own probabilities, the Bayes
class no longer does anything Bayesian.  I guess it's time to rename
that class to Classifier.

> >The neato thing here, at least from the perspective of DBDict, is
> >that all the meta information is now bundled up in a handy object.
> 
> This is unalloyed good.

"unalloyed" is a superb word, Alex.  It reminds me that I should be
studying for the GRE instead of hacking spam classifier code :)

Neale

From skip@pobox.com  Thu Nov 21 06:32:01 2002
From: skip@pobox.com (Skip Montanaro)
Date: Thu, 21 Nov 2002 00:32:01 -0600
Subject: [Spambayes] Is it safe to get back in the water?
Message-ID: <15836.32225.108330.152806@montanaro.dyndns.org>

I've been too timid to attempt an update from cvs followed by an install for
the last several days.  The changes have been flying too fast for me to keep
up with.  Assuming my hammie.db file (the anydbm version) hasn't been
modified in a week, can I assume enough breakage has occurred that I will
have to retrain from scratch?  What about hammie.py?  I currently execute

    hammie.py -f -d -p $HOME/hammie.db

from my procmailrc file.  Does a simple

    hammiefilter.py

do something similar?  How do I tell it what .db file to use?

thx,

Skip

From skip@pobox.com  Thu Nov 21 07:09:04 2002
From: skip@pobox.com (Skip Montanaro)
Date: Thu, 21 Nov 2002 01:09:04 -0600
Subject: [Spambayes] Joe-job update
Message-ID: <15836.34448.30442.967802@montanaro.dyndns.org>


This really has nothing to do with spambayes.  I'm just bringing personal
closure to the joe-job topic.  I kept getting flurries of bounces and they
seemed to all have one thing in common - the spammer was coming from some
host in the client.attbi.com domain:

    Received: from mx.boston.juno.com (12-254-180-157.client.attbi.com
            [12.254.180.157])
            by manatee.mojam.com (8.12.1/8.12.1) with ESMTP id gAL3osVN024610
            for <bigbsbfan@gate.net>; Wed, 20 Nov 2002 21:55:35 -0600

Funny thing, I'm an AT&T broadband customer, and in order to allow my laptop
at home to push mail through mail.mojam.com I had once upon a time added a
RELAY line to /etc/mail/access:

    client.attbi.com    RELAY

Damn!  It looks like my mail exchanger was being badly abused.

I also had

    montanaro.dyndns.org        RELAY

(the hostname of my laptop, which maps to whatever IP I happen to be at for
the moment), but it turns out this did no good.  Once I removed the
client.attbi.com line I discovered I couldn't route mail through
mail.mojam.com:

    MAIL From:<skip@montanaro.dyndns.org>
    250 2.1.0 <skip@montanaro.dyndns.org>... Sender ok
    RCPT To:<skip@pobox.com>
    550 5.7.1 <skip@pobox.com>... Relaying denied

After a moment's thought, the solution turned out to be simple.  Root's cron
now runs a little fix-access script that adjusts /etc/mail/access to allow
relaying from precisely the IP address I happen to be at for the moment:

    ip=`host montanaro.dyndns.org | sed -e 's/.* //'`
    cd /etc/mail
    echo "$ip       RELAY" > access.tmp
    cat /etc/mail/access.base >> access.tmp
    mv access.tmp access
    make

This should help stem the tide of my own joe-job flurries.  I don't know if
it will help anyone else, but I pass this solution along just in case.

Skip

From neale@woozle.org  Thu Nov 21 07:12:01 2002
From: neale@woozle.org (Neale Pickett)
Date: 20 Nov 2002 23:12:01 -0800
Subject: [Spambayes] Is it safe to get back in the water?
In-Reply-To: <15836.32225.108330.152806@montanaro.dyndns.org>
References: <15836.32225.108330.152806@montanaro.dyndns.org>
Message-ID: <w53ptszxupq.fsf@woozle.org>

So then, Skip Montanaro <skip@pobox.com> is all like:

> Assuming my hammie.db file (the anydbm version) hasn't been modified
> in a week, can I assume enough breakage has occurred that I will have
> to retrain from scratch?  

It *should* still work in the HEAD branch.  Back it up first though :)

The only change that will affect what you're doing is an optimization to
how hammie stores WordInfo objects.  So you should see the size of your
database drop by about 50% the next time you train it.

> What about hammie.py?  I currently execute
> 
>     hammie.py -f -d -p $HOME/hammie.db
> 
> from my procmailrc file.  Does a simple
> 
>     hammiefilter.py
> 
> do something similar?  How do I tell it what .db file to use?

If hammie.py is working for you, don't use hammiefilter.py yet.
Eventually, hammiefilter.py will be what you want, but I'll make a big
loud announcements before I check in anything that'll require you to
alter anything.

The heaps of checkins you've seen me make recently have all been in a
branch.  The idea is that we (well, just Tim Stone and I so far) will
get everything straightened out over there before we merge back in.  At
that point we should have a pretty good idea about what's going to be
messed up ;)

Neale

From bkc@murkworks.com  Thu Nov 21 00:00:42 2002
From: bkc@murkworks.com (Brad Clements)
Date: Wed, 20 Nov 2002 19:00:42 -0500
Subject: [Spambayes] LJ article
In-Reply-To: <200211202149.gAKLnLw28459@pcp02138704pcs.reston01.va.comcast.net>
Message-ID: <3DDBDBDA.29999.A120C12@localhost>

On 20 Nov 2002 at 16:49, Guido van Rossum wrote:

> If you're interested, write to Don Marti <dmarti@ssc.com> for
> details.  DON'T WRITE ME!


Perhaps you should also post on this list if you write to Don .. ?


--
This message was sent with an unlicensed evaluation version of
Novell NetMail. Please see http://www.netmail.com/ for details.

From vanhorn@whidbey.com  Thu Nov 21 09:44:44 2002
From: vanhorn@whidbey.com (G. Armour Van Horn)
Date: Thu, 21 Nov 2002 01:44:44 -0800
Subject: [Spambayes] LJ article
References: <3DDBDBDA.29999.A120C12@localhost>
Message-ID: <3DDCAB0C.34D3C928@whidbey.com>

Brad Clements wrote:

> On 20 Nov 2002 at 16:49, Guido van Rossum wrote:
>
> > If you're interested, write to Don Marti <dmarti@ssc.com> for
> > details.  DON'T WRITE ME!
>
> Perhaps you should also post on this list if you write to Don .. ?
>
> --

Okay, here's what I sent to Don:


Don,

I haven't done a lot of published writing lately, but I used to write a
lot for Byte and a flock of lesser-known magazines (most of which are
dead and gone now) and I think I still know how it's done. I have been
reading the bulk of the Spambayes mailing list for several weeks anyway
(there were just under a hundred susbscribers when I joined, if that
gives you a time frame).

Please let me know what the time frame, word count, deadline, and
compensation are and I can be more specific about what I think I could
do and how I would approach it.

Van


--
----------------------------------------------------------
Sign up now for Quotes of the Day, a handful of quotations
on a theme delivered every morning.
Enlightenment! Daily, for free!
mailto:twisted@whidbey.com?subject=Subscribe_QOTD

For web hosting and maintenance,
visit Van's home page: http://www.domainvanhorn.com/van/
----------------------------------------------------------


From francois.granger@free.fr  Thu Nov 21 09:55:07 2002
From: francois.granger@free.fr (Fran=?ISO-8859-1?B?5w==?=ois Granger)
Date: Thu, 21 Nov 2002 10:55:07 +0100
Subject: [Spambayes] Hourra for pop3proxy !
Message-ID: <BA026C0B.5D02E%francois.granger@free.fr>


I am pleased to announce that the current version works like a charm.

It is plug & play. First try with two pop server, everything was working as
advertised. The training interface is really good. It just need some
cosmetic improvements. I"ll try to come with some patches for this.

-- 
Le courrier est un moyen de communication. Les gens devraient
se poser des questions sur les implications politiques des choix (ou non
choix) de leurs outils et technologies. Pour des courriers propres :
<http://marc.herbert.free.fr/mail/> -- <http://minilien.com/?IXZneLoID0>


From sjoerd@acm.org  Thu Nov 21 10:08:25 2002
From: sjoerd@acm.org (Sjoerd Mullender)
Date: Thu, 21 Nov 2002 11:08:25 +0100
Subject: [Spambayes] Split DB and no update_probabilities
In-Reply-To: <20021120233128.C5A6FF5A7@cashew.wolfskeep.com> 
References: <20021120233128.C5A6FF5A7@cashew.wolfskeep.com> 
Message-ID: <20021121100825.AFBD974B08@indus.ins.cwi.nl>

On Wed, Nov 20 2002 "T. Alexander Popiel" wrote:

> I've got a patch to split the word database into two pieces:
> one with the ham and spam counts, and one with the spamprobs.
> At the same time, I got rid of the timestamps and killcounts,
> both of which are rarely used (and can be added back in by
> subclasses if they're really wanted).

I have one suggestion: in _wipe_probinfo use
	self.probinfo.clear()
instead of
	self.probinfo = {}
That way it is easier to replace the implementation of self.probinfo
by something other than a dictionary.

-- Sjoerd Mullender <sjoerd@acm.org>

From Paul.Moore@atosorigin.com  Thu Nov 21 10:11:48 2002
From: Paul.Moore@atosorigin.com (Moore, Paul)
Date: Thu, 21 Nov 2002 10:11:48 -0000
Subject: [Spambayes] New web training interface for pop3proxy
Message-ID: <16E1010E4581B049ABC51D4975CEDB88619949@UKDCX001.uk.int.atosorigin.com>

(This is from memory, as it happened on my home setup and I'm at
work now, so I apologise if it's a bit vague).

From: Richie Hindle [mailto:richie@entrian.com]
> > It's locking up for me.
>
> Urk. That's bad (and new - no-one's reported that before). And this
> happened when you hit the Train button, yes?

Yes.

>
> > There are no messages in the command prompt window - is there any
> > way to get it to produce trace messages (looking at the code, the
> > answer seems to be "no"...)?
>
> It's "no". 8-) Like I say, no-one's reported it locking before, and
> I've never seen it. You usually get a traceback when something goes
> wrong. So your console says something like:
>
> Loading database... Done BayesProxyListener listening on port 110  .
> UserInterfaceListener listening on port 8880                       .
>
> and nothing else, and the process is still running, but you can't
> get a page served to your browser? What error message do you get
> from the browser? If it's one of those pointless IE error pages,
> could you try telnetting to port 8880 and saying "GET / HTTP/1.0"?
> Can you even connect with telnet? How about port 110?

Exactly as you describe. The browser (IE6) just sits there, doing
nothing, with the training interface page still up. The progress
bar at the bottom of the screen is (hardly? not?) moving (can't
recall for sure if it changed, but it certainly didn't move far
in the few minutes I left it...) Basically the sort of behaviour I
get when IE is waiting for a response that never comes.

I can't do a telnet at the moment to check. Unfortunately, I may not
even be able to do one tonight, as I tried rebuilding the database to
see if that helped, and it did... I can't recall if I kept the old
database, or overwrote it :-(

> > Do I need to rebuild the database after upgrading? I didn't,
> > and the user interface said "Total emails trained: Spam: 0 Ham:
> > 0". This doesn't tally with reality - I'd trained on a batch of
> > messages (using hammie.py) before starting
>
> You shouldn't need to do anything, and even if you did I'd expect it
> to give an error rather than quietly failing. Are you using a pickle
> or a DBM? In fact, could you send me your bayescustomize.ini and/or
> details of the command line you're using (off-list if you prefer)?

It was a dbm file.

The command line was pop3proxy.py -d -l 8110 localhost

(proxying a local POP server on port 110 with the proxy on port 8110
using a DBM file). Working directory was the directory of the program.
No bayescustomize.ini file.

Sorry if this is a bit vague - it all happened late last night, after
a pretty hectic evening, so I wasn't in the best mood for debugging :-)

Paul.

From mhammond@skippinet.com.au  Thu Nov 21 12:24:51 2002
From: mhammond@skippinet.com.au (Mark Hammond)
Date: Thu, 21 Nov 2002 23:24:51 +1100
Subject: [Spambayes] New export utility for Outlook users.
Message-ID: <LCEPIIGDJPKCOIHOBJEPAEDNHNAA.mhammond@skippinet.com.au>

I just checked in Outlook2000/export.py.  This is a tool for exporting the
folders currently defined in the addin to a standard SpamBayes directory
structure.  The idea is that you can export your messages to text files, and
run the standard tests over your data, just like all the big-boys do <wink>

See the instructions in the script for more details.

Mark.


From tim@fourstonesExpressions.com  Thu Nov 21 16:51:22 2002
From: tim@fourstonesExpressions.com (Tim Stone - Four Stones Expressions)
Date: Thu, 21 Nov 2002 10:51:22 -0600
Subject: [Spambayes] http://www.spamarchive.org/
Message-ID: <6Z546JFPOFAYXD82DJ64C7OLTSKF.3ddd0f0a@riven>

New anti-spam site just launched at http://www.spamarchive.org/

Anybody know who this is?

- Tim
www.fourstonesExpressions.com 


From popiel@wolfskeep.com  Thu Nov 21 16:59:27 2002
From: popiel@wolfskeep.com (T. Alexander Popiel)
Date: Thu, 21 Nov 2002 08:59:27 -0800
Subject: [Spambayes] Split DB and no update_probabilities 
In-Reply-To: Message from Sjoerd Mullender <sjoerd@acm.org> 
	<20021121100825.AFBD974B08@indus.ins.cwi.nl> 
References: <20021120233128.C5A6FF5A7@cashew.wolfskeep.com>
	<20021121100825.AFBD974B08@indus.ins.cwi.nl> 
Message-ID: <20021121165927.CFE52F5AC@cashew.wolfskeep.com>

In message:  <20021121100825.AFBD974B08@indus.ins.cwi.nl>
             Sjoerd Mullender <sjoerd@acm.org> writes:
>On Wed, Nov 20 2002 "T. Alexander Popiel" wrote:
>
>> I've got a patch to split the word database into two pieces:
>> one with the ham and spam counts, and one with the spamprobs.
>> At the same time, I got rid of the timestamps and killcounts,
>> both of which are rarely used (and can be added back in by
>> subclasses if they're really wanted).
>
>I have one suggestion: in _wipe_probinfo use
>	self.probinfo.clear()
>instead of
>	self.probinfo = {}
>That way it is easier to replace the implementation of self.probinfo
>by something other than a dictionary.

Thanks.  This is just a bit of my python ignorance showing
through.  The reason I had broken _wipe_probinfo out into its
own method was to allow easier replacement of self.probinfo
(I normally despise one-statement methods), but your suggestion
makes it even cleaner.

- Alex

From msergeant@startechgroup.co.uk  Thu Nov 21 17:00:45 2002
From: msergeant@startechgroup.co.uk (Matt Sergeant)
Date: Thu, 21 Nov 2002 17:00:45 +0000
Subject: [Spambayes] http://www.spamarchive.org/
References: <6Z546JFPOFAYXD82DJ64C7OLTSKF.3ddd0f0a@riven>
Message-ID: <3DDD113D.6000603@startechgroup.co.uk>

Tim Stone - Four Stones Expressions said the following on 21/11/02 16:51:
> New anti-spam site just launched at http://www.spamarchive.org/
> 
> Anybody know who this is?

It's IronMail, the guys someone asked about yesterday (or was it Tuesday?).

Not sure why a big funded company would need to do this, and hide the 
fact that it's IronMail doing it (not actively or anything, but I'm not 
sure why they didn't announce it that way - perhaps they got caught on 
the hop). Check the whois records if you're in doubt.


From popiel@wolfskeep.com  Thu Nov 21 17:18:00 2002
From: popiel@wolfskeep.com (T. Alexander Popiel)
Date: Thu, 21 Nov 2002 09:18:00 -0800
Subject: [Spambayes] proposed changes to hammie & co. 
In-Reply-To: Message from Neale Pickett <neale@woozle.org> 
   of "20 Nov 2002 22:08:00 PST." <w53y97nxxof.fsf@woozle.org> 
References: <F0XV61MI64TQRCBYSRQZTLK533QPWT.3ddc4703@riven>
	<w53d6ozzhyt.fsf@woozle.org> <20021121053628.DD924F5A7@cashew.wolfskeep.com>
	<w53y97nxxof.fsf@woozle.org> 
Message-ID: <20021121171801.03720F5AC@cashew.wolfskeep.com>

In message:  <w53y97nxxof.fsf@woozle.org>
             Neale Pickett <neale@woozle.org> writes:
>So then, "T. Alexander Popiel" <popiel@wolfskeep.com> is all like:
>
>> In message:  <w53d6ozzhyt.fsf@woozle.org>
>>              Neale Pickett <neale@woozle.org> writes:
>> >
>> >I'm currently entwined with mucking the heck out of WordInfo.  I've got
>> >a neato scheme based on Alex's patch and comments where the WordInfo
>> >classes still compute their own probabilities, but also keep a revision
>> >number which is compared against a MetaInfo class.
>> 
>> Eww, do we gotta?  I thought I was trying to make the DB smaller. ;-)
>
>Ah, but the only thing *stored* is (spamcount, hamcount).  The
>probability is calculated the first time you ask for it.  If you don't
>update nspam or nham, the next time you ask for it it gives the cached
>value.  So the database is small, but you still get the in-memory
>probability caching if you're using a pickle or ZODB.

Sounds like there is no caching benefit for one-message-per-invocation
situations like running out of procmail, then.  Ouch.  Unless I'm
mistaken, by the time that the probability is being computed in your
scheme, the identity of the word has been lost, and thus the probability
can't be stored in a secondary database like I had written, either.
I suppose that there's enough performance penalties in the procmail
scenario (python startup, options loading, other various overhead)
that computing all the probabilities from counts is small change.

- Alex (overly critical)

From neale@woozle.org  Thu Nov 21 22:53:17 2002
From: neale@woozle.org (Neale Pickett)
Date: 21 Nov 2002 14:53:17 -0800
Subject: [Spambayes] proposed changes to hammie & co.
In-Reply-To: <20021121171801.03720F5AC@cashew.wolfskeep.com>
References: <F0XV61MI64TQRCBYSRQZTLK533QPWT.3ddc4703@riven>
	<w53d6ozzhyt.fsf@woozle.org>
	<20021121053628.DD924F5A7@cashew.wolfskeep.com>
	<w53y97nxxof.fsf@woozle.org>
	<20021121171801.03720F5AC@cashew.wolfskeep.com>
Message-ID: <w534raay1pe.fsf@woozle.org>

So then, "T. Alexander Popiel" <popiel@wolfskeep.com> is all like:

> Sounds like there is no caching benefit for one-message-per-invocation
> situations like running out of procmail, then.  Ouch.  Unless I'm
> mistaken, by the time that the probability is being computed in your
> scheme, the identity of the word has been lost, and thus the
> probability can't be stored in a secondary database like I had
> written, either.  I suppose that there's enough performance penalties
> in the procmail scenario (python startup, options loading, other
> various overhead) that computing all the probabilities from counts is
> small change.

Well it's easy enough to test.

I modified my WordInfo to store the probability in the store.  First
off, this makes the database twice as big (floats take more space to
pickle than ints), and adds about a second to scoring a batch of 200
messages.  So there is some overhead involved.

I trained 200 messages with both methods (storing prob and calculating
it each time).  Here were the times in the trials:

    storing    not
1   9.308      9.298
2   9.573      9.328
3   9.292      9.307
4   9.290      9.288
5   9.306      9.466

I don't think there's a clear winner here.  Given that we get similar
times, but a database half the size, I'm still inclined to go with not
storing probabilities.

Neale

From popiel@wolfskeep.com  Thu Nov 21 23:06:27 2002
From: popiel@wolfskeep.com (T. Alexander Popiel)
Date: Thu, 21 Nov 2002 15:06:27 -0800
Subject: [Spambayes] proposed changes to hammie & co. 
In-Reply-To: Message from Neale Pickett <neale@woozle.org> 
   of "21 Nov 2002 14:53:17 PST." <w534raay1pe.fsf@woozle.org> 
References: <F0XV61MI64TQRCBYSRQZTLK533QPWT.3ddc4703@riven>
	<w53d6ozzhyt.fsf@woozle.org> <20021121053628.DD924F5A7@cashew.wolfskeep.com>
	<w53y97nxxof.fsf@woozle.org> <20021121171801.03720F5AC@cashew.wolfskeep.com>
	<w534raay1pe.fsf@woozle.org> 
Message-ID: <20021121230627.9A841F5AC@cashew.wolfskeep.com>

In message:  <w534raay1pe.fsf@woozle.org>
             Neale Pickett <neale@woozle.org> writes:
>
>I don't think there's a clear winner here.  Given that we get similar
>times, but a database half the size, I'm still inclined to go with not
>storing probabilities.

Sounds good.  (Not only would the probability have to be stored, but
the revision stamp would have to be stored, for even more DB bloat.
Blech.)

- Alex

From hupp@cs.wisc.edu  Fri Nov 22 00:27:13 2002
From: hupp@cs.wisc.edu (Adam Hupp)
Date: Thu, 21 Nov 2002 18:27:13 -0600
Subject: [Spambayes] Numeric python store, hammiefilter extension and mutt
	macros
Message-ID: <20021122002713.GC29009@upl.cs.wisc.edu>


I've been working on a store for spambayes that uses the Numeric
python extension.  It's substantially faster than PersistentBayes and
the database is about half the size.  A comparison, training on 992 messages:

PersistentBayes:
training: 220s
update_prob: 3.2s
score 1 msg: .45s
score 6156 msgs: 58s

NumericBayes:
training: 14s
update_prob: 0.10s
score 1 msg: .59s
score 6156 msgs: 49s

There are no modifications to classifier.Bayes, it just uses a new
WordInfo class with properties.

I also modified hammiefilter to do untraining, retraining, and
training on filter results.  For example:

hammiefilter.py --filter --train

The incoming message is scored and filtered.  If the result is not
"Unsure" the classifier will be trained on it.


hammiefilter.py --reverse --good --train

The incoming message has previously been incorrectly marked as ham.
--reverse will untrain the classifier and --train will retrain it on
the message as spam.

With these tools it's straightforward to setup macros in mutt to
manage false negatives/positives and classify "Unsure" messages.

The modified files can be found at:

http://www.upl.cs.wisc.edu/~hupp/spambayes.tar.gz

hammiefilter requires Optik and the NumericBayes store requires
Numeric and MaskedArray (and optional part of Numeric).

-Adam

From tim@fourstonesExpressions.com  Fri Nov 22 00:42:57 2002
From: tim@fourstonesExpressions.com (Tim Stone - Four Stones Expressions)
Date: Thu, 21 Nov 2002 18:42:57 -0600
Subject: [Spambayes] Numeric python store, hammiefilter extension and mutt
	macros
In-Reply-To: <20021122002713.GC29009@upl.cs.wisc.edu>
Message-ID: <4YA0NKEBJE4165ICXURO2B087374D.3ddd7d91@riven>

Sounds really good, Adam.  Neale Pickett and I have been working on this kind 
of stuff in a branch named hammie-playground.  There have been some 
substantial changes made there, that'll be merged into the main thread soon.  
You might want to check there and see how your changes would fit in...  I 
really like your results.  Size and speed have been consistent challenges for 
us.

- TimS

11/21/2002 6:27:13 PM, Adam Hupp <hupp@cs.wisc.edu> wrote:

>
>I've been working on a store for spambayes that uses the Numeric
>python extension.  It's substantially faster than PersistentBayes and
>the database is about half the size.  A comparison, training on 992 messages:
>
>PersistentBayes:
>training: 220s
>update_prob: 3.2s
>score 1 msg: .45s
>score 6156 msgs: 58s
>
>NumericBayes:
>training: 14s
>update_prob: 0.10s
>score 1 msg: .59s
>score 6156 msgs: 49s
>
>There are no modifications to classifier.Bayes, it just uses a new
>WordInfo class with properties.
>
>I also modified hammiefilter to do untraining, retraining, and
>training on filter results.  For example:
>
>hammiefilter.py --filter --train
>
>The incoming message is scored and filtered.  If the result is not
>"Unsure" the classifier will be trained on it.
>
>
>hammiefilter.py --reverse --good --train
>
>The incoming message has previously been incorrectly marked as ham.
>--reverse will untrain the classifier and --train will retrain it on
>the message as spam.
>
>With these tools it's straightforward to setup macros in mutt to
>manage false negatives/positives and classify "Unsure" messages.
>
>The modified files can be found at:
>
>http://www.upl.cs.wisc.edu/~hupp/spambayes.tar.gz
>
>hammiefilter requires Optik and the NumericBayes store requires
>Numeric and MaskedArray (and optional part of Numeric).
>
>-Adam
>
>_______________________________________________
>Spambayes mailing list
>Spambayes@python.org
>http://mail.python.org/mailman/listinfo/spambayes
>
>
- Tim
www.fourstonesExpressions.com 


From neale@woozle.org  Fri Nov 22 06:10:08 2002
From: neale@woozle.org (Neale Pickett)
Date: 21 Nov 2002 22:10:08 -0800
Subject: [Spambayes] Numeric python store, hammiefilter extension and mutt
	macros
In-Reply-To: <20021122002713.GC29009@upl.cs.wisc.edu>
References: <20021122002713.GC29009@upl.cs.wisc.edu>
Message-ID: <w53el9ew2wv.fsf@woozle.org>

So then, Adam Hupp <hupp@cs.wisc.edu> is all like:

> PersistentBayes:
> training: 220s
> update_prob: 3.2s
> score 1 msg: .45s
> score 6156 msgs: 58s
> 
> NumericBayes:
> training: 14s
> update_prob: 0.10s
> score 1 msg: .59s
> score 6156 msgs: 49s

Holy cow!  That's impressive!

I'm no NumPy expert but it looks like you're taking advantage of some
sort of "do this on all elements of an array" function -- what the Cray
guys used to call vectorization.  I imagine NumPy can optimize that sort
of loop much better than straight CPython and you'd get speeds close to
that of compiled C.

This is a totally killer idea, except that we just decided to move
probability computation out to individual WordInfo objects!  The
thinking was--and testing seems to bear this out--that when most
transactions are small incremental updates and single message scoring
(instead of batches of messages), it's faster to compute individual
word probabilities as they're needed, since it saves a ton of I/O and
perhaps a lot of needless computation.

On the other hand, this could be of tremendous benefit to long-lived
processes like the pop3proxy and the Outlook plugin, which want to keep
the whole database around in memory.

Adam, would it be possible to abstract the bayesian part of the
algorithm (the part done in update_probabilities) so that it could be
called either with a NumPy vector operation, or in a one-at-a-time
fashion by individual WordInfo classes?  If you can think of a way to do
this, we can throw this in.  Even if you can't think of a way to do it,
I think it might be worth it to have two implementations of the same
algorithm just for this 15x speedup.

> I also modified hammiefilter to do untraining, retraining, and
> training on filter results.  For example:
> 
> hammiefilter.py --filter --train
> 
> The incoming message is scored and filtered.  If the result is not
> "Unsure" the classifier will be trained on it.
> 
> 
> hammiefilter.py --reverse --good --train
> 
> The incoming message has previously been incorrectly marked as ham.
> --reverse will untrain the classifier and --train will retrain it on
> the message as spam.
> 
> With these tools it's straightforward to setup macros in mutt to
> manage false negatives/positives and classify "Unsure" messages.

That's good stuff.  I'll have to check the list archives because I know
the issue of auto-training has been discussed and probably beaten into
the ground by now.  But first I want to get my branch merged in so
everybody else can witness my dementia ;)

Neale


From neale@woozle.org  Fri Nov 22 06:33:31 2002
From: neale@woozle.org (Neale Pickett)
Date: 21 Nov 2002 22:33:31 -0800
Subject: [Spambayes] Numeric python store, hammiefilter extension and mutt
	macros
In-Reply-To: <w53el9ew2wv.fsf@woozle.org>
References: <20021122002713.GC29009@upl.cs.wisc.edu>
	<w53el9ew2wv.fsf@woozle.org>
Message-ID: <w5365uqw1tw.fsf@woozle.org>

So then, Neale Pickett <neale@woozle.org> is all like:

> This is a totally killer idea, except that we just decided to move
> probability computation out to individual WordInfo objects!

I fired that off a little prematurely.  When I say "we" here, I actually
mean "Tim Stone, T. Alexander Popiel, and myself".  Although Tim Peters
has hinted that he thinks this is a good idea, we (Tim S and I) need to
get the nod from a few key people before this and the myriad other
changes in our (Tim S and I) little branch are checked in.

We (royal) hope that we (diminutive) have found this message
enlightening.

Weale

From rob@hooft.net  Fri Nov 22 08:58:06 2002
From: rob@hooft.net (Rob Hooft)
Date: Fri, 22 Nov 2002 09:58:06 +0100
Subject: [Spambayes] proposed changes to hammie & co.
References: <F0XV61MI64TQRCBYSRQZTLK533QPWT.3ddc4703@riven>
	<20021121053628.DD924F5A7@cashew.wolfskeep.com>	<w53y97nxxof.fsf@woozle.org>
	<20021121171801.03720F5AC@cashew.wolfskeep.com>
Message-ID: <3DDDF19E.3000305@hooft.net>

T. Alexander Popiel wrote:
> In message:  <w53y97nxxof.fsf@woozle.org>
>              Neale Pickett <neale@woozle.org> writes:
> 
>>So then, "T. Alexander Popiel" <popiel@wolfskeep.com> is all like:
>>
>>
>>>In message:  <w53d6ozzhyt.fsf@woozle.org>
>>>             Neale Pickett <neale@woozle.org> writes:
>>>
>>>>I'm currently entwined with mucking the heck out of WordInfo.  I've got
>>>>a neato scheme based on Alex's patch and comments where the WordInfo
>>>>classes still compute their own probabilities, but also keep a revision
>>>>number which is compared against a MetaInfo class.
>>>
>>>Eww, do we gotta?  I thought I was trying to make the DB smaller. ;-)
>>
>>Ah, but the only thing *stored* is (spamcount, hamcount).  The
>>probability is calculated the first time you ask for it.  If you don't
>>update nspam or nham, the next time you ask for it it gives the cached
>>value.  So the database is small, but you still get the in-memory
>>probability caching if you're using a pickle or ZODB.
> 
> 
> Sounds like there is no caching benefit for one-message-per-invocation
> situations like running out of procmail, then.  

Is this calculation for the few words in one message really 
time-determining? There is another way of caching: Make a dictionary 
that maps count-tuples to spam probabilities.

  (1,0) -> 0.155
  (0,1) -> 0.844
etc.

I definitely wouldn't move the calculation into the wordinfo class. It 
is a different task, so it "should" (design) be a separate class....

Rob


-- 
Rob W.W. Hooft  ||  rob@hooft.net  ||  http://www.hooft.net/people/rob/


From Paul.Moore@atosorigin.com  Fri Nov 22 09:33:02 2002
From: Paul.Moore@atosorigin.com (Moore, Paul)
Date: Fri, 22 Nov 2002 09:33:02 -0000
Subject: [Spambayes] Outlook addin crash
Message-ID: <16E1010E4581B049ABC51D4975CEDB8861994F@UKDCX001.uk.int.atosorigin.com>

Just got the following in the Outlook addin. No idea what caused it,
but the "Exception in thread xxxx" messages are probably the relevant
bits (I spent a while trying to get the "Filter Now" button to work
before I thought of starting traceutil).

The "Invalid window handle" message makes me think of a race condition
where Outlook hasn't opened a window by the time the addin needs it...

But the addin's UI is there (the extra buttons, and clicking them starts =
the
dialogs OK).

Paul.

>\Python22\Lib\site-packages\win32\lib\win32traceutil.py
Collecting Python Trace Output...
Outlook Spam Addin module loading
SpamAddin - Connecting to Outlook
Loaded bayes database from =
'C:\Applications\Spambayes\Outlook2000\default_bayes_
database.pck'
Loaded message database from =
'C:\Applications\Spambayes\Outlook2000\default_mess
age_database.pck'
Bayes database initialized with 769 spam and 7166 good messages
AntiSpam: Watching for new messages in folder Inbox
AntiSpam: Watching for new messages in folder Spam
Processing 0 missed spam in folder 'Inbox' took 18.9974ms
Exception in thread Thread-1:
Traceback (most recent call last):
  File "C:\Python22\Lib\threading.py", line 408, in __bootstrap
    self.run()
  File "C:\Python22\Lib\threading.py", line 396, in run
    apply(self.__target, self.__args, self.__kwargs)
  File "C:\Applications\Spambayes\Outlook2000\dialogs\AsyncDialog.py", =
line 115,
 in thread_target
    self._DoProcess()
  File "C:\Applications\Spambayes\Outlook2000\dialogs\FilterDialog.py", =
line 374
, in _DoProcess
    self.filterer(self.mgr, self.progress)
  File "C:\Applications\Spambayes\Outlook2000\filter.py", line 84, in =
filterer
    this_dispositions =3D filter_folder(f, mgr, progress)
  File "C:\Applications\Spambayes\Outlook2000\filter.py", line 64, in =
filter_fol
der
    disposition =3D filter_message(message, mgr, all_actions)
  File "C:\Applications\Spambayes\Outlook2000\filter.py", line 15, in =
filter_mes
sage
    prob =3D mgr.score(msg)
  File "C:\Applications\Spambayes\Outlook2000\manager.py", line 269, in =
score
    email =3D msg.GetEmailPackageObject()
  File "C:\Applications\Spambayes\Outlook2000\msgstore.py", line 535, in =
GetEmai
lPackageObject
    text =3D self._GetMessageText()
  File "C:\Applications\Spambayes\Outlook2000\msgstore.py", line 457, in =
_GetMes
sageText
    0)       # any # of results is fine
com_error: (-2147221246, 'Invalid window handle', None, None)

Exception in thread Thread-2:
Traceback (most recent call last):
  File "C:\Python22\Lib\threading.py", line 408, in __bootstrap
    self.run()
  File "C:\Python22\Lib\threading.py", line 396, in run
    apply(self.__target, self.__args, self.__kwargs)
  File "C:\Applications\Spambayes\Outlook2000\dialogs\AsyncDialog.py", =
line 115,
 in thread_target
    self._DoProcess()
  File "C:\Applications\Spambayes\Outlook2000\dialogs\FilterDialog.py", =
line 374
, in _DoProcess
    self.filterer(self.mgr, self.progress)
  File "C:\Applications\Spambayes\Outlook2000\filter.py", line 84, in =
filterer
    this_dispositions =3D filter_folder(f, mgr, progress)
  File "C:\Applications\Spambayes\Outlook2000\filter.py", line 64, in =
filter_fol
der
    disposition =3D filter_message(message, mgr, all_actions)
  File "C:\Applications\Spambayes\Outlook2000\filter.py", line 15, in =
filter_mes
sage
    prob =3D mgr.score(msg)
  File "C:\Applications\Spambayes\Outlook2000\manager.py", line 269, in =
score
    email =3D msg.GetEmailPackageObject()
  File "C:\Applications\Spambayes\Outlook2000\msgstore.py", line 535, in =
GetEmai
lPackageObject
    text =3D self._GetMessageText()
  File "C:\Applications\Spambayes\Outlook2000\msgstore.py", line 457, in =
_GetMes
sageText
    0)       # any # of results is fine
com_error: (-2147221246, 'Invalid window handle', None, None)

Exception in thread Thread-3:
Traceback (most recent call last):
  File "C:\Python22\Lib\threading.py", line 408, in __bootstrap
    self.run()
  File "C:\Python22\Lib\threading.py", line 396, in run
    apply(self.__target, self.__args, self.__kwargs)
  File "C:\Applications\Spambayes\Outlook2000\dialogs\AsyncDialog.py", =
line 115,
 in thread_target
    self._DoProcess()
  File "C:\Applications\Spambayes\Outlook2000\dialogs\FilterDialog.py", =
line 374
, in _DoProcess
    self.filterer(self.mgr, self.progress)
  File "C:\Applications\Spambayes\Outlook2000\filter.py", line 84, in =
filterer
    this_dispositions =3D filter_folder(f, mgr, progress)
  File "C:\Applications\Spambayes\Outlook2000\filter.py", line 64, in =
filter_fol
der
    disposition =3D filter_message(message, mgr, all_actions)
  File "C:\Applications\Spambayes\Outlook2000\filter.py", line 15, in =
filter_mes
sage
    prob =3D mgr.score(msg)
  File "C:\Applications\Spambayes\Outlook2000\manager.py", line 269, in =
score
    email =3D msg.GetEmailPackageObject()
  File "C:\Applications\Spambayes\Outlook2000\msgstore.py", line 535, in =
GetEmai
lPackageObject
    text =3D self._GetMessageText()
  File "C:\Applications\Spambayes\Outlook2000\msgstore.py", line 457, in =
_GetMes
sageText
    0)       # any # of results is fine
com_error: (-2147221246, 'Invalid window handle', None, None)

Exception in thread Thread-4:
Traceback (most recent call last):
  File "C:\Python22\Lib\threading.py", line 408, in __bootstrap
    self.run()
  File "C:\Python22\Lib\threading.py", line 396, in run
    apply(self.__target, self.__args, self.__kwargs)
  File "C:\Applications\Spambayes\Outlook2000\dialogs\AsyncDialog.py", =
line 115,
 in thread_target
    self._DoProcess()
  File "C:\Applications\Spambayes\Outlook2000\dialogs\FilterDialog.py", =
line 374
, in _DoProcess
    self.filterer(self.mgr, self.progress)
  File "C:\Applications\Spambayes\Outlook2000\filter.py", line 84, in =
filterer
    this_dispositions =3D filter_folder(f, mgr, progress)
  File "C:\Applications\Spambayes\Outlook2000\filter.py", line 64, in =
filter_fol
der
    disposition =3D filter_message(message, mgr, all_actions)
  File "C:\Applications\Spambayes\Outlook2000\filter.py", line 15, in =
filter_mes
sage
    prob =3D mgr.score(msg)
  File "C:\Applications\Spambayes\Outlook2000\manager.py", line 269, in =
score
    email =3D msg.GetEmailPackageObject()
  File "C:\Applications\Spambayes\Outlook2000\msgstore.py", line 535, in =
GetEmai
lPackageObject
    text =3D self._GetMessageText()
  File "C:\Applications\Spambayes\Outlook2000\msgstore.py", line 457, in =
_GetMes
sageText
    0)       # any # of results is fine
com_error: (-2147221246, 'Invalid window handle', None, None)

Exception in thread Thread-5:
Traceback (most recent call last):
  File "C:\Python22\Lib\threading.py", line 408, in __bootstrap
    self.run()
  File "C:\Python22\Lib\threading.py", line 396, in run
    apply(self.__target, self.__args, self.__kwargs)
  File "C:\Applications\Spambayes\Outlook2000\dialogs\AsyncDialog.py", =
line 115,
 in thread_target
    self._DoProcess()
  File "C:\Applications\Spambayes\Outlook2000\dialogs\FilterDialog.py", =
line 374
, in _DoProcess
    self.filterer(self.mgr, self.progress)
  File "C:\Applications\Spambayes\Outlook2000\filter.py", line 84, in =
filterer
    this_dispositions =3D filter_folder(f, mgr, progress)
  File "C:\Applications\Spambayes\Outlook2000\filter.py", line 64, in =
filter_fol
der
    disposition =3D filter_message(message, mgr, all_actions)
  File "C:\Applications\Spambayes\Outlook2000\filter.py", line 15, in =
filter_mes
sage
    prob =3D mgr.score(msg)
  File "C:\Applications\Spambayes\Outlook2000\manager.py", line 269, in =
score
    email =3D msg.GetEmailPackageObject()
  File "C:\Applications\Spambayes\Outlook2000\msgstore.py", line 535, in =
GetEmai
lPackageObject
    text =3D self._GetMessageText()
  File "C:\Applications\Spambayes\Outlook2000\msgstore.py", line 457, in =
_GetMes
sageText
    0)       # any # of results is fine
com_error: (-2147221246, 'Invalid window handle', None, None)

Traceback (most recent call last):
  File =
"C:\Applications\Spambayes\Outlook2000\dialogs\FolderSelector.py", line =
3
59, in _DoUpdateStatus
    self.SetDlgItemText(IDC_STATUS1, status_string)
  File "C:\Python22\lib\site-packages\Pythonwin\pywin\mfc\object.py", =
line 23, i
n __getattr__
    raise win32ui.error, "The MFC object has died."
win32ui: The MFC object has died.
Traceback (most recent call last):
  File =
"C:\Applications\Spambayes\Outlook2000\dialogs\FolderSelector.py", line =
3
59, in _DoUpdateStatus
    self.SetDlgItemText(IDC_STATUS1, status_string)
  File "C:\Python22\lib\site-packages\Pythonwin\pywin\mfc\object.py", =
line 23, i
n __getattr__
    raise win32ui.error, "The MFC object has died."
win32ui: The MFC object has died.
Traceback (most recent call last):
  File "C:\Applications\Spambayes\Outlook2000\dialogs\AsyncDialog.py", =
line 98,
in OnStart
    self.StartProcess()
  File "C:\Applications\Spambayes\Outlook2000\dialogs\FilterDialog.py", =
line 365
, in StartProcess
    self.mgr.EnsureOutlookFieldsForFolder(folder_id, config.include_sub)
  File "C:\Applications\Spambayes\Outlook2000\manager.py", line 135, in =
EnsureOu
tlookFieldsForFolder
    folders =3D item.Folders
  File "C:\Python22\lib\site-packages\win32com\client\__init__.py", line =
402, in
 __getattr__
    if d is not None: return getattr(d, attr)
  File "C:\Python22\lib\site-packages\win32com\client\__init__.py", line =
368, in
 __getattr__
    raise AttributeError, "'%s' object has no attribute '%s'" % =
(repr(self), att
r)
AttributeError: '<win32com.gen_py.Microsoft Outlook 9.0 Object =
Library._MailItem
>' object has no attribute 'Folders'
win32ui: Error in Command Message handler for command ID 1100, Code 0
Traceback (most recent call last):
  File "C:\Applications\Spambayes\Outlook2000\dialogs\AsyncDialog.py", =
line 98,
in OnStart
    self.StartProcess()
  File "C:\Applications\Spambayes\Outlook2000\dialogs\FilterDialog.py", =
line 365
, in StartProcess
    self.mgr.EnsureOutlookFieldsForFolder(folder_id, config.include_sub)
  File "C:\Applications\Spambayes\Outlook2000\manager.py", line 135, in =
EnsureOu
tlookFieldsForFolder
    folders =3D item.Folders
  File "C:\Python22\lib\site-packages\win32com\client\__init__.py", line =
402, in
 __getattr__
    if d is not None: return getattr(d, attr)
  File "C:\Python22\lib\site-packages\win32com\client\__init__.py", line =
368, in
 __getattr__
    raise AttributeError, "'%s' object has no attribute '%s'" % =
(repr(self), att
r)
AttributeError: '<win32com.gen_py.Microsoft Outlook 9.0 Object =
Library._MailItem
>' object has no attribute 'Folders'
win32ui: Error in Command Message handler for command ID 1100, Code 0

From Paul.Moore@atosorigin.com  Fri Nov 22 10:04:05 2002
From: Paul.Moore@atosorigin.com (Moore, Paul)
Date: Fri, 22 Nov 2002 10:04:05 -0000
Subject: [Spambayes] Outlook addin crash
Message-ID: <16E1010E4581B049ABC51D4975CEDB88619950@UKDCX001.uk.int.atosorigin.com>

From: Moore, Paul=20
> Just got the following in the Outlook addin. No idea what caused it,
> but the "Exception in thread xxxx" messages are probably the relevant
> bits (I spent a while trying to get the "Filter Now" button to work
> before I thought of starting traceutil).

Got it, by a bit of binary search my new messages...

I had an appointment confirmation in my inbox, which was causing the =
addin
to crash. I'd suggest that anything other than a mail item be ignored
when filtering.

Paul.

From tim@fourstonesExpressions.com  Fri Nov 22 15:07:38 2002
From: tim@fourstonesExpressions.com (Tim Stone - Four Stones Expressions)
Date: Fri, 22 Nov 2002 09:07:38 -0600
Subject: [Spambayes] proposed changes to hammie & co.
In-Reply-To: <3DDDF19E.3000305@hooft.net>
Message-ID: <GFJDB0LKKJHGSPWSMIVQEDJE8465UO0.3dde483a@riven>

11/22/2002 2:58:06 AM, Rob Hooft <rob@hooft.net> wrote:

>T. Alexander Popiel wrote:
>> In message:  <w53y97nxxof.fsf@woozle.org>
>>              Neale Pickett <neale@woozle.org> writes:
>> 
>>>So then, "T. Alexander Popiel" <popiel@wolfskeep.com> is all like:
>>>
>>>
>>>>In message:  <w53d6ozzhyt.fsf@woozle.org>
>>>>             Neale Pickett <neale@woozle.org> writes:
>>>>
>>>>>I'm currently entwined with mucking the heck out of WordInfo.  I've got
>>>>>a neato scheme based on Alex's patch and comments where the WordInfo
>>>>>classes still compute their own probabilities, but also keep a revision
>>>>>number which is compared against a MetaInfo class.
>>>>
>>>>Eww, do we gotta?  I thought I was trying to make the DB smaller. ;-)
>>>
>>>Ah, but the only thing *stored* is (spamcount, hamcount).  The
>>>probability is calculated the first time you ask for it.  If you don't
>>>update nspam or nham, the next time you ask for it it gives the cached
>>>value.  So the database is small, but you still get the in-memory
>>>probability caching if you're using a pickle or ZODB.
>> 
>> 
>> Sounds like there is no caching benefit for one-message-per-invocation
>> situations like running out of procmail, then.  
>
>Is this calculation for the few words in one message really 
>time-determining? There is another way of caching: Make a dictionary 
>that maps count-tuples to spam probabilities.
>
>  (1,0) -> 0.155
>  (0,1) -> 0.844
>etc.
>
Yeah, this is an interesting idea.  Cacheing is the right way to do this, not 
pre-calculating, because the tuple count becomes combinatorially large and is 
open ended.  But... once you've calculated for a given tuple, you shouldn't 
have to do it again.  The tuple:prob cache *could* be persistent, but I doubt 
there's much to be gained by that.

- TimS

>I definitely wouldn't move the calculation into the wordinfo class. It 
>is a different task, so it "should" (design) be a separate class....
>
>Rob

*****module probability*****
# assuming probcache is defined somewhere in some initialization
class ProbabilityCache:
    def __init__(self)
    self.probcache = {}

    def prob(self, nham, nspam)
        try:
            prob = self.probcache[nham][nspam]
        except KeyError:
            prob = calcprob(nham, nspam)
            self.probcache[nham][nspam] = prob

        return prob

def calcprob (nham, nspam)
    # code moved here from _update_probability in WordInfo class
***************************

....or something of that nature.  Maybe Adam Huff's NumPy vectorization stuff 
might play well into something like this.

Incidentally, a dictionary of dictionaries has faster lookup than a dictionary 
keyed by a constructed tuple.

x = {}
for i in range(500):
    x[i] = {}
    for j in range (500):
        x[i][j] = 1
t1s = time.time()
for k in range(5):
    for i in range(500):
        for j in range (500):
            a = x[i][j]
t1e = time.time()

x={}
for i in range(500):
    for j in range (500):
        x[(i,j)] = 1
t2s = time.time()
for k in range(5):
    for i in range(500):
        for j in range (500):
            a = x[(i,j)]
t2e = time.time()

print 'test 1 time =',t1e-t1s
print 'test 2 time =',t2e-t2s
*****
Four executions:
test 1 time = 3.41499996185
test 2 time = 4.41600000858

test 1 time = 3.375
test 2 time = 4.28600001335

test 1 time = 3.41500008106
test 2 time = 4.18599998951

test 1 time = 3.46500003338
test 2 time = 4.23699998856

- TimS

>
>
>-- 
>Rob W.W. Hooft  ||  rob@hooft.net  ||  http://www.hooft.net/people/rob/
>
>
>_______________________________________________
>Spambayes mailing list
>Spambayes@python.org
>http://mail.python.org/mailman/listinfo/spambayes
>
>
- Tim
www.fourstonesExpressions.com 


From tim@fourstonesExpressions.com  Fri Nov 22 16:42:03 2002
From: tim@fourstonesExpressions.com (Tim Stone - Four Stones Expressions)
Date: Fri, 22 Nov 2002 10:42:03 -0600
Subject: [Spambayes] proposed changes to hammie & co.
In-Reply-To: <w53adk2w2in.fsf@woozle.org>
Message-ID: <TQWU98MJFD43FALIMHPK4HEMH3ZC753.3dde5e5b@riven>

Well, I've gone and done it... I've touched classifier code.  Either my name 
is now mud, or I really am a part of the community... lol

I added result cacheing to the _update_probability method in WordInfo (in 
hammie-playground branch).  I suspect that this will save a lot of time, maybe 
commensurate with what Adam Huff demonstrated.  I don't have a large enough 
corpus to really benchmark this, though, and you'll definitely want to take a 
good look to make sure I haven't goofed anything up.  I certainly didn't 
change any calculations...

On a related note... There ought to be some safeguard against division by zero  
in the hamratio and spamratio calculations.  The system shouldn't blow up with 
a /0 exception, but just peacefully assume some default and go about its 
business.  That's because it's possible that this could be run when only spam 
has been trained on (for example).  Some (regular everyday) user may very well 
make this mistake, which is most likely to occur immediately after 
installation.  A blow up this early will probably just result in them not 
using it, assuming that it doesn't work.  I'd have fixed it, but I have no 
idea what the peaceful default should be...   
  
- TimS
www.fourstonesExpressions.com 


From neale@woozle.org  Fri Nov 22 18:43:40 2002
From: neale@woozle.org (Neale Pickett)
Date: 22 Nov 2002 10:43:40 -0800
Subject: [Spambayes] proposed changes to hammie & co.
In-Reply-To: <3DDDF19E.3000305@hooft.net>
References: <F0XV61MI64TQRCBYSRQZTLK533QPWT.3ddc4703@riven>
	<20021121053628.DD924F5A7@cashew.wolfskeep.com>
	<w53y97nxxof.fsf@woozle.org>
	<20021121171801.03720F5AC@cashew.wolfskeep.com>
	<3DDDF19E.3000305@hooft.net>
Message-ID: <w531y5dwilf.fsf@woozle.org>

So then, Rob Hooft <rob@hooft.net> is all like:

> Is this calculation for the few words in one message really
> time-determining? There is another way of caching: Make a dictionary
> that maps count-tuples to spam probabilities.
> 
>   (1,0) -> 0.155
>   (0,1) -> 0.844
> etc.

Hmm!  I did a small test against 200 spam, 200 ham, to see what tuple
frequency is like.  I got 21833 unique words, but only 869 unique values
for (spamcount, hamcount).  I also got gnuplot to animate out a cool
spinning 3D graph of it just as my boss walked by :)

The 20 most frequently-occuring (spamcount, hamcount) tuples were:

  (15, 0)  57
  (18 0)   57
  (19 0)   62
  (10 5)   65
  (0 20)   79
  (4 10)   98
  (5 10)   99
  (9 5)   113
  (14 0)  137
  (0 15)  153
  (13 0)  162
  (8 0)   288
  (4 5)   303
  (10 0)  317
  (5 5)   334
  (9 0)   611
  (0 10)  659
  (4 0)  4814
  (5 0)  4979
  (0 5)  6045

The 20 most infrequently-occurring were:

  (0, 130)   1
  (0, 135)   1
  (0, 140)   1
  (0, 155)   1
  (0, 165)   1
  (0, 175)   1
  (0, 250)   1
  (0, 285)   1
  (0, 310)   1
  (0, 725)   1
  (0, 75)    1
  (10, 30)   1
  (10, 40)   1
  (10, 85)   1
  (100, 40)  1
  (101, 115) 1
  (101, 20)  1
  (101, 25)  1
  (102, 115) 1
  (102, 20)  1

A graph of frequencies looks just a lot like a hyperbola:
<http://woozle.org/~neale/tmp/b.png>

The more I think about this caching scheme, the more I like it.  It
deals well with the fact that most of the words only occur a few times,
saves memory, and it will speed up pickles *and* databases.  It's going
in to the playground branch.

> I definitely wouldn't move the calculation into the wordinfo class. It
> is a different task, so it "should" (design) be a separate class....

Using this scheme, the calculation has to go back into the Bayes (or
Classifier) class.  WordInfo only stores counters now.

Neale

From neale@woozle.org  Fri Nov 22 18:51:38 2002
From: neale@woozle.org (Neale Pickett)
Date: 22 Nov 2002 10:51:38 -0800
Subject: [Spambayes] 
 Re: [Spambayes-checkins] spambayes classifier.py,1.53.2.6,1.53.2.7
In-Reply-To: <20021122182258.5CCA9F580@cashew.wolfskeep.com>
References: <E18FGk5-00006z-00@sc8-pr-cvs1.sourceforge.net>
	<20021122182258.5CCA9F580@cashew.wolfskeep.com>
Message-ID: <w53wun5v3np.fsf@woozle.org>

So then, "T. Alexander Popiel" <popiel@wolfskeep.com> is all like:

> In message:  <E18FGk5-00006z-00@sc8-pr-cvs1.sourceforge.net>
>              "Tim Stone" <timstone4@users.sourceforge.net> writes:
> >Update of /cvsroot/spambayes/spambayes
> >In directory sc8-pr-cvs1:/tmp/cvs-serv400
> >
> >Modified Files:
> >      Tag: hammie-playground
> >	classifier.py 
> >Log Message:
> >Added probability calculation result caching.  No benchmark available to see
> >how much, if any, performance gain is achieved, but it seems like it could
> >be significant, particularly in training large corpora, or with long running
> >processes.
> 
> You need to nuke the probcache when meta.revision changes. :-)
> 
> Also, wouldn't the cache implemented by this patch be more
> efficient if it indexed by hamcount and spamcount (both
> integers) instead of hamratio and spamratio (both floats)?

I should think so.  What do you think of this idea:

probcache is kept as a property of Classifier.  Make a
classifier.probability(self, word) method which looks up that word's
(spamcount, hamcount) tuple in probcache.  If it's not there, compute it
and add it.  Whenever Classifier.learn or Classifier.unlearn are called,
probcache is blown away.

This will effectively cache probabilities on demand, and make sure they
are current.  No need for a revision anymore.

Sound good?

From popiel@wolfskeep.com  Fri Nov 22 18:49:58 2002
From: popiel@wolfskeep.com (T. Alexander Popiel)
Date: Fri, 22 Nov 2002 10:49:58 -0800
Subject: [Spambayes] proposed changes to hammie & co. 
In-Reply-To: Message from Rob Hooft <rob@hooft.net> 
   of "Fri, 22 Nov 2002 09:58:06 +0100." <3DDDF19E.3000305@hooft.net> 
References: <F0XV61MI64TQRCBYSRQZTLK533QPWT.3ddc4703@riven>
	<20021121053628.DD924F5A7@cashew.wolfskeep.com> <w53y97nxxof.fsf@woozle.org>
	<20021121171801.03720F5AC@cashew.wolfskeep.com>  <3DDDF19E.3000305@hooft.net> 
Message-ID: <20021122184958.7B9C4F598@cashew.wolfskeep.com>

In message:  <3DDDF19E.3000305@hooft.net>
             Rob Hooft <rob@hooft.net> writes:
>
>Is this [spamprob] calculation for the few words in one message
>really time-determining?

No, which I went on to admit in the stuff you snipped. ;-)

>There is another way of caching: Make a dictionary 
>that maps count-tuples to spam probabilities.
>
>  (1,0) -> 0.155
>  (0,1) -> 0.844
>etc.

I'm not sure this is better; it would definitely have a
higher cache hit rate, but the lookups are significantly
more expensive (fetch the wordinfo, extract the counts,
then fetch the probability).

Something to measure...

>I definitely wouldn't move the calculation into the wordinfo class. It 
>is a different task, so it "should" (design) be a separate class....

I moderately agree, but OOP folks tend to have an aversion
to pure data classes (as I think WordInfo should be). ;-)

- Alex

From popiel@wolfskeep.com  Fri Nov 22 19:16:02 2002
From: popiel@wolfskeep.com (T. Alexander Popiel)
Date: Fri, 22 Nov 2002 11:16:02 -0800
Subject: [Spambayes] 
 Re: [Spambayes-checkins] spambayes classifier.py,1.53.2.6,1.53.2.7 
In-Reply-To: Message from Neale Pickett <neale@woozle.org> 
   of "22 Nov 2002 10:51:38 PST." <w53wun5v3np.fsf@woozle.org> 
References: <E18FGk5-00006z-00@sc8-pr-cvs1.sourceforge.net>
	<20021122182258.5CCA9F580@cashew.wolfskeep.com>  <w53wun5v3np.fsf@woozle.org> 
Message-ID: <20021122191603.1FFF4F598@cashew.wolfskeep.com>

In message:  <w53wun5v3np.fsf@woozle.org>
             Neale Pickett <neale@woozle.org> writes:
>
>What do you think of this idea:
>
>probcache is kept as a property of Classifier.  Make a
>classifier.probability(self, word) method which looks up that word's
>(spamcount, hamcount) tuple in probcache.  If it's not there, compute it
>and add it.  Whenever Classifier.learn or Classifier.unlearn are called,
>probcache is blown away.
>
>This will effectively cache probabilities on demand, and make sure they
>are current.  No need for a revision anymore.
>
>Sound good?

Sounds good to me.  If you split the probability computation itself
into a separate method from the cache management stuff, then it makes
it easier to subclass to replace just the counts->probability formula.

- Alex

From tim@fourstonesExpressions.com  Fri Nov 22 20:26:54 2002
From: tim@fourstonesExpressions.com (Tim Stone - Four Stones Expressions)
Date: Fri, 22 Nov 2002 14:26:54 -0600
Subject: [Spambayes] 
 Re: [Spambayes-checkins] spambayes classifier.py,1.53.2.6,1.53.2.7 
Message-ID: <31HB07872UNK821UJFYVUOE9JFQL75UO.3dde930e@riven>

11/22/2002 1:16:02 PM, "T. Alexander Popiel" <popiel@wolfskeep.com> wrote:

>In message:  <w53wun5v3np.fsf@woozle.org>
>             Neale Pickett <neale@woozle.org> writes:
>>
>>What do you think of this idea:
>>
>>probcache is kept as a property of Classifier.  Make a
>>classifier.probability(self, word) method which looks up that word's
>>(spamcount, hamcount) tuple in probcache.  If it's not there, compute it
>>and add it.  Whenever Classifier.learn or Classifier.unlearn are called,
>>probcache is blown away.
>>
>>This will effectively cache probabilities on demand, and make sure they
>>are current.  No need for a revision anymore.
>>
>>Sound good?
>
>Sounds good to me.  If you split the probability computation itself
>into a separate method from the cache management stuff, then it makes
>it easier to subclass to replace just the counts->probability formula.

>From my careful and time consuming examination of the code <wink>, it appeared 
to me that meta revision only changed when nham or nspam changed.  Therefore, 
caching on the ratios rather than nham and nspam allowed the cache to be 
pertinent all the time.  Nuking a cache is expensive...

As for indexing on an integer vs a float.  Both are immutable types, so you're 
really indexing on an object reference, not the value.  I think python is 
smart enough to realize this, and not waste the time hashing on the value in 
this instance... correct me if I'm wrong.

- TimS
>
>- Alex
>
>
- Tim
www.fourstonesExpressions.com 


From popiel@wolfskeep.com  Fri Nov 22 20:50:55 2002
From: popiel@wolfskeep.com (T. Alexander Popiel)
Date: Fri, 22 Nov 2002 12:50:55 -0800
Subject: [Spambayes] Re: caching stuff
In-Reply-To: Message from "T. Alexander Popiel" <popiel@wolfskeep.com> 
	<20021122203837.4D2CDF580@cashew.wolfskeep.com> 
References: <B7WSUSLKKEKGJITQ3CBB9ECTPNHA6ZU.3dde92fe@riven>
	<20021122203837.4D2CDF580@cashew.wolfskeep.com> 
Message-ID: <20021122205055.A66B2F580@cashew.wolfskeep.com>

In message:  <B7WSUSLKKEKGJITQ3CBB9ECTPNHA6ZU.3dde92fe@riven>
             <tim@fourstonesExpressions.com> writes:
>
>From my careful and time consuming examination of the code <wink>, it
>appeared to me that meta revision only changed when nham or nspam changed.
>Therefore, caching on the ratios rather than nham and nspam allowed the
>cache to be pertinent all the time.  Nuking a cache is expensive...

Unfortunately, preserving the cache when nham or nspam changes is bad,
because the bayesian adjustment changes, even if the ham and spam
ratios don't.  :-(

Nuking a cache in toto is a lot less expensive than individually
invalidating or updating records (which was update_probabilities
downfall).  Either is a lot less expensive than giving the wrong
answer.

>As for indexing on an integer vs a float.  Both are immutable types, so
>you're really indexing on an object reference, not the value.

Eh, I don't think so... but I don't know enough python internals to
be sure.  (Sure, they are immutable types, but I strongly doubt that
they're hashed as objects; that would imply that all references to
a float value 3.0 were references to the same object... which means
some sort of search for the 3.0 object when you added 2.5 and 0.5...
which would be a severe performance lose.  It seems far more likely
that they're hashed by value instead (even if that value is currently
boxed in an object).)

Does anyone with more python mojo have a definitive answer?  Guido?

- Alex

From tim@fourstonesExpressions.com  Fri Nov 22 21:19:11 2002
From: tim@fourstonesExpressions.com (Tim Stone - Four Stones Expressions)
Date: Fri, 22 Nov 2002 15:19:11 -0600
Subject: [Spambayes] Re: caching stuff
In-Reply-To: <20021122205055.A66B2F580@cashew.wolfskeep.com>
Message-ID: <621ZJEWSC7YV2YB6GAPJPJNLZX7454IG.3dde9f4f@riven>

11/22/2002 2:50:55 PM, "T. Alexander Popiel" <popiel@wolfskeep.com> wrote:

>In message:  <B7WSUSLKKEKGJITQ3CBB9ECTPNHA6ZU.3dde92fe@riven>
>             <tim@fourstonesExpressions.com> writes:
>>
>>From my careful and time consuming examination of the code <wink>, it
>>appeared to me that meta revision only changed when nham or nspam changed.
>>Therefore, caching on the ratios rather than nham and nspam allowed the
>>cache to be pertinent all the time.  Nuking a cache is expensive...
>
>Unfortunately, preserving the cache when nham or nspam changes is bad,
>because the bayesian adjustment changes, even if the ham and spam
>ratios don't.  :-(
>
>Nuking a cache in toto is a lot less expensive than individually
>invalidating or updating records (which was update_probabilities
>downfall).  Either is a lot less expensive than giving the wrong
>answer.

Well, if the baseian prob changes even if the ham and spam ratios don't, then 
of course the caching scheme is bad.  But I certainly don't see that in the 
code that I changed.  Maybe I'm looking in the wrong place...

- TimS
>
>>As for indexing on an integer vs a float.  Both are immutable types, so
>>you're really indexing on an object reference, not the value.
>
>Eh, I don't think so... but I don't know enough python internals to
>be sure.  (Sure, they are immutable types, but I strongly doubt that
>they're hashed as objects; that would imply that all references to
>a float value 3.0 were references to the same object... which means
>some sort of search for the 3.0 object when you added 2.5 and 0.5...
>which would be a severe performance lose.  It seems far more likely
>that they're hashed by value instead (even if that value is currently
>boxed in an object).)
>
>Does anyone with more python mojo have a definitive answer?  Guido?
>
>- Alex
>
>_______________________________________________
>Spambayes mailing list
>Spambayes@python.org
>http://mail.python.org/mailman/listinfo/spambayes
>
>
- Tim
www.fourstonesExpressions.com 


From lists@morpheus.demon.co.uk  Fri Nov 22 21:19:07 2002
From: lists@morpheus.demon.co.uk (Paul Moore)
Date: Fri, 22 Nov 2002 21:19:07 +0000
Subject: [Spambayes] Outlook addin crash
References: <16E1010E4581B049ABC51D4975CEDB88619950@UKDCX001.uk.int.atosorigin.com>
Message-ID: <n2m-g.zns1xpys.fsf@morpheus.demon.co.uk>

"Moore, Paul" <Paul.Moore@atosorigin.com> writes:

> From: Moore, Paul 
>> Just got the following in the Outlook addin. No idea what caused it,
>> but the "Exception in thread xxxx" messages are probably the relevant
>> bits (I spent a while trying to get the "Filter Now" button to work
>> before I thought of starting traceutil).
>
> Got it, by a bit of binary search my new messages...
>
> I had an appointment confirmation in my inbox, which was causing the addin
> to crash. I'd suggest that anything other than a mail item be ignored
> when filtering.

Hmm, the fix isn't obvious. In manager.py, BayesManager.score() is
where it goes wrong - msg.GetEmailPackageObject() fails.

I suspect that the correct solution is to default the score to 0 (ham)
for non-mail objects (on the basis that spammers don't send
appointments...). However, we're too late at this point - the place
where we're interacting with the MAPI object is in msgstore.py,
_GetMessageText. We could check PR_MESSAGE_CLASS there for IPM.Note,
but what do we do with non-notes?

The best I can think of is to define an exception class,
NonMessageException, and raise it in _GetMessageText. We can then
catch it in score() and handle it appropriately there. But I'm not
100% convinced that this isn't an abuse of exceptions...

But I can't think of a better answer short of a fairly major
restructuring.

Paul.
-- 
This signature intentionally left blank

From rob@hooft.net  Fri Nov 22 21:40:19 2002
From: rob@hooft.net (Rob Hooft)
Date: Fri, 22 Nov 2002 22:40:19 +0100
Subject: [Spambayes] proposed changes to hammie & co.
References: <F0XV61MI64TQRCBYSRQZTLK533QPWT.3ddc4703@riven>
	<20021121053628.DD924F5A7@cashew.wolfskeep.com> <w53y97nxxof.fsf@woozle.org>
	<20021121171801.03720F5AC@cashew.wolfskeep.com>  <3DDDF19E.3000305@hooft.net>
	<20021122184958.7B9C4F598@cashew.wolfskeep.com>
Message-ID: <3DDEA443.80300@hooft.net>

T. Alexander Popiel wrote:
> In message:  <3DDDF19E.3000305@hooft.net>
>              Rob Hooft <rob@hooft.net> writes:
> 
>>I definitely wouldn't move the calculation into the wordinfo class. It 
>>is a different task, so it "should" (design) be a separate class....
> 
> I moderately agree, but OOP folks tend to have an aversion
> to pure data classes (as I think WordInfo should be). ;-)

It doesn't have to be a pure data class. If you want to do pure OOP, 
indeed the WordInfo class should hide its implementation detail:

class WordInfo:
     def __init__(self,probcalc,...):
          self.probcalc=probcalc

     def spamprob(self):
          return self.probcalc(self.hamcount,self.spamcount)

class CachingProbCalc(ProbCalc):
     # The caching calculator
     def __call__(self,hamcount,spamcount):
         ....

class Bayes:
     ....
	pc=ProbCalc()
         ....
         wi=WordInfo(pc)

I like object composition (as you can see as well from CostCounter.py).

Rob
-- 
Rob W.W. Hooft  ||  rob@hooft.net  ||  http://www.hooft.net/people/rob/


From popiel@wolfskeep.com  Fri Nov 22 21:51:16 2002
From: popiel@wolfskeep.com (T. Alexander Popiel)
Date: Fri, 22 Nov 2002 13:51:16 -0800
Subject: [Spambayes] Re: caching stuff 
In-Reply-To: Message from Tim Stone - Four Stones Expressions
	<tim@fourstonesExpressions.com> 
	<621ZJEWSC7YV2YB6GAPJPJNLZX7454IG.3dde9f4f@riven> 
References: <621ZJEWSC7YV2YB6GAPJPJNLZX7454IG.3dde9f4f@riven> 
Message-ID: <20021122215116.BA3E0F580@cashew.wolfskeep.com>

In message:  <621ZJEWSC7YV2YB6GAPJPJNLZX7454IG.3dde9f4f@riven>
             <tim@fourstonesExpressions.com> writes:
>
>Well, if the baseian prob changes even if the ham and spam ratios don't, then 
>of course the caching scheme is bad.  But I certainly don't see that in the 
>code that I changed.  Maybe I'm looking in the wrong place...

In the probability computation (which I'm reading from
update_probabilities in an old image):

        prob = spamratio / (hamratio + spamratio)
        n = hamcount + spamcount
        prob = (StimesX + n * prob) / (S + n)


Here we see that prob is based on both the ratios and the
raw counts; thus, they're also based on nham & nspam
(because to get the same non-zero ratio, you'd have to
have a different raw count).

There's normally a hulking huge comment in the middle of
the code snippet above - that may be making it harder to
spot.

- Alex


From tim@fourstonesExpressions.com  Fri Nov 22 22:13:01 2002
From: tim@fourstonesExpressions.com (Tim Stone - Four Stones Expressions)
Date: Fri, 22 Nov 2002 16:13:01 -0600
Subject: [Spambayes] If this doesn't motivate us...
Message-ID: <JDOI1U082WGPO75YKHXSIGC0QPPJUQ.3ddeabed@riven>

http://www.freep.com/money/tech/mwend22_20021122.htm

ARGH!!!

- TimS
www.fourstonesExpressions.com 


From tim@fourstonesExpressions.com  Fri Nov 22 23:02:42 2002
From: tim@fourstonesExpressions.com (Tim Stone - Four Stones Expressions)
Date: Fri, 22 Nov 2002 17:02:42 -0600
Subject: [Spambayes] Re: caching stuff 
In-Reply-To: <20021122215116.BA3E0F580@cashew.wolfskeep.com>
Message-ID: <JFVTDBNH06ZWT61JD61B8ITOHD742.3ddeb792@riven>

11/22/2002 3:51:16 PM, "T. Alexander Popiel" <popiel@wolfskeep.com> wrote:

>In message:  <621ZJEWSC7YV2YB6GAPJPJNLZX7454IG.3dde9f4f@riven>
>             <tim@fourstonesExpressions.com> writes:
>>
>>Well, if the baseian prob changes even if the ham and spam ratios don't, 
then 
>>of course the caching scheme is bad.  But I certainly don't see that in the 
>>code that I changed.  Maybe I'm looking in the wrong place...
>
>In the probability computation (which I'm reading from
>update_probabilities in an old image):
>
>        prob = spamratio / (hamratio + spamratio)
>        n = hamcount + spamcount
>        prob = (StimesX + n * prob) / (S + n)
>
>
>Here we see that prob is based on both the ratios and the
>raw counts; thus, they're also based on nham & nspam
>(because to get the same non-zero ratio, you'd have to
>have a different raw count).

I get it now... the larger the raw counts, the more weight is given to this 
word...

So my cache mechanism is fatally flawed.

- TimS
>
>There's normally a hulking huge comment in the middle of
>the code snippet above - that may be making it harder to
>spot.
>
>- Alex
>
>
>
- Tim
www.fourstonesExpressions.com 


From tim@fourstonesExpressions.com  Fri Nov 22 23:46:49 2002
From: tim@fourstonesExpressions.com (Tim Stone - Four Stones Expressions)
Date: Fri, 22 Nov 2002 17:46:49 -0600
Subject: [Spambayes] Re: caching stuff 
In-Reply-To: <20021122232400.B57C9F580@cashew.wolfskeep.com>
Message-ID: <ZT32HEC8RPMJ21VPSQ5YGC653WURRNUS.3ddec1e9@riven>

11/22/2002 5:24:00 PM, "T. Alexander Popiel" <popiel@wolfskeep.com> wrote:

>In message:  <UR83D9KICAFEIERNXUMIUTFE0053X1Z.3ddea9ce@riven>
>             Tim Stone - Four Stones Expressions 
<tim@fourstonesExpressions.com
>> writes:
>>11/22/2002 3:51:16 PM, "T. Alexander Popiel" <popiel@wolfskeep.com> wrote:
>>
>>>In message:  <621ZJEWSC7YV2YB6GAPJPJNLZX7454IG.3dde9f4f@riven>
>>>             <tim@fourstonesExpressions.com> writes:
>>>
>>>        prob = spamratio / (hamratio + spamratio)
>>>        n = hamcount + spamcount
>>>        prob = (StimesX + n * prob) / (S + n)
>
>>But I think what you're saying is that it's possible to come up with the
>>same ratio with different raw numbers, like 2:5 and 4:10.  The ratio is
>>the same, but the prob is different? 
>
>Exactly!  Let's work it through with hamratios of 2:5 and 4:10, spamcount
>always 0:
>
>The initial version of prob = 0 / (.4 + 0) remains the same for both.
>However, the value of n is 2 in the first case and 4 in the second case.
>that means that the adjustment prob = (StimesX + n * prob) / (S + n)
>is different; StimesX / (S + 2) in one case, and StimesX / (S + 4) in
>the second.  Given the default S and X, that's .0918 and .0506.  A fairly
>significant difference.
>
>Does that help?

I humbly thank my teachers for their patience.  :)

Check the new code I'm checking into hammie-playground.  I think this is more 
what you're looking for.  Rob gave us some good food for thought, huh?

- TimS
>
>- Alex
>
>
- Tim
www.fourstonesExpressions.com 


From neale@woozle.org  Sat Nov 23 00:04:21 2002
From: neale@woozle.org (Neale Pickett)
Date: 22 Nov 2002 16:04:21 -0800
Subject: [Spambayes] anyone going to the spam conference?
Message-ID: <w53smxtup6i.fsf@woozle.org>

So, anyone planning on going to Paul Graham's spam conference January in
Cambridge?  I just got the nod from $FIRM to attend.  An all-expenses
paid trip to New England in winter!  Woo!

I'd like to present a talk on behalf of the spambayes project.  If
anyone else was planning on doing this, please let me know.  There are
certainly folks who know more than I about how our classifier works, but
nobody from our project shows up on the speakers list.

If nobody else steps up to the plate, I'm going to have to ask a lot of
dumb questions about the thing.  That should be a pretty good motivator!
<0.9 wink>

Neale

From lists@webcrunchers.com  Sat Nov 23 00:27:27 2002
From: lists@webcrunchers.com (John D.)
Date: Fri, 22 Nov 2002 16:27:27 -0800
Subject: [Spambayes] Hourra for pop3proxy !
In-Reply-To: <BA026C0B.5D02E%francois.granger@free.fr>
Message-ID: <v03110747ba047782a756@[192.168.0.2]>

=46rancis writes:

>It is plug & play. First try with two pop server, everything was working as
>advertised. The training interface is really good. It just need some
>cosmetic improvements. I"ll try to come with some patches for this.

I already talking with Richie about me adding a 'Spam management' system.  =
 Using an enhanced WEB GUI,  but allowing a number of "controls" for the=
 classifier,  tokenizer,  and other spambayes tweakage.   Other things I=
 want to include are spam management functions.   Making is easy to report=
 spam,  send to use@ftc.gov,   etc.

I want to start defining the GUI as early as next weekend.   If anyone along=
 this lines wants additional features,  please let me know.

So far planned:

  * Simplifying reporting to spamcop
  * Database for "frequent offenders"
  * Tracking tools for tracking their origin.
  * Testing validity of "opt out" addresses and links
  * Easy to use tracking and data collection on spammers.
  * References to the top anti-spammer links,  and interfacing
    data to these other anti-spammers.
  * establish a test connection to the spammers opt in pop server
    to see if that address is valid.

The Database will be using PostGreSQL and PyGreSQL modules.   Will work with=
 Apache server,  using the Python ready "cgi" modules.

Will interface with the pop3proxy,  and would pass the spam from the proxy=
 GUI to the database.

In tracking down spammers,  I'm finding it really difficult to keep track of=
 the spammers.   Because I have to spend time identifying previous spams=
 they did,  and looking through a large list.   Now,   I want to use the=
 database to search for the spam in the database to see if it occured=
 previously before,  so I can tell if it's a "repeat offender".

Now,  I want to be able to click on a spam message,  then click a query=
 button to see if I got this spam before.   Most of the spam I get that are=
 repeat offenders are identical,  so it should be easy to look them up.   =
 Once I find a particular spam,   I'll have notes I took on them,  =
 reminding me of my last correspondance with them.

Even though I spend a lot of time tracking them down,  they STILL continue=
 to spam me,  even though I would talk to the owner.    I can then collect=
 this data,  and use it for prosecution of "repeat violators".   I want to=
 make it easy for EVERYONE to do this.

I'll call this module "SMS" (Spam management system) for lack of a better na=
me.
I guess I would just put in seperate directory in CVS when I'm ready to=
 release it.

John


From lists@webcrunchers.com  Sat Nov 23 00:45:57 2002
From: lists@webcrunchers.com (John D.)
Date: Fri, 22 Nov 2002 16:45:57 -0800
Subject: [Spambayes] If this doesn't motivate us...
In-Reply-To: <JDOI1U082WGPO75YKHXSIGC0QPPJUQ.3ddeabed@riven>
Message-ID: <v03110748ba047f788657@[192.168.0.2]>

Tim writes:

>http://www.freep.com/money/tech/mwend22_20021122.htm
>
>ARGH!!!

It quotes "Police promptly raided the business and confiscated Ralsky's servers. Although they were returned a few days later, Ralsky now tries to cover his tracks better, so opponents won't know what companies and servers he's using".

I didn't know it was possible to forge the IP address.   I would be most interested in seeing how that's done.

Just so you know,  my contact in China and I have been very influencial in getting his Chinese servers shut down.   I have a reliable contact in China who I call upon that "opens a lot of doors" that no American can do from here.

He then translates things in Chinese for me,  then makes phone calls to the internet providers,   and speaks to them in Chinese terms.

It turns out the Chinese are getting sick of this,  and are soon planning to crack down on Foreign spammers using their networks within China...

John


From tim@fourstonesExpressions.com  Sat Nov 23 00:56:06 2002
From: tim@fourstonesExpressions.com (Tim Stone - Four Stones Expressions)
Date: Fri, 22 Nov 2002 18:56:06 -0600
Subject: [Spambayes] If this doesn't motivate us...
Message-ID: <424YZVSOVP1YFCB91WKG762ULKCA1T98.3dded226@riven>

11/22/2002 6:45:57 PM, "John D." <lists@webcrunchers.com> wrote:

>Tim writes:
>
>>http://www.freep.com/money/tech/mwend22_20021122.htm
>>
>>ARGH!!!
>
>It quotes "Police promptly raided the business and confiscated Ralsky's 
servers. Although they were returned a few days later, Ralsky now tries to 
cover his tracks better, so opponents won't know what companies and servers 
he's using".
>
>I didn't know it was possible to forge the IP address.   I would be most 
interested in seeing how that's done.

Forging an ip address is very very simple.  When you create a connection, you 
tell tcp what ip address you want it to say you're using.  Our friend Neale 
Pickett has written a great little treatment of sockets, that's at 
http://www.woozle.org/~neale/papers/sockets.html.  The problem with 
backtracing ip addresses is well documented.  When you spoof an address, you 
typically pick an address that you'd like people to think you are, like a 
yahoo address or something like that.  Then, if anybody gets mad at anybody, 
Yahoo gets the blame.  The solution is for our router/switch/gear 
manufacturing friends to make it impossible for people to spoof addresses, by 
looking at outgoing packets and rejecting those that don't match the ip 
address that the router knows it's coming from.  But for some reason, this 
isn't done, though DDOS and other spoof attacks are starting to raise an 
outcry that something be done.

>
>Just so you know,  my contact in China and I have been very influencial in 
getting his Chinese servers shut down.   I have a reliable contact in China 
who I call upon that "opens a lot of doors" that no American can do from here.

You go, dude!

>
>He then translates things in Chinese for me,  then makes phone calls to the 
internet providers,   and speaks to them in Chinese terms.
>
>It turns out the Chinese are getting sick of this,  and are soon planning to 
crack down on Foreign spammers using their networks within China...

That's great news.
>
>John
>
>
>
>
>_______________________________________________
>Spambayes mailing list
>Spambayes@python.org
>http://mail.python.org/mailman/listinfo/spambayes
>
>
- Tim
www.fourstonesExpressions.com 


From dereks@itsite.com  Sat Nov 23 03:45:12 2002
From: dereks@itsite.com (Derek Simkowiak)
Date: Fri, 22 Nov 2002 22:45:12 -0500 (EST)
Subject: [Spambayes] If this doesn't motivate us...
In-Reply-To: <424YZVSOVP1YFCB91WKG762ULKCA1T98.3dded226@riven>
Message-ID: <Pine.LNX.4.33L2.0211222231370.13853-100000@dev.itsite.com>

> address that the router knows it's coming from.  But for some reason, this
> isn't done, though DDOS and other spoof attacks are starting to raise an
> outcry that something be done.

	Rules from history:

1. Spamming will continue until it's not profitable.

2. Routers won't come with anti-spoof features until corporations are
willing to pay extra money for them.


> >It turns out the Chinese are getting sick of this,  and are soon planning to
> crack down on Foreign spammers using their networks within China...
[...]
> That's great news.

	Not to people concerned about human rights and freedom of speech.


--Derek


From lists@morpheus.demon.co.uk  Sat Nov 23 15:49:16 2002
From: lists@morpheus.demon.co.uk (Paul Moore)
Date: Sat, 23 Nov 2002 15:49:16 +0000
Subject: [Spambayes] New web training interface for pop3proxy
References: <16E1010E4581B049ABC51D4975CEDB88619949@UKDCX001.uk.int.atosorigin.com>
Message-ID: <n2m-g.wun4i8w3.fsf@morpheus.demon.co.uk>

"Moore, Paul" <Paul.Moore@atosorigin.com> writes:

> (This is from memory, as it happened on my home setup and I'm at
> work now, so I apologise if it's a bit vague).

It's just happened again. So I can diagnose a bit better...

>> It's "no". 8-) Like I say, no-one's reported it locking before, and
>> I've never seen it. You usually get a traceback when something goes
>> wrong. So your console says something like:
>>
>> Loading database... Done BayesProxyListener listening on port 110  .
>> UserInterfaceListener listening on port 8880                       .

Console screenshot:

Loading database... Done.
BayesProxyListener listening on port 8110.
UserInterfaceListener listening on port 8880.

>> and nothing else, and the process is still running, but you can't
>> get a page served to your browser? What error message do you get
>> from the browser? If it's one of those pointless IE error pages,
>> could you try telnetting to port 8880 and saying "GET / HTTP/1.0"?
>> Can you even connect with telnet? How about port 110?

The browser shows "Training..." and nothing more. The status bar shows
"Opening page http://localhost:8880/review..." and the progress bar is
part way across and stuck.

Python is running at 90%+ CPU. Looks like it's in a loop somewhere.

> I can't do a telnet at the moment to check.

Telnet isn't responding. The thing's almost certainly in a loop.

> It was a dbm file.
>
> The command line was pop3proxy.py -d -l 8110 localhost
>
> (proxying a local POP server on port 110 with the proxy on port 8110
> using a DBM file). Working directory was the directory of the program.
> No bayescustomize.ini file.

The UI showed the database as having 0 ham and 0 spam, but it was
doing this yesterday, and everything worked fine then. Looks like some
sort of database corruption, but a subtle one...

OK, I ran it with Corpus.Verbose = True, and it seems to be locking up
just after printing "training with" in Bayes.Trainer.train

Further checking... it's in self._add_msg(wordstream, is_spam) in
classifier.Bayes. Best I can locate, it's locking up trying to store a
None in self.wordinfo. Specifically,

    # Needed to tell a persistent DB that the content changed.
    wordinfo[word] = record

locks up with record = None (and word = electronics, but I doubt
that's relevant :-))

I can post my hammie.db, but it's 1.4M (360K zipped) so I won't bother
unless someone thinks it's going to help significantly... (BTW, some
sort of dumper of a spambayes database file might be helpful in
diagnosing problems like this - at least a structure validator. I
don't know how possible this is, and as this area is changing rapidly
right now, I'll just put it on the TODO list for the moment.

Paul.
-- 
This signature intentionally left blank

From francois.granger@free.fr  Sat Nov 23 18:59:58 2002
From: francois.granger@free.fr (=?iso-8859-1?Q?Fran=E7ois?= Granger)
Date: Sat, 23 Nov 2002 19:59:58 +0100
Subject: [Spambayes] New web training interface for pop3proxy
In-Reply-To: <n2m-g.wun4i8w3.fsf@morpheus.demon.co.uk>
References: 
 <16E1010E4581B049ABC51D4975CEDB88619949@UKDCX001.uk.int.atosorigin.com>
 <n2m-g.wun4i8w3.fsf@morpheus.demon.co.uk>
Message-ID: <a0510030eba057ff2c1fe@[192.168.1.11]>

At 15:49 +0000 23/11/02, in message Re: [Spambayes] New web training 
interface for pop3prox, Paul Moore wrote:
>I can post my hammie.db, but it's 1.4M (360K zipped) so I won't bother
>unless someone thinks it's going to help significantly... (BTW, some
>sort of dumper of a spambayes database file might be helpful in
>diagnosing problems like this - at least a structure validator. I
>don't know how possible this is, and as this area is changing rapidly
>right now, I'll just put it on the TODO list for the moment.


The enclosed script did it for the recent pickle format.

I'll retest it and improve if necessary.


It is rather crude at the moment. but by changing the format = csv by 
format = asctab line 145, you get either a pseudo csf or ascii 
tabulated values.
-- 
Le courrier �lectronique est un moyen de communication. Les gens devraient
se poser des questions sur les implications politiques des choix (ou non
choix) de leurs outils et technologies.
Pour des courriers propres : http://minilien.com/?IXZneLoID0 - 
http://marc.herbert.free.fr/mail/ http://expita.com/nomime.html-------------- next part --------------
Skipped content of type multipart/appledoubleFrom noreply@sourceforge.net  Sat Nov 23 14:00:57 2002
From: noreply@sourceforge.net (noreply@sourceforge.net)
Date: Sat, 23 Nov 2002 06:00:57 -0800
Subject: [Spambayes] [ spambayes-Bugs-642740 ] "Recover from Spam" wrong
	folder
Message-ID: <E18Faq9-0001wJ-00@sc8-sf-web2.sourceforge.net>

Bugs item #642740, was opened at 2002-11-24 01:00
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=498103&aid=642740&group_id=61702

Category: None
Group: None
Status: Open
Resolution: None
Priority: 5
Submitted By: Mark Hammond (mhammond)
Assigned to: Mark Hammond (mhammond)
Summary: "Recover from Spam" wrong folder

Initial Comment:
Outlook addin:

Selecting "Recover From Spam" recovers the selected
message to the Inbox folder - which is not necessarily
where came from.  The filterer will need to save the
folder it came from before we can do this.

----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=498103&aid=642740&group_id=61702

From tim.one@comcast.net  Sat Nov 23 20:27:30 2002
From: tim.one@comcast.net (Tim Peters)
Date: Sat, 23 Nov 2002 15:27:30 -0500
Subject: [Spambayes] If this doesn't motivate us...
In-Reply-To: <v03110748ba047f788657@[192.168.0.2]>
Message-ID: <LNBBLJKPBEHFEDALKOLCKEOJCOAB.tim.one@comcast.net>

[John D.]
> ...
> I didn't know it was possible to forge the IP address.   I would
> be most interested in seeing how that's done.

It's just bits put together by software.  Some systems make it easier than
others.  There's a huge and ongoing flap about Windows XP Home edition,
which is the first consumer MS OS said to make it quite easy.  Once you get
the spoofing working, here's how to do really nasty stuff <wink>:

    http://rr.sans.org/threats/intro_spoofing.php


From tim.one@comcast.net  Sat Nov 23 21:49:10 2002
From: tim.one@comcast.net (Tim Peters)
Date: Sat, 23 Nov 2002 16:49:10 -0500
Subject: [Spambayes] Outlook weirdness
In-Reply-To: <MJEHLHJKGINLONDMMKNECEGKHIAA.seant@iname.com>
Message-ID: <LNBBLJKPBEHFEDALKOLCMEONCOAB.tim.one@comcast.net>

[Sean True, on using a database]
> ...
> Slower *training* would be an issue, however.

For bulk training, but one-at-a-time training would be much faster (no need
for update_probabilities() at the end, which computes a new value for every
word in the database).  Bulk training could be taught to use a new
classifier based on an in-memory dict.  When that's done, the in-memory
dict's ham and spam counts would be added into the persistent DB (rewriting
only those WordInfo records corresponding to words that appeared in the bulk
training data), and then the in-memory dict could be thrown away.


From tim.one@comcast.net  Sat Nov 23 21:54:15 2002
From: tim.one@comcast.net (Tim Peters)
Date: Sat, 23 Nov 2002 16:54:15 -0500
Subject: [Spambayes] Documentation...
In-Reply-To: <ddtntuo5m5gddnp835hdohlj2rrtllu3kl@4ax.com>
Message-ID: <LNBBLJKPBEHFEDALKOLCKEOOCOAB.tim.one@comcast.net>

[Richie Hindle]
> This may be premature, but as part of helping John Draper set up the
> spambayes software I've made a start on some user documentation.  It
> could go on the website, or maybe in with the source code - I'm not
> sure we're ready to give the impression that this stuff is ready for
> "normal people" to use yet.

First check it into the project, so other people can help update it too, and
so it doesn't get lost.  These docs are a great beginning!


From tim.one@comcast.net  Sat Nov 23 22:10:27 2002
From: tim.one@comcast.net (Tim Peters)
Date: Sat, 23 Nov 2002 17:10:27 -0500
Subject: [Spambayes] New web training interface for pop3proxy
In-Reply-To: <ju3otuofsrrr5gf5em75qs4bg45k3lug37@4ax.com>
Message-ID: <LNBBLJKPBEHFEDALKOLCGEOPCOAB.tim.one@comcast.net>

[David Ascher]
> Make 'hovertips' that display the first few lines of the body

[Richie Hindle]
> This is done.  The code to strip HTML content uses a regular expres=
sion
> from tokenizer.py which is commented "Cheap-ass gimmick", so I'm
> interested to see how well people find it works!

It works very well except when it doesn't <wink>.  The chief damned-
whether-you-do-or-don't problem:  I've seen several msgs with HTML st=
yle
sheets and/or HTML comments exceeding 2K characters.  The 2K limit in=
 the
minimal matches serves two purposes:

1. Prevent the C stack from blowing up in the regexp engine.  But
   Fran=E7ois Granger reported a C stack blowup anyway on Mac OS 9,
   and I still have no clue how small a limit would prevent that on
   his box.

2. Prevent it from consuming an arbitrary amount of text in case
   we matched a "begin long construct" character sequence by accident=
.
   It's *unlikely* that random test contains <style" or "<!--"
   by accident, though, so I'm not much worried about that one.

> (Apologies to Tim - it seems to work extremely well.)

Yes, when it works at all <wink>.  Fixing it in all cases requires do=
ing
real HTML parsing, and that's expensive, so the current "cheap-ass gi=
mmick"
is accurate.

> Rest assures it's safe from HTML content leaking into the web
> interface - the worst that will happen is that you'll see HTML sour=
ce
> in the hovertip.

A giant <style .. </style> section near the start seems the most like=
ly
glitch here.  Are you using this regexp *from* Python, or from Javasc=
ript?
I have half a mind to replace the comment and style nuking with an
iterative, stack-friendly scheme (like, e.g., crack_uuencode() and
crack_urls(), which only use regexps to help find the right places to=
 poke
at -- they can't blow the C stack).  But if you're doing this from
Javascript, that wouldn't help you.


From tim@fourstonesExpressions.com  Sat Nov 23 22:40:40 2002
From: tim@fourstonesExpressions.com (Tim Stone - Four Stones Expressions)
Date: Sat, 23 Nov 2002 16:40:40 -0600
Subject: [Spambayes] New web training interface for pop3proxy
In-Reply-To: <LNBBLJKPBEHFEDALKOLCGEOPCOAB.tim.one@comcast.net>
Message-ID: <NON5ZOM5Z832141EDZY83IMKFBNLNI.3de003e8@riven>

11/23/2002 4:10:27 PM, Tim Peters <tim.one@comcast.net> wrote:

>[David Ascher]
>> Make 'hovertips' that display the first few lines of the body
>
>[Richie Hindle]
>> This is done.  The code to strip HTML content uses a regular expression
>> from tokenizer.py which is commented "Cheap-ass gimmick", so I'm
>> interested to see how well people find it works!
>
>It works very well except when it doesn't <wink>.  The chief damned-
>whether-you-do-or-don't problem:  I've seen several msgs with HTML style
>sheets and/or HTML comments exceeding 2K characters.  The 2K limit in the
>minimal matches serves two purposes:
>
>1. Prevent the C stack from blowing up in the regexp engine.  But
>   Fran�ois Granger reported a C stack blowup anyway on Mac OS 9,
>   and I still have no clue how small a limit would prevent that on
>   his box.
>
>2. Prevent it from consuming an arbitrary amount of text in case
>   we matched a "begin long construct" character sequence by accident.
>   It's *unlikely* that random test contains <style" or "<!--"
>   by accident, though, so I'm not much worried about that one.
>
>> (Apologies to Tim - it seems to work extremely well.)
>
>Yes, when it works at all <wink>.  Fixing it in all cases requires doing
>real HTML parsing, and that's expensive, so the current "cheap-ass gimmick"
>is accurate.
>
>> Rest assures it's safe from HTML content leaking into the web
>> interface - the worst that will happen is that you'll see HTML source
>> in the hovertip.
>
>A giant <style .. </style> section near the start seems the most likely
>glitch here.  Are you using this regexp *from* Python, or from Javascript?
>I have half a mind to replace the comment and style nuking with an
>iterative, stack-friendly scheme (like, e.g., crack_uuencode() and
>crack_urls(), which only use regexps to help find the right places to poke
>at -- they can't blow the C stack).  But if you're doing this from
>Javascript, that wouldn't help you.

A giant <style> or a giant comment, though those don't occur that often.  
Another tag that is probably huge and worthless is <script>...</script>, often 
couched in a huge comment.  (But do scripts even occur in emailed html?)  We 
should probably use another cheap ass gimmick to get rid of those tags, then 
use the cheap ass regex to get rid of the rest of the html.

One other problem with the regex that I see is that it doesn't seem to handle 
tags with ill placed whitespace very well... like < a href=...  A whitespace 
normalization substitution regex might be well advised.  Taking out whitespace 
after a < would change a < b to a <b, not altering its meaning from a clue 
perspective, and would change <  a href=... to <a href=..., making it 
recognizable to the cheap-ass gimmick regex.

There was some talk earlier about gleaning clues from some tags, like 
background, font, color, etc. kind of things... any more thought along those 
lines?
>
>
>
>_______________________________________________
>Spambayes mailing list
>Spambayes@python.org
>http://mail.python.org/mailman/listinfo/spambayes
>
>
- Tim
www.fourstonesExpressions.com 


From richie@entrian.com  Sat Nov 23 22:51:19 2002
From: richie@entrian.com (Richie Hindle)
Date: Sat, 23 Nov 2002 22:51:19 +0000
Subject: [Spambayes] Spambayes articles for the Linux Journal
Message-ID: <0nsvtucqpudd8411bfel9cc8so6gjhs6oi@4ax.com>


Exciting news for Spambayes: Don Marti has asked Gary Robinson and I to
write two articles about the project for the Linux Journal.  Gary Robinson
is going to cover the math side of things, and I'm going to cover the
history of the project and how to use the software.  

So if anyone's thinking "He'd better mention X, or he won't be doing it
right" then now's the time to say so!

[Neale]
> I'd like to present a talk on behalf of the spambayes project.  If
> anyone else was planning on doing this, please let me know.

No, but we're probaly going to be aiming to achieve the same things, in
your talk and my article - we should plagarise each other as much as
possible!

-- 
Richie Hindle
richie@entrian.com


From richie@entrian.com  Sat Nov 23 22:51:31 2002
From: richie@entrian.com (Richie Hindle)
Date: Sat, 23 Nov 2002 22:51:31 +0000
Subject: [Spambayes] New web training interface for pop3proxy
In-Reply-To: <LNBBLJKPBEHFEDALKOLCGEOPCOAB.tim.one@comcast.net>
References: <ju3otuofsrrr5gf5em75qs4bg45k3lug37@4ax.com>
	<LNBBLJKPBEHFEDALKOLCGEOPCOAB.tim.one@comcast.net>
Message-ID: <r900uu8monvsidlbcfgg21371r6phknb92@4ax.com>


[Richie Hindle]
> The code to strip HTML content uses a regular expression
> from tokenizer.py which is commented "Cheap-ass gimmick", so I'm
> interested to see how well people find it works!

[Tim Peters]
> Are you using this regexp *from* Python, or from Javascript?
> I have half a mind to replace the comment and style nuking with an
> iterative, stack-friendly scheme (like, e.g., crack_uuencode() and
> crack_urls(), which only use regexps to help find the right places to poke
> at -- they can't blow the C stack).  But if you're doing this from
> Javascript, that wouldn't help you.

I'm using it from Python, but (currently) only in a relatively unimportant
feature.  I wouldn't call it worth changing for the sake of the hovertips,
but it's definitely worth changing to fix Fran�ois' stack explosion.

Somewhere (I can't find it right now, but I'll have a proper look ASAP) I
have an attempt at rewriting Tom Christiansen's striphtml program
(http://www.perl.com/CPAN/authors/Tom_Christiansen/scripts/striphtml.gz)
using non-greedy Python regexps - it's mostly a mechanical rewrite,
changing things like "<.*?>" to "<[^>]*>", and so forth.  That might work -
I'll try to dig it out.

-- 
Richie Hindle
richie@entrian.com


From richie@entrian.com  Sat Nov 23 22:51:24 2002
From: richie@entrian.com (Richie Hindle)
Date: Sat, 23 Nov 2002 22:51:24 +0000
Subject: [Spambayes] New web training interface for pop3proxy
In-Reply-To: <n2m-g.wun4i8w3.fsf@morpheus.demon.co.uk>
References: 
	<16E1010E4581B049ABC51D4975CEDB88619949@UKDCX001.uk.int.atosorigin.com>
	<n2m-g.wun4i8w3.fsf@morpheus.demon.co.uk>
Message-ID: <u9uvtu4i39142m68an94a2fveef7g5tlli@4ax.com>


[Paul on his pop3proxy hang]
> The browser shows "Training..." and nothing more. The status bar shows
> "Opening page http://localhost:8880/review..." and the progress bar is
> part way across and stuck.

I just had this myself, and in my case I'm convinced it's a bug in my vius
scanner, McAfee 6.02.  The POP3 proxy had cached a virus message to disk,
the virus checker had popped up, and I'd hit "Exclude".  A few seconds
later I hit Train and the browser locked.  Messages were still arriving
over the the active POP3 link - I let that finish, tried to kill the hung
browser, and saw that the virus checker was flagged as "Not responding" in
the Close Program dialog.  I tried to kill it and the operating system
exploded (nothing unusual there with Windows 98).

Judging by the rest of what you say, this isn't what's happening to you,
but I mention it for two reasons - to ask whether anyone else has had a
similar problem, and to point out that caching emails to disk can have
unwanted side effects.  The POP3 proxy can already use Tim Stone's
GzipFileMessageFactory to store the messages in compressed form, by
enabling pop3proxy_cache_use_gzip, which is off by default because it slows
things down a bit - perhaps it should be enabled by default.  Or when (if?)
we switch to ZODB, we should probably store the messages in there.

> Specifically,
> 
>     # Needed to tell a persistent DB that the content changed.
>     wordinfo[word] = record
> 
> locks up with record = None (and word = electronics, but I doubt
> that's relevant :-))

Um?  That code says:

>             if record is None:
>                 record = self.WordInfoClass(now)
> 
>             if is_spam:
>                 record.spamcount += 1
>             else:
>                 record.hamcount += 1
>             # Needed to tell a persistent DB that the content changed.
>             wordinfo[word] = record

So by the time it gets to the offending line, record can't be None...

Anyhow, I'm using a pickle rather than a DBM, which might explain why I
haven't seen your problem in the same way you have.  Does Paul's
description ring any bells with people who use/maintain the DBM code?  I'll
start running the code with a DBM and see what happens.

-- 
Richie Hindle
richie@entrian.com


From richie@entrian.com  Sat Nov 23 22:52:46 2002
From: richie@entrian.com (Richie Hindle)
Date: Sat, 23 Nov 2002 22:52:46 +0000
Subject: [Spambayes] Documentation...
In-Reply-To: <LNBBLJKPBEHFEDALKOLCKEOOCOAB.tim.one@comcast.net>
References: <ddtntuo5m5gddnp835hdohlj2rrtllu3kl@4ax.com>
	<LNBBLJKPBEHFEDALKOLCKEOOCOAB.tim.one@comcast.net>
Message-ID: <2e10uuc5agh9nqki2b7rn973m7ofu9qguv@4ax.com>


[Richie Hindle]
> This may be premature, but as part of helping John Draper set up the
> spambayes software I've made a start on some user documentation.

[Tim Peters]
> First check it into the project, so other people can help update it too, and
> so it doesn't get lost.  These docs are a great beginning!

Will do, as soon as CVS will speak to me.  8-(

Some of those docs will form part of the Linux Journal article, so all
comments are gratefully received (preferably before our 5th December
deadline!)

-- 
Richie Hindle
richie@entrian.com


From tim@fourstonesExpressions.com  Sat Nov 23 22:56:08 2002
From: tim@fourstonesExpressions.com (Tim Stone - Four Stones Expressions)
Date: Sat, 23 Nov 2002 16:56:08 -0600
Subject: [Spambayes] New web training interface for pop3proxy
In-Reply-To: <u9uvtu4i39142m68an94a2fveef7g5tlli@4ax.com>
Message-ID: <KH5ZGKJIFRPGCIHF87LHQND2VC95.3de00788@riven>

I wouldn't spend a whole lot of time working this right now, as Neale, Alex, 
and I are about to merge our stuff in, and quite a bit of this code has at 
least changed places, if not been altered a bit.  When you see the merge, go 
ahead and test this stuff again.

- TimS

11/23/2002 4:51:24 PM, Richie Hindle <richie@entrian.com> wrote:

>
>[Paul on his pop3proxy hang]
>> The browser shows "Training..." and nothing more. The status bar shows
>> "Opening page http://localhost:8880/review..." and the progress bar is
>> part way across and stuck.
>
>I just had this myself, and in my case I'm convinced it's a bug in my vius
>scanner, McAfee 6.02.  The POP3 proxy had cached a virus message to disk,
>the virus checker had popped up, and I'd hit "Exclude".  A few seconds
>later I hit Train and the browser locked.  Messages were still arriving
>over the the active POP3 link - I let that finish, tried to kill the hung
>browser, and saw that the virus checker was flagged as "Not responding" in
>the Close Program dialog.  I tried to kill it and the operating system
>exploded (nothing unusual there with Windows 98).
>
>Judging by the rest of what you say, this isn't what's happening to you,
>but I mention it for two reasons - to ask whether anyone else has had a
>similar problem, and to point out that caching emails to disk can have
>unwanted side effects.  The POP3 proxy can already use Tim Stone's
>GzipFileMessageFactory to store the messages in compressed form, by
>enabling pop3proxy_cache_use_gzip, which is off by default because it slows
>things down a bit - perhaps it should be enabled by default.  Or when (if?)
>we switch to ZODB, we should probably store the messages in there.
>
>> Specifically,
>> 
>>     # Needed to tell a persistent DB that the content changed.
>>     wordinfo[word] = record
>> 
>> locks up with record = None (and word = electronics, but I doubt
>> that's relevant :-))
>
>Um?  That code says:
>
>>             if record is None:
>>                 record = self.WordInfoClass(now)
>> 
>>             if is_spam:
>>                 record.spamcount += 1
>>             else:
>>                 record.hamcount += 1
>>             # Needed to tell a persistent DB that the content changed.
>>             wordinfo[word] = record
>
>So by the time it gets to the offending line, record can't be None...
>
>Anyhow, I'm using a pickle rather than a DBM, which might explain why I
>haven't seen your problem in the same way you have.  Does Paul's
>description ring any bells with people who use/maintain the DBM code?  I'll
>start running the code with a DBM and see what happens.
>
>-- 
>Richie Hindle
>richie@entrian.com
>
>
>_______________________________________________
>Spambayes mailing list
>Spambayes@python.org
>http://mail.python.org/mailman/listinfo/spambayes
>
>
- Tim
www.fourstonesExpressions.com 


From neale@woozle.org  Sun Nov 24 00:01:37 2002
From: neale@woozle.org (Neale Pickett)
Date: 23 Nov 2002 16:01:37 -0800
Subject: [Spambayes] merging tomorrow--big code changes
Message-ID: <w53y97ju97i.fsf@woozle.org>

Tomorrow I'm going to merge the hammie-playground branch back in to
HEAD.  A lot of stuff has changed and I'm late for a very important date
so I can't go over it now.  Tim Stone and I have tried to make sure
everything still works, but a few of the more noticeable changes will
be:

* You'll have to re-train your databases (pickles or dbms)
* some pop3proxy and hammie options have new names in the
  bayescustomize.ini file

I'll post a more detailed explanantion of what will change later tonight
when I get back, but I wanted to give interested parties a heads up as
soon as possible.  If you want to check for yourself, you can use

  cvs update -r hammie-playground

to try all the new stuff (back up your trained databases first though).

Neale


From jeremy@alum.mit.edu  Sun Nov 24 00:16:25 2002
From: jeremy@alum.mit.edu (Jeremy Hylton)
Date: Sat, 23 Nov 2002 19:16:25 -0500
Subject: [Spambayes] merging tomorrow--big code changes
In-Reply-To: <w53y97ju97i.fsf@woozle.org>
References: <w53y97ju97i.fsf@woozle.org>
Message-ID: <15840.6745.568509.850650@slothrop.zope.com>

May I suggest changing the name of the persistent module?  I'm just
trying to avoid confusion with the ZODB-based persistence approach.
In that world, Persistence and Persistent mean something quite
different than a class with load() and store() methods.  The training
stuff also seems independent of the load/store part.

Perhaps there should just be a storage module with PickledClassifier
and DBDictClassifer.  The classify() method could be moved into 
classifer.Classifer.  (This seems like the natural place.)  Then the
abstract class isn't really necessary, since it only defines two
methods that most be overridden anyway.

Jeremy


From tim.one@comcast.net  Sun Nov 24 03:11:58 2002
From: tim.one@comcast.net (Tim Peters)
Date: Sat, 23 Nov 2002 22:11:58 -0500
Subject: [Spambayes] Spambayes articles for the Linux Journal
In-Reply-To: <0nsvtucqpudd8411bfel9cc8so6gjhs6oi@4ax.com>
Message-ID: <LNBBLJKPBEHFEDALKOLCMEPOCOAB.tim.one@comcast.net>

[Richie Hindle]
> Exciting news for Spambayes: Don Marti has asked Gary Robinson and I
> to write two articles about the project for the Linux Journal.

Excellent!  I'm very glad you're doing this!

> Gary Robinson is going to cover the math side of things, and I'm going
> to cover the history of the project and how to use the software.
>
> So if anyone's thinking "He'd better mention X, or he won't be doing it
> right" then now's the time to say so!

The most important algorithmic thing that's likely to get overlooked is
tokenzization:  overall, we got more good out of tokenizing different things
in different ways than from anything else.  For example, the first thing I
added to Graham's scheme was special tokenization of embedded URLs, and that
instantly cut the FN rate in half (indeed, it's still the single biggest win
we ever got!).  The other odd part of tokenization is how hard it works to
*blind* the classifier to things that would create unhelpfully correlated
clues.  That's why we strip HTML decorations, and ignore most header lines
by default, although both have their downsides too.

The importance of a sound testing framework should go without saying, of
course <wink>.


From tim.one@comcast.net  Sun Nov 24 08:01:08 2002
From: tim.one@comcast.net (Tim Peters)
Date: Sun, 24 Nov 2002 03:01:08 -0500
Subject: [Spambayes] New web training interface for pop3proxy
In-Reply-To: <NON5ZOM5Z832141EDZY83IMKFBNLNI.3de003e8@riven>
Message-ID: <LNBBLJKPBEHFEDALKOLCIEBACPAB.tim.one@comcast.net>

[Tim Stone]
> ...
> Another tag that is probably huge and worthless is
> <script>...</script>, often couched in a huge comment.  (But do
> scripts even occur in emailed html?)

Yes, and especially in spam.  The mere presence of <script and/or </script
generates tokens now (see virus_re).

> We should probably use another cheap ass gimmick to get rid of those
> tags,

I checked in code to get rid of <style and <!-- gimmicks in a different way.
Leaving <script> guts alone allows the classifier to see common bits of spam
script, so it's probably helpful to leave those bits alone.

> then use the cheap ass regex to get rid of the rest of the html.

Yup.

> One other problem with the regex that I see is that it doesn't
> seem to handle tags with ill placed whitespace very well... like < a
> href=...

It doesn't handle them at all, and intentionally not, because it has no idea
whether it's even looking at HTML.  Note that the most likely values
attached to href= are picked up anyway, though (scanning for embedded URLs
is done without regard to context -- whether attached to href= or src= or
just sitting in plain text or whatever, we tokenize 'em).  The special tags
we look for (like <script> and <iframe>) do allow for leading whitespace,
because there's scant chance those will match other kinds of text by
accident.

> A whitespace normalization substitution regex might be well advised.
> Taking out whitespace after a < would change a < b to a <b, not
> altering its meaning from a clue perspective, and would change <  a
> href=... to <a href=..., making it recognizable to the cheap-ass
> gimmick regex.

This very email shows why that's not advisable:  it would pick up accidental
instances of "<" and consider them to be "HTML tags" ending with one of the
">"s I'm using to quote your text; or, for example, deleting everything from
your

    like < a

thru my

    (like <script>

I'm afraid real parsing can't be done by a cheap-ass gimmick.

> There was some talk earlier about gleaning clues from some tags, like
> background, font, color, etc. kind of things... any more thought along
> those lines?

Not here -- it wouldn't catch any spam in any of my test data that isn't
already getting caught.  The only white-on-white Unsure I've seen would have
been called spam instead then, but that would require more than just noting
which color and background values were being used (since I've only seen this
once, they would be hapaxes, unique to that single spam).

Real parsing is probably inevitable someday, though.  It's too easy to fool
a cheap-ass gimmick.  But for now, almost nothing does.  BTW, real parsing
is much harder than just using a real parser <0.9 wink>, because so much
HTML is ill-formed.


From tim.one@comcast.net  Sun Nov 24 08:06:33 2002
From: tim.one@comcast.net (Tim Peters)
Date: Sun, 24 Nov 2002 03:06:33 -0500
Subject: [Spambayes] New web training interface for pop3proxy
In-Reply-To: <r900uu8monvsidlbcfgg21371r6phknb92@4ax.com>
Message-ID: <LNBBLJKPBEHFEDALKOLCEEBBCPAB.tim.one@comcast.net>

[Richie Hindle]
> I'm using it from Python, but (currently) only in a relatively
> unimportant feature.  I wouldn't call it worth changing for the
> sake of the hovertips, but it's definitely worth changing to fix
> Fran=E7ois' stack explosion.

That should be fixed now.


From richie@entrian.com  Sun Nov 24 09:47:26 2002
From: richie@entrian.com (Richie Hindle)
Date: Sun, 24 Nov 2002 09:47:26 +0000
Subject: [Spambayes] Spambayes articles for the Linux Journal
In-Reply-To: <LNBBLJKPBEHFEDALKOLCMEPOCOAB.tim.one@comcast.net>
References: <0nsvtucqpudd8411bfel9cc8so6gjhs6oi@4ax.com>
	<LNBBLJKPBEHFEDALKOLCMEPOCOAB.tim.one@comcast.net>
Message-ID: <5t71uukc82pj8q4ukheniu2at093i2d8e7@4ax.com>


[Richie Hindle]
> So if anyone's thinking "He'd better mention X, or he won't be doing it
> right" then now's the time to say so!

[Tim Peters]
> The most important algorithmic thing that's likely to get overlooked is
> tokenzization

[...and...]
> The importance of a sound testing framework should go without saying, of
> course <wink>.

These are the two things that differentiate this project most from all the
others that sprang up as a result of Paul Graham's article (with Python's
ability for rapid prototyping coming a close third), so I'll definitely
talk about them, yes.

-- 
Richie Hindle
richie@entrian.com


From neale@woozle.org  Sun Nov 24 09:48:01 2002
From: neale@woozle.org (Neale Pickett)
Date: 24 Nov 2002 01:48:01 -0800
Subject: [Spambayes] detailed differences between branches
Message-ID: <w53u1i7ti26.fsf@woozle.org>

In case anyone was wondering, the protagonist of the movie won.

So here's what I think will be different--what will change when I
merge.  I'm guessing on a few things here because I'm really sleepy and
I don't know what all these modules do.  When in doubt, I'm assuming
what's in HEAD is newer than what's in the playground branch.

Tim, care to comment on Corpus.py, FileCorpus.py, and CostCounter.py?

Options.py
 * No more show_best_discriminators
 * persistent_storage_file becomes pop3proxy_persistent_storage_file and
   hammiefilter_persistent_storage_file
 * ditto for persistent_use_database

Tester.py
 * Changed some calls to new classifier.Classifier arguments

classifier.py
 * PICKLE_VERSION is now 4
 * WordInfo stores only spamcount, hamcount
 * New MetaInfo class stores nspam, nham
 * class name change: Bayes -> Classifier
 * Classifier class computes spamprob on demand and caches result.  This
   means you don't need to call update_probabilities() anymore.  But
   it's still there if you really really want to call it.
 * As a result, the third argument to Classifier.learn() has been
   removed.
 * New Classifier.probability() method takes a record (WordInfo object)
   as an argument and returns the probability for that record.  This is
   what does the caching.

dbdict.py
 * DBDict class moved out of hammie.py and into here
 * We know it's ugly right now.

hammie.py
 * Just contains Hammie class now, if you try to run this as __main__,
   it will run main() from hammiebulk.py for compatibility.
 * Also has a handy open() function to open a database or pickle and
   return a Hammie object.

hammiefilter.py:
 * funtions moved into HammieFilter class
 * default db is "~/.hammiedb".  I don't know what's reasonable for the
   Windows camp, so I'm leaving it up to you folks to figure that out.

pop3proxy.py:
 * Uses Classifier class instead of Hammie class

hammiebulk.py:
 * The canonical executable Hammie front-end.  Now does dishes!
 * You must pass in -D or -d, there is no default.


Man.  I've got my work cut out for me.

'Night

Neale

From richie@entrian.com  Sun Nov 24 09:53:01 2002
From: richie@entrian.com (Richie Hindle)
Date: Sun, 24 Nov 2002 09:53:01 +0000
Subject: [Spambayes] merging tomorrow--big code changes
In-Reply-To: <w53y97ju97i.fsf@woozle.org>
References: <w53y97ju97i.fsf@woozle.org>
Message-ID: <0281uuocck6494jibulp22s6m1vu704f12@4ax.com>


[Neale]
> A lot of stuff has changed

Maybe now would be a good time to change the default name of the
classification header?  Since we're going to be getting some publicity
soon, through the Linux Journal articles and the spam conference, we ought
to get it right.  As far as I can remember, the alternatives are:

1. X-Hammie-Disposition: Yes/No/Unsure          (ie. do nothing)
2. X-Spambayes-Classification: Spam/Ham/Unsure
3. X-Ham-Status: Yes/No/Unsure

Have I forgotten any?

Can we have a vote?

-- 
Richie Hindle
richie@entrian.com


From lists@morpheus.demon.co.uk  Sun Nov 24 15:25:31 2002
From: lists@morpheus.demon.co.uk (Paul Moore)
Date: Sun, 24 Nov 2002 15:25:31 +0000
Subject: [Spambayes] New web training interface for pop3proxy
References: 
	<16E1010E4581B049ABC51D4975CEDB88619949@UKDCX001.uk.int.atosorigin.com>
	<n2m-g.wun4i8w3.fsf@morpheus.demon.co.uk>
	<u9uvtu4i39142m68an94a2fveef7g5tlli@4ax.com>
Message-ID: <n2m-g.8yzjxa50.fsf@morpheus.demon.co.uk>

Richie Hindle <richie@entrian.com> writes:

>> locks up with record = None (and word = electronics, but I doubt
>> that's relevant :-))
>
> Um?  That code says:
>
>>             if record is None:
>>                 record = self.WordInfoClass(now)
>> 
>>             if is_spam:
>>                 record.spamcount += 1
>>             else:
>>                 record.hamcount += 1
>>             # Needed to tell a persistent DB that the content changed.
>>             wordinfo[word] = record
>
> So by the time it gets to the offending line, record can't be None...

I know, but I did do "print record" just at that line (or so I thought
- I took out the tracing once I'd got as far as I did, so I don't have
evidence any more :-()

Tried it again, and it locked on

>>> print word, record
mac WordInfo'(1038151445.245, 1, 0, 0, 0.5)'

Don't know what this implies, though...

Paul.
-- 
This signature intentionally left blank

From lists@morpheus.demon.co.uk  Sun Nov 24 17:35:07 2002
From: lists@morpheus.demon.co.uk (Paul Moore)
Date: Sun, 24 Nov 2002 17:35:07 +0000
Subject: [Spambayes] New web training interface for pop3proxy
References: 
	<16E1010E4581B049ABC51D4975CEDB88619949@UKDCX001.uk.int.atosorigin.com>
	<n2m-g.wun4i8w3.fsf@morpheus.demon.co.uk>
	<u9uvtu4i39142m68an94a2fveef7g5tlli@4ax.com>
Message-ID: <n2m-g.3cpqyipg.fsf@morpheus.demon.co.uk>

Richie Hindle <richie@entrian.com> writes:

> Um?  That code says:
>
>>             if record is None:
>>                 record = self.WordInfoClass(now)
>> 
>>             if is_spam:
>>                 record.spamcount += 1
>>             else:
>>                 record.hamcount += 1
>>             # Needed to tell a persistent DB that the content changed.
>>             wordinfo[word] = record
>
> So by the time it gets to the offending line, record can't be None...

Hmm. I just did some tracing by using Python's trace hooks (neat
trick, although it does produce a lot of output...) It stops in
hammie.py at line 130, which is in DBDict.__setitem__ doing
self.hash[key] = v. That's mildly worrying, as it sort of implies that
the lockup is in the DBM C extension. (If any more Python code was
being called, I'd expect to see trace output).

Any DBM experts in the house?

Paul.

PS The tracing code I used was as follows. It's quite a neat trick, as
   it doesn't need any changes to the source at all - and by playing
   with the trace hook, you can even dump local variable values if you
   need to...

import sys

def tracer(frame, event, arg):
    if event == 'call':
        return tracer
    if event == 'line':
        print frame.f_code.co_filename, frame.f_lineno
        return tracer

sys.settrace(tracer)
sys.argv = ["pop3proxy.py", "-d", "-l", "8110", "localhost"]
execfile(sys.argv[0])


-- 
This signature intentionally left blank

From lists@morpheus.demon.co.uk  Sun Nov 24 15:28:17 2002
From: lists@morpheus.demon.co.uk (Paul Moore)
Date: Sun, 24 Nov 2002 15:28:17 +0000
Subject: [Spambayes] New web training interface for pop3proxy
References: <u9uvtu4i39142m68an94a2fveef7g5tlli@4ax.com>
	<KH5ZGKJIFRPGCIHF87LHQND2VC95.3de00788@riven>
Message-ID: <n2m-g.65unxa0e.fsf@morpheus.demon.co.uk>

Tim Stone - Four Stones Expressions <tim@fourstonesExpressions.com> writes:

> I wouldn't spend a whole lot of time working this right now, as Neale, Alex, 
> and I are about to merge our stuff in, and quite a bit of this code has at 
> least changed places, if not been altered a bit.  When you see the merge, go 
> ahead and test this stuff again.

You're probably right. As the changes require a database rebuild, and
that fixes the problem, I won't be able to check immediately. But I'll
keep an eye out, and if it doesn't reoccur, I'll assume it's been
fixed.

Paul.
-- 
This signature intentionally left blank

From neale@woozle.org  Sun Nov 24 20:04:32 2002
From: neale@woozle.org (Neale Pickett)
Date: 24 Nov 2002 12:04:32 -0800
Subject: [Spambayes] merging tomorrow--big code changes
In-Reply-To: <15840.6745.568509.850650@slothrop.zope.com>
References: <w53y97ju97i.fsf@woozle.org>
	<15840.6745.568509.850650@slothrop.zope.com>
Message-ID: <w53hee6u433.fsf@woozle.org>

So then, Jeremy Hylton <jeremy@alum.mit.edu> is all like:

> Perhaps there should just be a storage module with PickledClassifier
> and DBDictClassifer.  The classify() method could be moved into 
> classifer.Classifer.  (This seems like the natural place.)  Then the
> abstract class isn't really necessary, since it only defines two
> methods that most be overridden anyway.

That sounds reasonable to me, and it sounds like Tim S doesn't have a
problem with it either.  I'll go ahead and do that after my chores
today.

From neale@woozle.org  Sun Nov 24 20:05:29 2002
From: neale@woozle.org (Neale Pickett)
Date: 24 Nov 2002 12:05:29 -0800
Subject: [Spambayes] merging tomorrow--big code changes
In-Reply-To: <0281uuocck6494jibulp22s6m1vu704f12@4ax.com>
References: <w53y97ju97i.fsf@woozle.org>
	<0281uuocck6494jibulp22s6m1vu704f12@4ax.com>
Message-ID: <w53d6ouu41i.fsf@woozle.org>

So then, Richie Hindle <richie@entrian.com> is all like:

> 1. X-Hammie-Disposition: Yes/No/Unsure          (ie. do nothing)
> 2. X-Spambayes-Classification: Spam/Ham/Unsure
> 3. X-Ham-Status: Yes/No/Unsure
> 
> Can we have a vote?

I like 2.  I got another message in private expressing support for 2.
Anyone else?

From tim.one@comcast.net  Sun Nov 24 20:18:34 2002
From: tim.one@comcast.net (Tim Peters)
Date: Sun, 24 Nov 2002 15:18:34 -0500
Subject: [Spambayes] merging tomorrow--big code changes
In-Reply-To: <w53d6ouu41i.fsf@woozle.org>
Message-ID: <LNBBLJKPBEHFEDALKOLCOECKCPAB.tim.one@comcast.net>

[Neale Pickett. voting on

 1. X-Hammie-Disposition: Yes/No/Unsure          (ie. do nothing)
 2. X-Spambayes-Classification: Spam/Ham/Unsure
 3. X-Ham-Status: Yes/No/Unsure
]

> I like 2.  I got another message in private expressing support for 2.
> Anyone else?

I like #2 best too.  If we instead wanted to be real geeky, make it

    X-Fisher-Inverse-Chi-Combination: Tim/tIm/tiM

If you have to ask, you shouldn't be looking at it <wink>.

From tim@fourstonesExpressions.com  Sun Nov 24 21:03:09 2002
From: tim@fourstonesExpressions.com (Tim Stone - Four Stones Expressions)
Date: Sun, 24 Nov 2002 15:03:09 -0600
Subject: [Spambayes] merging tomorrow--big code changes
In-Reply-To: <w53d6ouu41i.fsf@woozle.org>
Message-ID: <3WLKQOCBICVSRLQP3X6432HGKFMKHDUQ.3de13e8d@riven>

2 for me... - TimS
11/24/2002 2:05:29 PM, Neale Pickett <neale@woozle.org> wrote:

>So then, Richie Hindle <richie@entrian.com> is all like:
>
>> 1. X-Hammie-Disposition: Yes/No/Unsure          (ie. do nothing)
>> 2. X-Spambayes-Classification: Spam/Ham/Unsure
>> 3. X-Ham-Status: Yes/No/Unsure
>> 
>> Can we have a vote?
>
>I like 2.  I got another message in private expressing support for 2.
>Anyone else?
>
>_______________________________________________
>Spambayes mailing list
>Spambayes@python.org
>http://mail.python.org/mailman/listinfo/spambayes
>
>
- Tim
www.fourstonesExpressions.com 


From lists@morpheus.demon.co.uk  Sun Nov 24 20:24:09 2002
From: lists@morpheus.demon.co.uk (Paul Moore)
Date: Sun, 24 Nov 2002 20:24:09 +0000
Subject: [Spambayes] merging tomorrow--big code changes
References: <w53y97ju97i.fsf@woozle.org>
	<0281uuocck6494jibulp22s6m1vu704f12@4ax.com>
	<w53d6ouu41i.fsf@woozle.org>
Message-ID: <n2m-g.wun2wwba.fsf@morpheus.demon.co.uk>

Neale Pickett <neale@woozle.org> writes:

> So then, Richie Hindle <richie@entrian.com> is all like:
>
>> 1. X-Hammie-Disposition: Yes/No/Unsure          (ie. do nothing)
>> 2. X-Spambayes-Classification: Spam/Ham/Unsure
>> 3. X-Ham-Status: Yes/No/Unsure
>> 
>> Can we have a vote?
>
> I like 2.  I got another message in private expressing support for 2.
> Anyone else?

(2) suits me.

Paul.
-- 
This signature intentionally left blank

From popiel@wolfskeep.com  Sun Nov 24 21:11:00 2002
From: popiel@wolfskeep.com (T. Alexander Popiel)
Date: Sun, 24 Nov 2002 13:11:00 -0800
Subject: [Spambayes] merging tomorrow--big code changes 
In-Reply-To: Message from Neale Pickett <neale@woozle.org> 
   of "24 Nov 2002 12:05:29 PST." <w53d6ouu41i.fsf@woozle.org> 
References: <w53y97ju97i.fsf@woozle.org>
	<0281uuocck6494jibulp22s6m1vu704f12@4ax.com>  <w53d6ouu41i.fsf@woozle.org> 
Message-ID: <20021124211100.68529F589@cashew.wolfskeep.com>

In message:  <w53d6ouu41i.fsf@woozle.org>
             Neale Pickett <neale@woozle.org> writes:
>So then, Richie Hindle <richie@entrian.com> is all like:
>
>> 1. X-Hammie-Disposition: Yes/No/Unsure          (ie. do nothing)
>> 2. X-Spambayes-Classification: Spam/Ham/Unsure
>> 3. X-Ham-Status: Yes/No/Unsure
>> 
>> Can we have a vote?
>
>I like 2.  I got another message in private expressing support for 2.
>Anyone else?

I'll add my voice to 2.

- Alex (rebuilding his computer and not writing code)

From francois.granger@free.fr  Sun Nov 24 22:07:46 2002
From: francois.granger@free.fr (=?iso-8859-1?Q?Fran=E7ois?= Granger)
Date: Sun, 24 Nov 2002 23:07:46 +0100
Subject: [Spambayes] merging tomorrow--big code changes
In-Reply-To: <LNBBLJKPBEHFEDALKOLCOECKCPAB.tim.one@comcast.net>
References: <LNBBLJKPBEHFEDALKOLCOECKCPAB.tim.one@comcast.net>
Message-ID: <a05100318ba06fda870a8@[192.168.1.11]>

At 15:18 -0500 24/11/02, in message RE: [Spambayes] merging 
tomorrow--big code changes, Tim Peters wrote:
>[Neale Pickett. voting on
>
>  1. X-Hammie-Disposition: Yes/No/Unsure          (ie. do nothing)
>  2. X-Spambayes-Classification: Spam/Ham/Unsure
>  3. X-Ham-Status: Yes/No/Unsure
>]
>
>>  I like 2.  I got another message in private expressing support for 2.
>>  Anyone else?
>
>I like #2 best too.

So do I. Makes it real clear what to look for in your mail filter.

>   If we instead wanted to be real geeky, make it
>
>     X-Fisher-Inverse-Chi-Combination: Tim/tIm/tiM
>
>If you have to ask, you shouldn't be looking at it <wink>.

sorry, I was looking <wink>


i'll be back at testing as soon as dust settle after the _Big_Change_.
-- 
Le courrier �lectronique est un moyen de communication. Les gens devraient
se poser des questions sur les implications politiques des choix (ou non
choix) de leurs outils et technologies.
Pour des courriers propres : http://minilien.com/?IXZneLoID0 - 
http://marc.herbert.free.fr/mail/ http://expita.com/nomime.html

From tim@fourstonesExpressions.com  Sun Nov 24 22:14:11 2002
From: tim@fourstonesExpressions.com (Tim Stone - Four Stones Expressions)
Date: Sun, 24 Nov 2002 16:14:11 -0600
Subject: [Spambayes] merging tomorrow--big code changes
Message-ID: <NHWSQKT387XMIA6E006TPDPNXRGB.3de14f33@riven>

11/24/2002 4:07:46 PM, Fran�ois Granger <francois.granger@free.fr> wrote:

>At 15:18 -0500 24/11/02, in message RE: [Spambayes] merging 
>tomorrow--big code changes, Tim Peters wrote:
>>[Neale Pickett. voting on
>>
>>  1. X-Hammie-Disposition: Yes/No/Unsure          (ie. do nothing)
>>  2. X-Spambayes-Classification: Spam/Ham/Unsure
>>  3. X-Ham-Status: Yes/No/Unsure
>>]
>>
>>>  I like 2.  I got another message in private expressing support for 2.
>>>  Anyone else?
>>
>>I like #2 best too.
>
>So do I. Makes it real clear what to look for in your mail filter.
>
>>   If we instead wanted to be real geeky, make it
>>
>>     X-Fisher-Inverse-Chi-Combination: Tim/tIm/tiM
>>
>>If you have to ask, you shouldn't be looking at it <wink>.
>
>sorry, I was looking <wink>

How about
X-Phreakish-Fisher-Inverse-Chi-Combination-Translation: Ti|\/|.T1m.T1|\/|

c'est moi... - T1|\/|S
>
>
>i'll be back at testing as soon as dust settle after the _Big_Change_.
>-- 
>Le courrier �lectronique est un moyen de communication. Les gens devraient
>se poser des questions sur les implications politiques des choix (ou non
>choix) de leurs outils et technologies.
>Pour des courriers propres : http://minilien.com/?IXZneLoID0 - 
>http://marc.herbert.free.fr/mail/ http://expita.com/nomime.html
>
>_______________________________________________
>Spambayes mailing list
>Spambayes@python.org
>http://mail.python.org/mailman/listinfo/spambayes
>
>
- Tim
www.fourstonesExpressions.com 


From mhammond@skippinet.com.au  Sun Nov 24 22:45:44 2002
From: mhammond@skippinet.com.au (Mark Hammond)
Date: Mon, 25 Nov 2002 09:45:44 +1100
Subject: [Spambayes] Important information for Outlook users
Message-ID: <LCEPIIGDJPKCOIHOBJEPKECCHOAA.mhammond@skippinet.com.au>

I am about to check in some changes to the way the Spam field is used in the
Outlook add-in.

Specifically, the new code reverts back to writing a floating point number
from 0-1 in the property (rather than the integer used now).  The "Percent"
format in Outlook is used, giving us the same result as now, except we get a
"%" sign next to the Spam score.

This change was made to prevent the user from having to manually add the
field to Outlook - and bizarrely - we simply can't make it work the way we
want with integers.  Users will still need to add the field to each folder's
view, but the field will already exist in the Field Chooser ready to be
dropped in place.

Unfortunately, unless the following procedure is performed, all new email
will show a blank "Spam" score.  The filtering etc will all work correctly,
but the new percentage spam value will not show up in the existing "Spam"
field.

To get everything working as expected:

+ Delete the old spam field from all messages by executing the following
  command:

  C:\> Outlook2000\sandbox\delete_mapi_field.py -f "Folder name" -d Spam

  Execute this for your Inbox, Spam, and Maybe Spam folders.  For example:

    delete_mapi_field.py -f "Inbox" -f "SomeFolder\Spam" -f "Maybe" -d Spam

  Will delete the field from the 3 named folders.

+ Delete the old Spam field from Outlook.  For each of your folders you care
  about:
  * Go to the Outlook field chooser, and select "User Defined Fields".
  * If the "Spam" field exists as a column in your Outlook view, drag it
    back to the Field Chooser window.
  * Select "Delete" in the Field Chooser to delete the field.

+ Restart Outlook.

+ Re-Add the Spam field to your folder.  For each folder:
  * Re-Open the Field Chooser.
  * Select "User Defined Fields".  The "Spam" field should already exist.
  * Drag the field to wherever you want!

+ Re-Filter: If you want to see the scores for your existing mail,
  just use the "Filter Now" option (with "Score only" selected)

And that, finally, is it.  There is no need to re-train.  NOTE: If any of
the folders have no messages, the field will *not* be created as Outlook
starts.  *sigh* - I just thought of that, and don't care enough to fix it
now :(

Until you perform the above procedure, you will also see additional messages
in the trace window as Outlook starts:

...
Warning: failed to create the Outlook user-property in folder 'Spam'
 (-2147352567, 'Exception occurred.', (4096, 'Microsoft Outlook', 'A custom
field with this name but a different data type already exists. Enter a
different name.', None, 0, -2147467259), None)
 This is probably because the code has recently been changed, but it will
 have no effect on the filtering or scoring.
...

This is simply a reflection of the property type change, and again, has no
affect on the operation of the plugin, just in the display of the field.

Let me know if there are any problems, and/or just create bugs at
source-forge and assign them to me!

Mark.


From papaDoc@videotron.ca  Sun Nov 24 22:57:06 2002
From: papaDoc@videotron.ca (Remi Ricard)
Date: Sun, 24 Nov 2002 17:57:06 -0500
Subject: [Spambayes] merging tomorrow--big code changes
In-Reply-To: <NHWSQKT387XMIA6E006TPDPNXRGB.3de14f33@riven>
References: <NHWSQKT387XMIA6E006TPDPNXRGB.3de14f33@riven>
Message-ID: <1038178626.1007.3.camel@porsche>

Hi,

>  1. X-Hammie-Disposition: Yes/No/Unsure          (ie. do nothing)
>  2. X-Spambayes-Classification: Spam/Ham/Unsure
>  3. X-Ham-Status: Yes/No/Unsure
>

I also prefer #2 because last week when I started using pop3proxy at my
job I was confused by the "X-Hammie-Disposition: Yes" I was not sure
if this was meaning the message was Spam or Ham. Maybe it is because
my mother thong is not English.

So for me the #2 or #3 is obvious but I prefer #2 since this gives a
hint about the software used.


From neale@woozle.org  Mon Nov 25 02:48:17 2002
From: neale@woozle.org (Neale Pickett)
Date: 24 Nov 2002 18:48:17 -0800
Subject: [Spambayes] branches merged, flame away ;)
Message-ID: <w53isymfjpq.fsf@woozle.org>

No, seriously, I think everything will continue to work the way it was
before, except that you'll have to rebuild all your database.  I may
have missed something though.  Please let me know if something starts
breaking after your next cvs update.

Also, it might not be a bad idea to back up your database before you
update, too.  Just in case I did something terribly, terribly wrong.
There's a reason I don't write the control code for guided missiles ;)

Neale

From papaDoc@videotron.ca  Mon Nov 25 00:01:19 2002
From: papaDoc@videotron.ca (Remi Ricard)
Date: Sun, 24 Nov 2002 19:01:19 -0500
Subject: [Spambayes] merging tomorrow--big code changes
Message-ID: <1038182479.1040.11.camel@porsche>

Hi Neale,

>=20
> > I also prefer #2 because last week when I started using pop3proxy at
> > my job I was confused by the "X-Hammie-Disposition: Yes" I was not
> > sure if this was meaning the message was Spam or Ham. Maybe it is
> > because my mother thong is not English.
>=20
> Salut Remi.
>=20
> Moi j'essaye m'ensenger le fran=E7ais.  D'un =E9tudiant a l'autre, je doi=
s
> te dire que le mot "thong", en englais, c'est une type des
> sous-v=EAtements tr=E9s tr=E9s petite des femmes.  Je pense que tu as veu=
lu
> dire "mother tongue" :)

You know now why I want to get rid of Spam. This garbage confuses me ;-)

>=20
> En tout cas, je vais le changer =E0 #2: "X-Spambayes-Classification:
> Spam/Ham/Unsure".  Merci pour ton usage de spambayes!
>=20
> Neale

spambayes@python.org


From noreply@sourceforge.net  Mon Nov 25 00:41:38 2002
From: noreply@sourceforge.net (noreply@sourceforge.net)
Date: Sun, 24 Nov 2002 16:41:38 -0800
Subject: [Spambayes] 
 [ spambayes-Patches-643306 ] tests on fixed training and test set
Message-ID: <E18G7Ji-0001nR-00@sc8-sf-web4.sourceforge.net>

Patches item #643306, was opened at 2002-11-25 00:41
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=498105&aid=643306&group_id=61702

Category: None
Group: None
Status: Open
Resolution: None
Priority: 5
Submitted By: Daniel Etzold (etzi)
Assigned to: Nobody/Anonymous (nobody)
Summary: tests on fixed training and test set

Initial Comment:
A new argument for runtest.sh "fixedsets" and a file
"fixedsets.py" are introduced to create just one
classifier from Set1 and test this classifier on the
eMails of Set2.
This is useful when comparing the results of this
project with other methods (SVM, Rocchio, k-Nearest
Neighbour, Naive Bayes) for equal training and test sets.

----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=498105&aid=643306&group_id=61702

From mhammond@skippinet.com.au  Mon Nov 25 03:48:32 2002
From: mhammond@skippinet.com.au (Mark Hammond)
Date: Mon, 25 Nov 2002 14:48:32 +1100
Subject: [Spambayes] branches merged, flame away ;)
In-Reply-To: <w53isymfjpq.fsf@woozle.org>
Message-ID: <LCEPIIGDJPKCOIHOBJEPIECPHOAA.mhammond@skippinet.com.au>

> No, seriously, I think everything will continue to work the way it was
> before, except that you'll have to rebuild all your database.  I may
> have missed something though.  Please let me know if something starts
> breaking after your next cvs update.
>
> Also, it might not be a bad idea to back up your database before you
> update, too.  Just in case I did something terribly, terribly wrong.
> There's a reason I don't write the control code for guided missiles ;)

This has broken the Outlook client in a number of ways.  Some are obvious,
but this one is not:

Traceback (most recent call last):
...
  File "F:\src\spambayes\Outlook2000\manager.py", line 274, in score
    result = self.bayes.spamprob(bayes_tokenize(email), evidence)
  File "F:\src\spambayes\classifier.py", line 271, in chi2_spamprob
    clues = self._getclues(wordstream)
  File "F:\src\spambayes\classifier.py", line 525, in _getclues
    prob = self.probability(record)
  File "F:\src\spambayes\classifier.py", line 356, in probability
    return self.probcache[spamcount][hamcount]
exceptions.AttributeError: Classifier instance has no attribute 'probcache'

Any clues?

Mark.


From neale@woozle.org  Mon Nov 25 04:13:18 2002
From: neale@woozle.org (Neale Pickett)
Date: 24 Nov 2002 20:13:18 -0800
Subject: [Spambayes] branches merged, flame away ;)
In-Reply-To: <LCEPIIGDJPKCOIHOBJEPIECPHOAA.mhammond@skippinet.com.au>
References: <LCEPIIGDJPKCOIHOBJEPIECPHOAA.mhammond@skippinet.com.au>
Message-ID: <w53el9affs1.fsf@woozle.org>

So then, "Mark Hammond" <mhammond@skippinet.com.au> is all like:

> This has broken the Outlook client in a number of ways.  Some are obvious,
> but this one is not:

> exceptions.AttributeError: Classifier instance has no attribute 'probcache'

Ah, I wasn't initializing probcache when the Classifier method was
created.  I've just checked something in that should fix that--let me
know how it goes for you.

Sorry about the breakage, Mark.  I don't have a Windows box to test with
and arrogantly that if hammie and pop3proxy were okay, the Outlook
client would be too :-\

Neale

From neale@woozle.org  Mon Nov 25 04:25:46 2002
From: neale@woozle.org (Neale Pickett)
Date: 24 Nov 2002 20:25:46 -0800
Subject: [Spambayes] header name has changed
Message-ID: <w5365umff79.fsf@woozle.org>

So I went ahead and changed it to:

  X-Spambayes-Classification: spam/ham/unsure

Because 100% of the votes were for that :)


From neale@woozle.org  Mon Nov 25 04:34:16 2002
From: neale@woozle.org (Neale Pickett)
Date: 24 Nov 2002 20:34:16 -0800
Subject: [Spambayes] New web training interface for pop3proxy
In-Reply-To: <n2m-g.3cpqyipg.fsf@morpheus.demon.co.uk>
References: <16E1010E4581B049ABC51D4975CEDB88619949@UKDCX001.uk.int.atosorigin.com>
	<n2m-g.wun4i8w3.fsf@morpheus.demon.co.uk>
	<u9uvtu4i39142m68an94a2fveef7g5tlli@4ax.com>
	<n2m-g.3cpqyipg.fsf@morpheus.demon.co.uk>
Message-ID: <w53wun2e08n.fsf@woozle.org>

So then, Paul Moore <lists@morpheus.demon.co.uk> is all like:

> Hmm. I just did some tracing by using Python's trace hooks (neat
> trick, although it does produce a lot of output...) It stops in
> hammie.py at line 130, which is in DBDict.__setitem__ doing
> self.hash[key] = v. That's mildly worrying, as it sort of implies that
> the lockup is in the DBM C extension. (If any more Python code was
> being called, I'd expect to see trace output).

That *is* distressing, and does sound like a problem in the dbm code.  I
know little about that code, but if it happens again let me know and
I'll start learning ;)

Neale

From tim.one@comcast.net  Mon Nov 25 04:51:45 2002
From: tim.one@comcast.net (Tim Peters)
Date: Sun, 24 Nov 2002 23:51:45 -0500
Subject: [Spambayes] New web training interface for pop3proxy
In-Reply-To: <w53wun2e08n.fsf@woozle.org>
Message-ID: <LNBBLJKPBEHFEDALKOLCGEEPCPAB.tim.one@comcast.net>

[Paul Moore]
>> Hmm. I just did some tracing by using Python's trace hooks (neat
>> trick, although it does produce a lot of output...) It stops in
>> hammie.py at line 130, which is in DBDict.__setitem__ doing
>> self.hash[key] = v. That's mildly worrying, as it sort of implies that
>> the lockup is in the DBM C extension. (If any more Python code was
>> being called, I'd expect to see trace output).

[Neale Pickett]
> That *is* distressing, and does sound like a problem in the dbm code.
> I know little about that code, but if it happens again let me know and
> I'll start learning ;)

IIRC, Paul is running on Windows.  If so, and he's picking up the ancient
Berkeley 1.85 bsddb that ships with Windows Python, that's got lots of
severe problems that will never be fixed:

    http://www.sleepycat.com/historic.html

Python 2.3 will ship with a much more modern version of bsddb.  An early
taste of that can be gotten now via

    http://sf.net/projects/pybsddb/

That project is essentially getting folded into 2.3.


From vanhorn@whidbey.com  Mon Nov 25 06:05:05 2002
From: vanhorn@whidbey.com (G. Armour Van Horn)
Date: Sun, 24 Nov 2002 22:05:05 -0800
Subject: [Spambayes] header name has changed
References: <w5365umff79.fsf@woozle.org>
Message-ID: <3DE1BD91.EE996549@whidbey.com>

Well, I almost wrote to vote for the spelling challenge we discussed a
while back,
    X-Spambayes-Judgement: spam|ham|unsure
but decided that all of us that were going to get the joke had already
gotten it.

Van

Neale Pickett wrote:

> So I went ahead and changed it to:
>
>   X-Spambayes-Classification: spam/ham/unsure
>
> Because 100% of the votes were for that :)
>
> _______________________________________________
> Spambayes mailing list
> Spambayes@python.org
> http://mail.python.org/mailman/listinfo/spambayes

--
----------------------------------------------------------
Sign up now for Quotes of the Day, a handful of quotations
on a theme delivered every morning.
Enlightenment! Daily, for free!
mailto:twisted@whidbey.com?subject=Subscribe_QOTD

For web hosting and maintenance,
visit Van's home page: http://www.domainvanhorn.com/van/
----------------------------------------------------------


From mhammond@skippinet.com.au  Mon Nov 25 06:09:20 2002
From: mhammond@skippinet.com.au (Mark Hammond)
Date: Mon, 25 Nov 2002 17:09:20 +1100
Subject: [Spambayes] branches merged, flame away ;)
In-Reply-To: <w53el9affs1.fsf@woozle.org>
Message-ID: <LCEPIIGDJPKCOIHOBJEPKEDGHOAA.mhammond@skippinet.com.au>

> From: Neale Pickett [mailto:neale@woozle.org]
> Ah, I wasn't initializing probcache when the Classifier method was
> created.  I've just checked something in that should fix that--let me
> know how it goes for you.

OK thanks - all seems to work now.

Mark.

From Paul.Moore@atosorigin.com  Mon Nov 25 11:00:23 2002
From: Paul.Moore@atosorigin.com (Moore, Paul)
Date: Mon, 25 Nov 2002 11:00:23 -0000
Subject: [Spambayes] Important information for Outlook users
Message-ID: <16E1010E4581B049ABC51D4975CEDB88619953@UKDCX001.uk.int.atosorigin.com>

From: Mark Hammond [mailto:mhammond@skippinet.com.au]
> Let me know if there are any problems, and/or just create
> bugs at source-forge and assign them to me!

This isn't working properly for me. I'm getting strange,
inconsistent results. When I try to delete the field, the script
says it's done it, but then if I run it again, it says "deleting 1
field instance from Outlook" again. Sometimes, not always :-(

When I start Outlook, the field appears - with type "Combination".
Tried again, got type "Number". Even when I deleted and recreated
the field as a Percentage type, I looked again a little later and
it was "Number" again.

I've reregistered the addin, rebuilt the databases, refiltered
everything, deleted all pyc files. Everything I can think of.

I'm sorry this is such a vague report - I can't get consistent
behaviour from it (and I've spent far too long on this today, so
I'm going to have to give up now). I'll try again when I have some
more time.

Paul.

From Paul.Moore@atosorigin.com  Mon Nov 25 11:04:21 2002
From: Paul.Moore@atosorigin.com (Moore, Paul)
Date: Mon, 25 Nov 2002 11:04:21 -0000
Subject: [Spambayes] New web training interface for pop3proxy
Message-ID: <16E1010E4581B049ABC51D4975CEDB885E2E10@UKDCX001.uk.int.atosorigin.com>

From: Tim Peters [mailto:tim.one@comcast.net]
> IIRC, Paul is running on Windows.  If so, and he's picking up
> the ancient Berkeley 1.85 bsddb that ships with Windows Python,
> that's got lots of severe problems that will never be fixed:

That's true. In the meantime, can I upgrade in a way which replaces
the shipped DBM? If I install pybsddb, will it override the standard
library version? If not, can I make it? Or would it be worth
specifically looking for pybsddb, and using that in preference if it
is present?

Paul.

From francois.granger@free.fr  Mon Nov 25 11:21:38 2002
From: francois.granger@free.fr (Fran=?ISO-8859-1?B?5w==?=ois Granger)
Date: Mon, 25 Nov 2002 12:21:38 +0100
Subject: [Spambayes] merging tomorrow--big code changes
In-Reply-To: <1038182479.1040.11.camel@porsche>
Message-ID: <BA07C652.5D2AD%francois.granger@free.fr>

on 25/11/02 1:01, Remi Ricard at papaDoc@videotron.ca wrote:

> Hi Neale,
>>=20
>>> Maybe it is
>>> because my mother thong is not English.
>>=20
>> Moi j'essaye m'ensenger le fran=E7ais.
>=20
> You know now why I want to get rid of Spam. This garbage confuses me ;-)

La communaut=E9 francophone autour de SpamBaye s'agrandis. On va pouvoir
=E9changer nos spam...
;-)


[en] The french community is growing ! We will be able to exchange spams...=
.

--=20
Le courrier est un moyen de communication. Les gens devraient
se poser des questions sur les implications politiques des choix (ou non
choix) de leurs outils et technologies. Pour des courriers propres :
<http://marc.herbert.free.fr/mail/> -- <http://minilien.com/?IXZneLoID0>


From mhammond@skippinet.com.au  Mon Nov 25 11:31:11 2002
From: mhammond@skippinet.com.au (Mark Hammond)
Date: Mon, 25 Nov 2002 22:31:11 +1100
Subject: [Spambayes] Important information for Outlook users
In-Reply-To: <16E1010E4581B049ABC51D4975CEDB88619953@UKDCX001.uk.int.atosorigin.com>
Message-ID: <LCEPIIGDJPKCOIHOBJEPIEEDHOAA.mhammond@skippinet.com.au>

> From: Mark Hammond [mailto:mhammond@skippinet.com.au]
> > Let me know if there are any problems, and/or just create
> > bugs at source-forge and assign them to me!
>
> This isn't working properly for me. I'm getting strange,

When you say "isn't working", you mean only the field display, or the whole
plugin?

> inconsistent results. When I try to delete the field, the script
> says it's done it, but then if I run it again, it says "deleting 1
> field instance from Outlook" again. Sometimes, not always :-(

The "1 field" created via outlook is field that the plugin creates in an
attempt to have all this work <wink>.  When the plugin starts watching a
folder, it selects the first message in the folder, and creates a
UserDefined field called "Spam" in this message.  Doing this should also
create the field in the "Field Chooser"

However, in the process of normal message scoring, Outlook itself is not
used to set the field value.  Hence you will see "1 field from Outlook"
after the plugin has restarted.

> When I start Outlook, the field appears - with type "Combination".
> Tried again, got type "Number". Even when I deleted and recreated
> the field as a Percentage type, I looked again a little later and
> it was "Number" again.
>
> I've reregistered the addin, rebuilt the databases, refiltered
> everything, deleted all pyc files. Everything I can think of.

In all folders you tried, or just the inbox?  Does it work in the Spam or
Maybe folders?

Any exceptions or unusual messages in the trace output?

Thanks,

Mark.


From francois.granger@free.fr  Mon Nov 25 11:34:25 2002
From: francois.granger@free.fr (Fran=?ISO-8859-1?B?5w==?=ois Granger)
Date: Mon, 25 Nov 2002 12:34:25 +0100
Subject: [Spambayes] New web training interface for pop3proxy
In-Reply-To: <16E1010E4581B049ABC51D4975CEDB885E2E10@UKDCX001.uk.int.atosorigin.com>
Message-ID: <BA07C951.5D2B1%francois.granger@free.fr>

on 25/11/02 12:04, Moore, Paul at Paul.Moore@atosorigin.com wrote:

>  Or would it be worth
> specifically looking for pybsddb, and using that in preference if it
> is present?

Since it use anydbm, you can copy lib/anydbm.py in your spambayes folder and
modify  the line 51:

_names = ['dbhash', 'gdbm', 'dbm', 'dumbdbm']

and add your preferred dbm in front of the list. It will use it if it exist.

_names = ['pybsddb', 'dbhash', 'gdbm', 'dbm', 'dumbdbm']

-- 
Le courrier est un moyen de communication. Les gens devraient
se poser des questions sur les implications politiques des choix (ou non
choix) de leurs outils et technologies. Pour des courriers propres :
<http://marc.herbert.free.fr/mail/> -- <http://minilien.com/?IXZneLoID0>


From Paul.Moore@atosorigin.com  Mon Nov 25 11:36:50 2002
From: Paul.Moore@atosorigin.com (Moore, Paul)
Date: Mon, 25 Nov 2002 11:36:50 -0000
Subject: [Spambayes] Important information for Outlook users
Message-ID: <16E1010E4581B049ABC51D4975CEDB885E2E12@UKDCX001.uk.int.atosorigin.com>

From: Mark Hammond [mailto:mhammond@skippinet.com.au]
> When you say "isn't working", you mean only the field
> display, or the whole plugin?

Basically the field display (as described) but I got an odd
message when I rescored, claiming (almost) everything was spam.
But the scores assigned didn't bear this out, so I ignored it
as a possible one-off.

> The "1 field" created via outlook is field that the plugin creates
> in an attempt to have all this work <wink>. When the plugin starts
> watching a folder, it selects the first message in the folder, and
> creates a UserDefined field called "Spam" in this message. Doing
> this should also create the field in the "Field Chooser"
>
> However, in the process of normal message scoring, Outlook itself is
> not used to set the field value. Hence you will see "1 field from
> Outlook" after the plugin has restarted.

That makes sense. But I'm not sure if it ties in with what I'm seeing.
If I run delete_outlook_field twice in succession, I get the "1 field"
both times. And Outlook wasn't running when I did this.

Wait a minute - delete_outlook_field will start Outlook via COM, which
I guess will start the addin, which...

OK, that's probably not a problem, then.

> > When I start Outlook, the field appears - with type "Combination".
> > Tried again, got type "Number". Even when I deleted and recreated
> > the field as a Percentage type, I looked again a little later and
> > it was "Number" again.
> >
> > I've reregistered the addin, rebuilt the databases, refiltered
> > everything, deleted all pyc files. Everything I can think of.
>
> In all folders you tried, or just the inbox? Does it work in the
> Spam or Maybe folders?

Same problem in all 3 folders.

> Any exceptions or unusual messages in the trace output?

None that I recall, beyond one that appeared because I forgot to
rebuild the database first time (DB format had changed, so the addin
sulked, IIRC).

As I said, no time to do much more now. I'll do a full retest of all
this tomorrow.

Paul.

PS If I could get the addin to work while Exchange was offline, it'd
   be possible to test at home. But given that *Outlook* doesn't work
   properly when offline, it's a bit unfair to expect the addin to do
   so :-)

From msergeant@startechgroup.co.uk  Mon Nov 25 11:55:39 2002
From: msergeant@startechgroup.co.uk (Matt Sergeant)
Date: Mon, 25 Nov 2002 11:55:39 +0000
Subject: [Spambayes] anyone going to the spam conference?
References: <w53smxtup6i.fsf@woozle.org>
Message-ID: <3DE20FBB.50601@startechgroup.co.uk>

Neale Pickett said the following on 23/11/02 00:04:
> So, anyone planning on going to Paul Graham's spam conference January in
> Cambridge?  I just got the nod from $FIRM to attend.  An all-expenses
> paid trip to New England in winter!  Woo!
> 
> I'd like to present a talk on behalf of the spambayes project.  If
> anyone else was planning on doing this, please let me know.  There are
> certainly folks who know more than I about how our classifier works, but
> nobody from our project shows up on the speakers list.
> 
> If nobody else steps up to the plate, I'm going to have to ask a lot of
> dumb questions about the thing.  That should be a pretty good motivator!
> <0.9 wink>

I'm going, and I will be mentioning the spambayes project, and talking 
about the algorithms involved.

Look forward to seeing you there!

Matt.


From jm@jmason.org  Mon Nov 25 12:10:53 2002
From: jm@jmason.org (Justin Mason)
Date: Mon, 25 Nov 2002 12:10:53 +0000
Subject: [Spambayes] anyone going to the spam conference? 
In-Reply-To: Message from Matt Sergeant <msergeant@startechgroup.co.uk> 
   of "Mon, 25 Nov 2002 11:55:39 GMT." <3DE20FBB.50601@startechgroup.co.uk> 
Message-ID: <20021125121058.7A0EA16F16@jmason.org>


Matt Sergeant said:

> > So, anyone planning on going to Paul Graham's spam conference January in
> > Cambridge?  I just got the nod from $FIRM to attend.  An all-expenses
> > paid trip to New England in winter!  Woo!

Snow!  (I'm guessing here BTW ;)

I'd love to go, but I think I'm going to be very busy moving halfway
across the world around that date. :( 

Still, Matt can represent SpamAssassin, and there's always another few
SpamAssassin developers who might go...

--j.

From skip@pobox.com  Mon Nov 25 14:18:00 2002
From: skip@pobox.com (Skip Montanaro)
Date: Mon, 25 Nov 2002 08:18:00 -0600
Subject: [Spambayes] New web training interface for pop3proxy
In-Reply-To: <16E1010E4581B049ABC51D4975CEDB885E2E10@UKDCX001.uk.int.atosorigin.com>
References: <16E1010E4581B049ABC51D4975CEDB885E2E10@UKDCX001.uk.int.atosorigin.com>
Message-ID: <15842.12568.441397.329398@montanaro.dyndns.org>


    Paul> In the meantime, can I upgrade in a way which replaces the shipped
    Paul> DBM?

The bugs in Berkeley DB 1.85 are (primarily?) with its hash file
implementation.  You could simply replace any instances of anydbm.open or
bsddb.hashopen with bsddb.btopen (the btree file implementation).  You'd
have to retrain of course, but that should be simpler than installing
pybsddb.

Skip

From fgranger@teleprosoft.com  Mon Nov 25 13:51:40 2002
From: fgranger@teleprosoft.com (Fran=?ISO-8859-1?B?5w==?=ois Granger)
Date: Mon, 25 Nov 2002 14:51:40 +0100
Subject: [Spambayes] Current version
Message-ID: <BA07E97C.5D2C3%fgranger@teleprosoft.com>

I get an error message:

Traceback (most recent call last):
  File "HD:Dev:spambayes:pop3proxy.py", line 121, in ?
    import storage, tokenizer, mboxutils
  File "HD:Dev:spambayes:storage.py", line 52, in ?
    import dbdict
  File "HD:Dev:spambayes:dbdict.py", line 59, in ?
    import dbhash
  File "HD:Python 2.2.2:Lib:dbhash.py", line 5, in ?
    import bsddb
ImportError: No module named bsddb

This module is not available on all plateforms. It seems safet to use Python
resource in the form of Anydbm....

And this is contradictory with the other scripts:
In hammiebulk.py line 48
import anydbm

I'll switch to Pickle to do my testings... ;-)

Salutations,
Francois Granger
-- 
fgranger@teleprosoft.com - <http://www.teleprosoft.com>
tel: +33 1 41 88 48 00 - Fax: + 33 1 41 88 48 48


From francois.granger@free.fr  Mon Nov 25 15:39:57 2002
From: francois.granger@free.fr (Fran=?ISO-8859-1?B?5w==?=ois Granger)
Date: Mon, 25 Nov 2002 16:39:57 +0100
Subject: [Spambayes] New web training interface for pop3proxy
In-Reply-To: <16E1010E4581B049ABC51D4975CEDB885E2E13@UKDCX001.uk.int.atosorigin.com>
Message-ID: <BA0802DD.5D2F3%francois.granger@free.fr>

on 25/11/02 12:38, Moore, Paul at Paul.Moore@atosorigin.com wrote:

> From: Fran=E7ois Granger [mailto:francois.granger@free.fr]
>>>  Or would it be worth
>>> specifically looking for pybsddb, and using that in preference if it
>>> is present?
>>=20
>> Since it use anydbm, you can copy lib/anydbm.py in your spambayes folder
>> and modify  the line 51:
>=20
> That looks reasonable, as a workaround at least. I'll try that.

Beware, since that, I discovered that it uses dbhash explicitly

I modified:

  dbdict.py line 59 and 94 to force it to use anydbm.

And this seem to work.
--=20
Le courrier est un moyen de communication. Les gens devraient
se poser des questions sur les implications politiques des choix (ou non
choix) de leurs outils et technologies. Pour des courriers propres :
<http://marc.herbert.free.fr/mail/> -- <http://minilien.com/?IXZneLoID0>


From Paul.Moore@atosorigin.com  Mon Nov 25 15:42:09 2002
From: Paul.Moore@atosorigin.com (Moore, Paul)
Date: Mon, 25 Nov 2002 15:42:09 -0000
Subject: [Spambayes] New web training interface for pop3proxy
Message-ID: <16E1010E4581B049ABC51D4975CEDB885E2E1A@UKDCX001.uk.int.atosorigin.com>

From: Fran=E7ois Granger [mailto:francois.granger@free.fr]
> Beware, since that, I discovered that it uses dbhash explicitly
>=20
> I modified:
>=20
>   dbdict.py line 59 and 94 to force it to use anydbm.
>=20
> And this seem to work.

I suspect the best answer is to make the DBM implementation
configurable via bayescustomize.ini (with the default being
anydbm).

Paul.

From tim.one@comcast.net  Mon Nov 25 15:59:57 2002
From: tim.one@comcast.net (Tim Peters)
Date: Mon, 25 Nov 2002 10:59:57 -0500
Subject: [Spambayes] New web training interface for pop3proxy
In-Reply-To: 
 <16E1010E4581B049ABC51D4975CEDB885E2E10@UKDCX001.uk.int.atosorigin.com>
Message-ID: <BIEJKCLHCIOIHAGOKOLHOEILEAAA.tim.one@comcast.net>

[Tim]
> IIRC, Paul is running on Windows.  If so, and he's picking up
> the ancient Berkeley 1.85 bsddb that ships with Windows Python,
> that's got lots of severe problems that will never be fixed:

[Moore, Paul]
> That's true. In the meantime, can I upgrade in a way which replaces
> the shipped DBM? If I install pybsddb, will it override the standard
> library version? If not, can I make it? Or would it be worth
> specifically looking for pybsddb, and using that in preference if it
> is present?

All fine questions.  I can't answer them (I'm not a bsddb user -- note that
I've stuck to dicts since day #1 here <0.9 wink>).  Asking on
comp.lang.python may get better answers, but before that I'd poke the
resources (mailing list, trackers) linked to from the project's home page:

    http://pybsddb.sourceforge.net/

"Download it and try it" is also a fine approach.


From neale@woozle.org  Mon Nov 25 16:19:23 2002
From: neale@woozle.org (Neale Pickett)
Date: 25 Nov 2002 08:19:23 -0800
Subject: [Spambayes] Current version
In-Reply-To: <BA07E97C.5D2C3%fgranger@teleprosoft.com>
References: <BA07E97C.5D2C3%fgranger@teleprosoft.com>
Message-ID: <w53fztpei5w.fsf@woozle.org>

So then, Fran�ois Granger <fgranger@teleprosoft.com> is all like:

> This module is not available on all plateforms. It seems safet to use Python
> resource in the form of Anydbm....

Weird.  I had it as anydbm initially, but after going through the
documentation on the various dbm modules, it didn't look like I was
guaranteed all the functionality I would need, so I made it explicit.
It looks as though the other dbm methods /do/ provide sufficient
functionality though, so I'll check anydbm back in.

Neale

From neale@woozle.org  Mon Nov 25 16:31:14 2002
From: neale@woozle.org (Neale Pickett)
Date: 25 Nov 2002 08:31:14 -0800
Subject: [Spambayes] anyone going to the spam conference?
In-Reply-To: <3DE20FBB.50601@startechgroup.co.uk>
References: <w53smxtup6i.fsf@woozle.org> <3DE20FBB.50601@startechgroup.co.uk>
Message-ID: <w53bs4dehm5.fsf@woozle.org>

So then, Matt Sergeant <msergeant@startechgroup.co.uk> is all like:

> I'm going, and I will be mentioning the spambayes project, and talking
> about the algorithms involved.

Oh cool.  Do you think your mention will be sufficient, or should I go
ahead with volunteering a presentation devoted to spambayes?  I'll be
giving one locally, so I'll be preparing a short talk in any case.

Looking forward to it (bring your PGP fingerprint!)

Neale

From Paul.Moore@atosorigin.com  Mon Nov 25 16:34:25 2002
From: Paul.Moore@atosorigin.com (Moore, Paul)
Date: Mon, 25 Nov 2002 16:34:25 -0000
Subject: [Spambayes] Current version
Message-ID: <16E1010E4581B049ABC51D4975CEDB885E2E1B@UKDCX001.uk.int.atosorigin.com>

From: Neale Pickett [mailto:neale@woozle.org]
> > This module is not available on all plateforms. It seems safet to
> > use Python resource in the form of Anydbm....
>
> Weird. I had it as anydbm initially, but after going through the
> documentation on the various dbm modules, it didn't look like I
> was guaranteed all the functionality I would need, so I made it
> explicit. It looks as though the other dbm methods /do/ provide
> sufficient functionality though, so I'll check anydbm back in.

Better not.

Dumbdbm doesn't support first() and next() for key iteration. (And
dbhash doesn't support iterkeys()).

I'm writing a longer message at the moment, but just as a warning,
reverting to anydbm is *also* wrong :-(

Paul.

From Paul.Moore@atosorigin.com  Mon Nov 25 16:36:11 2002
From: Paul.Moore@atosorigin.com (Moore, Paul)
Date: Mon, 25 Nov 2002 16:36:11 -0000
Subject: [Spambayes] New web training interface for pop3proxy
Message-ID: <16E1010E4581B049ABC51D4975CEDB88619956@UKDCX001.uk.int.atosorigin.com>

From: Tim Peters [mailto:tim.one@comcast.net]
> > That's true. In the meantime, can I upgrade in a way which
> > replaces the shipped DBM? If I install pybsddb, will it override
> > the standard library version? If not, can I make it? Or would
> > it be worth specifically looking for pybsddb, and using that in
> > preference if it is present?
>
> All fine questions. I can't answer them (I'm not a bsddb user --
> note that I've stuck to dicts since day #1 here <0.9 wink>). Asking
> on comp.lang.python may get better answers, but before that I'd poke
> the resources (mailing list, trackers) linked to from the project's
> home page:
>
>     http://pybsddb.sourceforge.net/
>
> "Download it and try it" is also a fine approach.

Sorry - I wasn't clear. I have pybsddb installed, it looks like it
doesn't override the standard library version automatically. It
would be nice if anydbm offered hooks to allow user extension
with new DBM lookalikes. I naively assumed anydbm would be
complicated :-) When I looked at it, though, I found it was pretty
simple. But probably not extensible, due to its reliance on
whichdb, which contains nasty magic number testing...

Just making the thing user-configurable is probably simpler.
Fran=E7ois pointed out that dbdict used dbhash explicitly, and the
only other relevant place is hammiebulk (which imports anydbm,
but then never uses it!)

So I think that just changing dbdict to allow user customisation
of the DBM implementation should be enough.

I did a quick hack implementation (including fixing the doctests
:-)) and it looks possible, except for the fact that dumbdbm
doesn't support the required interface - specifically first() and
next() methods. But dbhash doesn't support iterkeys() :-(

Bluntly, there doesn't seem to be a usable common subset of
functionality here. At the very least, dbdict.py shouldn't be
changed to use anydbm, as it will fail when used on a system with
nothing other than dumbdbm.

And pybsddb looks even more messy. I'll think about this tonight,
and try to put together a generic interface.

Paul.

From francois.granger@free.fr  Mon Nov 25 16:40:02 2002
From: francois.granger@free.fr (Fran=?ISO-8859-1?B?5w==?=ois Granger)
Date: Mon, 25 Nov 2002 17:40:02 +0100
Subject: [Spambayes] Current version
In-Reply-To: <w53fztpei5w.fsf@woozle.org>
Message-ID: <BA0810F2.5D324%francois.granger@free.fr>

on 25/11/02 17:19, Neale Pickett at neale@woozle.org wrote:

> So then, Fran=E7ois Granger <fgranger@teleprosoft.com> is all like:
>=20
>> This module is not available on all plateforms. It seems safet to use Py=
thon
>> resource in the form of Anydbm....
>=20
> Weird.  I had it as anydbm initially, but after going through the
> documentation on the various dbm modules, it didn't look like I was
> guaranteed all the functionality I would need, so I made it explicit.
> It looks as though the other dbm methods /do/ provide sufficient
> functionality though, so I'll check anydbm back in.

Nice, from what I read in the code, it is ok for anydbm.

Only two changes needed. dbdict.py line 59 and 94 to force it to use anydbm=
.
--=20
Le courrier est un moyen de communication. Les gens devraient
se poser des questions sur les implications politiques des choix (ou non
choix) de leurs outils et technologies. Pour des courriers propres :
<http://marc.herbert.free.fr/mail/> -- <http://minilien.com/?IXZneLoID0>


From neale@woozle.org  Mon Nov 25 16:47:17 2002
From: neale@woozle.org (Neale Pickett)
Date: 25 Nov 2002 08:47:17 -0800
Subject: [Spambayes] Current version
In-Reply-To: <16E1010E4581B049ABC51D4975CEDB885E2E1B@UKDCX001.uk.int.atosorigin.com>
References: <16E1010E4581B049ABC51D4975CEDB885E2E1B@UKDCX001.uk.int.atosorigin.com>
Message-ID: <w5365ulegve.fsf@woozle.org>

So then, "Moore, Paul" <Paul.Moore@atosorigin.com> is all like:

> From: Neale Pickett [mailto:neale@woozle.org]
> > I'll check anydbm back in.
> 
> Better not.
> 
> Dumbdbm doesn't support first() and next() for key iteration. (And
> dbhash doesn't support iterkeys()).

Ya know, now that I think about it, we don't need key iteration
anymore.  Since we're now storing only the counters associated with a
word, there's no reason I can think of that anything would need to
iterate over the keys.

This is why Fran�ois could use anydbm without problems--we're not using
the first() and next() constructs anymore.

Instead of going back to dbhash, I'm going to see if we can't dump
dbdict altogether and just use the built-in shelve module.

Neale

From noreply@sourceforge.net  Mon Nov 25 16:36:41 2002
From: noreply@sourceforge.net (noreply@sourceforge.net)
Date: Mon, 25 Nov 2002 08:36:41 -0800
Subject: [Spambayes] 
 [ spambayes-Patches-639122 ] hammie: ignore emails older than n days
Message-ID: <E18GMDx-0001IW-00@sc8-sf-web2.sourceforge.net>

Patches item #639122, was opened at 2002-11-15 13:47
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=498105&aid=639122&group_id=61702

Category: None
Group: None
Status: Open
Resolution: None
Priority: 5
Submitted By: Jason Hildebrand (jdhildeb)
>Assigned to: Neale Pickett (npickett)
Summary: hammie: ignore emails older than n days

Initial Comment:
Since your documentation stresses the importance of
training using only relatively recent emails, I thought
a good way to do this would be to have hammie do it for me.

So I added a new configuration option:

[Hammie]
# when training, hammie will ignore messages older than
this number of days.
# i.e. set to 365 to ignore messages older than one year.
# Set to 0 to disable any filtering by date.
ignore_old_messages: 0

The patch also modifies Hammie to output the number of
messages it read/ignored for each mail file it processes.

This option might also prove useful for doing
incremental training (i.e. set up cron to train once a
week, and set ignore_old_messages to 7).


----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=498105&aid=639122&group_id=61702

From msergeant@startechgroup.co.uk  Mon Nov 25 16:48:32 2002
From: msergeant@startechgroup.co.uk (Matt Sergeant)
Date: Mon, 25 Nov 2002 16:48:32 +0000
Subject: [Spambayes] anyone going to the spam conference?
References: <w53smxtup6i.fsf@woozle.org> <3DE20FBB.50601@startechgroup.co.uk>
	<w53bs4dehm5.fsf@woozle.org>
Message-ID: <3DE25460.80303@startechgroup.co.uk>

Neale Pickett said the following on 25/11/02 16:31:
> So then, Matt Sergeant <msergeant@startechgroup.co.uk> is all like:
> 
> 
>>I'm going, and I will be mentioning the spambayes project, and talking
>>about the algorithms involved.
> 
> 
> Oh cool.  Do you think your mention will be sufficient, or should I go
> ahead with volunteering a presentation devoted to spambayes?  I'll be
> giving one locally, so I'll be preparing a short talk in any case.

I don't know - there's not really much to the whole system if you think 
about it. While it's taken about two months to get here, the algorithms 
used (and the code involved) are still incredibly simple.

On the flip side my statistics skills aren't that hot, so if you grok 
the algorithms (like chi-squared) better than I do then you might be a 
better candidate to speak. Plus most of my talk focuses on the problems 
encountered when filtering at the network level for large companies, so 
you may want to hear more about the algorithms than the problems raised 
by the algorithms :-)

On the flip-flip side I think all the speaker slots are full :-)

> Looking forward to it (bring your PGP fingerprint!)

Keysigning's overrated. It wastes beer time :-)

Matt.


From Paul.Moore@atosorigin.com  Mon Nov 25 16:58:51 2002
From: Paul.Moore@atosorigin.com (Moore, Paul)
Date: Mon, 25 Nov 2002 16:58:51 -0000
Subject: [Spambayes] Current version
Message-ID: <16E1010E4581B049ABC51D4975CEDB88619957@UKDCX001.uk.int.atosorigin.com>

From: Neale Pickett [mailto:neale@woozle.org]
> Ya know, now that I think about it, we don't need key
> iteration anymore.  Since we're now storing only the
> counters associated with a word, there's no reason I
> can think of that anything would need to iterate over
> the keys.
[...]
> Instead of going back to dbhash, I'm going to see if we
> can't dump dbdict altogether and just use the built-in
> shelve module.

I'd rather you didn't. I can't (immediately) see a simple way
to customize shelve to use (say) pybsddb. Let me have a play
tonight, and I'll see if I can make it customizable as it
stands.

I'm assuming from what you say that I can simply rip out
the __iter__ and iter* methods? In fact, if so, it's pretty
simple. Just change "import anydbm" to

    from Options import options
    # Whatever name seems appropriate - the default
    # value should be 'anydbm'
    DBM_METHOD =3D options.dbm_implementation_method
    dbm =3D __import__(DBM_METHOD)

and then use dbm in place of anydbm.

For pybsddb, we may need an adapter class to supply the
right set of methods, but that's not hard, and then it's
just a case of

    [dbdict]
    dbm_method=3Dpybsd_wrapper

in bayescustomize.ini if the user wants to go this way.

Paul.

PS Just in case it's getting lost why I care about this - if
   my speculation about the hangs I've been getting is correct,
   and it's related to the old and buggy DBM implementation
   supplied as standard in the Windows distribution, this could
   hit a lot of Windows users. Making sure we can work around
   it is probably worth it. Totally automatic detection of
   whether pybsddb is installed would be even more idiot-proof,
   but let's walk before we run - after all, we may end up
   all using Zodb in any case :-)

From noreply@sourceforge.net  Mon Nov 25 16:36:43 2002
From: noreply@sourceforge.net (noreply@sourceforge.net)
Date: Mon, 25 Nov 2002 08:36:43 -0800
Subject: [Spambayes] 
 [ spambayes-Patches-643306 ] tests on fixed training and test set
Message-ID: <E18GMDz-0001Ih-00@sc8-sf-web2.sourceforge.net>

Patches item #643306, was opened at 2002-11-24 16:41
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=498105&aid=643306&group_id=61702

Category: None
Group: None
Status: Open
Resolution: None
Priority: 5
Submitted By: Daniel Etzold (etzi)
>Assigned to: Neale Pickett (npickett)
Summary: tests on fixed training and test set

Initial Comment:
A new argument for runtest.sh "fixedsets" and a file
"fixedsets.py" are introduced to create just one
classifier from Set1 and test this classifier on the
eMails of Set2.
This is useful when comparing the results of this
project with other methods (SVM, Rocchio, k-Nearest
Neighbour, Naive Bayes) for equal training and test sets.

----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=498105&aid=643306&group_id=61702

From fgranger@teleprosoft.com  Mon Nov 25 17:45:23 2002
From: fgranger@teleprosoft.com (Fran=?ISO-8859-1?B?5w==?=ois Granger)
Date: Mon, 25 Nov 2002 18:45:23 +0100
Subject: [Spambayes] Current version
In-Reply-To: <16E1010E4581B049ABC51D4975CEDB88619957@UKDCX001.uk.int.atosorigin.com>
Message-ID: <BA082042.5D337%fgranger@teleprosoft.com>

on 25/11/02 17:58, Moore, Paul at Paul.Moore@atosorigin.com wrote:

>  but let's walk before we run - after all, we may end up
>  all using Zodb in any case :-)

I hope not before it get packaged with the standard Python distro....


From richie@entrian.com  Mon Nov 25 20:16:08 2002
From: richie@entrian.com (Richie Hindle)
Date: Mon, 25 Nov 2002 20:16:08 +0000
Subject: [Spambayes] Current version
In-Reply-To: <16E1010E4581B049ABC51D4975CEDB88619957@UKDCX001.uk.int.atosorigin.com>
References: <16E1010E4581B049ABC51D4975CEDB88619957@UKDCX001.uk.int.atosorigin.com>
Message-ID: <luo4uugdble02ipu1keho1fmn9v50t71tn@4ax.com>


[Paul Moore]
>    but let's walk before we run - after all, we may end up
>    all using Zodb in any case :-)

I'd like this (despite the additional installation burden - we can ship
binaries for Windows and Mac) and not only for technical reasons.  As I
understand it, post-1.8x versions of the core bsddb code ship under the
Sleepycat license, which demands that projects using it must be
published-source.  This is a problem if we want Spambayes to be fully
PSF-licensed - if someone wants to take the Spambayes source and fund their
addictions by creating a commercial, closed-source spam-filter product, the
PSF license allows that but not if the code relies on bsddb.  Not that I'm
in favour of people making money from Spambayes (unless it's me 8-) but the
PSF license does allow for it - it should be all or nothing.  Or do I have
this all wrong?

Slightly OT: This has concerned me since PLabs announced that they were
integrating bsddb into Python 2.3 - it's going to make it very easy
(especially on Windows) for someone to write code that uses anydbm, wrap it
up with Py2exe and ship it under a commercial license, not knowing that
they're breaking the Sleepycat license.  They've never heard of Sleepycat
Software or even bsddb - as far as they're concerned, this "bsddb.pyd" file
that Py2exe tells them they need to ship is just another part of Python,
like _socket.pyd or select.pyd.

-- 
Richie Hindle
richie@entrian.com


From skip@pobox.com  Mon Nov 25 20:30:32 2002
From: skip@pobox.com (Skip Montanaro)
Date: Mon, 25 Nov 2002 14:30:32 -0600
Subject: [Spambayes] Current version
In-Reply-To: <luo4uugdble02ipu1keho1fmn9v50t71tn@4ax.com>
References: <16E1010E4581B049ABC51D4975CEDB88619957@UKDCX001.uk.int.atosorigin.com>
	<luo4uugdble02ipu1keho1fmn9v50t71tn@4ax.com>
Message-ID: <15842.34920.583131.72086@montanaro.dyndns.org>


    Richie> As I understand it, post-1.8x versions of the core bsddb code
    Richie> ship under the Sleepycat license, which demands that projects
    Richie> using it must be published-source.  

Don't use bsddb in a closed-source product.  Use dbm or dumdbm or use
pickles or roll your own thang.  I doubt the presence of bsddb would be the
only barrier to creating a closed-source product based upon the spambayes
code.

Skip

From skip@pobox.com  Mon Nov 25 20:33:47 2002
From: skip@pobox.com (Skip Montanaro)
Date: Mon, 25 Nov 2002 14:33:47 -0600
Subject: [Spambayes] New web training interface for pop3proxy
In-Reply-To: <16E1010E4581B049ABC51D4975CEDB885E2E1A@UKDCX001.uk.int.atosorigin.com>
References: <16E1010E4581B049ABC51D4975CEDB885E2E1A@UKDCX001.uk.int.atosorigin.com>
Message-ID: <15842.35115.185328.504724@montanaro.dyndns.org>


    Paul> I suspect the best answer is to make the DBM implementation
    Paul> configurable via bayescustomize.ini (with the default being
    Paul> anydbm).

I think you might want to specify the database open function instead of just
the module.  There are three ways to open db files with bsddb (btopen,
hashopen, rnopen).  That will require a little more trickery in Options.py,
but not an insane amount.

Skip

From neale@woozle.org  Mon Nov 25 20:37:43 2002
From: neale@woozle.org (Neale Pickett)
Date: 25 Nov 2002 12:37:43 -0800
Subject: [Spambayes] Current version
In-Reply-To: <16E1010E4581B049ABC51D4975CEDB88619957@UKDCX001.uk.int.atosorigin.com>
References: <16E1010E4581B049ABC51D4975CEDB88619957@UKDCX001.uk.int.atosorigin.com>
Message-ID: <w53k7j1crmw.fsf@woozle.org>

So then, "Moore, Paul" <Paul.Moore@atosorigin.com> is all like:

> I'd rather you didn't. I can't (immediately) see a simple way
> to customize shelve to use (say) pybsddb. Let me have a play
> tonight, and I'll see if I can make it customizable as it
> stands.

Okay.  In the meantime, I've taken the iteritems stuff out of dbdict.  I
blew away Classifier.update_probabilities(), which was the only thing
left that needed it.  Beaujolais!


> I'm assuming from what you say that I can simply rip out
> the __iter__ and iter* methods? In fact, if so, it's pretty
> simple. Just change "import anydbm" to
> 
>     from Options import options
>     # Whatever name seems appropriate - the default
>     # value should be 'anydbm'
>     DBM_METHOD = options.dbm_implementation_method
>     dbm = __import__(DBM_METHOD)
> 
> and then use dbm in place of anydbm.
> 
> For pybsddb, we may need an adapter class to supply the
> right set of methods, but that's not hard, and then it's
> just a case of
> 
>     [dbdict]
>     dbm_method=pybsd_wrapper
> 
> in bayescustomize.ini if the user wants to go this way.

I like that.  You go, Paul!


From neale@woozle.org  Mon Nov 25 20:41:00 2002
From: neale@woozle.org (Neale Pickett)
Date: 25 Nov 2002 12:41:00 -0800
Subject: [Spambayes] anyone going to the spam conference?
In-Reply-To: <3DE25460.80303@startechgroup.co.uk>
References: <w53smxtup6i.fsf@woozle.org> <3DE20FBB.50601@startechgroup.co.uk>
	<w53bs4dehm5.fsf@woozle.org> <3DE25460.80303@startechgroup.co.uk>
Message-ID: <w53fztpcrhf.fsf@woozle.org>

So then, Matt Sergeant <msergeant@startechgroup.co.uk> is all like:

> On the flip side my statistics skills aren't that hot, so if you grok
> the algorithms (like chi-squared) better than I do then you might be a
> better candidate to speak. Plus most of my talk focuses on the
> problems encountered when filtering at the network level for large
> companies, so you may want to hear more about the algorithms than the
> problems raised by the algorithms :-)

> On the flip-flip side I think all the speaker slots are full :-)

I'll read up on the math, and if anyone has questions I'll make a lame
attempt to answer them.  I doubt many folks will be too interested in
the math, though.

Or if you'd rather, I'll just sit in the back and heckle :)

BTW Matt, you should chat with Richie Hindle, who's writing the LJ
article.

Neale

From tim@zope.com  Mon Nov 25 21:22:17 2002
From: tim@zope.com (Tim Peters)
Date: Mon, 25 Nov 2002 16:22:17 -0500
Subject: [Spambayes] Current version
In-Reply-To: <luo4uugdble02ipu1keho1fmn9v50t71tn@4ax.com>
Message-ID: <BIEJKCLHCIOIHAGOKOLHEEADEBAA.tim@zope.com>

[Richie Hindle]
> ...
> Slightly OT: This has concerned me since PLabs announced that they were
> integrating bsddb into Python 2.3 - it's going to make it very easy
> (especially on Windows) for someone to write code that uses
> anydbm, wrap it up with Py2exe and ship it under a commercial license,
> not knowing that they're breaking the Sleepycat license.  They've never
> heard of Sleepycat Software or even bsddb - as far as they're concerned,
> this "bsddb.pyd" file that Py2exe tells them they need to ship is just
> another part of Python, like _socket.pyd or select.pyd.

Barry Warsaw talked w/ the Sleepycat folks about the licensing issues, and
was satisfied they're tractable.  I won't presume to explain them, though.
My contribution was adding an XXX comment to the 2.3 NEWS file reminding us
that the licensing issues are going to remain clear as mud unless and until
Bar^H^H^Hsomeone writes up what they believe Sleepcat says is the truth.


From tim@fourstonesExpressions.com  Mon Nov 25 21:38:10 2002
From: tim@fourstonesExpressions.com (Tim Stone - Four Stones Expressions)
Date: Mon, 25 Nov 2002 15:38:10 -0600
Subject: [Spambayes] Re: [Spambayes-checkins] spambayes sb0.5.exe,NONE,1.1.2.1
In-Reply-To: <20021125214014.GA11635@glacier.arctrix.com>
Message-ID: <GWQ63316132TR84RPWVOMZX2WPLJI94.3de29842@riven>

Cool... I'll remove it.

11/25/2002 3:40:14 PM, Neil Schemenauer <nas@python.ca> wrote:

>Tim Stone wrote:
>> Added Files:
>>       Tag: hammie-playground
>> 	sb0.5.exe 
>
>I don't think this belongs in CVS.  SF has a file distribution feature
>that would be more appropriate.
>
>  Neil
>
>


c'est moi - TimS
www.fourstonesExpressions.com 


From barry@wooz.org  Mon Nov 25 21:42:56 2002
From: barry@wooz.org (Barry A. Warsaw)
Date: Mon, 25 Nov 2002 16:42:56 -0500
Subject: [Spambayes] Current version
References: <luo4uugdble02ipu1keho1fmn9v50t71tn@4ax.com>
	<BIEJKCLHCIOIHAGOKOLHEEADEBAA.tim@zope.com>
Message-ID: <15842.39264.237047.682498@gargle.gargle.HOWL>


>>>>> "TP" == Tim Peters <tim@zope.com> writes:

    TP> [Richie Hindle]
    >> ...  Slightly OT: This has concerned me since PLabs announced
    >> that they were integrating bsddb into Python 2.3 - it's going
    >> to make it very easy (especially on Windows) for someone to
    >> write code that uses anydbm, wrap it up with Py2exe and ship it
    >> under a commercial license, not knowing that they're breaking
    >> the Sleepycat license.  They've never heard of Sleepycat
    >> Software or even bsddb - as far as they're concerned, this
    >> "bsddb.pyd" file that Py2exe tells them they need to ship is
    >> just another part of Python, like _socket.pyd or select.pyd.

    TP> Barry Warsaw talked w/ the Sleepycat folks about the licensing
    TP> issues, and was satisfied they're tractable.  I won't presume
    TP> to explain them, though.  My contribution was adding an XXX
    TP> comment to the 2.3 NEWS file reminding us that the licensing
    TP> issues are going to remain clear as mud unless and until
    TP> Bar^H^H^Hsomeone writes up what they believe Sleepcat says is
    TP> the truth.

Here's what I believe Sleepycat says is the truth.

Because Python is open source, we are allowed to link BerkeleyDB with
Python in a binary and the resulting binary can be used in commercial
applications.  The reason this is okay (IIU Sleepycat correctly), is
that Sleepycat considers Python to be the "application using
BerkeleyDB" and because you can get the source for Python.  Your
closed source commercial application isn't the "application using
BerkeleyDB".

Sleepycat wants to promote scripting language use of BerkeleyDB so
they allow this, even though they admit it's a somewhat arbitrary
decision.

If I was shipping a commercial application though, I'm not sure how
much I would trust the opinion of an under-rested, non-lawyer, Unix
hacking bass player, but YMMV.

aren't-you-glad-i'm-not-a-drummer-ly y'rs,
-Barry

From tim@fourstonesExpressions.com  Mon Nov 25 22:51:48 2002
From: tim@fourstonesExpressions.com (Tim Stone - Four Stones Expressions)
Date: Mon, 25 Nov 2002 16:51:48 -0600
Subject: [Spambayes] Packaging
Message-ID: <ON3X782ON5451KESOGDE0IFCAUQ3VBA.3de2a984@riven>

I believe it's time to start thinking about how to package this thing.  At 
least that time is approaching.  Right now, the project has a bunch of files 
that will not need to be included in a distribution.  I gave it a go my own 
way, and was duly reproached by Neil... <wink>  I'd like to start thinking 
along these lines, but I clearly don't know what's involved from the 
sourceforge perspective, and I'd like to hear the thoughts of the rest of the 
team.  Our algorithmic work seems to be winding down, and we're wrapping up a 
code refactoring effort. The user interfaces (hammie, pop3proxy, outlook) seem 
to be stabilizing nicely.  With a bit more tweaking, I think at least the 
pop3proxy stuff is ready for alpha release.

Maybe I'm stepping a bit out of line... but what the hey.  You guys will tell 
me if I am, and I'll still think it's time to push an alpha release out the 
door.  :)

c'est moi - TimS
www.fourstonesExpressions.com 


From lists@morpheus.demon.co.uk  Mon Nov 25 22:37:53 2002
From: lists@morpheus.demon.co.uk (Paul Moore)
Date: Mon, 25 Nov 2002 22:37:53 +0000
Subject: [Spambayes] Current version
References: 
	<16E1010E4581B049ABC51D4975CEDB88619957@UKDCX001.uk.int.atosorigin.com>
	<w53k7j1crmw.fsf@woozle.org>
Message-ID: <n2m-g.of8d2s3i.fsf@morpheus.demon.co.uk>

(I wish I'd read this before I posted my last message :-))

Neale Pickett <neale@woozle.org> writes:

> So then, "Moore, Paul" <Paul.Moore@atosorigin.com> is all like:
>
>> I'd rather you didn't. I can't (immediately) see a simple way
>> to customize shelve to use (say) pybsddb. Let me have a play
>> tonight, and I'll see if I can make it customizable as it
>> stands.
>
> Okay.  In the meantime, I've taken the iteritems stuff out of dbdict.  I
> blew away Classifier.update_probabilities(), which was the only thing
> left that needed it.  Beaujolais!

Ah. As long as Classifier.update_probabilities() is history, iterators
aren't needed. Neat.

>> I'm assuming from what you say that I can simply rip out
>> the __iter__ and iter* methods? In fact, if so, it's pretty
>> simple. Just change "import anydbm" to
>> 
>>     from Options import options
>>     # Whatever name seems appropriate - the default
>>     # value should be 'anydbm'
>>     DBM_METHOD = options.dbm_implementation_method
>>     dbm = __import__(DBM_METHOD)
>> 
>> and then use dbm in place of anydbm.
>> 
>> For pybsddb, we may need an adapter class to supply the
>> right set of methods, but that's not hard, and then it's
>> just a case of
>> 
>>     [dbdict]
>>     dbm_method=pybsd_wrapper
>> 
>> in bayescustomize.ini if the user wants to go this way.
>
> I like that.  You go, Paul!

OK, ignore my previous message (well, laugh at me a bit for using the
wrong database format, if you like :-)). I'll implement this (probably
tomorrow).

Patch to follow. Is a posting to the list OK, or should I upload it to
SF? (No CVS commit ability, and I wouldn't know what to do if I had it
:-))

Paul.
-- 
This signature intentionally left blank

From lists@morpheus.demon.co.uk  Mon Nov 25 22:32:48 2002
From: lists@morpheus.demon.co.uk (Paul Moore)
Date: Mon, 25 Nov 2002 22:32:48 +0000
Subject: [Spambayes] Current version
References: 
	<16E1010E4581B049ABC51D4975CEDB885E2E1B@UKDCX001.uk.int.atosorigin.com>
	<w5365ulegve.fsf@woozle.org>
Message-ID: <n2m-g.r8d92sbz.fsf@morpheus.demon.co.uk>

Neale Pickett <neale@woozle.org> writes:

> So then, "Moore, Paul" <Paul.Moore@atosorigin.com> is all like:
>
>> From: Neale Pickett [mailto:neale@woozle.org]
>> > I'll check anydbm back in.
>> 
>> Better not.
>> 
>> Dumbdbm doesn't support first() and next() for key iteration. (And
>> dbhash doesn't support iterkeys()).
>
> Ya know, now that I think about it, we don't need key iteration
> anymore.  Since we're now storing only the counters associated with
> a word, there's no reason I can think of that anything would need to
> iterate over the keys.
>
> This is why Franois could use anydbm without problems--we're not
> using the first() and next() constructs anymore.

Actually, iteritems() is used in update_probabilities(), which is
still called in pop3proxy. I'm not sure why Fran~ois didn't see the
problem - maybe he hasn't trained any data with the change in place...

[BTW, __iter__ should be implemented as iterkeys, not as iteritems, if
it's to be compatible with "real" dictionaries...]

Annoyingly, as far as I can see, anydbm doesn't actually offer any
decent guarantees. It says that it will use one of dbhash, gdbm, dbm,
or dumbdbm. And it says "The object returned by open() supports most
of the same functionality as dictionaries; keys and their
corresponding values can be stored, retrieved, and deleted, and the
has_key() and keys() methods are available. Keys and values must
always be strings." Nothing about iteration.

And the individual dbm modules are no help:

dbhash: documents first(), last(), next(), previous() and sync()
dbm: "the items() and values() methods are not supported". keys() is
     slow, and iterkeys() isn't supported.
gdbm: firstkey(), nextkey(), reorganize() and sync()
dumbdbm: says nothing (but supports iterkeys, not itervalues or
         iteritems - __iter__ is iterkeys())

Also, whichdb on a pybsddb3 hash database reports it as a hashdb, so
that reopening an existing db file will (on Windows) use the broken
built-in DB implementation, rather than the pybsddb3 one.

[later]

Doh. I've been spending a lot of time looking at this now, trying out
implementations, and I just read the help for hammiebulk.py - which
points out that for pop3proxy, the pickle store is recommended over
DBM. I was using DBM at the stage when I was using a custom fetcher
which ran the classifier as a filter. I didn't think to switch when I
changed to pop3proxy :-(

So all of this, while of theoretical interest, is not in fact of
practical value to me...

I still feel that anydbm is not well suited to user customisation, and
that DBDict could do with the ability for the user to specify a
particular DBM implementation. But I don't have a real need any more,
so while I'm happy to help, I'm no longer driven by the need :-)

Paul.
-- 
This signature intentionally left blank

From lists@morpheus.demon.co.uk  Mon Nov 25 22:45:45 2002
From: lists@morpheus.demon.co.uk (Paul Moore)
Date: Mon, 25 Nov 2002 22:45:45 +0000
Subject: [Spambayes] Pop3proxy - doesn't save pickle when reviewing?
Message-ID: <n2m-g.lm3h2rqe.fsf@morpheus.demon.co.uk>

Just switched pop3proxy over to pickle format. One thing I notices,
which surprises me, is that when I review messages and train, the
pickle is *not* saved. I have to manually save.

Is this intentional? With my way of working, I imagine it could result
in me forgetting to save. Surely a save isn't too much of an overhead
- the review screen shows "Training... Trained on X messages...
Updating probabilities... Done" messages already, so adding
"Saving..." would fit into the UI fine.

It should just be a change to onReview(), changing

        self.push("Done.</b></p>")

to

        self.doSave()

Paul.
-- 
This signature intentionally left blank

From richie@entrian.com  Mon Nov 25 23:10:29 2002
From: richie@entrian.com (Richie Hindle)
Date: Mon, 25 Nov 2002 23:10:29 +0000
Subject: [Spambayes] Packaging
In-Reply-To: <ON3X782ON5451KESOGDE0IFCAUQ3VBA.3de2a984@riven>
References: <ON3X782ON5451KESOGDE0IFCAUQ3VBA.3de2a984@riven>
Message-ID: <sbb5uukr2e68v34qvqj6a5coc9d269c9uk@4ax.com>


[Tim Stone]
> I believe it's time to start thinking about how to package this thing.

I agree.  With upcoming publicity from the spam conference and the Linux
Journal articles, we need a better story than "check it out from
SourceForge CVS".  Just a tarball would probably be enough for starters,
but we should wait for the dust to settle after the recent refactorings.

-- 
Richie Hindle
richie@entrian.com


From lists@morpheus.demon.co.uk  Mon Nov 25 23:22:40 2002
From: lists@morpheus.demon.co.uk (Paul Moore)
Date: Mon, 25 Nov 2002 23:22:40 +0000
Subject: [Spambayes] Exception in pop3proxy UI - classify word
Message-ID: <n2m-g.3cpp44lb.fsf@morpheus.demon.co.uk>

When I try to classify a word in the pop3proxy UI, I get an exception
raised:

error: uncaptured python exception,
 closing channel <__main__.UserInterface connected at 0xe42170>
 (exceptions.AttributeError:'WordInfo' object has no attribute
 '__slots__'
 [C:\Python22\lib\asyncore.py|poll|99]
 [C:\Python22\lib\asyncore.py|handle_read_event|394]
 [C:\Python22\lib\asynchat.py|handle_read|130]
 [C:\Applications\Spambayes\pop3proxy.py|found_terminator|736]
 [C:\Applications\Spambayes\pop3proxy.py|onRequest|761]
 [C:\Applications\Spambayes\pop3proxy.py|onWordquery|1105])

Hmm, classifier.WordInfo lost its __slots__ just today.

Paul.
-- 
This signature intentionally left blank

From mhammond@skippinet.com.au  Mon Nov 25 23:45:01 2002
From: mhammond@skippinet.com.au (Mark Hammond)
Date: Tue, 26 Nov 2002 10:45:01 +1100
Subject: [Spambayes] Guidance re pickles versus DB for Outlook
Message-ID: <LCEPIIGDJPKCOIHOBJEPGEGPHOAA.mhammond@skippinet.com.au>

Hi everyone (and tim1 <wink>)

  I've been thinking about the "database" to use for the Outlook plugin.  I
see two reasonable choices today: pickles and whatever anydbm picks up on
Windows.

My understanding is that the main trade-offs are that pickles are slow to
load, but lightening to use, whereas a database is fast(er) to load, but
slow to use.  IIRC, updating the probabilities was a real killer for a DB,
but this has recently died.

To be honest, my main motivation in even thinking about this is the terrible
things we are doing to Outlook's startup time.  My decent machine is taking
quite a few seconds longer to get outlook started - and this cost is worn
every time *any* application uses Outlook for anything at all.  If we do any
sort of training, we also pay this penalty shutting down, saving the pickle.
If we crash, we lose all recent training data.

So, I see two basic routes I can take:

* Move to a DB, but stick with a fully synchronous model.  We still wear the
DB load time at startup, but this should be reduced significantly.  We wear
the performance costs at runtime associated with the scoring, and do all
such scoring in the "foreground", and saving of the DB as necessary.

* Stick with pickles, but move to a threaded asynchronous model.  Messages
can be "queued" for scoring/training.  At startup, we spin a new thread to
load the pickle.  Any "missed" messages at startup, and all messages as they
arrive are queued for scoring and filtering.  If the pickle is loaded, then
it will generally appear synchronous, otherwise new messages may sit in your
inbox for a few seconds before they are removed.  When the pickle is
modified, a background thread copies the data, and starts writing.  We do
some smarts with renaming the previous versions, as Tim1 implicated.  There
would be support for synchronous calls too (eg, "show spam clues"), but in
general, asynch could be used.

I would appreciate some comments on this.  I am leaning towards the asynch
model, but it is clearly more complicated.  However, if moving to a DB
simply means we will have perf issues, just not at startup, then the
complexity would be warranted.

Any thoughts?  Fairy god-mothers? Magic answers?

Thanks,

Mark.


From tim@fourstonesExpressions.com  Tue Nov 26 02:15:41 2002
From: tim@fourstonesExpressions.com (Tim Stone - Four Stones Expressions)
Date: Mon, 25 Nov 2002 20:15:41 -0600
Subject: [Spambayes] Guidance re pickles versus DB for Outlook
In-Reply-To: <LCEPIIGDJPKCOIHOBJEPGEGPHOAA.mhammond@skippinet.com.au>
Message-ID: <KFDCKI53GFCALK829664VTA5074OJYX.3de2d94d@riven>

Ok, I'm glad you've put this out here. IMO, DBM is too unreliable to be 
anything but a test database.  In real life, bad stuff happens... the database 
has to be resilient, or at least recoverable.  DBM doesn't seem to be either, 
really.  (are the perl dbm implementations better?)  In the absence of a real 
database, which may be out of reach here, we should stick with pickles, which 
have a rather short 'indoubt' window that exists only as the pickle is being 
written.  Pickles are slow to load, slow to store, and fast to access, 
primarily because the entire object model is being materialized into memory.  
This makes 'em honkin memory hogs, with the memory consumption being a 
potential show-stopper.  But that won't happen except in huge database cases, 
and we can perhaps deal with that by placing some artificial limit on the 
pickle size.  When it exceeds that size, prune the least important stuff out.

So, as far as async goes, wow... that adds a huge amount of complexity.  Is it 
really worth it?  I really doubt it.  It makes for really neat architectures, 
and it certainly isn't out of the question, but it makes a rigorous test of a 
system all but impossible, makes the code really hard to understand, modify, 
maintain, and seriously violates the stupid is good principle.

So, to deal with the outlook startup times, I wonder if there are any 
partitioning schemes we can implement.  Perhaps we could split the pickled 
stuff into partitions, based on spamprob (perhaps), alphabetically, nham, 
whatever.  We could load a small subset by default, and then load the whole 
thing later at a user's request, or ... I don't know, I'm just thinking out 
loud.

- TimS


11/25/2002 5:45:01 PM, "Mark Hammond" <mhammond@skippinet.com.au> wrote:

>Hi everyone (and tim1 <wink>)
>
>  I've been thinking about the "database" to use for the Outlook plugin.  I
>see two reasonable choices today: pickles and whatever anydbm picks up on
>Windows.
>
>My understanding is that the main trade-offs are that pickles are slow to
>load, but lightening to use, whereas a database is fast(er) to load, but
>slow to use.  IIRC, updating the probabilities was a real killer for a DB,
>but this has recently died.
>
>To be honest, my main motivation in even thinking about this is the terrible
>things we are doing to Outlook's startup time.  My decent machine is taking
>quite a few seconds longer to get outlook started - and this cost is worn
>every time *any* application uses Outlook for anything at all.  If we do any
>sort of training, we also pay this penalty shutting down, saving the pickle.
>If we crash, we lose all recent training data.
>
>So, I see two basic routes I can take:
>
>* Move to a DB, but stick with a fully synchronous model.  We still wear the
>DB load time at startup, but this should be reduced significantly.  We wear
>the performance costs at runtime associated with the scoring, and do all
>such scoring in the "foreground", and saving of the DB as necessary.
>
>* Stick with pickles, but move to a threaded asynchronous model.  Messages
>can be "queued" for scoring/training.  At startup, we spin a new thread to
>load the pickle.  Any "missed" messages at startup, and all messages as they
>arrive are queued for scoring and filtering.  If the pickle is loaded, then
>it will generally appear synchronous, otherwise new messages may sit in your
>inbox for a few seconds before they are removed.  When the pickle is
>modified, a background thread copies the data, and starts writing.  We do
>some smarts with renaming the previous versions, as Tim1 implicated.  There
>would be support for synchronous calls too (eg, "show spam clues"), but in
>general, asynch could be used.
>
>I would appreciate some comments on this.  I am leaning towards the asynch
>model, but it is clearly more complicated.  However, if moving to a DB
>simply means we will have perf issues, just not at startup, then the
>complexity would be warranted.
>
>Any thoughts?  Fairy god-mothers? Magic answers?
>
>Thanks,
>
>Mark.
>
>
>_______________________________________________
>Spambayes mailing list
>Spambayes@python.org
>http://mail.python.org/mailman/listinfo/spambayes
>
>


c'est moi - TimS
www.fourstonesExpressions.com 


From anthony@interlink.com.au  Tue Nov 26 02:22:50 2002
From: anthony@interlink.com.au (Anthony Baxter)
Date: Tue, 26 Nov 2002 13:22:50 +1100
Subject: [Spambayes] david mertz article on his trigram-based graham-ish
	software.
Message-ID: <200211260222.gAQ2Moo12197@localhost.localdomain>


http://www-106.ibm.com/developerworks/linux/library/l-spamf.html

short summary: his trigram based approach gets 
2 fp from 1851 ham, 142 fn from 1916 spam
while the non-trigram model gets
4 fp and 97 fn. 

This (to me) suggests his trigram approach is dragging
down both ham and spam scores. But if he is using the
Graham scoring/combining approach, who knows what results
he's getting ;-)

Anthony 

From tim.one@comcast.net  Tue Nov 26 02:52:25 2002
From: tim.one@comcast.net (Tim Peters)
Date: Mon, 25 Nov 2002 21:52:25 -0500
Subject: [Spambayes] Guidance re pickles versus DB for Outlook
In-Reply-To: <LCEPIIGDJPKCOIHOBJEPGEGPHOAA.mhammond@skippinet.com.au>
Message-ID: <LNBBLJKPBEHFEDALKOLCKEGGCPAB.tim.one@comcast.net>

[Mark Hammond]
>   I've been thinking about the "database" to use for the Outlook
> plugin.  I see two reasonable choices today: pickles and whatever
> anydbm picks up on Windows.

Then I think we're stuck with pickles for now.  On Windows, anydbm picks up
the ancient 1.85 bsddb we (PLabs) ship with the Windows installer, and
that's got nasty bugs no matter how you drive it:

    http://www.sleepycat.com/historic.html

SourceForge is littered with reports of "mysterious failures" of the bsddb
code on Windows; it just isn't reliable.

ZODB is, and that's what Jeremy is using, while the neil*.py code in the
project is Neil Schemenauer's implementation of a CDB-based approach.
Windows Python 2.3 will ship with a modern bsddb, but that's no help  right
now.  (BTW, as long as you're sitting idle <wink>, follow the instructions
in CVS PCbuild\readme.txt for building the new bsddb code, and let me know
what you think about the 4 linker warnings we get -- I don't know whether to
be worried or not, and I don't know how to get rid of them either short of
giving up on the static-link version of the Berkeley code, + building &
linking distinct Release and Debug versions of the latter)

I'm not really worried about the scoring time with a DB -- "a real" DB has
its own caching schemes to speed frequently accessed items, our project
appears to have grown some form of dict-based spamprob cache of its own, and
scoring has always been a minor part of the total time burden anyway.

> ...
> To be honest,

I'm not sure that's allowed ... let me ask ... OK, you're cleared!

> my main motivation in even thinking about this is the terrible
> things we are doing to Outlook's startup time.  My decent machine
> is taking quite a few seconds longer to get outlook started - and
> this cost is worn every time *any* application uses Outlook for
> anything at all.  If we do any sort of training, we also pay this
> penalty shutting down, saving the pickle.  If we crash, we lose all
> recent training data.

Yup, those are all things a real DB avoids.

> So, I see two basic routes I can take:
>
> * Move to a DB, but stick with a fully synchronous model.  We
> still wear the DB load time at startup, but this should be reduced
> significantly.

Oh yes.

> We wear the performance costs at runtime associated with the scoring,

Not worried.

> and do all such scoring in the "foreground", and saving of the DB as
> necessary.

The classifier internals have been fiddled (by others -- and thanks!) so
that only words whose counts have changed need to be updated, and updating
100-or-so records after training is cheap.

> * Stick with pickles, but move to a threaded asynchronous model.
>   Messages can be "queued" for scoring/training.  At startup, we spin
>   a new thread to load the pickle.  Any "missed" messages at startup,
>   and all messages as they arrive are queued for scoring and filtering.
>   If the pickle is loaded, then it will generally appear synchronous,
>   otherwise new messages may sit in your inbox for a few seconds
>   before they are removed.  When the pickle is modified, a background
>   thread copies the data, and starts writing.  We do some smarts with
>   renaming the previous versions, as Tim1 implicated.  There
>   would be support for synchronous calls too (eg, "show spam clues"),
>   but in general, asynch could be used.

That should work fine, and I'll sign up for *anything* that gets you to use
the lovely Queue module for real <wink>.  Over the long haul I'm not sure it
will fly, because we still have no way to prune the database over time, and
indeed got rid of the WordInfo fields that were intended to make this
possible in an effective way.  So the dict keeps growing, and saving it away
keeps taking longer.  But spinning that off to a thread should hide the pain
for a long time, and we'll solve the pruning problem in the meantime <heh>.

> ...
> Any thoughts?  Fairy god-mothers? Magic answers?

Today it's pickled dicts, ZODB, or roll-our-own on Windows.  In 2.3, bsddb
becomes a real possiblity.


From jeremy@alum.mit.edu  Tue Nov 26 04:13:29 2002
From: jeremy@alum.mit.edu (Jeremy Hylton)
Date: Mon, 25 Nov 2002 23:13:29 -0500
Subject: [Spambayes] Guidance re pickles versus DB for Outlook
In-Reply-To: <KFDCKI53GFCALK829664VTA5074OJYX.3de2d94d@riven>
References: <LCEPIIGDJPKCOIHOBJEPGEGPHOAA.mhammond@skippinet.com.au>
	<KFDCKI53GFCALK829664VTA5074OJYX.3de2d94d@riven>
Message-ID: <15842.62697.829412.348546@slothrop.zope.com>

>>>>> "TS" == Tim Stone <- Four Stones Expressions <tim@fourstonesExpressions.com>> writes:

  TS> So, to deal with the outlook startup times, I wonder if there
  TS> are any partitioning schemes we can implement.  Perhaps we could
  TS> split the pickled stuff into partitions, based on spamprob
  TS> (perhaps), alphabetically, nham, whatever.  We could load a
  TS> small subset by default, and then load the whole thing later at
  TS> a user's request, or ... I don't know, I'm just thinking out
  TS> loud.

I wonder if there are any databases for Python that store objects as
pickles, but allow individual objects to be loaded on demand.  If the
applicatino runs for a while, perhaps it could cache the recently used
objects to make the application fast without leaving all the objects
in memory.

Put another way, I'd be interested to hear why you don't want to use
ZODB.

Jeremy


From tim@fourstonesExpressions.com  Tue Nov 26 04:17:37 2002
From: tim@fourstonesExpressions.com (Tim Stone - Four Stones Expressions)
Date: Mon, 25 Nov 2002 22:17:37 -0600
Subject: [Spambayes] Guidance re pickles versus DB for Outlook
In-Reply-To: <15842.62697.829412.348546@slothrop.zope.com>
Message-ID: <1V6VTQL084W1UXTOJ85URMHB461GF.3de2f5e1@riven>

ZODB sounds ideal to me, and I'm all for it, but Tim1 is kinda like "it 
doesn't ship till 2.3, so we're hosed till then..."

-TimS

11/25/2002 10:13:29 PM, Jeremy Hylton <jeremy@alum.mit.edu> wrote:

>>>>>> "TS" == Tim Stone <- Four Stones Expressions 
<tim@fourstonesExpressions.com>> writes:
>
>  TS> So, to deal with the outlook startup times, I wonder if there
>  TS> are any partitioning schemes we can implement.  Perhaps we could
>  TS> split the pickled stuff into partitions, based on spamprob
>  TS> (perhaps), alphabetically, nham, whatever.  We could load a
>  TS> small subset by default, and then load the whole thing later at
>  TS> a user's request, or ... I don't know, I'm just thinking out
>  TS> loud.
>
>I wonder if there are any databases for Python that store objects as
>pickles, but allow individual objects to be loaded on demand.  If the
>applicatino runs for a while, perhaps it could cache the recently used
>objects to make the application fast without leaving all the objects
>in memory.
>
>Put another way, I'd be interested to hear why you don't want to use
>ZODB.
>
>Jeremy
>
>
>


c'est moi - TimS
www.fourstonesExpressions.com 


From tim.one@comcast.net  Tue Nov 26 04:37:29 2002
From: tim.one@comcast.net (Tim Peters)
Date: Mon, 25 Nov 2002 23:37:29 -0500
Subject: [Spambayes] Guidance re pickles versus DB for Outlook
In-Reply-To: <1V6VTQL084W1UXTOJ85URMHB461GF.3de2f5e1@riven>
Message-ID: <LNBBLJKPBEHFEDALKOLCOEGKCPAB.tim.one@comcast.net>

[Tim Stone]
> ZODB sounds ideal to me, and I'm all for it, but Tim1 is kinda like "it
> doesn't ship till 2.3, so we're hosed till then..."

It's a modern bsddb that doesn't ship on Windows before 2.3.  ZODB is a
different beast entirely, and won't ship with the PLabs Windows distro even
in 2.3.  In several ways, ZODB is ideal for this app, and I had it in mind
when first writing this stuff.  It's a big pile of additional code, though.


From tim@fourstonesExpressions.com  Tue Nov 26 04:41:41 2002
From: tim@fourstonesExpressions.com (Tim Stone - Four Stones Expressions)
Date: Mon, 25 Nov 2002 22:41:41 -0600
Subject: [Spambayes] Guidance re pickles versus DB for Outlook
In-Reply-To: <LNBBLJKPBEHFEDALKOLCOEGKCPAB.tim.one@comcast.net>
Message-ID: <C864TN07MLQL2WSOE91VZUFE9RO2WV.3de2fb85@riven>

11/25/2002 10:37:29 PM, Tim Peters <tim.one@comcast.net> wrote:

>[Tim Stone]
>> ZODB sounds ideal to me, and I'm all for it, but Tim1 is kinda like "it
>> doesn't ship till 2.3, so we're hosed till then..."
>
>It's a modern bsddb that doesn't ship on Windows before 2.3.  ZODB is a
>different beast entirely,

Ah, got that wrong... so what are the issues around using ZODB?  distribution 
issues?  licensing a problem?

> and won't ship with the PLabs Windows distro even
>in 2.3.  In several ways, ZODB is ideal for this app, and I had it in mind
>when first writing this stuff.  It's a big pile of additional code, though.
>
>
>_______________________________________________
>Spambayes mailing list
>Spambayes@python.org
>http://mail.python.org/mailman/listinfo/spambayes
>
>


c'est moi - TimS
www.fourstonesExpressions.com 


From mhammond@skippinet.com.au  Tue Nov 26 04:51:39 2002
From: mhammond@skippinet.com.au (Mark Hammond)
Date: Tue, 26 Nov 2002 15:51:39 +1100
Subject: [Spambayes] Guidance re pickles versus DB for Outlook
In-Reply-To: <LNBBLJKPBEHFEDALKOLCOEGKCPAB.tim.one@comcast.net>
Message-ID: <LCEPIIGDJPKCOIHOBJEPGEIBHOAA.mhammond@skippinet.com.au>

{TimP]

> It's a modern bsddb that doesn't ship on Windows before 2.3.  ZODB is a
> different beast entirely, and won't ship with the PLabs Windows
> distro even
> in 2.3.  In several ways, ZODB is ideal for this app, and I had it in mind
> when first writing this stuff.  It's a big pile of additional
> code, though.

Would that then make this project qualify as Zope work again? <wink>

Mark.


From jeremy@alum.mit.edu  Tue Nov 26 04:56:20 2002
From: jeremy@alum.mit.edu (Jeremy Hylton)
Date: Mon, 25 Nov 2002 23:56:20 -0500
Subject: [Spambayes] Guidance re pickles versus DB for Outlook
In-Reply-To: <C864TN07MLQL2WSOE91VZUFE9RO2WV.3de2fb85@riven>
References: <LNBBLJKPBEHFEDALKOLCOEGKCPAB.tim.one@comcast.net>
	<C864TN07MLQL2WSOE91VZUFE9RO2WV.3de2fb85@riven>
Message-ID: <15842.65268.491388.44806@slothrop.zope.com>

>>>>> "TS" == Tim Stone <- Four Stones Expressions <tim@fourstonesExpressions.com>> writes:

  TS> 11/25/2002 10:37:29 PM, Tim Peters <tim.one@comcast.net> wrote:
  >> [Tim Stone]
  >>> ZODB sounds ideal to me, and I'm all for it, but Tim1 is kinda
  >>> like "it doesn't ship till 2.3, so we're hosed till then..."
  >>
  >> It's a modern bsddb that doesn't ship on Windows before 2.3.
  >> ZODB is a different beast entirely,

  TS> Ah, got that wrong... so what are the issues around using ZODB?
  TS> distribution issues?  licensing a problem?

It's a big distutils package that users have to download and install.
License is open source -- Zope Public License.

http://www.zope.org/Products/StandaloneZODB

The Windows binary installer is 750KB.  The source release is 435KB.

Jeremy


From tim@fourstonesExpressions.com  Tue Nov 26 04:59:04 2002
From: tim@fourstonesExpressions.com (Tim Stone - Four Stones Expressions)
Date: Mon, 25 Nov 2002 22:59:04 -0600
Subject: [Spambayes] Guidance re pickles versus DB for Outlook
In-Reply-To: <15842.65268.491388.44806@slothrop.zope.com>
Message-ID: <2WVC9JEZX61PIFYXKH93XWTOMURC.3de2ff98@riven>

11/25/2002 10:56:20 PM, Jeremy Hylton <jeremy@alum.mit.edu> wrote:

>>>>>> "TS" == Tim Stone <- Four Stones Expressions 
<tim@fourstonesExpressions.com>> writes:
>
>  TS> 11/25/2002 10:37:29 PM, Tim Peters <tim.one@comcast.net> wrote:
>  >> [Tim Stone]
>  >>> ZODB sounds ideal to me, and I'm all for it, but Tim1 is kinda
>  >>> like "it doesn't ship till 2.3, so we're hosed till then..."
>  >>
>  >> It's a modern bsddb that doesn't ship on Windows before 2.3.
>  >> ZODB is a different beast entirely,
>
>  TS> Ah, got that wrong... so what are the issues around using ZODB?
>  TS> distribution issues?  licensing a problem?
>
>It's a big distutils package that users have to download and install.
>License is open source -- Zope Public License.
>
>http://www.zope.org/Products/StandaloneZODB
>
>The Windows binary installer is 750KB.  The source release is 435KB.

Couldn't we just package it with spambayes?  Seems reasonable to me...

- TimS

>
>Jeremy
>
>
>


c'est moi - TimS
www.fourstonesExpressions.com 


From jeremy@alum.mit.edu  Tue Nov 26 05:11:01 2002
From: jeremy@alum.mit.edu (Jeremy Hylton)
Date: Tue, 26 Nov 2002 00:11:01 -0500
Subject: [Spambayes] Guidance re pickles versus DB for Outlook
In-Reply-To: <2WVC9JEZX61PIFYXKH93XWTOMURC.3de2ff98@riven>
References: <15842.65268.491388.44806@slothrop.zope.com>
	<2WVC9JEZX61PIFYXKH93XWTOMURC.3de2ff98@riven>
Message-ID: <15843.613.206451.90831@slothrop.zope.com>

>>>>> "TS" == Tim Stone <- Four Stones Expressions <tim@fourstonesExpressions.com>> writes:

  >> http://www.zope.org/Products/StandaloneZODB
  >>
  >> The Windows binary installer is 750KB.  The source release is
  >> 435KB.

  TS> Couldn't we just package it with spambayes?  Seems reasonable to
  TS> me...

There's certainly no license problem with that.  I don't what size
people would consider reasonable for a download.

There are also some operational problems that need to be solved for
ZODB.  On my machine, I'm running a ZODB server listening on a Unix
domain socket that all the pspam tools connect to.  I have to get the
server and the POP proxy started when my machine boots.  I also have
to do some database management.  The database grows by appending to the
end; occasionally you need to pack (delete) old copies of the data.

Jeremy


From neale@woozle.org  Tue Nov 26 05:25:54 2002
From: neale@woozle.org (Neale Pickett)
Date: 25 Nov 2002 21:25:54 -0800
Subject: [Spambayes] Guidance re pickles versus DB for Outlook
In-Reply-To: <LNBBLJKPBEHFEDALKOLCOEGKCPAB.tim.one@comcast.net>
References: <LNBBLJKPBEHFEDALKOLCOEGKCPAB.tim.one@comcast.net>
Message-ID: <w538yzgc36l.fsf@woozle.org>

So then, Tim Peters <tim.one@comcast.net> is all like:

> [Tim Stone]
> > ZODB sounds ideal to me

> It's a big pile of additional code, though.

That's the issue I have with it.  It *is* ideal for this project.  But
it's a good deal of additional code to install.  For instance, I
couldn't package spambayes for Debian at all, because Debian's ZODB is
for Python 2.1, but spambayes uses generators from Python >= 2.2.

If ZODB were part of Python, that'd be something else.  But I don't
think that's a good idea either.  You can only include so many batteries
in a thing before the batteries outweigh the thing itself.  And anyway,
I hardly expect spambayes to have any weight on what to do with the core
Python distribution.

Neale

From jeremy@alum.mit.edu  Tue Nov 26 05:28:49 2002
From: jeremy@alum.mit.edu (Jeremy Hylton)
Date: Tue, 26 Nov 2002 00:28:49 -0500
Subject: [Spambayes] Guidance re pickles versus DB for Outlook
In-Reply-To: <w538yzgc36l.fsf@woozle.org>
References: <LNBBLJKPBEHFEDALKOLCOEGKCPAB.tim.one@comcast.net>
	<w538yzgc36l.fsf@woozle.org>
Message-ID: <15843.1681.44172.263500@slothrop.zope.com>

>>>>> "NP" == Neale Pickett <neale@woozle.org> writes:

  NP> That's the issue I have with it.  It *is* ideal for this
  NP> project.  But it's a good deal of additional code to install.
  NP> For instance, I couldn't package spambayes for Debian at all,
  NP> because Debian's ZODB is for Python 2.1, but spambayes uses
  NP> generators from Python >= 2.2.

So Debian can't manage to compile the code for 2.1 and 2.2?  I'm
honestly puzzled why this is an issue.

Can't Debian users install software some other way?  They're going to
have to download and install spambayes.

Jeremy


From tim@fourstonesExpressions.com  Tue Nov 26 05:39:04 2002
From: tim@fourstonesExpressions.com (Tim Stone - Four Stones Expressions)
Date: Mon, 25 Nov 2002 23:39:04 -0600
Subject: [Spambayes] Guidance re pickles versus DB for Outlook
In-Reply-To: <15843.2098.301270.712174@slothrop.zope.com>
Message-ID: <NI2XMGFEIE07H4Y7XRNMMIE05ZJG2.3de308f8@riven>

11/25/2002 11:35:46 PM, jeremy@alum.mit.edu (Jeremy Hylton) wrote:

>>>>>> "TS" == Tim Stone <- Four Stones Expressions 
<tim@fourstonesExpressions.com>> writes:
>
>  TS> Well, I think I'm gonna fool with it, at least from a research
>  TS> perspective.  I think we can overcome the objections raised thus
>  TS> far, and we really do need robustness in the database.  I'm
>  TS> forever having my database corrupted when the pop3proxy does
>  TS> something unexpected, and this is a VERY controlled environment.
>  TS> God knows what the masses will do to this thing...  Double
>  TS> clicking browser buttons, killing processes, etc. etc.  We just
>  TS> gotta not have databases head south.
>
>(I didn't mean to take this conversation off the list.  Did you?

Nope... sorry

>It
>would be helpful for people who want to roll their own DB to think
>about database corruption.)
>
>  TS> The only real impediment I see thus far is that
>  TS> classifier.Classifier implements __getstate__ and __setstate__.
>  TS> I think this is wrong anyhow...
>
>  TS> Anything else that you see?
>
>I've been using the checked in pspam code for weeks now.  So I don't
>think there's any problem with the existing code.  __getstate__ and
>__setstate__ are supported by ZODB, and I've been using the ones
>provided by the default classifier.
>
>Jeremy
>
>
>


c'est moi - TimS
www.fourstonesExpressions.com 


From neale@woozle.org  Tue Nov 26 06:28:13 2002
From: neale@woozle.org (Neale Pickett)
Date: 25 Nov 2002 22:28:13 -0800
Subject: [Spambayes] Guidance re pickles versus DB for Outlook
In-Reply-To: <LCEPIIGDJPKCOIHOBJEPGEGPHOAA.mhammond@skippinet.com.au>
References: <LCEPIIGDJPKCOIHOBJEPGEGPHOAA.mhammond@skippinet.com.au>
Message-ID: <w533cpoc0aq.fsf@woozle.org>

Mark, I'm going to have to go with everyone else, and I'm the guy who
wrote the DBM back-end.  Until a reasonable dbm implementation is
available in a default Windows install, it's pretty much a no-brainer
due to potential database corruption.

However, I'm still going to address a few of your points :)

So then, "Mark Hammond" <mhammond@skippinet.com.au> is all like:

> * Move to a DB, but stick with a fully synchronous model.  We still
>   wear the DB load time at startup, but this should be reduced
>   significantly.  We wear the performance costs at runtime associated
>   with the scoring, and do all such scoring in the "foreground", and
>   saving of the DB as necessary.

The startup time for loading a DB is virtually non-existant.  I can't
say for sure what the DBM back-ends do, but I imagine it's something
like "open file, check a magic number, do a sanity check or two, build a
few structures in memory, return".  The only thing a DBDict does when
you start it is read in the MetaInfo class, which is stored as a
2-tuple.  So I don't think this is going to be very slow at all.

It's very easy to implement a hybrid dict/pickle method, which caches
DBM writes and only writes them out when you call the store() method.
I've been meaning to implement the write cache for a while now, because
training a dbdict on a large corpus is so abysmally slow right now, and
I have to do that a lot.

For small training batches though (1 or 2 messages), I don't think
you'll notice much difference.

> I would appreciate some comments on this.  I am leaning towards the
> asynch model, but it is clearly more complicated.  However, if moving
> to a DB simply means we will have perf issues, just not at startup,
> then the complexity would be warranted.

The DBM method is currently about 10 times slower than the pickle for
training, but it's a lot faster when you look at the whole picture, at
least if you are constantly opening and closing your persistent store.
I trained a new database with 50 messages using both methods:

pickle:
    min: 0.00468504428864
    max: 0.0303419828415
    avg: 0.00997757434845
    tot: 1.799s

dbm:
    min: 0.0343930721283
    max: 0.35492503643
    avg: 0.102057716846
    tot: 5.976s


Here's that same run with an existing database trained on a full 1088
messages.  You can see that the dbm method scales much better with a
large dataset:

pickle:
    min: 0.0046820640564
    max: 0.128466010094
    avg: 0.0142578792572
    tot: 8.874s

dbm:
    min: 0.0369809865952
    max: 0.546903014183
    avg: 0.11867954731
    tot: 6.749s


This is why the procmail crowd prefers the dbm though, here's a train on
*one* message:

pickle:
    min: 0.011234998703
    max: 0.011234998703
    avg: 0.011234998703
    tot: 7.908s

dbm:
    min: 0.0912280082703
    max: 0.0912280082703
    avg: 0.0912280082703
    tot: 0.574s


But in *getting* it trained, the pickle smoked the DBM:

pickle:
    min: 0.00426197052002
    max: 0.302268981934
    avg: 0.0123967074734
    tot: 26.991s

dbm:
    min: 0.0284680128098
    max: 1.34902703762
    avg: 0.139986485572
    tot: 2m46.591s

This performance loss can be mitigated pretty well by caching DBM
writes.  It would also fix the "problem" Tim S has with closing the DBM
before writing out the MetaData.  To me, that's the same as crashing,
but in any case, it'll fix it.

So if the DBM support on Windows were any good, I wouldn't know which
one you should use for the Outlook stuff.  But I suspect that a DBM with
write-caching could pound the vinegar-flavored snot out of a pickle.  :)

Things being what they are, though, it sounds like you should stay away
from DBM until Python 2.3.

Neale

From neale@woozle.org  Tue Nov 26 06:29:33 2002
From: neale@woozle.org (Neale Pickett)
Date: 25 Nov 2002 22:29:33 -0800
Subject: [Spambayes] Current version
In-Reply-To: <n2m-g.of8d2s3i.fsf@morpheus.demon.co.uk>
References: <16E1010E4581B049ABC51D4975CEDB88619957@UKDCX001.uk.int.atosorigin.com>
	<n2m-g.of8d2s3i.fsf@morpheus.demon.co.uk>
Message-ID: <w53y97galo2.fsf@woozle.org>

So then, Paul Moore <lists@morpheus.demon.co.uk> is all like:

> Patch to follow. Is a posting to the list OK, or should I upload it to
> SF? (No CVS commit ability, and I wouldn't know what to do if I had it
> :-))

Yep, patches on the list is fine.  Post away!

Neale

From neale@woozle.org  Tue Nov 26 06:35:44 2002
From: neale@woozle.org (Neale Pickett)
Date: 25 Nov 2002 22:35:44 -0800
Subject: [Spambayes] Guidance re pickles versus DB for Outlook
In-Reply-To: <15843.1681.44172.263500@slothrop.zope.com>
References: <LNBBLJKPBEHFEDALKOLCOEGKCPAB.tim.one@comcast.net>
	<w538yzgc36l.fsf@woozle.org>
	<15843.1681.44172.263500@slothrop.zope.com>
Message-ID: <w53u1i4aldr.fsf@woozle.org>

So then, Jeremy Hylton <jeremy@alum.mit.edu> is all like:

> So Debian can't manage to compile the code for 2.1 and 2.2?  I'm
> honestly puzzled why this is an issue.

> Can't Debian users install software some other way?  They're going to
> have to download and install spambayes.

I was just trying to illustrate that for Joe Random User, installing
Python is easy, and installing ZODB may be easy, but there's a good
chance that installing ZODB with Python2.2 will involve a compile.  And
I'm guessing that Tim1's sisters wouldn't be comfortable doing that.  I
know my sister wouldn't.

I also suspect that the average Windows or Mac user may tolerate having
to install Python before they can use the Whiz-Bam-Anti-Spam tool, but
that said user will probably not tolerate installing Python and ZODB.

I could be way, way off on this, though.  I don't even know the mind of
the average Windows user well enough to understand why they want to run
Windows in the first place <wink>

Neale

From neale@woozle.org  Tue Nov 26 06:39:16 2002
From: neale@woozle.org (Neale Pickett)
Date: 25 Nov 2002 22:39:16 -0800
Subject: [Spambayes] Guidance re pickles versus DB for Outlook
In-Reply-To: <NI2XMGFEIE07H4Y7XRNMMIE05ZJG2.3de308f8@riven>
References: <NI2XMGFEIE07H4Y7XRNMMIE05ZJG2.3de308f8@riven>
Message-ID: <w53ptssal7v.fsf@woozle.org>

So then, Tim Stone - Four Stones Expressions <tim@fourstonesExpressions.com> is all like:

> 11/25/2002 11:35:46 PM, jeremy@alum.mit.edu (Jeremy Hylton) wrote:
> 
> >>>>>> "TS" == Tim Stone <- Four Stones Expressions 
> <tim@fourstonesExpressions.com>> writes:
> >
> >  TS> Well, I think I'm gonna fool with it, at least from a research
> >  TS> perspective.  I think we can overcome the objections raised thus
> >  TS> far, and we really do need robustness in the database.  I'm
> >  TS> forever having my database corrupted when the pop3proxy does
> >  TS> something unexpected, and this is a VERY controlled environment.
> >  TS> God knows what the masses will do to this thing...  Double
> >  TS> clicking browser buttons, killing processes, etc. etc.  We just
> >  TS> gotta not have databases head south.

AFAICR, the only south-heading any databases have done, aside from known
bugs with the Windows dbm module, has been when you train a DBM and then
don't run store() for some reason.

Perhaps it is time to take another look at DBDictClassifier.__del__
calling store().

From mhammond@skippinet.com.au  Tue Nov 26 07:30:18 2002
From: mhammond@skippinet.com.au (Mark Hammond)
Date: Tue, 26 Nov 2002 18:30:18 +1100
Subject: [Spambayes] Guidance re pickles versus DB for Outlook
In-Reply-To: <w53u1i4aldr.fsf@woozle.org>
Message-ID: <LCEPIIGDJPKCOIHOBJEPAEIIHOAA.mhammond@skippinet.com.au>

[Neale]
> I also suspect that the average Windows or Mac user may tolerate having
> to install Python before they can use the Whiz-Bam-Anti-Spam tool, but
> that said user will probably not tolerate installing Python and ZODB.

Actually, I doubt this will fly.  We will need to create a stand-alone DLL,
and a "one click" installer.  This should be interesting <wink>.  Good
excuse to look at how these Python distribution tools have progressed
recently.

So, given that, ZODB is sounding attractive.  I would package it up, so a
few hundred extra K is probably no big deal.  The code could fall back to
the slow-loading pickles, so people running from source still work, just
possibly not as well.

> I could be way, way off on this, though.  I don't even know the mind of
> the average Windows user well enough to understand why they want to run
> Windows in the first place <wink>

Well, you don't have to look too far.  I tend to find they are people who
don't take operating systems quite so seriously <wink>

Mark.


From Paul.Moore@atosorigin.com  Tue Nov 26 09:25:35 2002
From: Paul.Moore@atosorigin.com (Moore, Paul)
Date: Tue, 26 Nov 2002 09:25:35 -0000
Subject: [Spambayes] Important information for Outlook users
Message-ID: <16E1010E4581B049ABC51D4975CEDB88619958@UKDCX001.uk.int.atosorigin.com>

From: Moore, Paul=20
> As I said, no time to do much more now. I'll do a full retest of all
> this tomorrow.

Right, I did a complete rebuild of everything - effectively, clearing
out all traces of the addin as best I could, and then installing from
scratch like a new user. (I trained on existing spam and ham, rather
than starting from an empty database).

Here's my log of what I did...

---- Start here ----

Unregister addin
Delete fields twice, to ensure 0 fields deleted second time
Delete directory
Copy in new version
Register addin
Start Outlook
Move 51 new mails to Unsure folder (temporarily)
Train on Inbox and Spam folders
385 ham, 999 spam
Moved new mails back
Define filters, but don't enable yet
Outlook has Spam columns in Inbox, Spam, and Unsure
   - first in Spam shows "0%", first in Inbox shows "0" (?!)
   - the one in the inbox came in while I was training...
Filter now, all messages, Inbox and Spam, only score.
Inbox showing as numbers. Spam as %.
Filter unread (51 msgs) performing all actions.
42 spam, 9 good, 1 unsure. Look reasonable.
Unsure folder is showing Spam field as a number, not a %.
Delete unsure message as spam. Field shows as a % in spam folder.
Enable filtering, and wait and see how it goes...

Trace log:

Collecting Python Trace Output...
Outlook Spam Addin module loading
SpamAddin - Connecting to Outlook
Created new configuration file =
'C:\Applications\Spambayes\Outlook2000\default_configuration.pck'
Either bayes database or message database is missing - creating new
Bayes database initialized with 0 spam and 0 good messages
Checked 386 in folder Inbox - 385 new entries found.
Checked 999 in folder Spam - 999 new entries found.
AntiSpam: Watching for new messages in folder Inbox
AntiSpam: Watching for new messages in folder Spam
Training on message -  trained as spam
Training on message '' -  already was trained as spam

--------------------

The only real issue specific to me (Exchange) here that I can see
is that mail arrived *during* the training process. I can't stop
that happening (working offline breaks all sorts of things, and
Exchange delivers mail as it feels - there is no "get mail on
request only" facility).

Otherwise, I see no reason why the Inbox and Unsure folders should
show the Spam field as numbers rather than as percentages.

I've left it for now - I can, I assume, delete and recreate the field
as a percentage in the Field Chooser, but I don't want to change
anything that might provide evidence :-) Filtering seems to be OK,
working on the "real" values rather than the formatted ones (if you
see what I mean - the *100 factor isn't mucking things up).

One good thing, this time I got a much better result than yesterday.
I think that deregistering the addin before deleting the fields was
the key here, based on what Mark said about the field being created
when the addin starts up...

Hope this helps,
Paul.

From mwh@python.net  Tue Nov 26 10:18:34 2002
From: mwh@python.net (Michael Hudson)
Date: 26 Nov 2002 10:18:34 +0000
Subject: [Spambayes] Re: Guidance re pickles versus DB for Outlook
References: <LNBBLJKPBEHFEDALKOLCOEGKCPAB.tim.one@comcast.net>
	<w538yzgc36l.fsf@woozle.org> <15843.1681.44172.263500@slothrop.zope.com>
	<w53u1i4aldr.fsf@woozle.org>
Message-ID: <2mn0nwiqh1.fsf@starship.python.net>

Neale Pickett <neale@woozle.org> writes:

> I was just trying to illustrate that for Joe Random User, installing
> Python is easy, and installing ZODB may be easy, but there's a good
> chance that installing ZODB with Python2.2 will involve a compile.  And
> I'm guessing that Tim1's sisters wouldn't be comfortable doing that.  I
> know my sister wouldn't.

Well, all sisters use Windows or Macs, so "we" can do the compile, surely?
(well, so long as the latter are using 10.2.2...).

If it's possible to leverage Jack's work on utilizing the installed
Python for Jaguar, that would get a few megs off the download, too.

Cheers,
M.

-- 
  surely, somewhere, somehow, in the history of computing, at least
  one manual has been written that you could at least remotely
  attempt to consider possibly glancing at.              -- Adam Rixey


From Paul.Moore@atosorigin.com  Tue Nov 26 10:22:40 2002
From: Paul.Moore@atosorigin.com (Moore, Paul)
Date: Tue, 26 Nov 2002 10:22:40 -0000
Subject: [Spambayes] Current version
Message-ID: <16E1010E4581B049ABC51D4975CEDB885E2E21@UKDCX001.uk.int.atosorigin.com>

From: Neale Pickett [mailto:neale@woozle.org]

> So then, Paul Moore <lists@morpheus.demon.co.uk> is all like:
> 
> > Patch to follow. Is a posting to the list OK, or should I upload it to
> > SF? (No CVS commit ability, and I wouldn't know what to do if I had it
> > :-))
> 
> Yep, patches on the list is fine.  Post away!

OK. Here's a patch. I've taken a simple approach to errors - if the user
puts garbage into the option, the ImportError or AttributeError which gets
generated is just passed back up the call stack. Maybe longer term we
should look at error handling, but I can't find an example of recommended
practice to steal :-)

I tested it via the doctests and a bit of diagnostic printing. I'm taking
the view that if I can create a DBDict successfully, if the code that uses
it doesn't work, I never touched the interface :-) [I'll do some better
testing tonight, when I'm back to my POP3 system, rather than Exchange...]

Paul.

PS Actually, there's another issue with this - it opens up a massive security
   hole, as the user can arrange for an arbitrary module to be imported - and
   an arbitrary function within that module to be run - by changing the INI
   file. This is not good. But I can't think how we could avoid this without
   removing the ability to choose a DBM implementation...

PPS The doctests in dbdict.py need fixing - I'll update them in a separate
    patch once this one has been sorted out. Also, the iterskip argument is
    no longer needed/used. Should it go?

-------------- next part --------------
A non-text attachment was scrubbed...
Name: dbdict.patch
Type: application/octet-stream
Size: 2586 bytes
Desc: dbdict.patch
Url : http://mail.python.org/pipermail/spambayes/attachments/20021126/454abef4/dbdict.exe
From Paul.Moore@atosorigin.com  Tue Nov 26 10:37:52 2002
From: Paul.Moore@atosorigin.com (Moore, Paul)
Date: Tue, 26 Nov 2002 10:37:52 -0000
Subject: [Spambayes] Guidance re pickles versus DB for Outlook
Message-ID: <16E1010E4581B049ABC51D4975CEDB8861995A@UKDCX001.uk.int.atosorigin.com>

From: Mark Hammond [mailto:mhammond@skippinet.com.au]
> I've been thinking about the "database" to use for the Outlook plugin. =
 I
> see two reasonable choices today: pickles and whatever anydbm picks up =
on
> Windows.

I agree this needs some consideration. I've also noticed the Outlook =
startup
and shutdown slowness. If we ever move to a "train assuming the =
classifier
got it right" model, we get lots more training, and therefore even less
chance we can avoid the shutdown slowup.

One potential problem with anydbm is that it picks up Berkleley DB on =
Windows,
and the version shipped with Python 2.2 is very old (1.85, I believe) =
and
has known bugs. I got some peculiar behaviour (hanging) with pop3proxy =
using
anydbm. There's no reason to believe that problems will be common, but =
any
problems that *do* occur will be awfully hard for the user to locate, =
let
alone diagnose.

> So, I see two basic routes I can take:
> * Move to a DB, but stick with a fully synchronous model.
> * Stick with pickles, but move to a threaded asynchronous model.

Pickles and async sounds a *lot* harder, but I suspect it's a more =
robust
option, at least until we get a newer DB shipped with Python.

We could use another persistence mechanism (bsddb3, or even Zodb) but =
that
would have to be shipped with the code, which starts to raise nasty
packaging issues...

Paul.

From seant@iname.com  Tue Nov 26 13:44:08 2002
From: seant@iname.com (Sean True)
Date: Tue, 26 Nov 2002 08:44:08 -0500
Subject: [Spambayes] Guidance re pickles versus DB for Outlook
In-Reply-To: <LCEPIIGDJPKCOIHOBJEPAEIIHOAA.mhammond@skippinet.com.au>
Message-ID: <MJEHLHJKGINLONDMMKNEMEPNIFAA.seant@iname.com>

> [Neale]
> > I also suspect that the average Windows or Mac user may tolerate having
> > to install Python before they can use the Whiz-Bam-Anti-Spam tool, but
> > that said user will probably not tolerate installing Python and ZODB.
> 
> Actually, I doubt this will fly.  We will need to create a 
> stand-alone DLL,
> and a "one click" installer.  This should be interesting <wink>.  Good
> excuse to look at how these Python distribution tools have progressed
> recently.

I've done some limited work in this space already. Nothing usable but --
I'd reccomend anyone pursuing this to use the McMillan Installer 
(http://www.mcmillan-inc.com/install1.html). I have used py2exe for other 
projects, but COM support in py2exe is not as robust.

Not that I got Installer to actually build the Outlook addin, either,
but I suspect that it would be a morning's work for someone with the
inclination to tackle it.

-- Sean


From papaDoc@videotron.ca  Tue Nov 26 13:55:16 2002
From: papaDoc@videotron.ca (papaDoc)
Date: Tue, 26 Nov 2002 08:55:16 -0500
Subject: [Spambayes] Packaging the installation
Message-ID: <3DE37D44.9020201@videotron.ca>

Hi,

>[Neale]
>  
>
>>I also suspect that the average Windows or Mac user may tolerate having
>>to install Python before they can use the Whiz-Bam-Anti-Spam tool, but
>>that said user will probably not tolerate installing Python and ZODB.
>>    
>>
>Actually, I doubt this will fly.  We will need to create a 
>stand-alone DLL,
>and a "one click" installer.  This should be interesting <wink>.  Good
>excuse to look at how these Python distribution tools have progressed
>recently.
>  
>

> I've done some limited work in this space already. Nothing usable but --
> I'd reccomend anyone pursuing this to use the McMillan Installer 
> (http://www.mcmillan-inc.com/install1.html). I have used py2exe for other 
> projects, but COM support in py2exe is not as robust.

> Not that I got Installer to actually build the Outlook addin, either,
> but I suspect that it would be a morning's work for someone with the
> inclination to tackle it.


We use InstallShield at work to make the packaging for our software.

I might be able to squeeze some time and do it for Spambayes for windows.

But I won't be able to test it since I'm not using Outlook.

papaDoc


From tim@fourstonesExpressions.com  Tue Nov 26 13:59:39 2002
From: tim@fourstonesExpressions.com (Tim Stone - Four Stones Expressions)
Date: Tue, 26 Nov 2002 07:59:39 -0600
Subject: [Spambayes] Packaging the installation
In-Reply-To: <3DE37D44.9020201@videotron.ca>
Message-ID: <ICSQSR64T2XMHGAHBHD85A85484XSWR.3de37e4b@riven>

11/26/2002 7:55:16 AM, papaDoc <papaDoc@videotron.ca> wrote:

>Hi,
>
>>[Neale]
>>  
>>
>>>I also suspect that the average Windows or Mac user may tolerate having
>>>to install Python before they can use the Whiz-Bam-Anti-Spam tool, but
>>>that said user will probably not tolerate installing Python and ZODB.
>>>    
>>>
>>Actually, I doubt this will fly.  We will need to create a 
>>stand-alone DLL,
>>and a "one click" installer.  This should be interesting <wink>.  Good
>>excuse to look at how these Python distribution tools have progressed
>>recently.
>>  
>>
>
>> I've done some limited work in this space already. Nothing usable but --
>> I'd reccomend anyone pursuing this to use the McMillan Installer 
>> (http://www.mcmillan-inc.com/install1.html). I have used py2exe for other 
>> projects, but COM support in py2exe is not as robust.
>
>> Not that I got Installer to actually build the Outlook addin, either,
>> but I suspect that it would be a morning's work for someone with the
>> inclination to tackle it.
>
>
>We use InstallShield at work to make the packaging for our software.
>
>I might be able to squeeze some time and do it for Spambayes for windows.
>
>But I won't be able to test it since I'm not using Outlook.
>
>papaDoc

I already gave that one a spin, checked it in yesterday and got booed off the 
planet... lol... seems that sourceforge has their own distribution mechanism.  
So I made an attempt at personal redemption and removed the installable thingy 
from cvs.

- TimS
>
>
>
>_______________________________________________
>Spambayes mailing list
>Spambayes@python.org
>http://mail.python.org/mailman/listinfo/spambayes
>
>


c'est moi - TimS
www.fourstonesExpressions.com 


From tim@fourstonesExpressions.com  Tue Nov 26 14:03:34 2002
From: tim@fourstonesExpressions.com (Tim Stone - Four Stones Expressions)
Date: Tue, 26 Nov 2002 08:03:34 -0600
Subject: [Spambayes] Guidance re pickles versus DB for Outlook
In-Reply-To: <w533cpoc0aq.fsf@woozle.org>
Message-ID: <05631VIG1V51UTC4YEBLFMI4X2X07DB.3de37f36@riven>

Nice treatment of the issue, dude.  So how are you going to do the write 
caching thing?  I imagine you're not going to use a working copy model, like I 
had going... ;)

The problem Richie had with __del__ is that there's no guarantee that it will 
actually be called.

- TimS

11/26/2002 12:28:13 AM, Neale Pickett <neale@woozle.org> wrote:

>Mark, I'm going to have to go with everyone else, and I'm the guy who
>wrote the DBM back-end.  Until a reasonable dbm implementation is
>available in a default Windows install, it's pretty much a no-brainer
>due to potential database corruption.
>
>However, I'm still going to address a few of your points :)
>
>So then, "Mark Hammond" <mhammond@skippinet.com.au> is all like:
>
>> * Move to a DB, but stick with a fully synchronous model.  We still
>>   wear the DB load time at startup, but this should be reduced
>>   significantly.  We wear the performance costs at runtime associated
>>   with the scoring, and do all such scoring in the "foreground", and
>>   saving of the DB as necessary.
>
>The startup time for loading a DB is virtually non-existant.  I can't
>say for sure what the DBM back-ends do, but I imagine it's something
>like "open file, check a magic number, do a sanity check or two, build a
>few structures in memory, return".  The only thing a DBDict does when
>you start it is read in the MetaInfo class, which is stored as a
>2-tuple.  So I don't think this is going to be very slow at all.
>
>It's very easy to implement a hybrid dict/pickle method, which caches
>DBM writes and only writes them out when you call the store() method.
>I've been meaning to implement the write cache for a while now, because
>training a dbdict on a large corpus is so abysmally slow right now, and
>I have to do that a lot.
>
>For small training batches though (1 or 2 messages), I don't think
>you'll notice much difference.
>
>> I would appreciate some comments on this.  I am leaning towards the
>> asynch model, but it is clearly more complicated.  However, if moving
>> to a DB simply means we will have perf issues, just not at startup,
>> then the complexity would be warranted.
>
>The DBM method is currently about 10 times slower than the pickle for
>training, but it's a lot faster when you look at the whole picture, at
>least if you are constantly opening and closing your persistent store.
>I trained a new database with 50 messages using both methods:
>
>pickle:
>    min: 0.00468504428864
>    max: 0.0303419828415
>    avg: 0.00997757434845
>    tot: 1.799s
>
>dbm:
>    min: 0.0343930721283
>    max: 0.35492503643
>    avg: 0.102057716846
>    tot: 5.976s
>
>
>Here's that same run with an existing database trained on a full 1088
>messages.  You can see that the dbm method scales much better with a
>large dataset:
>
>pickle:
>    min: 0.0046820640564
>    max: 0.128466010094
>    avg: 0.0142578792572
>    tot: 8.874s
>
>dbm:
>    min: 0.0369809865952
>    max: 0.546903014183
>    avg: 0.11867954731
>    tot: 6.749s
>
>
>This is why the procmail crowd prefers the dbm though, here's a train on
>*one* message:
>
>pickle:
>    min: 0.011234998703
>    max: 0.011234998703
>    avg: 0.011234998703
>    tot: 7.908s
>
>dbm:
>    min: 0.0912280082703
>    max: 0.0912280082703
>    avg: 0.0912280082703
>    tot: 0.574s
>
>
>But in *getting* it trained, the pickle smoked the DBM:
>
>pickle:
>    min: 0.00426197052002
>    max: 0.302268981934
>    avg: 0.0123967074734
>    tot: 26.991s
>
>dbm:
>    min: 0.0284680128098
>    max: 1.34902703762
>    avg: 0.139986485572
>    tot: 2m46.591s
>
>This performance loss can be mitigated pretty well by caching DBM
>writes.  It would also fix the "problem" Tim S has with closing the DBM
>before writing out the MetaData.  To me, that's the same as crashing,
>but in any case, it'll fix it.
>
>So if the DBM support on Windows were any good, I wouldn't know which
>one you should use for the Outlook stuff.  But I suspect that a DBM with
>write-caching could pound the vinegar-flavored snot out of a pickle.  :)
>
>Things being what they are, though, it sounds like you should stay away
>from DBM until Python 2.3.
>
>Neale
>
>_______________________________________________
>Spambayes mailing list
>Spambayes@python.org
>http://mail.python.org/mailman/listinfo/spambayes
>
>


c'est moi - TimS
www.fourstonesExpressions.com 


From skip@pobox.com  Tue Nov 26 15:36:47 2002
From: skip@pobox.com (Skip Montanaro)
Date: Tue, 26 Nov 2002 09:36:47 -0600
Subject: [Spambayes] Guidance re pickles versus DB for Outlook
In-Reply-To: <KFDCKI53GFCALK829664VTA5074OJYX.3de2d94d@riven>
References: <LCEPIIGDJPKCOIHOBJEPGEGPHOAA.mhammond@skippinet.com.au>
        <KFDCKI53GFCALK829664VTA5074OJYX.3de2d94d@riven>
Message-ID: <15843.38159.506024.37816@montanaro.dyndns.org>

>>>>> "Tim" == Tim Stone <- Four Stones Expressions <tim@fourstonesExpressions.com>> writes:

    Tim> Ok, I'm glad you've put this out here. IMO, DBM is too unreliable
    Tim> to be anything but a test database.  In real life, bad stuff
    Tim> happens... the database has to be resilient, or at least
    Tim> recoverable.  DBM doesn't seem to be either, really.  (are the perl
    Tim> dbm implementations better?)

As I understand it, recent versions of Berkeley DB can be used as the
underpinnings of MySQL.  They do all the stuff MySQL needs, so I doubt the
actual implementation is a problem.  You probably have to take care at the
application level to make sure bits get all the way out to the disk.

Skip

From skip@pobox.com  Tue Nov 26 15:57:25 2002
From: skip@pobox.com (Skip Montanaro)
Date: Tue, 26 Nov 2002 09:57:25 -0600
Subject: [Spambayes] Guidance re pickles versus DB for Outlook
In-Reply-To: <15842.62697.829412.348546@slothrop.zope.com>
References: <LCEPIIGDJPKCOIHOBJEPGEGPHOAA.mhammond@skippinet.com.au>
        <KFDCKI53GFCALK829664VTA5074OJYX.3de2d94d@riven>
        <15842.62697.829412.348546@slothrop.zope.com>
Message-ID: <15843.39397.770235.412408@montanaro.dyndns.org>


    Jeremy> Put another way, I'd be interested to hear why you don't want to
    Jeremy> use ZODB.

Disclaimer: I'm not saying I don't want to use ZODB.  I'm offering some
reasons why it might not be everyone's obvious choice.

For most of us who have *any* experience with ZODB it's probably all
indirect via Zope, so there are probably some inaccurate perceptions about
it.  These thoughts that have come to my mind at one time or another:

    * How could a database from a company (Zope) whose sole business is not
      databases be more reliable than a database from organizations whose
      sole raison d'etre is databases (Sleepycat, Postgres, MySQL, ...)?

    * Dealing with Zope's monolithic system is frustrating to people (like
      me) who are used to having files reside in filesystems.  Some of that
      frustration probably carries over to ZODB, though it's almost
      certainly not ZODB's problem.

    * It seems to grow without bound, else why do I need to pack my Data.fs
      file every now and then?

It doesn't really matter if the perceptions are accurate or not.  They still
need to be addressed to some extent before people are going to be
comfortable with it.  ZODB is, for better or for worse, tied to Zope the
application.  Accordingly, perceived problems with Zope will rub off on
ZODB.

Also, there is the issue of availability.  While it's just an extra install,
its use requires more than the usual Python install.  Having it in the core
distribution would provide stronger assurances that it runs wherever Python
runs (e.g., does it run on MacOS 8 or 9, both of which I believe Jack still
supports with his Mac installer?).

Skip

From richie@entrian.com  Tue Nov 26 15:57:44 2002
From: richie@entrian.com (Richie Hindle)
Date: Tue, 26 Nov 2002 15:57:44 +0000
Subject: [Spambayes] New web training interface for pop3proxy
In-Reply-To: <LNBBLJKPBEHFEDALKOLCEEBBCPAB.tim.one@comcast.net>
References: <r900uu8monvsidlbcfgg21371r6phknb92@4ax.com>
	<LNBBLJKPBEHFEDALKOLCEEBBCPAB.tim.one@comcast.net>
Message-ID: <7vv6uu8rm3uo8v6972jg2up1ear807dl89@4ax.com>


[Tim Peters]
> I have half a mind to replace the comment and style nuking with an
> iterative, stack-friendly scheme

[later]
> That should be fixed now.

Works for me - thanks very much.

-- 
Richie Hindle
richie@entrian.com


From skip@pobox.com  Tue Nov 26 16:07:48 2002
From: skip@pobox.com (Skip Montanaro)
Date: Tue, 26 Nov 2002 10:07:48 -0600
Subject: [Spambayes] Guidance re pickles versus DB for Outlook
In-Reply-To: <w533cpoc0aq.fsf@woozle.org>
References: <LCEPIIGDJPKCOIHOBJEPGEGPHOAA.mhammond@skippinet.com.au>
        <w533cpoc0aq.fsf@woozle.org>
Message-ID: <15843.40020.316238.815470@montanaro.dyndns.org>


    Neale> Things being what they are, though, it sounds like you should
    Neale> stay away from DBM until Python 2.3.

On Windows can't you simply rearrange anydbm._names to avoid finding dbhash
first, or are gdbm and dbm not available at all (leaving you with dumbdbm as
the only alternative)?  Does spambayes make use of enough of the 1.85
gotchas outlined on Sleepycat's historic.html page that you can't somehow
avoid its problems?

Skip

From skip@pobox.com  Tue Nov 26 16:11:17 2002
From: skip@pobox.com (Skip Montanaro)
Date: Tue, 26 Nov 2002 10:11:17 -0600
Subject: [Spambayes] Guidance re pickles versus DB for Outlook
In-Reply-To: <w53u1i4aldr.fsf@woozle.org>
References: <LNBBLJKPBEHFEDALKOLCOEGKCPAB.tim.one@comcast.net>
        <w538yzgc36l.fsf@woozle.org>
        <15843.1681.44172.263500@slothrop.zope.com>
        <w53u1i4aldr.fsf@woozle.org>
Message-ID: <15843.40229.462273.516022@montanaro.dyndns.org>


    Neale> I also suspect that the average Windows or Mac user may tolerate
    Neale> having to install Python before they can use the
    Neale> Whiz-Bam-Anti-Spam tool, but that said user will probably not
    Neale> tolerate installing Python and ZODB.

If the average Windows user wasn't averse to installing and running software
of unknown origins, we probably wouldn't have quite the problem we have
today with computer viruses and worms. ;-)

hey-honey-look-at-this-cool-new-solitaire-program-ly, y'rs,

Skip

From jeremy@alum.mit.edu  Tue Nov 26 16:14:49 2002
From: jeremy@alum.mit.edu (Jeremy Hylton)
Date: Tue, 26 Nov 2002 11:14:49 -0500
Subject: [Spambayes] Guidance re pickles versus DB for Outlook
In-Reply-To: <15843.39397.770235.412408@montanaro.dyndns.org>
References: <LCEPIIGDJPKCOIHOBJEPGEGPHOAA.mhammond@skippinet.com.au>
	<KFDCKI53GFCALK829664VTA5074OJYX.3de2d94d@riven>
	<15842.62697.829412.348546@slothrop.zope.com>
	<15843.39397.770235.412408@montanaro.dyndns.org>
Message-ID: <15843.40441.659922.991160@slothrop.zope.com>

>>>>> "SM" == Skip Montanaro <skip@pobox.com> writes:

  Jeremy> Put another way, I'd be interested to hear why you don't
  Jeremy> want to use ZODB.

  SM> Disclaimer: I'm not saying I don't want to use ZODB.  I'm
  SM> offering some reasons why it might not be everyone's obvious
  SM> choice.

But you're not saying you do want to use ZODB, so you're still part of
the problem <wink>.

  SM> For most of us who have *any* experience with ZODB it's probably
  SM> all indirect via Zope, so there are probably some inaccurate
  SM> perceptions about it.  These thoughts that have come to my mind
  SM> at one time or another:

  SM> * How could a database from a company (Zope) whose sole business
  SM>       is not databases be more reliable than a database from
  SM>       organizations whose sole raison d'etre is databases
  SM>       (Sleepycat, Postgres, MySQL, ...)?

I don't think I could argue that ZODB is more reliable that
BerkeleyDB.  It's true that we have fewer database experts and expend
fewer resources working on database reliability.  On the other hand,
Barry is nearly finished with a BerkeleyDB-based storage for ZODB.

ZODB is an object persistence tool that uses a database behind it.
You can use our FileStorage or you can use someone else's database,
although BerkeleyDB is the best we can offer at the moment.  (It would
be really cool to do a Postgres storage...)

  SM> * Dealing with Zope's monolithic system is frustrating to people
  SM>       (like me) who are used to having files reside in
  SM>       filesystems.  Some of that frustration probably carries
  SM>       over to ZODB, though it's almost certainly not ZODB's
  SM>       problem.


This sounds like a Zope complaint that doesn't have anything to do
with ZODB, but maybe I misunderstand you.  You don't have to store
your code in the database, although that will be mostly possible in
ZODB4.

Seriously, ZODB stores object pickles in a database.  The storage
layer is free to manage those pickles however it likes.  FileStorage
uses a single file.  Toby Dickenson's DirectoryStorage represents each
pickle as a separate file.

  SM> * It seems to grow without bound, else why do I need to pack my
  SM>       Data.fs file every now and then?

It grows without bound unless you pack it.  Why is that a problem? 
BerkeleyDB log files grow without bound, too.  Databases require some
tending.  One possibility with FileStorage is to add an explicit
pack() call to update training operation.  We'd need to think
carefully about the performance impact.

  SM> Also, there is the issue of availability.  While it's just an
  SM> extra install, its use requires more than the usual Python
  SM> install.  Having it in the core distribution would provide
  SM> stronger assurances that it runs wherever Python runs (e.g.,
  SM> does it run on MacOS 8 or 9, both of which I believe Jack still
  SM> supports with his Mac installer?).

I think we'd want a spambayes installer that packaged up spambayes,
python, and zodb.

Jeremy


From neale@woozle.org  Tue Nov 26 16:19:23 2002
From: neale@woozle.org (Neale Pickett)
Date: 26 Nov 2002 08:19:23 -0800
Subject: [Spambayes] Guidance re pickles versus DB for Outlook
In-Reply-To: <LCEPIIGDJPKCOIHOBJEPAEIIHOAA.mhammond@skippinet.com.au>
References: <LCEPIIGDJPKCOIHOBJEPAEIIHOAA.mhammond@skippinet.com.au>
Message-ID: <w538yzg9ud0.fsf@woozle.org>

So then, "Mark Hammond" <mhammond@skippinet.com.au> is all like:

> So, given that, ZODB is sounding attractive.  I would package it up, so a
> few hundred extra K is probably no big deal.  The code could fall back to
> the slow-loading pickles, so people running from source still work, just
> possibly not as well.

I haven't used ZODB much really but it sure looks like the Right Way.  I
don't think it will fly with the unwashed Unix masses, but it sounds
like a piece of cake with the Windows crowd.  That makes sense; you only
have one OS vendor and one processor family to support.

I'm going to go sulk in /dev/corner now.

Neale

From wsy@merl.com  Tue Nov 26 16:19:23 2002
From: wsy@merl.com (Bill Yerazunis)
Date: Tue, 26 Nov 2002 11:19:23 -0500
Subject: [Spambayes] Introduction to list: Bill Yerazunis
Message-ID: <200211261619.gAQGJN629857@localhost.localdomain>


I should post an introduction for myself.

I'm Bill Yerazunis, and I'm doing spamfiltering.  

Robert Woodhead and Paul Graham sent me.

I wrote CRM114 (which hashes phrases as "features" and does Bayesian
chain-rule evaluation), it seems to work well for me but I hear
that some folks here had big problems with it.

Anyway... good morning.

	  -Bill Yerazunis

From neale@woozle.org  Tue Nov 26 16:22:53 2002
From: neale@woozle.org (Neale Pickett)
Date: 26 Nov 2002 08:22:53 -0800
Subject: [Spambayes] Guidance re pickles versus DB for Outlook
In-Reply-To: <15843.40020.316238.815470@montanaro.dyndns.org>
References: <LCEPIIGDJPKCOIHOBJEPGEGPHOAA.mhammond@skippinet.com.au>
	<w533cpoc0aq.fsf@woozle.org>
	<15843.40020.316238.815470@montanaro.dyndns.org>
Message-ID: <w534ra49u76.fsf@woozle.org>

So then, Skip Montanaro <skip@pobox.com> is all like:

>     Neale> Things being what they are, though, it sounds like you should
>     Neale> stay away from DBM until Python 2.3.
> 
> On Windows can't you simply rearrange anydbm._names to avoid finding dbhash
> first, or are gdbm and dbm not available at all (leaving you with dumbdbm as
> the only alternative)?  Does spambayes make use of enough of the 1.85
> gotchas outlined on Sleepycat's historic.html page that you can't somehow
> avoid its problems?

I don't know for sure.  I've certainly had my share of corruption when
using 1.85 DBs, but that was *years* ago.  Anyway it ticked me off so
much that I am loathe to suggest anyone else use it.  And in any case,
Paul Moore seems to have experienced some DBM suckage already.  I just
don't get a warm fuzzy feeling using known buggy software that's three
major revisions old and no longer maintained.

gdbm would probably be a good alternative though.  Rearranging
anydbm._names on Windows might be doable.  Any takers?

From richie@entrian.com  Tue Nov 26 16:23:28 2002
From: richie@entrian.com (Richie Hindle)
Date: Tue, 26 Nov 2002 16:23:28 +0000
Subject: [Spambayes] Documentation...
In-Reply-To: <2e10uuc5agh9nqki2b7rn973m7ofu9qguv@4ax.com>
References: <ddtntuo5m5gddnp835hdohlj2rrtllu3kl@4ax.com>
	<LNBBLJKPBEHFEDALKOLCKEOOCOAB.tim.one@comcast.net>
	<2e10uuc5agh9nqki2b7rn973m7ofu9qguv@4ax.com>
Message-ID: <q207uu0k1j220mccmqd3ltodr2fton6gtc@4ax.com>


[Richie Hindle]
> I've made a start on some user documentation.

[Tim Peters]
> First check it into the project, so other people can help update it too, and
> so it doesn't get lost.  These docs are a great beginning!

This is now checked in - I've folded my stuff into INTEGRATION.txt.

-- 
Richie Hindle
richie@entrian.com


From richie@entrian.com  Tue Nov 26 16:23:31 2002
From: richie@entrian.com (Richie Hindle)
Date: Tue, 26 Nov 2002 16:23:31 +0000
Subject: [Spambayes] Pop3proxy - doesn't save pickle when reviewing?
In-Reply-To: <n2m-g.lm3h2rqe.fsf@morpheus.demon.co.uk>
References: <n2m-g.lm3h2rqe.fsf@morpheus.demon.co.uk>
Message-ID: <ge37uuklta2t6hufa1fbaa24sbe5oemj9t@4ax.com>


[Paul Moore]
> Just switched pop3proxy over to pickle format. One thing I notices,
> which surprises me, is that when I review messages and train, the
> pickle is *not* saved. I have to manually save.
> 
> Is this intentional? With my way of working, I imagine it could result
> in me forgetting to save. Surely a save isn't too much of an overhead
> - the review screen shows "Training... Trained on X messages...
> Updating probabilities... Done" messages already, so adding
> "Saving..." would fit into the UI fine.

I've made this change - it now auto-saves after any kind of training.

It did this to begin with, but someone asked that it didn't because it made
training through the "Upload" form of the web interface tedious - saving
can easily take 30 seconds on a slow machine with a well-trained database.

However, given that a) it keeps blowing up for people who haven't saved, b)
you can now train using the "official" POP3-proxy web training interface
rather than using upload, and c) you can now upload mbox files into the web
interface to do batch-training, I reckon you're right, it should auto-save.

-- 
Richie Hindle
richie@entrian.com


From richie@entrian.com  Tue Nov 26 16:23:33 2002
From: richie@entrian.com (Richie Hindle)
Date: Tue, 26 Nov 2002 16:23:33 +0000
Subject: [Spambayes] Guidance re pickles versus DB for Outlook
In-Reply-To: <LCEPIIGDJPKCOIHOBJEPAEIIHOAA.mhammond@skippinet.com.au>
References: <w53u1i4aldr.fsf@woozle.org>
	<LCEPIIGDJPKCOIHOBJEPAEIIHOAA.mhammond@skippinet.com.au>
Message-ID: <2n37uucckrhn6q3aqubt49kag3gp8jo4in@4ax.com>


[Mark]
> We will need to create a stand-alone DLL, and a "one click" installer.
> [...]  So, given that, ZODB is sounding attractive.  I would package it
> up, so a few hundred extra K is probably no big deal.

I agree.  We've had a lot of hassles with the different database formats,
and we've been forced to make lots of compromises.  The fact that the
default database types are different for hammie and pop3proxy is a wart
(and a documentation headache - "how do you use hammie to pre-train the
POP3 proxy?" etc.).  The web interface now does a defensive auto-save after
training, which can be a painful delay.  BSDDB 1.8x explodes or hangs at
random points (allegedly 8-).  Neale's about to implement a caching layer
on top of the DBM stuff... I'm sure there are more examples.

As Mark says, we're going to have to package this thing up anyway, so why
not make ZODB a part of that package?  All this assumes (as Skip points
out) that ZODB is as portable as Spambayes.

On the subject of packaging: I've used InnoSetup before and been very
impressed.  Someone mentioned Install Shield - I don't believe there's a
credible free version of that, whereas InnoSetup is completely free.

-- 
Richie Hindle
richie@entrian.com


From tim@fourstonesExpressions.com  Tue Nov 26 16:33:42 2002
From: tim@fourstonesExpressions.com (Tim Stone - Four Stones Expressions)
Date: Tue, 26 Nov 2002 10:33:42 -0600
Subject: [Spambayes] Guidance re pickles versus DB for Outlook
In-Reply-To: <2n37uucckrhn6q3aqubt49kag3gp8jo4in@4ax.com>
Message-ID: <953XURPOZY1YHELKHSPRHCLKJILIPK.3de3a266@riven>

11/26/2002 10:23:33 AM, Richie Hindle <richie@entrian.com> wrote:

>
>[Mark]
>> We will need to create a stand-alone DLL, and a "one click" installer.
>> [...]  So, given that, ZODB is sounding attractive.  I would package it
>> up, so a few hundred extra K is probably no big deal.
>
>I agree.  We've had a lot of hassles with the different database formats,
>and we've been forced to make lots of compromises.  The fact that the
>default database types are different for hammie and pop3proxy is a wart
>(and a documentation headache - "how do you use hammie to pre-train the
>POP3 proxy?" etc.).  The web interface now does a defensive auto-save after
>training, which can be a painful delay.  BSDDB 1.8x explodes or hangs at
>random points (allegedly 8-).  Neale's about to implement a caching layer
>on top of the DBM stuff... I'm sure there are more examples.
>
>As Mark says, we're going to have to package this thing up anyway, so why
>not make ZODB a part of that package?  All this assumes (as Skip points
>out) that ZODB is as portable as Spambayes.

Much discussion on this topic...

>
>On the subject of packaging: I've used InnoSetup before and been very
>impressed.  Someone mentioned Install Shield - I don't believe there's a
>credible free version of that, whereas InnoSetup is completely free.

Richie, I checked in an installable version of spambayes into cvs yesterday, 
using the microsoft installer, and got booed off the planet.  Apparently 
sourceforge has their own distribution mechanism?  - TimS
>
>-- 
>Richie Hindle
>richie@entrian.com
>
>
>_______________________________________________
>Spambayes mailing list
>Spambayes@python.org
>http://mail.python.org/mailman/listinfo/spambayes
>
>


c'est moi - TimS
www.fourstonesExpressions.com 


From papaDoc@videotron.ca  Tue Nov 26 16:36:42 2002
From: papaDoc@videotron.ca (papaDoc)
Date: Tue, 26 Nov 2002 11:36:42 -0500
Subject: [Spambayes] Guidance re pickles versus DB for Outlook
Message-ID: <3DE3A31A.3010008@videotron.ca>

Hi,

> On the subject of packaging: I've used InnoSetup before and been very
> impressed.  Someone mentioned Install Shield - I don't believe there's a
> credible free version of that, whereas InnoSetup is completely free.

You are right Install Shield is not free but I'm free to use it since we
(my company) paid the full price for it ;-)


> I already gave that one a spin, I think TimS was allmost by the SF staff by 
> trying to use something they don't use it in yesterday and got booed off the 
> planet... lol... seems that sourceforge has their own distribution mechanism.

I can take a look at how SF packages thing or how InnoSetup work.

papaDoc


From tim@fourstonesExpressions.com  Tue Nov 26 16:38:08 2002
From: tim@fourstonesExpressions.com (Tim Stone - Four Stones Expressions)
Date: Tue, 26 Nov 2002 10:38:08 -0600
Subject: [Spambayes] Guidance re pickles versus DB for Outlook
In-Reply-To: <w534ra49u76.fsf@woozle.org>
Message-ID: <WQMKF96OM4XJ2UGC2XH4Z6ZHCUQ1X.3de3a370@riven>

11/26/2002 10:22:53 AM, Neale Pickett <neale@woozle.org> wrote:

>So then, Skip Montanaro <skip@pobox.com> is all like:
>
>>     Neale> Things being what they are, though, it sounds like you should
>>     Neale> stay away from DBM until Python 2.3.
>> 
>> On Windows can't you simply rearrange anydbm._names to avoid finding dbhash
>> first, or are gdbm and dbm not available at all (leaving you with dumbdbm 
as
>> the only alternative)?  Does spambayes make use of enough of the 1.85
>> gotchas outlined on Sleepycat's historic.html page that you can't somehow
>> avoid its problems?
>
>I don't know for sure.  I've certainly had my share of corruption when
>using 1.85 DBs, but that was *years* ago.  Anyway it ticked me off so
>much that I am loathe to suggest anyone else use it.  And in any case,
>Paul Moore seems to have experienced some DBM suckage already.  I just
>don't get a warm fuzzy feeling using known buggy software that's three
>major revisions old and no longer maintained.
>
>gdbm would probably be a good alternative though.  Rearranging
>anydbm._names on Windows might be doable.  Any takers?

Francois gave us a clue on that one yesterday (or so).  Looks like we can 
rearrange this, but it will require copying the module into spambayes... 
yuk... another solution is to clone the module... call it spambayesdbm.  Maybe 
that would have several advantages.

Quoting Francois:

<quote>
on 25/11/02 12:04, Moore, Paul at Paul.Moore@atosorigin.com wrote:

>  Or would it be worth
> specifically looking for pybsddb, and using that in preference if it
> is present?

Since it use anydbm, you can copy lib/anydbm.py in your spambayes folder and
modify  the line 51:

_names = ['dbhash', 'gdbm', 'dbm', 'dumbdbm']

and add your preferred dbm in front of the list. It will use it if it exist.

_names = ['pybsddb', 'dbhash', 'gdbm', 'dbm', 'dumbdbm']

--
Le courrier est un moyen de communication. Les gens devraient
se poser des questions sur les implications politiques des choix (ou non
choix) de leurs outils et technologies. Pour des courriers propres :
<http://marc.herbert.free.fr/mail/> -- <http://minilien.com/?IXZneLoID0>

_______________________________________________
Spambayes mailing list
Spambayes@python.org
http://mail.python.org/mailman/listinfo/spambayes
>
>_______________________________________________
>Spambayes mailing list
>Spambayes@python.org
>http://mail.python.org/mailman/listinfo/spambayes
>
>
</quote>

c'est moi - TimS
www.fourstonesExpressions.com 


From jeremy@alum.mit.edu  Tue Nov 26 16:58:06 2002
From: jeremy@alum.mit.edu (Jeremy Hylton)
Date: Tue, 26 Nov 2002 11:58:06 -0500
Subject: [Spambayes] Guidance re pickles versus DB for Outlook
In-Reply-To: <WQMKF96OM4XJ2UGC2XH4Z6ZHCUQ1X.3de3a370@riven>
References: <w534ra49u76.fsf@woozle.org>
	<WQMKF96OM4XJ2UGC2XH4Z6ZHCUQ1X.3de3a370@riven>
Message-ID: <15843.43038.341350.515691@slothrop.zope.com>

I just did a cvs update in spambayes and tried to restart my pspam
code, but nothing is working anymore :-(.  I'm sorry I haven't had
time to read every message on the proposed changes, but there's been a
flurry of activity and I've got a day job.

Anyway, here's a traceback.  Can anyone suggest quickly how I would
fix this?  The pspam code calls learn() for a bunch of messages and
then calls update_probabilities() at the end.  Is that the default
now?  Or is that a discontinued feature?  Are the APIs documented anywhere?

Traceback (most recent call last):
  File "update.py", line 61, in ?
    main(FORCE_REBUILD)
  File "update.py", line 49, in main
    profile.update()
  File "/home/jeremy/src/spambayes/pspam/pspam/profile.py", line 88, in update
    changed1 = self._update(self.hams, False)
  File "/home/jeremy/src/spambayes/pspam/pspam/profile.py", line 113, in _update
    self.classifier.learn(tokenize(msg), is_spam, False)
TypeError: learn() takes exactly 3 arguments (4 given)

Jeremy


From tim@fourstonesExpressions.com  Tue Nov 26 17:02:36 2002
From: tim@fourstonesExpressions.com (Tim Stone - Four Stones Expressions)
Date: Tue, 26 Nov 2002 11:02:36 -0600
Subject: [Spambayes] Guidance re pickles versus DB for Outlook
In-Reply-To: <15843.43038.341350.515691@slothrop.zope.com>
Message-ID: <OLB0OMYVROYWXR64HEROGKNH775L.3de3a92c@riven>

Update probabilities is no longer necessary, as WordInfo now calculates 
probabilities on demand.  Simply remove the call... - TimS

11/26/2002 10:58:06 AM, Jeremy Hylton <jeremy@alum.mit.edu> wrote:

>I just did a cvs update in spambayes and tried to restart my pspam
>code, but nothing is working anymore :-(.  I'm sorry I haven't had
>time to read every message on the proposed changes, but there's been a
>flurry of activity and I've got a day job.
>
>Anyway, here's a traceback.  Can anyone suggest quickly how I would
>fix this?  The pspam code calls learn() for a bunch of messages and
>then calls update_probabilities() at the end.  Is that the default
>now?  Or is that a discontinued feature?  Are the APIs documented anywhere?
>
>Traceback (most recent call last):
>  File "update.py", line 61, in ?
>    main(FORCE_REBUILD)
>  File "update.py", line 49, in main
>    profile.update()
>  File "/home/jeremy/src/spambayes/pspam/pspam/profile.py", line 88, in 
update
>    changed1 = self._update(self.hams, False)
>  File "/home/jeremy/src/spambayes/pspam/pspam/profile.py", line 113, in 
_update
>    self.classifier.learn(tokenize(msg), is_spam, False)
>TypeError: learn() takes exactly 3 arguments (4 given)
>
>Jeremy
>
>
>_______________________________________________
>Spambayes mailing list
>Spambayes@python.org
>http://mail.python.org/mailman/listinfo/spambayes
>
>


c'est moi - TimS
www.fourstonesExpressions.com 


From jeremy@alum.mit.edu  Tue Nov 26 17:10:42 2002
From: jeremy@alum.mit.edu (Jeremy Hylton)
Date: Tue, 26 Nov 2002 12:10:42 -0500
Subject: [Spambayes] Guidance re pickles versus DB for Outlook
In-Reply-To: <OLB0OMYVROYWXR64HEROGKNH775L.3de3a92c@riven>
References: <15843.43038.341350.515691@slothrop.zope.com>
	<OLB0OMYVROYWXR64HEROGKNH775L.3de3a92c@riven>
Message-ID: <15843.43794.280220.481129@slothrop.zope.com>

>>>>> "TS" == Tim Stone <- Four Stones Expressions <tim@fourstonesExpressions.com>> writes:

  TS> Update probabilities is no longer necessary, as WordInfo now
  TS> calculates probabilities on demand.  Simply remove the call... 

The problem I had was with the learn() method, which is documented to
take three arguments but actually only takes two.  It would be really
helpful if there were some developer docs or at least consistent
docstrings.

I changed learn() to only take two arguments, modified by WordIndo
class, and removed the update_probabilities() call and eliminated
tracebacks.

But I also destroyed the spam filtering process.  Every message I've
received since making those changes scored a 1.000.  For example,
here's the scoring detail for your message:

[59102 refs]
Score: 0.99999745451

Clues
-----
*H* 5.07285667561e-06
*S* 0.999999981876
hylton 0.0223511806365
jeremy 0.0226462499923
>jeremy 0.06749672346
skip:" 50 0.155172413793
haven't 0.228515259315
subject:for 0.267754749015
header:Received:5 0.271990252442
header:In-Reply-To:1 0.30024968789
sorry 0.30024968789
header:Errors-To:1 0.30781519964
got 0.310423809387
update 0.333024118738
content-type:text/plain 0.387298828087
simply 0.604185271179
working 0.617911285794
how 0.630464812873
longer 0.638534738834
takes 0.646878824969
exactly 0.6540436457
skip:m 10 0.689618426887
url:org 0.691518467852
header:Reply-To:1 0.711509992491
skip:s 20 0.716962524655
read 0.73131168503
every 0.745634117591
skip:p 10 0.756747597975
skip:w 20 0.763982841115
now 0.809312390184
mailing 0.810022433084
arguments 0.844827586207
changes, 0.844827586207
default 0.844827586207
documented 0.844827586207
skip:> 40 0.844827586207
this? 0.844827586207
url:python 0.844827586207
email addr:python.org 0.908163265306
job. 0.908163265306
remove 0.921389083674
quickly 0.934782608696
restart 0.934782608696
skip:u 20 0.934782608696
list 0.941063400602
calls 0.949438202247
skip:s 30 0.949438202247
subject:: [ 0.949438202247
url:mail 0.949438202247
proposed 0.95871559633
url:mailman 0.95871559633
url:listinfo 0.96511627907
am, 0.973372781065
messages 0.973372781065
suggest 0.973372781065
skip:_ 40 0.97619047619
there's 0.97619047619
here's 0.978468899522
subject:]  0.983271375465
tried 0.984429065744
nothing 0.991493383743

Jeremy


From jeremy@alum.mit.edu  Tue Nov 26 17:22:54 2002
From: jeremy@alum.mit.edu (Jeremy Hylton)
Date: Tue, 26 Nov 2002 12:22:54 -0500
Subject: [Spambayes] Guidance re pickles versus DB for Outlook
In-Reply-To: <15843.43794.280220.481129@slothrop.zope.com>
References: <15843.43038.341350.515691@slothrop.zope.com>
	<OLB0OMYVROYWXR64HEROGKNH775L.3de3a92c@riven>
	<15843.43794.280220.481129@slothrop.zope.com>
Message-ID: <15843.44526.791749.654112@slothrop.zope.com>

The problems here seem to be deep.  (Can anyone point me to a summary
of the key changes made over the weekend?)  The new MetaInfo object is
updated in a way that looks like it may not mark the persistent object
as changed, so I'll have to create a version of it that inherits from
Persistent.

So I'm trying to figure out what the MetaInfo object is for and why
it's separate from the Classifier.  It looks like it stores two
numbers (nham and nspam) and an unused revision count.  The revision
count is tossed every time the object is pickled and unpickled.  It
uses properties to update this unused revision count.  The whole class
seems excessive for keeping two counters.  It looks like classifier
grew 10+ lines of code to create properties to udpate these two
numbers by accessing properties on the MetaInfo object.

Why can't we store two counters as two integer attributes?

Sorry if I sound bitter, but I used to have a handy spam filtering
system, but it looks like it will take hours to get it working again
and I don't have time for that :-(.

Jeremy


From mwh@python.net  Tue Nov 26 17:29:41 2002
From: mwh@python.net (Michael Hudson)
Date: 26 Nov 2002 17:29:41 +0000
Subject: [Spambayes] Re: Guidance re pickles versus DB for Outlook
References: <15843.43038.341350.515691@slothrop.zope.com>
	<OLB0OMYVROYWXR64HEROGKNH775L.3de3a92c@riven>
	<15843.43794.280220.481129@slothrop.zope.com>
	<15843.44526.791749.654112@slothrop.zope.com>
Message-ID: <2mwun0dyt6.fsf@starship.python.net>

Jeremy Hylton <jeremy@alum.mit.edu> writes:

> Sorry if I sound bitter, but I used to have a handy spam filtering
> system, but it looks like it will take hours to get it working again
> and I don't have time for that :-(.

Then "cvs up -D XXX" is your friend, surely?  I know *I* don't have
time to keep up with the code, so I'm not trying to...

Cheers,
M.

-- 
  Two things I learned for sure during a particularly intense acid
  trip in my own lost youth: (1) everything is a trivial special case
  of something else; and, (2) death is a bunch of blue spheres.
                                             -- Tim Peters, 1 May 1998


From jeremy@alum.mit.edu  Tue Nov 26 17:34:18 2002
From: jeremy@alum.mit.edu (Jeremy Hylton)
Date: Tue, 26 Nov 2002 12:34:18 -0500
Subject: [Spambayes] Re: Guidance re pickles versus DB for Outlook
In-Reply-To: <2mwun0dyt6.fsf@starship.python.net>
References: <15843.43038.341350.515691@slothrop.zope.com>
	<OLB0OMYVROYWXR64HEROGKNH775L.3de3a92c@riven>
	<15843.43794.280220.481129@slothrop.zope.com>
	<15843.44526.791749.654112@slothrop.zope.com>
	<2mwun0dyt6.fsf@starship.python.net>
Message-ID: <15843.45210.633174.813679@slothrop.zope.com>

>>>>> "MH" == Michael Hudson <mwh@python.net> writes:

  MH> Jeremy Hylton <jeremy@alum.mit.edu> writes:
  >> Sorry if I sound bitter, but I used to have a handy spam
  >> filtering system, but it looks like it will take hours to get it
  >> working again and I don't have time for that :-(.

  MH> Then "cvs up -D XXX" is your friend, surely?  I know *I* don't
  MH> have time to keep up with the code, so I'm not trying to...

Indeed. Too bad no one tagged the tree before integrating the changes.

Jeremy


From popiel@wolfskeep.com  Tue Nov 26 18:02:33 2002
From: popiel@wolfskeep.com (T. Alexander Popiel)
Date: Tue, 26 Nov 2002 10:02:33 -0800
Subject: [Spambayes] Guidance re pickles versus DB for Outlook 
In-Reply-To: Message from Jeremy Hylton <jeremy@alum.mit.edu> 
	<15843.43038.341350.515691@slothrop.zope.com> 
References: <w534ra49u76.fsf@woozle.org>
	<WQMKF96OM4XJ2UGC2XH4Z6ZHCUQ1X.3de3a370@riven>
	<15843.43038.341350.515691@slothrop.zope.com> 
Message-ID: <20021126180233.27A75F589@cashew.wolfskeep.com>

In message:  <15843.43038.341350.515691@slothrop.zope.com>
             Jeremy Hylton <jeremy@alum.mit.edu> writes:

>I just did a cvs update in spambayes and tried to restart my pspam
>code, but nothing is working anymore :-(.  I'm sorry I haven't had
>time to read every message on the proposed changes, but there's been a
>flurry of activity and I've got a day job.

Yeah, there's been a bunch of changes, mostly revolving around the
removal of update_probabilities.

>Anyway, here's a traceback.  Can anyone suggest quickly how I would
>fix this?  The pspam code calls learn() for a bunch of messages and
>then calls update_probabilities() at the end.  Is that the default
>now?  Or is that a discontinued feature?  Are the APIs documented anywhere?

You need to remove the _last_ argument passed to learn(), dealing
with whether or not to run update_probabilities().  Judging by later
mail, it sounds like you removed the second-to-last argument (ham vs.
spam), and are now calling everything ham.

You should also remove the call to update_probabilities(), since it's
toast, but it sounds like you already did that.

The API is unfortunately not documented yet; until someone goes back
through and updates it, the code itself will be the best documentation.

Addressing a concern from a later mail, the MetaInfo class exists to
make it easier to derive subclasses of the classifier and its parts,
without having to touch everything.  For the two counts that it currently
holds, it is certainly overkill, but it provides a nice way to expand
that set without having to go mucking around in as many places.  You
are correct that the revision number it holds is useless; it is cruft
left over from one of the interim schemes for doing away with
update_probabilities(), and as such the revision number should be
removed.  Perhaps I'll patch that.

- Alex

From jeremy@alum.mit.edu  Tue Nov 26 18:28:09 2002
From: jeremy@alum.mit.edu (Jeremy Hylton)
Date: Tue, 26 Nov 2002 13:28:09 -0500
Subject: [Spambayes] Guidance re pickles versus DB for Outlook 
In-Reply-To: <20021126180233.27A75F589@cashew.wolfskeep.com>
References: <w534ra49u76.fsf@woozle.org>
	<WQMKF96OM4XJ2UGC2XH4Z6ZHCUQ1X.3de3a370@riven>
	<15843.43038.341350.515691@slothrop.zope.com>
	<20021126180233.27A75F589@cashew.wolfskeep.com>
Message-ID: <15843.48441.507274.158825@slothrop.zope.com>

>>>>> "TAP" == T Alexander Popiel <popiel@wolfskeep.com> writes:

  TAP> In message: <15843.43038.341350.515691@slothrop.zope.com>
  TAP>              Jeremy Hylton <jeremy@alum.mit.edu> writes:

  >> I just did a cvs update in spambayes and tried to restart my
  >> pspam code, but nothing is working anymore :-(.  I'm sorry I
  >> haven't had time to read every message on the proposed changes,
  >> but there's been a flurry of activity and I've got a day job.

  TAP> Yeah, there's been a bunch of changes, mostly revolving around
  TAP> the removal of update_probabilities.

Except that it wasn't removed!  It's just a big decoy waiting to lure
people reading the code away from the interface.

  >> Anyway, here's a traceback.  Can anyone suggest quickly how I
  >> would fix this?  The pspam code calls learn() for a bunch of
  >> messages and then calls update_probabilities() at the end.  Is
  >> that the default now?  Or is that a discontinued feature?  Are
  >> the APIs documented anywhere?

  TAP> You need to remove the _last_ argument passed to learn(),
  TAP> dealing with whether or not to run update_probabilities().
  TAP> Judging by later mail, it sounds like you removed the
  TAP> second-to-last argument (ham vs.  spam), and are now calling
  TAP> everything ham.

Except that I didn't.  It's pretty easy to remove the last argument
:-). 

  TAP> You should also remove the call to update_probabilities(),
  TAP> since it's toast, but it sounds like you already did that.

  TAP> The API is unfortunately not documented yet; until someone goes
  TAP> back through and updates it, the code itself will be the best
  TAP> documentation.

Except that the code has docstrings that are wrong.

  TAP> Addressing a concern from a later mail, the MetaInfo class
  TAP> exists to make it easier to derive subclasses of the classifier
  TAP> and its parts, without having to touch everything.

This sounds like YAGNI to me.

Jeremy


From jm@jmason.org  Tue Nov 26 19:09:29 2002
From: jm@jmason.org (Justin Mason)
Date: Tue, 26 Nov 2002 19:09:29 +0000
Subject: packaging (was Re: [Spambayes] Guidance re pickles versus DB for
	Outlook )
In-Reply-To: Message from Richie Hindle <richie@entrian.com> 
	<2n37uucckrhn6q3aqubt49kag3gp8jo4in@4ax.com> 
Message-ID: <20021126190934.687A016F1B@jmason.org>


Richie Hindle said:

> On the subject of packaging: I've used InnoSetup before and been very
> impressed.  Someone mentioned Install Shield - I don't believe there's a
> credible free version of that, whereas InnoSetup is completely free.

Just to chime in on this one: I can second that.  InnoSetup is quite good,
and since it runs from 1 text format file, isn't too foreign for UNIXish
folks ;)

--j.

From steve@blighty.com  Tue Nov 26 19:53:52 2002
From: steve@blighty.com (Steve Atkins)
Date: Tue, 26 Nov 2002 11:53:52 -0800
Subject: packaging (was Re: [Spambayes] Guidance re pickles versus DB for
	Outlook )
In-Reply-To: <20021126190934.687A016F1B@jmason.org>;
	from jm@jmason.org on Tue, Nov 26, 2002 at 07:09:29PM +0000
References: <richie@entrian.com> <20021126190934.687A016F1B@jmason.org>
Message-ID: <20021126115351.A6396@blighty.com>

On Tue, Nov 26, 2002 at 07:09:29PM +0000, Justin Mason wrote:
> Richie Hindle said:
> 
> > On the subject of packaging: I've used InnoSetup before and been very
> > impressed.  Someone mentioned Install Shield - I don't believe there's a
> > credible free version of that, whereas InnoSetup is completely free.
> 
> Just to chime in on this one: I can second that.  InnoSetup is quite good,
> and since it runs from 1 text format file, isn't too foreign for UNIXish
> folks ;)

Yes. I've been using it for years with no problems. For some
production code at the moment I'm running some bourne shell scripts to
generate the setup files from within cygwin (so the checkout from CVS,
build, package, upload steps are fully automated). If you do that,
don't forget the unix2dos step...

I have InstallShield licenses. I choose to use Inno Setup.

(Hi, I'm Steve. I write software for abuse desks, develop mail filters and
 run samspade.org.)

Cheers,
  Steve


From francois.granger@free.fr  Tue Nov 26 20:09:07 2002
From: francois.granger@free.fr (=?iso-8859-1?Q?Fran=E7ois?= Granger)
Date: Tue, 26 Nov 2002 21:09:07 +0100
Subject: [Spambayes] Guidance re pickles versus DB for Outlook
In-Reply-To: <w53u1i4aldr.fsf@woozle.org>
References: <LNBBLJKPBEHFEDALKOLCOEGKCPAB.tim.one@comcast.net>
	<15843.1681.44172.263500@slothrop.zope.com>
	<w53u1i4aldr.fsf@woozle.org>
Message-ID: <a05100305ba0984c5b55b@[192.168.1.11]>

At 22:35 -0800 25/11/02, in message Re: [Spambayes] Guidance re 
pickles versus DB for Outlo, Neale Pickett wrote:
>
>I also suspect that the average Windows or Mac user may tolerate having
>to install Python before they can use the Whiz-Bam-Anti-Spam tool, but
>that said user will probably not tolerate installing Python and ZODB.

I never saw ZODB for MacOS pre-X...

>
>I could be way, way off on this, though.  I don't even know the mind of
>the average Windows user well enough to understand why they want to run
>Windows in the first place <wink>

So do I, Mac is sooooo superior <double wink>
-- 
Le courrier �lectronique est un moyen de communication. Les gens devraient
se poser des questions sur les implications politiques des choix (ou non
choix) de leurs outils et technologies.
Pour des courriers propres : http://minilien.com/?IXZneLoID0 - 
http://marc.herbert.free.fr/mail/ http://expita.com/nomime.html

From jeremy@alum.mit.edu  Tue Nov 26 20:11:07 2002
From: jeremy@alum.mit.edu (Jeremy Hylton)
Date: Tue, 26 Nov 2002 15:11:07 -0500
Subject: [Spambayes] Guidance re pickles versus DB for Outlook
In-Reply-To: <a05100305ba0984c5b55b@[192.168.1.11]>
References: <LNBBLJKPBEHFEDALKOLCOEGKCPAB.tim.one@comcast.net>
	<w538yzgc36l.fsf@woozle.org>
	<15843.1681.44172.263500@slothrop.zope.com>
	<w53u1i4aldr.fsf@woozle.org>
	<a05100305ba0984c5b55b@[192.168.1.11]>
Message-ID: <15843.54619.158621.529525@slothrop.zope.com>

>>>>> "FG" =3D=3D Fran=E7ois Granger <francois.granger@free.fr> writes:=


  FG> At 22:35 -0800 25/11/02, in message Re: [Spambayes] Guidance re
  FG> pickles versus DB for Outlo, Neale Pickett wrote:
  >>
  >> I also suspect that the average Windows or Mac user may tolerate
  >> having to install Python before they can use the
  >> Whiz-Bam-Anti-Spam tool, but that said user will probably not
  >> tolerate installing Python and ZODB.

  FG> I never saw ZODB for MacOS pre-X...

The source tarball has a distutils setup.py file.  I don't know off
the top of my head why it wouldn't work.  I know some people run Zope
on MacOS.

Jeremy


From popiel@wolfskeep.com  Tue Nov 26 20:15:34 2002
From: popiel@wolfskeep.com (T. Alexander Popiel)
Date: Tue, 26 Nov 2002 12:15:34 -0800
Subject: [Spambayes] Guidance re pickles versus DB for Outlook 
In-Reply-To: Message from jeremy@alum.mit.edu (Jeremy Hylton) 
	<15843.48441.507274.158825@slothrop.zope.com> 
References: <w534ra49u76.fsf@woozle.org>
	<WQMKF96OM4XJ2UGC2XH4Z6ZHCUQ1X.3de3a370@riven>
	<15843.43038.341350.515691@slothrop.zope.com>
	<20021126180233.27A75F589@cashew.wolfskeep.com>
	<15843.48441.507274.158825@slothrop.zope.com> 
Message-ID: <20021126201534.2A3A0F589@cashew.wolfskeep.com>

In message:  <15843.48441.507274.158825@slothrop.zope.com>
             jeremy@alum.mit.edu (Jeremy Hylton) writes:
>>>>>> "TAP" == T Alexander Popiel <popiel@wolfskeep.com> writes:
>
>  TAP> Yeah, there's been a bunch of changes, mostly revolving around
>  TAP> the removal of update_probabilities.
>
>Except that it wasn't removed!  It's just a big decoy waiting to lure
>people reading the code away from the interface.

Err, I just did a cvs update, and I don't see update_probabilities()
in classifier.py.  Are you sure you're looking at the right version?

>  TAP> Judging by later mail, it sounds like you removed the
>  TAP> second-to-last argument (ham vs.  spam), and are now calling
>  TAP> everything ham.
>
>Except that I didn't.  It's pretty easy to remove the last argument
>:-). 

Okay, then I don't know what's wrong for you.

>  TAP> The API is unfortunately not documented yet; until someone goes
>  TAP> back through and updates it, the code itself will be the best
>  TAP> documentation.
>
>Except that the code has docstrings that are wrong.

Yeah, that sucks.  Glancing through the current version of classifier.py,
though, I don't see any real examples.  (I don't doubt that they exist,
I'm just being blind, I suspect.)

>  TAP> Addressing a concern from a later mail, the MetaInfo class
>  TAP> exists to make it easier to derive subclasses of the classifier
>  TAP> and its parts, without having to touch everything.
>
>This sounds like YAGNI to me.

Possibly.  During the evolution, it was handy.  Haven't gotten around
to ripping it back out.

- Alex

From neale@woozle.org  Tue Nov 26 20:22:35 2002
From: neale@woozle.org (Neale Pickett)
Date: 26 Nov 2002 12:22:35 -0800
Subject: [Spambayes] Guidance re pickles versus DB for Outlook
In-Reply-To: <15843.44526.791749.654112@slothrop.zope.com>
References: <15843.43038.341350.515691@slothrop.zope.com>
	<OLB0OMYVROYWXR64HEROGKNH775L.3de3a92c@riven>
	<15843.43794.280220.481129@slothrop.zope.com>
	<15843.44526.791749.654112@slothrop.zope.com>
Message-ID: <w53r8d884j8.fsf@woozle.org>

So then, Jeremy Hylton <jeremy@alum.mit.edu> is all like:

> So I'm trying to figure out what the MetaInfo object is for and why
> it's separate from the Classifier.

I split it out so the DBM method could have some easy "state" object to
pickle.  Initially, it kept track of when nham or nspam changed, for
purposes of invalidating the cache.  That's no longer needed, so I've
taken it all out.  MetaInfo is now a container for two ints.

From neale@woozle.org  Tue Nov 26 20:25:37 2002
From: neale@woozle.org (Neale Pickett)
Date: 26 Nov 2002 12:25:37 -0800
Subject: [Spambayes] Re: Guidance re pickles versus DB for Outlook
In-Reply-To: <15843.45210.633174.813679@slothrop.zope.com>
References: <15843.43038.341350.515691@slothrop.zope.com>
	<OLB0OMYVROYWXR64HEROGKNH775L.3de3a92c@riven>
	<15843.43794.280220.481129@slothrop.zope.com>
	<15843.44526.791749.654112@slothrop.zope.com>
	<2mwun0dyt6.fsf@starship.python.net>
	<15843.45210.633174.813679@slothrop.zope.com>
Message-ID: <w53n0nw84e6.fsf@woozle.org>

So then, Jeremy Hylton <jeremy@alum.mit.edu> is all like:

> Indeed. Too bad no one tagged the tree before integrating the changes.

It didn't occur to me to do that.  I happened to have a pre-merge tree
lying around so I've tagged it "pre-playground-merge".

Sorry I've caused problems for you Jeremy.  I didn't see the pspam/pspam
directory or I would have fixed the calls for you.  Let me know if
anything else seems fishy.

Neale

From tim@fourstonesExpressions.com  Tue Nov 26 20:24:59 2002
From: tim@fourstonesExpressions.com (Tim Stone - Four Stones Expressions)
Date: Tue, 26 Nov 2002 14:24:59 -0600
Subject: [Spambayes] Guidance re pickles versus DB for Outlook
In-Reply-To: <w53r8d884j8.fsf@woozle.org>
Message-ID: <DAHBMHDZTPLPKA905D32VUCBONZYMH.3de3d89b@riven>

Do we really need this state object at this point?  I can't see any 
benefit to having this object at this point...

- TimS

11/26/2002 2:22:35 PM, Neale Pickett <neale@woozle.org> wrote:

>So then, Jeremy Hylton <jeremy@alum.mit.edu> is all like:
>
>> So I'm trying to figure out what the MetaInfo object is for and 
why
>> it's separate from the Classifier.
>
>I split it out so the DBM method could have some easy "state" 
object to
>pickle.  Initially, it kept track of when nham or nspam changed, 
for
>purposes of invalidating the cache.  That's no longer needed, so 
I've
>taken it all out.  MetaInfo is now a container for two ints.
>
>


c'est moi - TimS
www.fourstonesExpressions.com 


From jeremy@alum.mit.edu  Tue Nov 26 20:29:57 2002
From: jeremy@alum.mit.edu (Jeremy Hylton)
Date: Tue, 26 Nov 2002 15:29:57 -0500
Subject: [Spambayes] Re: Guidance re pickles versus DB for Outlook
In-Reply-To: <w53n0nw84e6.fsf@woozle.org>
References: <15843.43038.341350.515691@slothrop.zope.com>
	<OLB0OMYVROYWXR64HEROGKNH775L.3de3a92c@riven>
	<15843.43794.280220.481129@slothrop.zope.com>
	<15843.44526.791749.654112@slothrop.zope.com>
	<2mwun0dyt6.fsf@starship.python.net>
	<15843.45210.633174.813679@slothrop.zope.com>
	<w53n0nw84e6.fsf@woozle.org>
Message-ID: <15843.55749.392683.507121@slothrop.zope.com>

>>>>> "NP" == Neale Pickett <neale@woozle.org> writes:

  NP> So then, Jeremy Hylton <jeremy@alum.mit.edu> is all like:
  >> Indeed. Too bad no one tagged the tree before integrating the
  >> changes.

  NP> It didn't occur to me to do that.  I happened to have a
  NP> pre-merge tree lying around so I've tagged it
  NP> "pre-playground-merge".

Thanks!  That's cool.

  NP> Sorry I've caused problems for you Jeremy.  I didn't see the
  NP> pspam/pspam directory or I would have fixed the calls for you.
  NP> Let me know if anything else seems fishy.

And sorry to complain so loudly today.  It's good that someone is
maintaining and improving the code.  The fault is probably more in our
development process.  I didn't say "Hey, check that you don't break
the pspam code" and there was no obvious way for you to test it.

I'm a big fan of unittests.  We should probably develop some.

Jeremy


From neale@woozle.org  Tue Nov 26 21:06:29 2002
From: neale@woozle.org (Neale Pickett)
Date: 26 Nov 2002 13:06:29 -0800
Subject: [Spambayes] Re: Guidance re pickles versus DB for Outlook
In-Reply-To: <15843.55749.392683.507121@slothrop.zope.com>
References: <15843.43038.341350.515691@slothrop.zope.com>
	<OLB0OMYVROYWXR64HEROGKNH775L.3de3a92c@riven>
	<15843.43794.280220.481129@slothrop.zope.com>
	<15843.44526.791749.654112@slothrop.zope.com>
	<2mwun0dyt6.fsf@starship.python.net>
	<15843.45210.633174.813679@slothrop.zope.com>
	<w53n0nw84e6.fsf@woozle.org>
	<15843.55749.392683.507121@slothrop.zope.com>
Message-ID: <w53el9882i2.fsf@woozle.org>

So then, jeremy@alum.mit.edu (Jeremy Hylton) is all like:

> And sorry to complain so loudly today.  It's good that someone is
> maintaining and improving the code.  The fault is probably more in our
> development process.  I didn't say "Hey, check that you don't break
> the pspam code" and there was no obvious way for you to test it.

> I'm a big fan of unittests.  We should probably develop some.

I couldn't agree more.  I found myself wishing we had some when I was
making all the changes.  The only code that has something approaching a
test is the pop3proxy.  I'll bring that up on the list.

And please, keep complaining!  I want stuff to work too :)

Thanks

Neale

From tim@fourstonesExpressions.com  Tue Nov 26 21:09:51 2002
From: tim@fourstonesExpressions.com (Tim Stone - Four Stones Expressions)
Date: Tue, 26 Nov 2002 15:09:51 -0600
Subject: [Spambayes] Re: Guidance re pickles versus DB for Outlook
Message-ID: <768543JF5ZVRE084RNC64FDTRCB515.3de3e31f@riven>

11/26/2002 3:06:29 PM, Neale Pickett <neale@woozle.org> wrote:

>So then, jeremy@alum.mit.edu (Jeremy Hylton) is all like:
>
>> And sorry to complain so loudly today.  It's good that someone is
>> maintaining and improving the code.  The fault is probably more in our
>> development process.  I didn't say "Hey, check that you don't break
>> the pspam code" and there was no obvious way for you to test it.
>
>> I'm a big fan of unittests.  We should probably develop some.
>
>I couldn't agree more.  I found myself wishing we had some when I was
>making all the changes.  The only code that has something approaching a
>test is the pop3proxy.  I'll bring that up on the list.

Awwww... that's not quite right... FileCorpus has a fairly large test harness 
that exercises just about everything in that side of the universe...  - TimS

>
>And please, keep complaining!  I want stuff to work too :)
>
>Thanks
>
>Neale
>
>_______________________________________________
>Spambayes mailing list
>Spambayes@python.org
>http://mail.python.org/mailman/listinfo/spambayes
>
>


c'est moi - TimS
www.fourstonesExpressions.com 


From neale@woozle.org  Tue Nov 26 21:13:19 2002
From: neale@woozle.org (Neale Pickett)
Date: 26 Nov 2002 13:13:19 -0800
Subject: [Spambayes] Guidance re pickles versus DB for Outlook
In-Reply-To: <DAHBMHDZTPLPKA905D32VUCBONZYMH.3de3d89b@riven>
References: <DAHBMHDZTPLPKA905D32VUCBONZYMH.3de3d89b@riven>
Message-ID: <w53adjw826o.fsf@woozle.org>

So then, Tim Stone - Four Stones Expressions <tim@fourstonesExpressions.com> is all like:

> Do we really need this state object at this point?  I can't see any 
> benefit to having this object at this point...

Well, we do get some use out of it.  Or we would if we used it the way I
thought we were :)

I had intended for this class to get pickled and stored like so in
DBDictClassifier.store():

  self.wordinfo[self.statekey] = self.meta

If we do this, then we get something that will break when the
PICKLE_VERSION changes.  Also, as has been pointed out by someone else,
it's a handy place to store things, without DBDictClassifier having to
know what's inside itself.

OTOH, since it's a subclass of classifier.Classifier, it'd be just as
easy to write getstate() and setstate() methods on Classifier to return
what would be in self.meta.  They don't have to be named that, of
course.

I don't know which one is uglier.  But right now the DBDict is storing
"meta" information differently than everything else, so something's
gotta give.  Anyone have an opinion on which is better?

Neale

From neale@woozle.org  Tue Nov 26 21:24:02 2002
From: neale@woozle.org (Neale Pickett)
Date: 26 Nov 2002 13:24:02 -0800
Subject: [Spambayes] Re: Guidance re pickles versus DB for Outlook
In-Reply-To: <768543JF5ZVRE084RNC64FDTRCB515.3de3e31f@riven>
References: <768543JF5ZVRE084RNC64FDTRCB515.3de3e31f@riven>
Message-ID: <w53znrw6n4d.fsf@woozle.org>

So then, Tim Stone - Four Stones Expressions <tim@fourstonesExpressions.com> is all like:

> >I couldn't agree more.  I found myself wishing we had some when I was
> >making all the changes.  The only code that has something approaching
> >a test is the pop3proxy.  I'll bring that up on the list.
> 
> Awwww... that's not quite right... FileCorpus has a fairly large test
> harness that exercises just about everything in that side of the
> universe...  - TimS

Sorry Tim, I meant "the only code I was working with", not "the only
code".  The FileCorpus does indeed have a nice looking unit test.

I also didn't realize I was mailing the whole list.  I'm losing it.

So, shall we create a new directory for unit tests?  I imagine there's a
lot of stuff to test, there being close to 100 .py files in the project
now.  A test/test.py should be able to run every test we write, yes?

I've only scratched the surface of the built-in unittest module, so if
someone has a lot of experience with that maybe they could offer up a
suggestion as to how best to structure a unit test.

Neale

From jeremy@alum.mit.edu  Tue Nov 26 21:27:50 2002
From: jeremy@alum.mit.edu (Jeremy Hylton)
Date: Tue, 26 Nov 2002 16:27:50 -0500
Subject: [Spambayes] Re: Guidance re pickles versus DB for Outlook
In-Reply-To: <w53znrw6n4d.fsf@woozle.org>
References: <768543JF5ZVRE084RNC64FDTRCB515.3de3e31f@riven>
	<w53znrw6n4d.fsf@woozle.org>
Message-ID: <15843.59222.103960.129526@slothrop.zope.com>

I might suggest adopting the directory structure we've been using for
ZODB, which is roughly this.

The top-level release directory contains two scripts, setup.py and
test.py.  The first is a distutils setup script.  The second is a
driver for the unittests that runs the unittests.  It can also have
license, readme, etc.

The actual code is contained in a subdirectory that is a python
package, e.g. a directory names spambayes with an __init__.  The
unittests for a particular package live in a subpackage called tests,
e.g. spambayes.tests.

Jeremy


From francois.granger@free.fr  Tue Nov 26 21:38:33 2002
From: francois.granger@free.fr (=?iso-8859-1?Q?Fran=E7ois?= Granger)
Date: Tue, 26 Nov 2002 22:38:33 +0100
Subject: [Spambayes] Guidance re pickles versus DB for Outlook
In-Reply-To: <WQMKF96OM4XJ2UGC2XH4Z6ZHCUQ1X.3de3a370@riven>
References: <WQMKF96OM4XJ2UGC2XH4Z6ZHCUQ1X.3de3a370@riven>
Message-ID: <a05100308ba09971a0427@[192.168.1.11]>

At 10:38 -0600 26/11/02, in message Re: [Spambayes] Guidance re 
pickles versus DB for Outlo, Tim Stone - Four Stones Expressions 
wrote:
>Francois gave us a clue on that one yesterday (or so).  Looks like we can
>rearrange this, but it will require copying the module into spambayes...
>yuk... another solution is to clone the module... call it spambayesdbm.  Maybe
>that would have several advantages.
>
>Quoting Francois:
>
><quote>
>on 25/11/02 12:04, Moore, Paul at Paul.Moore@atosorigin.com wrote:
>
>>   Or would it be worth
>>  specifically looking for pybsddb, and using that in preference if it
>>  is present?
>
>Since it use anydbm, you can copy lib/anydbm.py in your spambayes folder and
>modify  the line 51:
>
>_names = ['dbhash', 'gdbm', 'dbm', 'dumbdbm']
>
>and add your preferred dbm in front of the list. It will use it if it exist.
>
>_names = ['pybsddb', 'dbhash', 'gdbm', 'dbm', 'dumbdbm']

We can even preselect the dbm in use even through anydbm. Doing it 
based on the plateform would be a nice idea until the ZODB one is 
packaged with Python distro. And this is a 5 line of codes 
maximum.....

It is even easier to customize the behavior of these 'standard' dbms. 
See the /Demo/classes/Dbm.py script.....

PS: I am sorry, I made several 'reply to sender' instead of 'Reply 
all' tonight. my apologies.....
-- 
Le courrier �lectronique est un moyen de communication. Les gens devraient
se poser des questions sur les implications politiques des choix (ou non
choix) de leurs outils et technologies.
Pour des courriers propres : http://minilien.com/?IXZneLoID0 - 
http://marc.herbert.free.fr/mail/ http://expita.com/nomime.html

From tim@fourstonesExpressions.com  Tue Nov 26 21:42:03 2002
From: tim@fourstonesExpressions.com (Tim Stone - Four Stones Expressions)
Date: Tue, 26 Nov 2002 15:42:03 -0600
Subject: [Spambayes] Re: Guidance re pickles versus DB for Outlook
In-Reply-To: <w53znrw6n4d.fsf@woozle.org>
Message-ID: <WQQLPJW51531VZ3E873A9SLIE0.3de3eaab@riven>

11/26/2002 3:24:02 PM, Neale Pickett <neale@woozle.org> wrote:

>So then, Tim Stone - Four Stones Expressions <tim@fourstonesExpressions.com> 
is all like:
>
>> >I couldn't agree more.  I found myself wishing we had some when I was
>> >making all the changes.  The only code that has something approaching
>> >a test is the pop3proxy.  I'll bring that up on the list.
>> 
>> Awwww... that's not quite right... FileCorpus has a fairly large test
>> harness that exercises just about everything in that side of the
>> universe...  - TimS
>
>Sorry Tim, I meant "the only code I was working with", not "the only
>code".  The FileCorpus does indeed have a nice looking unit test.

No sweat.  I knew what you were sayin.
>
>I also didn't realize I was mailing the whole list.  I'm losing it.
>
>So, shall we create a new directory for unit tests?  I imagine there's a
>lot of stuff to test, there being close to 100 .py files in the project
>now.  A test/test.py should be able to run every test we write, yes?

I think we should really only test the stuff that's destined for 
"production."  That's a substantially smaller subset of the 100+ files.  Some 
of those non-production files are testers themselves, or useful for some 
aspect of algorithm research.  We could spend a LOT of time designing tests 
for code that will ultimately become part of the historical archive here and 
not much else.

Incidentally, I've identified the following list of files that are necessary 
to run pop3proxy and a procmail filter.  This doesn't include outlook stuff, 
or Jeremy's pspam stuff.

chi2.py
classifier.py
Corpus.py
dbdict.py
FileCorpus.py
hammie.py
hammiebulk.py
hammiecli.py
hammiefilter.py
hammiesrv.py
mboxutils.py
Options.py
pop3proxy.py
sets.py
storage.py
tokenizer.py

- TimS

>
>I've only scratched the surface of the built-in unittest module, so if
>someone has a lot of experience with that maybe they could offer up a
>suggestion as to how best to structure a unit test.
>
>Neale
>
>


c'est moi - TimS
www.fourstonesExpressions.com 


From popiel@wolfskeep.com  Tue Nov 26 21:42:29 2002
From: popiel@wolfskeep.com (T. Alexander Popiel)
Date: Tue, 26 Nov 2002 13:42:29 -0800
Subject: [Spambayes] Re: Guidance re pickles versus DB for Outlook 
In-Reply-To: Message from Neale Pickett <neale@woozle.org> 
   of "26 Nov 2002 13:06:29 PST." <w53el9882i2.fsf@woozle.org> 
References: <15843.43038.341350.515691@slothrop.zope.com>
	<OLB0OMYVROYWXR64HEROGKNH775L.3de3a92c@riven>
	<15843.43794.280220.481129@slothrop.zope.com>
	<15843.44526.791749.654112@slothrop.zope.com>
	<2mwun0dyt6.fsf@starship.python.net>
	<15843.45210.633174.813679@slothrop.zope.com> <w53n0nw84e6.fsf@woozle.org>
	<15843.55749.392683.507121@slothrop.zope.com>  <w53el9882i2.fsf@woozle.org> 
Message-ID: <20021126214229.29B0CF589@cashew.wolfskeep.com>

In message:  <w53el9882i2.fsf@woozle.org>
             Neale Pickett <neale@woozle.org> writes:
>So then, jeremy@alum.mit.edu (Jeremy Hylton) is all like:
>
>> I'm a big fan of unittests.  We should probably develop some.
>
>I couldn't agree more.

Err, before we start writing unit tests, shouldn't we have some
specifications on what everything is actually supposed to do?
While we're at it, it would probably be good to gather requirements
and lay out interfaces for the various elements...


Here's a first cut at what I see as requirements for the code:

R1. The code should be written in a uniform language, as much as
    is possible.  For historical reasons, this language is probably
    going to be python version 2.2.1.

R2. The code (in training mode) should accept email messages and
    their classifications as inputs, and record relevant data
    for later classification.

R3. The code (in classification mode) should accept email messages
    as input and offer trinary (ham/spam/unsure) classification as
    output.

R4. Raw RFC 822 messages should be acceptable as input email messages.

R5. For the MS users, the botch that Outlook turns messages into should
    be acceptable as input email messages.  Despite my opinion of it. ;-)

R6. Classifications should be output as a line of the form:
    X-Spambayes-Classification: ham/spam/unsure
    where only one of ham/spam/unsure should be present, without /.

R7. Classifications may be added to the headers of a raw rfc822
    message, in which case the whole (annotated) message should be
    echoed as output.

R8. In no case should more than one classification be present in
    a message.

R9. If a message provided as training input already has been trained
    with a classification, then it should be untrained from the old
    classification before training with the new classification.

R10. There should be a classifier front-end usable as a procmail filter.

R11. There should be a classifier front-end usable as a pop3 proxy.

R12. There should be a classifier front-end usable as an Outlook plugin.

R13. There should be training front-ends appropriate to each of the
     classifier front-ends.

R14. The classifier front-ends should use a common internal classifier
     module/class/whatnot to do all work not specifically related to
     managing input and output.

R15. The training front-ends should also use a common internal training
     module/class/whatnot to do all work not specifically related to
     managing input and output.

R16. There should not be any costly 'process all the data' operation
     associated with either training or classification.

R17. The internal database for the knowledge gleaned from training
     should be stored in persistent form between invocations.

R18. Changes to the internal database should be reflected in the
     persistent store in a timely manner.

R19. Changes to the persistent representation of the database should
     be done with an eye towards recoverability of the data in the
     case that a power outage (or similar catastrophic event)
     interrupts the update.

R20. The classification method should consist of some combination
     scheme applied to ham/spam probabilities associated with tokens
     derived from parsing the email messages.

R21. The chi-square combination method should be an allowed combination
     scheme.

R22. The gary-combining method may be an allowed combination scheme.

R23. The modified Graham probability computation (without biases,
     with bayesian adjustment, etc.) should be an allowed probability
     computation schere.

R24. The tokenization scheme should be based on the recent spambayes
     tokenizer code, which I don't feel like describing in sufficient
     detail at this time.

etc...


- Alex

From jeremy@alum.mit.edu  Tue Nov 26 21:59:41 2002
From: jeremy@alum.mit.edu (Jeremy Hylton)
Date: Tue, 26 Nov 2002 16:59:41 -0500
Subject: [Spambayes] Re: Guidance re pickles versus DB for Outlook 
In-Reply-To: <20021126214229.29B0CF589@cashew.wolfskeep.com>
References: <15843.43038.341350.515691@slothrop.zope.com>
	<OLB0OMYVROYWXR64HEROGKNH775L.3de3a92c@riven>
	<15843.43794.280220.481129@slothrop.zope.com>
	<15843.44526.791749.654112@slothrop.zope.com>
	<2mwun0dyt6.fsf@starship.python.net>
	<15843.45210.633174.813679@slothrop.zope.com>
	<w53n0nw84e6.fsf@woozle.org>
	<15843.55749.392683.507121@slothrop.zope.com>
	<w53el9882i2.fsf@woozle.org>
	<20021126214229.29B0CF589@cashew.wolfskeep.com>
Message-ID: <15843.61133.450511.743679@slothrop.zope.com>

>>>>> "TAP" == T Alexander Popiel <popiel@wolfskeep.com> writes:


  TAP> R1. The code should be written in a uniform language, as much
  TAP> as
  TAP>     is possible.  For historical reasons, this language is
  TAP>     probably going to be python version 2.2.1.

Em, how about 2.2.2?  Why would we write against a more-buggy version
of Python 2.2.2.

Jeremy


From popiel@wolfskeep.com  Tue Nov 26 22:06:15 2002
From: popiel@wolfskeep.com (T. Alexander Popiel)
Date: Tue, 26 Nov 2002 14:06:15 -0800
Subject: [Spambayes] Re: Guidance re pickles versus DB for Outlook 
In-Reply-To: Message from jeremy@alum.mit.edu (Jeremy Hylton) 
	<15843.61133.450511.743679@slothrop.zope.com> 
References: <15843.43038.341350.515691@slothrop.zope.com>
	<OLB0OMYVROYWXR64HEROGKNH775L.3de3a92c@riven>
	<15843.43794.280220.481129@slothrop.zope.com>
	<15843.44526.791749.654112@slothrop.zope.com>
	<2mwun0dyt6.fsf@starship.python.net>
	<15843.45210.633174.813679@slothrop.zope.com> <w53n0nw84e6.fsf@woozle.org>
	<15843.55749.392683.507121@slothrop.zope.com> <w53el9882i2.fsf@woozle.org>
	<20021126214229.29B0CF589@cashew.wolfskeep.com>
	<15843.61133.450511.743679@slothrop.zope.com> 
Message-ID: <20021126220615.D1732F589@cashew.wolfskeep.com>

In message:  <15843.61133.450511.743679@slothrop.zope.com>
             jeremy@alum.mit.edu (Jeremy Hylton) writes:
>>>>>> "TAP" == T Alexander Popiel <popiel@wolfskeep.com> writes:
>
>
>  TAP> R1. The code should be written in a uniform language, as much
>  TAP> as
>  TAP>     is possible.  For historical reasons, this language is
>  TAP>     probably going to be python version 2.2.1.
>
>Em, how about 2.2.2?  Why would we write against a more-buggy version
>of Python 2.2.2.

Unfortunately, python 2.2.1 is the most recent version of python
packaged for debian-stable.  Requiring a version more recent than
that will be a hassle for many people (myself included).

- Alex

From db3l@fitlinxx.com  Tue Nov 26 22:20:20 2002
From: db3l@fitlinxx.com (David Bolen)
Date: 26 Nov 2002 17:20:20 -0500
Subject: [Spambayes] Re: Guidance re pickles versus DB for Outlook
References: <w53u1i4aldr.fsf@woozle.org>
	<LCEPIIGDJPKCOIHOBJEPAEIIHOAA.mhammond@skippinet.com.au>
Message-ID: <u8yzgc6sb.fsf@fitlinxx.com>

"Mark Hammond" <mhammond@skippinet.com.au> writes:

> Actually, I doubt this will fly.  We will need to create a stand-alone DLL,
> and a "one click" installer.  This should be interesting <wink>.  Good
> excuse to look at how these Python distribution tools have progressed
> recently.

It was back earlier in November, but I did package up the Outlook
addin at that point in time using installer (5b4) and it worked great.
I didn't bother putting finishing the step of putting it into an Inno
Setup installer, but the created distribution tree seemed to install
fine.  I'll try to re-verify with the latest CVS.

For a final build you do need to tweak the "drivexxxx" module
installer creates since by default it just does the registration and
not the additional step of inserting the registry key for the Outlook
addin that addin.py does, but that's about it.

-- David


From skip@pobox.com  Tue Nov 26 19:00:47 2002
From: skip@pobox.com (Skip Montanaro)
Date: Tue, 26 Nov 2002 13:00:47 -0600
Subject: [Spambayes] Guidance re pickles versus DB for Outlook
In-Reply-To: <WQMKF96OM4XJ2UGC2XH4Z6ZHCUQ1X.3de3a370@riven>
References: <w534ra49u76.fsf@woozle.org>
        <WQMKF96OM4XJ2UGC2XH4Z6ZHCUQ1X.3de3a370@riven>
Message-ID: <15843.50399.167526.290462@montanaro.dyndns.org>


    Tim> Francois gave us a clue on that one yesterday (or so).  Looks like
    Tim> we can rearrange this, but it will require copying the module into
    Tim> spambayes...  yuk... another solution is to clone the
    Tim> module... call it spambayesdbm.  Maybe that would have several
    Tim> advantages.

What's wrong with

    import anydbm
    anydbm._names.remove("dbhash")

?

    Tim> _names = ['pybsddb', 'dbhash', 'gdbm', 'dbm', 'dumbdbm']

Note that you can't probably can't just add "pybsddb" to the front of the
list.  Anydbm expects a certain API to be exported from the underlying
database modules, which is why you see "dbhash" and not "bsddb" in the list.

Skip

From skip@pobox.com  Tue Nov 26 18:53:25 2002
From: skip@pobox.com (Skip Montanaro)
Date: Tue, 26 Nov 2002 12:53:25 -0600
Subject: [Spambayes] Guidance re pickles versus DB for Outlook
In-Reply-To: <15843.40441.659922.991160@slothrop.zope.com>
References: <LCEPIIGDJPKCOIHOBJEPGEGPHOAA.mhammond@skippinet.com.au>
        <KFDCKI53GFCALK829664VTA5074OJYX.3de2d94d@riven>
        <15842.62697.829412.348546@slothrop.zope.com>
        <15843.39397.770235.412408@montanaro.dyndns.org>
        <15843.40441.659922.991160@slothrop.zope.com>
Message-ID: <15843.49957.663560.838946@montanaro.dyndns.org>


    SM> * Dealing with Zope's monolithic system is frustrating to people
    SM> (like me) who are used to having files reside in filesystems.  Some
    SM> of that frustration probably carries over to ZODB, though it's
    SM> almost certainly not ZODB's problem.

    Jeremy> This sounds like a Zope complaint ...

Like I said, for better or for worse, perceptions about Zope rub off on
ZODB. ;-)

    SM> * It seems to grow without bound, else why do I need to pack my
    SM> Data.fs file every now and then?

    Jeremy> It grows without bound unless you pack it.  Why is that a
    Jeremy> problem?  BerkeleyDB log files grow without bound, too.

Hmmm...  What log files?  If I do something like

    db = bsddb.hashopen("foo", "c")
    while 1:
        db["1"] = "1"*100000

the underlying file doesn't grow without bound.  (I let the above run just
now for about five minutes.  Final size after I unterrupted it and closed
the file was 114k.)  If db was a zodb file would that still be true?  I
thought all writes in ZODB (perhaps is this a property of FileStorage?) were
at the end of the file.  Space freed up isn't reused, hence the need for the
occasional pack, right?

Skip

From db3l@fitlinxx.com  Tue Nov 26 22:34:24 2002
From: db3l@fitlinxx.com (David Bolen)
Date: 26 Nov 2002 17:34:24 -0500
Subject: [Spambayes] Re: Guidance re pickles versus DB for Outlook
References: <LCEPIIGDJPKCOIHOBJEPGEGPHOAA.mhammond@skippinet.com.au>
	<KFDCKI53GFCALK829664VTA5074OJYX.3de2d94d@riven>
	<15842.62697.829412.348546@slothrop.zope.com>
	<15843.39397.770235.412408@montanaro.dyndns.org>
Message-ID: <u4ra4c64v.fsf@fitlinxx.com>

Skip Montanaro <skip@pobox.com> writes:

> For most of us who have *any* experience with ZODB it's probably all
> indirect via Zope, so there are probably some inaccurate perceptions about
> it.  These thoughts that have come to my mind at one time or another:

Just so you know there are at least some other experiences out there,
in our case we've been using ZODB as the persistant storage for a
scheduling system of ours since late 2000, and have never used Zope
itself, other than for an installation from which to manually extract
ZODB (we were using it before there was a standalone package).

>     * How could a database from a company (Zope) whose sole business is not
>       databases be more reliable than a database from organizations whose
>       sole raison d'etre is databases (Sleepycat, Postgres, MySQL, ...)?

Since the default FileStorage back-end is really just a bunch of
concatenated object pickles with enough meta data to skip around the
file and mark transaction boundaries, it's not really like you need to
compare to a full relational system, nor build in all the capabilities
such a system requires.  And of course, that's just the FileStorage
back-end.

In our experience the default FileStorage has proven very resilient,
although we do have a regular task that backs up and packs the
database.  But I don't think we've ever had one fail to load -
probably lost a final uncommitted transaction a time or two but that's
to be expected, and things were still consistent.

Our packing is also because we're constantly modifying lots of the
persistant objects very frequently (as the scheduler runs) - I expect
things would grow more gradually with spambayes, much as the current
pickle tends to stabilize, and only during training.

>     * Dealing with Zope's monolithic system is frustrating to people (like
>       me) who are used to having files reside in filesystems.  Some of that
>       frustration probably carries over to ZODB, though it's almost
>       certainly not ZODB's problem.

That has little to do with ZODB - we've always used ZODB directly and
just consider it a way to persist our application objects (virtually)
transparently.

>     * It seems to grow without bound, else why do I need to pack my Data.fs
>       file every now and then?

That's specific to FileStorage - different back-ends can handle things
differently.  Most of the growth with FileStorage is to handle
transactions and rollbacks (it just keeps appending and ends up having
"deleted" copies of objects around).  But the appending also means
that it's rarely re-writing lots of older data, which helps with the
robustness.

> It doesn't really matter if the perceptions are accurate or not.  They still
> need to be addressed to some extent before people are going to be
> comfortable with it.  ZODB is, for better or for worse, tied to Zope the
> application.  Accordingly, perceived problems with Zope will rub off on
> ZODB.

I'm not totally sure I agree - we're talking about ZODB being a
behind-the-scenes back-end to spambayes.  It's quite possible that
many users (at least end users as opposed to developers) would never
even think about how the information is being stored behind the
scenes, nor care as long as it worked.

One of the interesting thoughts for ZODB that I haven't seen mentioned
here would be the possibility of using ZEO to permit multiple clients
to share the same database transparently.  For me I might use that to
ensure when I read with Outlook at work versus home that I have access
to the same training data, but maybe later enhancements could key the
database data (either ham only or everything) by the executing user,
and permit the data itself to be stored centrally for a community of
users (and thus backed up and packed if necessary by an administrator).

-- David


From skip@pobox.com  Tue Nov 26 22:35:26 2002
From: skip@pobox.com (Skip Montanaro)
Date: Tue, 26 Nov 2002 16:35:26 -0600
Subject: [Spambayes] Re: Guidance re pickles versus DB for Outlook 
In-Reply-To: <20021126220615.D1732F589@cashew.wolfskeep.com>
References: <15843.43038.341350.515691@slothrop.zope.com>
        <OLB0OMYVROYWXR64HEROGKNH775L.3de3a92c@riven>
        <15843.43794.280220.481129@slothrop.zope.com>
        <15843.44526.791749.654112@slothrop.zope.com>
        <2mwun0dyt6.fsf@starship.python.net>
        <w53n0nw84e6.fsf@woozle.org>
        <w53el9882i2.fsf@woozle.org>
        <20021126214229.29B0CF589@cashew.wolfskeep.com>
        <15843.61133.450511.743679@slothrop.zope.com>
        <20021126220615.D1732F589@cashew.wolfskeep.com>
Message-ID: <15843.63278.932180.923883@montanaro.dyndns.org>


    Alex> Unfortunately, python 2.2.1 is the most recent version of python
    Alex> packaged for debian-stable.  Requiring a version more recent than
    Alex> that will be a hassle for many people (myself included).

(I think we travelled this road before, but just in case...)

Isn't it possible for you to grab and expand the 2.2.2 tar file, then
execute

    configure --prefix=$HOME/local
    make install

and then put $HOME/local/bin in front of your PATH?

Skip

From db3l@fitlinxx.com  Tue Nov 26 22:46:22 2002
From: db3l@fitlinxx.com (David Bolen)
Date: 26 Nov 2002 17:46:22 -0500
Subject: [Spambayes] Re: Important information for Outlook users
References: <16E1010E4581B049ABC51D4975CEDB88619958@UKDCX001.uk.int.atosorigin.com>
Message-ID: <uznrwar0h.fsf@fitlinxx.com>

"Moore, Paul" <Paul.Moore@atosorigin.com> writes:

> Otherwise, I see no reason why the Inbox and Unsure folders should
> show the Spam field as numbers rather than as percentages.

In my case I found it necessary to do some manual cleanup and ensure
that I didn't have any other folders with that attribute (e.g., I
didn't catch that I had some scored messages in my Deleted folder).
That could be from not deregistering the addin first.

But by manually blowing away the "Outlook folder" level field in any
of the folders that had it (after removing it from any columns in my
visible windows), and then running the delete script, I finally
managed to get everything flushed, and would see the new Spam user
defined field showing up in the relevant folders.  Some of it was
trial and error, in that if I saw the trace window complain that it
couldn't create the Spam field due to it already existing, I'd use
that to key on a folder to manually remove it from again.

Once I then re-created the columns, it was showing up as a percent
without doing anything.  Note that while I was occasionally manually
deleting the user defined field in folders in Outlook, I never
manually created it, but left that to the addin whenever it first
accessed a folder.

-- David


From tim@fourstonesExpressions.com  Tue Nov 26 22:52:57 2002
From: tim@fourstonesExpressions.com (Tim Stone - Four Stones Expressions)
Date: Tue, 26 Nov 2002 16:52:57 -0600
Subject: [Spambayes] Guidance re pickles versus DB for Outlook
In-Reply-To: <w53adjw826o.fsf@woozle.org>
Message-ID: <ED2WC93ZQYSC0HD94QNDAM64GAGFA7.3de3fb49@riven>

Neale, if you want to give me a hint as to how you'd like to see the dbdict 
write cache implemented, I'm willing to give it a crack... - TimS

11/26/2002 3:13:19 PM, Neale Pickett <neale@woozle.org> wrote:

>So then, Tim Stone - Four Stones Expressions <tim@fourstonesExpressions.com> 
is all like:
>
>> Do we really need this state object at this point?  I can't see any 
>> benefit to having this object at this point...
>
>Well, we do get some use out of it.  Or we would if we used it the way I
>thought we were :)
>
>I had intended for this class to get pickled and stored like so in
>DBDictClassifier.store():
>
>  self.wordinfo[self.statekey] = self.meta
>
>If we do this, then we get something that will break when the
>PICKLE_VERSION changes.  Also, as has been pointed out by someone else,
>it's a handy place to store things, without DBDictClassifier having to
>know what's inside itself.
>
>OTOH, since it's a subclass of classifier.Classifier, it'd be just as
>easy to write getstate() and setstate() methods on Classifier to return
>what would be in self.meta.  They don't have to be named that, of
>course.
>
>I don't know which one is uglier.  But right now the DBDict is storing
>"meta" information differently than everything else, so something's
>gotta give.  Anyone have an opinion on which is better?
>
>Neale
>
>


c'est moi - TimS
www.fourstonesExpressions.com 


From popiel@wolfskeep.com  Tue Nov 26 22:55:19 2002
From: popiel@wolfskeep.com (T. Alexander Popiel)
Date: Tue, 26 Nov 2002 14:55:19 -0800
Subject: [Spambayes] Re: Guidance re pickles versus DB for Outlook 
In-Reply-To: Message from Skip Montanaro <skip@pobox.com> 
	<15843.63278.932180.923883@montanaro.dyndns.org> 
References: <15843.43038.341350.515691@slothrop.zope.com>
	<OLB0OMYVROYWXR64HEROGKNH775L.3de3a92c@riven>
	<15843.43794.280220.481129@slothrop.zope.com>
	<15843.44526.791749.654112@slothrop.zope.com>
	<2mwun0dyt6.fsf@starship.python.net> <w53n0nw84e6.fsf@woozle.org>
	<w53el9882i2.fsf@woozle.org> <20021126214229.29B0CF589@cashew.wolfskeep.com>
	<15843.61133.450511.743679@slothrop.zope.com>
	<20021126220615.D1732F589@cashew.wolfskeep.com>
	<15843.63278.932180.923883@montanaro.dyndns.org> 
Message-ID: <20021126225519.49EF8F589@cashew.wolfskeep.com>

In message:  <15843.63278.932180.923883@montanaro.dyndns.org>
             Skip Montanaro <skip@pobox.com> writes:
>
>    Alex> Unfortunately, python 2.2.1 is the most recent version of python
>    Alex> packaged for debian-stable.  Requiring a version more recent than
>    Alex> that will be a hassle for many people (myself included).
>
>(I think we travelled this road before, but just in case...)

Yes, we have ( ;-) ), but just to rehash:

>Isn't it possible for you to grab and expand the 2.2.2 tar file, then
>execute
>
>    configure --prefix=$HOME/local
>    make install
>
>and then put $HOME/local/bin in front of your PATH?

Yes, it's possible, but then I have to pay more attention to upgrades
than my normal method of primarily listening to the debian security
announcement list.

For some people, the local-python-in-the-home-dir approach won't work
because they've got draconian disk quotas to contend with.

For yet other people, the local copy of python won't work because
the machine involved is not under their control and they don't
have permission to create executables.

- Alex

From francois.granger@free.fr  Tue Nov 26 22:16:41 2002
From: francois.granger@free.fr (francois.granger@free.fr)
Date: Tue, 26 Nov 2002 23:16:41 +0100 (CET)
Subject: [Spambayes] UnitTest (was: Re: Guidance re pickles versus DB for
	Outlook)
In-Reply-To: <w53znrw6n4d.fsf@woozle.org>
References: <768543JF5ZVRE084RNC64FDTRCB515.3de3e31f@riven>
	<w53znrw6n4d.fsf@woozle.org>
Message-ID: <1038349001.3de3f2c9ee7b1@imp.free.fr>

En r�ponse � Neale Pickett <neale@woozle.org>:

> I've only scratched the surface of the built-in unittest module, so if
> someone has a lot of experience with that maybe they could offer up a
> suggestion as to how best to structure a unit test.

Did you had a chance to look at 

http://diveintopython.org/

from my poor experience as programmer, it is the best coverage of UnitTests in Python that I have seen.

From msew@ev1.net  Tue Nov 26 23:28:29 2002
From: msew@ev1.net (msew)
Date: Tue, 26 Nov 2002 15:28:29 -0800
Subject: [Spambayes] merging tomorrow--big code changes
In-Reply-To: <w53d6ouu41i.fsf@woozle.org>
References: <0281uuocck6494jibulp22s6m1vu704f12@4ax.com>
 <w53y97ju97i.fsf@woozle.org>
 <0281uuocck6494jibulp22s6m1vu704f12@4ax.com>
Message-ID: <5.2.0.9.0.20021126152706.06650c18@mail.ev1.net>

At 12:05 02/11/24 -0800, Neale Pickett wrote:

>So then, Richie Hindle <richie@entrian.com> is all like:
>
> > 1. X-Hammie-Disposition: Yes/No/Unsure          (ie. do nothing)
> > 2. X-Spambayes-Classification: Spam/Ham/Unsure
> > 3. X-Ham-Status: Yes/No/Unsure
> >
> > Can we have a vote?
>
>I like 2.  I got another message in private expressing support for 2.
>Anyone else?

I vote 2 (as a lurker and someone who is wanting to use this)  Hammie and 
Ham doesn't strike me as a good name at all.


>X-Spambayes-Classification


is a good name.    get ride of the ham verbage


~msew

From mhammond@skippinet.com.au  Tue Nov 26 23:34:31 2002
From: mhammond@skippinet.com.au (Mark Hammond)
Date: Wed, 27 Nov 2002 10:34:31 +1100
Subject: [Spambayes] Re: Important information for Outlook users
In-Reply-To: <uznrwar0h.fsf@fitlinxx.com>
Message-ID: <LCEPIIGDJPKCOIHOBJEPOELIHOAA.mhammond@skippinet.com.au>

> In my case I found it necessary to do some manual cleanup and ensure
> that I didn't have any other folders with that attribute (e.g., I
> didn't catch that I had some scored messages in my Deleted folder).
> That could be from not deregistering the addin first.

I'll try and get back to the meat of this later, but I saw Paul mention her
was fairly sure unregistering the addin had some effect.

I am fairly certain that unregistering the adding will have zero effect on
any of the field workings.

I'll try and summarize here my understanding of the field mechanism.

Terminology:

Extended MAPI is a complicated, low-level COM based API for messaging.  It
is generally only available via C++, as it has no type library, and uses
very complicated memory management.  Python bindings for this exist which
almost make it seem usable.  This is the cornerstone of Exchange (Server and
Client) and also the Outlook client.  It is built for speed and flexibility.
It achieves both at the cost of complexity.

CDO, which used to be known as MAPI (<frown> - hence "Extended" as added
above) is an IDispatch COM based interface to Extended MAPI.  It was created
so VB and other scripting language could use this API.

Outlook now comes with its own COM, IDispatch based object model.  It is
richer than CDO in non-messaging related functions.  Outlook fully
implements Extended MAPI (both as its own message store, and connecting to
other MAPI providers such as exchange).  Outlook supports CDO as a
by-product of supporting extended MAPI, but CDO is officially "deprecated"
for this.

SpamBayes:

The SpamBayes outlook plugin does almost everything via Extended MAPI.  It
does not use CDO at all.  The Outlook object model is used only for this
field mess, and for User Interface and New Message event handling.

Properties:

Extended MAPI/CDO have a rich property system.  User defined "property
names" can be created with a fairly global name-to-ID mapping.  Any
individual item in the message store can have any set of properties
associated with it - including these user defined properties.

These properties have an interesting type system - although the name-to-ID
mapping is global, each instance of a property can potentially have a
different type.  In practice, the same type is used for a given ID.  MAPI
has no concept of a "property format" - just an ID and type.

Thus, for any and every message in your store, it is possible to have a
field called "Spam", all of which share the same ID, but having different
types.  And indeed, I recently changed the type that new properties of this
ID are written with.

Outlook itself has a richer model for "User Defined Fields".  A User Defined
Field, as you can see from the UI, has more "user oriented" data types (eg,
formulas, both "number" and "percent", etc).  It also introduces the concept
of a "Format", and introduces the concept of "adding the field to the
folder".  Although Outlook does indeed store each individual message's
user-defined property value as a MAPI property, it is not documented (nor
yet discoverable by me) how MAPI properties and these other extended Outlook
concepts fit together.

So, what is this means is:  The plugin successfully uses MAPI properties for
all filtering and scoring purposes.  This appears to always correctly set
the field for each individual message item processed by the plugin.
Unfortunately, for concepts not covered by MAPI, such as the fields in the
Field Chooser list, we must revert to Outlook, and this model doesn't work
too well.  To make matters worse, deleting a field from the Outlook UI does
*not* change *any* existing messages - it only changes this undocumented
Outlook specific extension to the model - ie, all existing MAPI properties
remain unchanged.

So, at the end of the day, the plugin has zero effect on any of this.  The
definition of MAPI properties, and of Outlook user-defined fields is quite
independent of the plugin.  Uninstalling the plugin will leave all
properties and fields in exactly the same state.  A stand-alone script can,
and does, create such fields, and these fields exist well after the script
has finished/been-deleted/whatever.

Hope this makes some sense - well, as much sense to you as it does to me
<wink>

Mark


From B-Morgan@concentric.net  Tue Nov 26 23:48:43 2002
From: B-Morgan@concentric.net (Brad Morgan)
Date: Tue, 26 Nov 2002 16:48:43 -0700
Subject: [Spambayes] Re: Guidance re pickles versus DB for Outlook 
In-Reply-To: <20021126220615.D1732F589@cashew.wolfskeep.com>
Message-ID: <NABBJOOEOFODEALNMJAJEECFHEAA.B-Morgan@concentric.net>

>>  TAP> R1. The code should be written in a uniform language, as much
>>  TAP> as
>>  TAP>     is possible.  For historical reasons, this language is
>>  TAP>     probably going to be python version 2.2.1.
>>
>>Em, how about 2.2.2?  Why would we write against a more-buggy version
>>of Python 2.2.2.

> Unfortunately, python 2.2.1 is the most recent version of python
> packaged for debian-stable.  Requiring a version more recent than
> that will be a hassle for many people (myself included).

ActivePython which I believe is the prefered Windows version is also at
2.2.1.  Would the "Windows experts" comment on ActivePython vs. some other
Windows distribution?

Regards,

Brad Morgan


From mhammond@skippinet.com.au  Wed Nov 27 01:56:36 2002
From: mhammond@skippinet.com.au (Mark Hammond)
Date: Wed, 27 Nov 2002 12:56:36 +1100
Subject: [Spambayes] Re: Guidance re pickles versus DB for Outlook 
In-Reply-To: <NABBJOOEOFODEALNMJAJEECFHEAA.B-Morgan@concentric.net>
Message-ID: <LCEPIIGDJPKCOIHOBJEPKEMAHOAA.mhammond@skippinet.com.au>

> ActivePython which I believe is the prefered Windows version is also at
> 2.2.1.  Would the "Windows experts" comment on ActivePython vs. some other
> Windows distribution?

There is no ActivePython build that works with this plugin - and only the
very latest win32all builds do.

I am no longer involved with ActiveState, or in any way with ActivePython -
so for stuff I am involved in (like the Windows extensions), my releases are
always likely to be more up to date.

Mark.


From pje@telecommunity.com  Wed Nov 27 02:29:39 2002
From: pje@telecommunity.com (Phillip J. Eby)
Date: Tue, 26 Nov 2002 21:29:39 -0500
Subject: [Spambayes] Re: Guidance re pickles versus DB for Outlook 
In-Reply-To: <NABBJOOEOFODEALNMJAJEECFHEAA.B-Morgan@concentric.net>
References: <20021126220615.D1732F589@cashew.wolfskeep.com>
Message-ID: <5.1.0.14.0.20021126212859.0656ce20@mail.telecommunity.com>

At 04:48 PM 11/26/02 -0700, Brad Morgan wrote:

>ActivePython which I believe is the prefered Windows version is also at
>2.2.1.  Would the "Windows experts" comment on ActivePython vs. some other
>Windows distribution?

I've never used ActivePython, myself; I always just download the Windows 
installer from python.org.  Which means I'm on 2.2.2.  :)


From skip@pobox.com  Wed Nov 27 01:15:22 2002
From: skip@pobox.com (Skip Montanaro)
Date: Tue, 26 Nov 2002 19:15:22 -0600
Subject: [Spambayes] Re: Guidance re pickles versus DB for Outlook 
In-Reply-To: <NABBJOOEOFODEALNMJAJEECFHEAA.B-Morgan@concentric.net>
References: <20021126220615.D1732F589@cashew.wolfskeep.com>
        <NABBJOOEOFODEALNMJAJEECFHEAA.B-Morgan@concentric.net>
Message-ID: <15844.7338.235986.880992@montanaro.dyndns.org>

Folks,

Another thing to remember is that there should be no functional difference
between 2.2.n and 2.2.n+1.  The people using 2.2.1 might run into a bug that
was fixed in 2.2.2 (check the 2.2.2 WhatsNew document), but should not
encounter changes to the API, language or library.

Skip

From neale@woozle.org  Wed Nov 27 03:02:38 2002
From: neale@woozle.org (Neale Pickett)
Date: 26 Nov 2002 19:02:38 -0800
Subject: [Spambayes] Re: Guidance re pickles versus DB for Outlook
In-Reply-To: <20021126225519.49EF8F589@cashew.wolfskeep.com>
References: <15843.43038.341350.515691@slothrop.zope.com>
	<OLB0OMYVROYWXR64HEROGKNH775L.3de3a92c@riven>
	<15843.43794.280220.481129@slothrop.zope.com>
	<15843.44526.791749.654112@slothrop.zope.com>
	<2mwun0dyt6.fsf@starship.python.net> <w53n0nw84e6.fsf@woozle.org>
	<w53el9882i2.fsf@woozle.org>
	<20021126214229.29B0CF589@cashew.wolfskeep.com>
	<15843.61133.450511.743679@slothrop.zope.com>
	<20021126220615.D1732F589@cashew.wolfskeep.com>
	<15843.63278.932180.923883@montanaro.dyndns.org>
	<20021126225519.49EF8F589@cashew.wolfskeep.com>
Message-ID: <w53n0nv7m0h.fsf@woozle.org>

So then, "T. Alexander Popiel" <popiel@wolfskeep.com> is all like:

> >Isn't it possible for you to grab and expand the 2.2.2 tar file, then
> >execute
> >
> >    configure --prefix=$HOME/local
> >    make install
> >
> >and then put $HOME/local/bin in front of your PATH?
> 
> Yes, it's possible, but then I have to pay more attention to upgrades
> than my normal method of primarily listening to the debian security
> announcement list.

Alex, python2.2 is in stable.  Just "apt-get install python2.2" and
you're gold.

Neale

From neale@woozle.org  Wed Nov 27 03:04:07 2002
From: neale@woozle.org (Neale Pickett)
Date: 26 Nov 2002 19:04:07 -0800
Subject: [Spambayes] Re: Guidance re pickles versus DB for Outlook
In-Reply-To: <w53n0nv7m0h.fsf@woozle.org>
References: <15843.43038.341350.515691@slothrop.zope.com>
	<OLB0OMYVROYWXR64HEROGKNH775L.3de3a92c@riven>
	<15843.43794.280220.481129@slothrop.zope.com>
	<15843.44526.791749.654112@slothrop.zope.com>
	<2mwun0dyt6.fsf@starship.python.net> <w53n0nw84e6.fsf@woozle.org>
	<w53el9882i2.fsf@woozle.org>
	<20021126214229.29B0CF589@cashew.wolfskeep.com>
	<15843.61133.450511.743679@slothrop.zope.com>
	<20021126220615.D1732F589@cashew.wolfskeep.com>
	<15843.63278.932180.923883@montanaro.dyndns.org>
	<20021126225519.49EF8F589@cashew.wolfskeep.com>
	<w53n0nv7m0h.fsf@woozle.org>
Message-ID: <w53isyj7ly0.fsf@woozle.org>

So then, Neale Pickett <neale@woozle.org> is all like:

> Alex, python2.2 is in stable.  Just "apt-get install python2.2" and
> you're gold.

I'm an idiot, you were talking about a micro revision, not a minor rev.
The python2.2 in stable is 2.2.1.

I need to do something about this itchy send-finger.

Neale

From tim@fourstonesExpressions.com  Wed Nov 27 03:22:03 2002
From: tim@fourstonesExpressions.com (Tim Stone - Four Stones Expressions)
Date: Tue, 26 Nov 2002 21:22:03 -0600
Subject: [Spambayes] Bouncing Spam
Message-ID: <XV1VXNIPSQ76QRMWUAW1VYSXUVU.3de43a5b@riven>

This is a bit off topic, but perhaps someone knows... Is it possible to return 
a mail in such a manner that it appears to the sender to have bounced?  It'd 
be real nice to have a 'bounce' option for spam, so that the senders *might* 
take the address off their list, if they think it's a dead address...

c'est moi - TimS
www.fourstonesExpressions.com 


From wsy@merl.com  Wed Nov 27 03:40:15 2002
From: wsy@merl.com (Bill Yerazunis)
Date: Tue, 26 Nov 2002 22:40:15 -0500
Subject: [Spambayes] Bouncing Spam
In-Reply-To: <XV1VXNIPSQ76QRMWUAW1VYSXUVU.3de43a5b@riven> (message from Tim
	Stone - Four Stones Expressions on Tue, 26 Nov 2002 21:22:03 -0600)
References: <XV1VXNIPSQ76QRMWUAW1VYSXUVU.3de43a5b@riven>
Message-ID: <200211270340.gAR3eF532476@localhost.localdomain>


   From: Tim Stone - Four Stones Expressions <tim@fourstonesExpressions.com>

   This is a bit off topic, but perhaps someone knows... Is it
   possible to return a mail in such a manner that it appears to the
   sender to have bounced?  It'd be real nice to have a 'bounce'
   option for spam, so that the senders *might* take the address off
   their list, if they think it's a dead address...

Yes, BUT you have to be the program sitting on TCP port 25, which
means a privved program, or at least a trusted one.

You certainly can fake a failed address by returning a code 550 in 
response to a "rcpt to" command from the other end.

Read the RFC (I think it's RFC-2822 but not sure) for the
permissible error bounces, and when you can send them.  All 5xx codes
are errors of one form or another, and I believe that code 550 is
"user unknown".  However, some implementations may get very
annoyed when you pop a 550 response any time other than after "rcpt".

There _are_ other 5xx codes you can use that generate bounces,
but heck, give it a try.  The worst thing is... the mail may 
bounce!  :)

Maybe I should... code this up?  :)

	  -Bill Yerazunis

From steve@blighty.com  Wed Nov 27 04:17:56 2002
From: steve@blighty.com (Steve Atkins)
Date: Tue, 26 Nov 2002 20:17:56 -0800
Subject: [Spambayes] Bouncing Spam
In-Reply-To: <XV1VXNIPSQ76QRMWUAW1VYSXUVU.3de43a5b@riven>;
	from tim@fourstonesExpressions.com on Tue, Nov 26, 2002 at 09:22:03PM -0600
References: <XV1VXNIPSQ76QRMWUAW1VYSXUVU.3de43a5b@riven>
Message-ID: <20021126201756.A20174@blighty.com>

On Tue, Nov 26, 2002 at 09:22:03PM -0600, Tim Stone - Four Stones Expressions wrote:

> This is a bit off topic, but perhaps someone knows... Is it possible to return 
> a mail in such a manner that it appears to the sender to have bounced?  It'd 
> be real nice to have a 'bounce' option for spam, so that the senders *might* 
> take the address off their list, if they think it's a dead address...

Yes, there are programs that'll do this by sending mail that looks
like a bounce message. They're pretty worthless because you have no
idea who the original sender was. Typically the faked bounce ends up
at an innocent third-party (if the faked from address exists) or a
doublebounce/postmaster mailbox (if it doesn't). Don't do this unless
you deeply understand what you're doing and why.

The only way to safely bounce spam is to drop it with a permanent
(5xx) rejection at an appropriate point in the SMTP transaction at the
first point it enters your network (i.e. if your secondary MX accepts
it, don't reject the delivery to your primary). Even then you need to
read RFC 2821 very carefully. A lot of people throw non-defined errors
or throw them at the wrong point in the transaction, which can make
things worse rather than better. Doing this with the MTAs support is
the only easy way to do it.

Cheers,
  Steve


From tim_one@email.msn.com  Wed Nov 27 05:19:19 2002
From: tim_one@email.msn.com (Tim Peters)
Date: Wed, 27 Nov 2002 00:19:19 -0500
Subject: [Spambayes] Re: Guidance re pickles versus DB for Outlook
In-Reply-To: <u4ra4c64v.fsf@fitlinxx.com>
Message-ID: <LNBBLJKPBEHFEDALKOLCIEINCPAB.tim_one@email.msn.com>

[David Bolen]
> ...
> One of the interesting thoughts for ZODB that I haven't seen mentioned
> here would be the possibility of using ZEO to permit multiple clients
> to share the same database transparently.

Jeremy (Hylton) has already done this; the code is in this project's pspam
subdirectory.  But if some people are frightened by ZODB ...

discretely y'rs  - tim


From tim_one@email.msn.com  Wed Nov 27 05:19:21 2002
From: tim_one@email.msn.com (Tim Peters)
Date: Wed, 27 Nov 2002 00:19:21 -0500
Subject: [Spambayes] Re: Guidance re pickles versus DB for Outlook
In-Reply-To: <NABBJOOEOFODEALNMJAJEECFHEAA.B-Morgan@concentric.net>
Message-ID: <LNBBLJKPBEHFEDALKOLCKEINCPAB.tim_one@email.msn.com>

[Brad Morgan]
> Would the "Windows experts" comment on ActivePython vs. some other
> Windows distribution?

If you want to redistribute, be sure to read the ActivePython license (short
course:  you probably can't redistribute AP).  Apart from that, AP's
installer is more modern than the PLabs installer will ever be.  The core
Python content (interpreter and libraries) is the same.  The spambayes
Outlook client requires a version of MarkH's win32 extensions more recent
than ActiveState yet ships.


From tim_one@email.msn.com  Wed Nov 27 05:31:27 2002
From: tim_one@email.msn.com (Tim Peters)
Date: Wed, 27 Nov 2002 00:31:27 -0500
Subject: [Spambayes] Re: Important information for Outlook users
In-Reply-To: <LCEPIIGDJPKCOIHOBJEPOELIHOAA.mhammond@skippinet.com.au>
Message-ID: <LNBBLJKPBEHFEDALKOLCAEIPCPAB.tim_one@email.msn.com>

[Mark Hammond, with an excellent intro to MAPI and Outlook models]

I could think of only one reason why deregistering the addin might have
helped:  email kept coming in *while* I was running delete_outlook_field.py,
so the old field kept getting created again while I was trying to delete it.
This took longer than you think <wink>, because about 50 folders across 3
different .pst files needed cleaning.

I don't care, though.  Repeated poke-and-hope solves everything in the end.
I just wish Outlook had a way to say:  this is my view named Spammie.  I
want the same friggin' view applied to *every* folder that uses Spammie.
But any modification to Spammie appears to affect only a version local to
the folder current at the time of modification.  So this dance has to be
repeated (in my case) about 50 times.

Now that we've got a percent sign, the natural question is "percentage of
what?" <0.9 wink>.


From tim_one@email.msn.com  Wed Nov 27 05:37:44 2002
From: tim_one@email.msn.com (Tim Peters)
Date: Wed, 27 Nov 2002 00:37:44 -0500
Subject: [Spambayes] Guidance re pickles versus DB for Outlook
In-Reply-To: <w533cpoc0aq.fsf@woozle.org>
Message-ID: <LNBBLJKPBEHFEDALKOLCAEJACPAB.tim_one@email.msn.com>

[Neale Pickett]
> ...
> It's very easy to implement a hybrid dict/pickle method, which caches
> DBM writes and only writes them out when you call the store() method.
> I've been meaning to implement the write cache for a while now, because
> training a dbdict on a large corpus is so abysmally slow right now, and
> I have to do that a lot.

Suggestion:  train into a brand new all-dict all-in-memory classifier.  When
that's done, add the dict counts to the DB counts, then throw away the dict.
The advantage is code and conceptual simplicity.


From tim_one@email.msn.com  Wed Nov 27 05:52:17 2002
From: tim_one@email.msn.com (Tim Peters)
Date: Wed, 27 Nov 2002 00:52:17 -0500
Subject: [Spambayes] Guidance re pickles versus DB for Outlook
In-Reply-To: <2n37uucckrhn6q3aqubt49kag3gp8jo4in@4ax.com>
Message-ID: <LNBBLJKPBEHFEDALKOLCAEJBCPAB.tim_one@email.msn.com>

[Richie Hindle]
> As Mark says, we're going to have to package this thing up anyway, so why
> not make ZODB a part of that package?  All this assumes (as Skip points
> out) that ZODB is as portable as Spambayes.

Few things are as portable as pure Python code.  ZODB has some C code in it
that hasn't been ported as widely, but it should work fine except on
Platforms from Mars.  The bulk of ZODB is written in pure Python; the
underlying persistence and BTree machinery is coded in C.

> On the subject of packaging: I've used InnoSetup before and been very
> impressed.  Someone mentioned Install Shield - I don't believe there's a
> credible free version of that, whereas InnoSetup is completely free.

InnoSetup works great, and especially for straightforward installs.
Recommended here too.  The harder the install, the more valuable other
installers become.  The spambayes installer could probably consist of a
plain zip file -- except that one of my sisters doesn't know how to unzip
<wink/sigh>.


From neale@woozle.org  Wed Nov 27 07:01:09 2002
From: neale@woozle.org (Neale Pickett)
Date: 26 Nov 2002 23:01:09 -0800
Subject: [Spambayes] Bouncing Spam
In-Reply-To: <20021126201756.A20174@blighty.com>
References: <XV1VXNIPSQ76QRMWUAW1VYSXUVU.3de43a5b@riven>
	<20021126201756.A20174@blighty.com>
Message-ID: <w531y577ayy.fsf@woozle.org>

So then, Steve Atkins <steve@blighty.com> is all like:

> The only way to safely bounce spam is to drop it with a permanent
> (5xx) rejection at an appropriate point in the SMTP transaction at the
> first point it enters your network (i.e. if your secondary MX accepts
> it, don't reject the delivery to your primary).

This appears to actually work.  I set up 5xx rejection for certain MAIL
FROM address patterns, with the hopes that spammers would think I had no
valid addresses on my box, and I actually do see the attempts fall off
over time.  Check out the red line:

  http://woozle.org/stats/spam.html

(sorry about the log scale, it doesn't look like much otherwise)

The spikes are when I add new address patterns to the blacklist--I just
added a bunch of them a few days ago, and probably around mid-October.

The other nice thing about doing it this way is that the message never
gets sent to you, so your bandwidth isn't being used up by it, nor is
your disk space.

It's sort of like the graph you'd expect to see of the roach deaths per
day in a house.  You fumigate, a lot of roaches die, and then slowly
rebuild their numbers.  Then you fumigate again, they die again, etc.

Neale


From tim_one@email.msn.com  Wed Nov 27 07:20:35 2002
From: tim_one@email.msn.com (Tim Peters)
Date: Wed, 27 Nov 2002 02:20:35 -0500
Subject: [Spambayes] Introduction to list: Bill Yerazunis
In-Reply-To: <200211261619.gAQGJN629857@localhost.localdomain>
Message-ID: <LNBBLJKPBEHFEDALKOLCOEJDCPAB.tim_one@email.msn.com>

[Bill Yerazunis]
> I should post an introduction for myself.
>
> I'm Bill Yerazunis, and I'm doing spamfiltering.

Hi, Bill -- nice to see you!

> Robert Woodhead and Paul Graham sent me.
>
> I wrote CRM114 (which hashes phrases as "features" and does Bayesian
> chain-rule evaluation), it seems to work well for me but I hear
> that some folks here had big problems with it.

I ran a number of experiments inspired by CRM114 after Gary Robinson asked
me to take a look, but have not used your original software, and don't
recall any other reports about it on this list.

The experimental results weren't competitive with the code we've got now,
but there could be any number of reasons for that.  The chief suspect in my
eyes was that my main test trains on over 30K msgs per run, generating more
than 320K unique tokens, and multiplying that by 16 leaves one hell of a lot
of hash codes to slam into 1M buckets.  I expect, but don't know, that you
must train on a lot less data.

Other variants we tried included cutting back from subsets of 5-grams to
subsets of 3-grams; boosting the # of buckets to 2M; doing exact
(non-hashing) accounting for subsets of 3-grams and 2-grams; using less
training data; and using (what's called here) chi-combining of spamprobs
instead of Bayesianish combining.  Of those, the ones that helped most were
avoiding hashing, and using chi-combining.  They didn't help enough to
justify switching directions here (they didn't get as good as what we
already had).

A variant suggested by Gary appeared to work *as well* as what we do now,
focused on finding high-value non-overlapping multi-word phrases.  Not
enough people tested that to say for sure, and based on the test results we
got there wasn't a good case to be made that it was better or worse than our
default scheme -- it looked like a wash, based on error rates.  But it was
more expensive and required a bigger database, so nobody pursued it.

I expect it's impossible to compare schemes convincingly without a shared
test set.  We've got people here with very easy data, and with
excruciatingly difficult data, but for the most part only the spam is
sharable.  My main test turned out to be on the easy end, and the ham in
that test consists of 20,000 msgs taken at random from a public archive of
comp.lang.python mailing-list traffic.  In theory, anyone could use that
ham, but the spam has to be taken from a different source, and that creates
all sorts of problems of its own (there are too many clues in the headers
about the source of msgs to avoid getting great results for bad reasons,
unless great care is taken to blind the classifier to such clues).

The best things I saw in the CRM114-like approaches is that they learned
very quickly, and that the hashing versions had bounded database size.  The
worst thing I saw is that "naive Bayesian" prob combining relies on an
assumption of word independence, and generating ~16 "words" per input word
violates that assumption massively.  So when the scheme is wrong, it's
spectactularly wrong, giving "a probability" closer to 0 or 1 than the
chance that the universe will vanish within the next nanosecond <wink>.
chi-combining is a good way to sidestep that outcome, but the extreme
cross-word correlation violates its theoretical underpinnings too.  In the
hashing versions, unfortunate collisions caused some false positives that
were simply outrageous to human eyes.

What we've found so far is that unigrams produced by gonzo tokenization (we
tokenize different things in different ways) learn slower than some other
approaches, but that as the # of training msgs increases, it hasn't yet been
possible to beat them.

On my main test with 20,000 ham and 14,000 spam, our unigram scheme
currently has no FN, 3 FP, and 93 unsure.  The latter are msgs where
chi-combining can't decide whether a thing is ham or spam:  the amount of
evidence in each direction appears about the same.  One of the FP is a quote
of an entire Nigerian scam spam, with a one-line comment at the start like
"Ah, jeez, here's another Nigerian wire scam -- this one has been around for
20 years".  It would be an FP under CRM114 too, unless CRM114 is broken
<wink>.  Another consists of the one-word msg "subscribe", followed by an ad
for the web-based email system the poster used to send the msg.  The third
is a brief on-topic question followed by a long and obnoxious
employer-generated sig, talking about how they're a regulated investment
company, that the info therein is confidential, visit their website for more
info, etc.  All three are indeed ham to human eyes, but statistically
they're indistiguishable from spam (and I don't care what statistical
gimmick is used to analyze them -- the ham content in each is tiny compared
to the advertising/scam content).

"Unsures" are harder to characterize.  Things that often end up there
include:

+ Conference announcements (and it's often hard for people to decide
  which of these are ham and which spam!).

+ Long tech email in mixed languages (e.g., the Russian parts get scored
  as spammy because there's a lot more Russian in the spam than in the
  ham).

+ Long, chatty, "just folks" spam, written as if by a friend.  This is
  still blessedly rare.

Having stared at these for a couple of months now, I'm convinced no
statistical scheme is going to classify them reliably and correctly.  It's a
remarkable property of chi-combining that it's good at getting confused
about the msgs our human testers have found ambiguous.  When a msg gets
scored as unsure, people are usually sympathetic ("hmm, ya, that *is* an odd
msg!").  Porn spam and Korean spam never scores unsure <wink>.

Rhetorical question:  are you able to share your test data?  There are a
number of sub-1% (error rate) schemes kicking around now, and no clear way
to compare them.  Indeed, even for a single scheme, when the error rates get
so low it's darned hard to say for sure whether a change is an improvement
or just a statistical glitch.  One thing that's helped this project a lot is
having multiple testers with different data, and a shared testing framework.
We can't share our data, but people aren't shy about sharing bad results
<wink>.


From neale@woozle.org  Wed Nov 27 07:29:33 2002
From: neale@woozle.org (Neale Pickett)
Date: 26 Nov 2002 23:29:33 -0800
Subject: [Spambayes] Guidance re pickles versus DB for Outlook
In-Reply-To: <LNBBLJKPBEHFEDALKOLCAEJACPAB.tim_one@email.msn.com>
References: <LNBBLJKPBEHFEDALKOLCAEJACPAB.tim_one@email.msn.com>
Message-ID: <w53wumz5v36.fsf@woozle.org>

So then, "Tim Peters" <tim_one@email.msn.com> is all like:

> Suggestion: train into a brand new all-dict all-in-memory classifier.
> When that's done, add the dict counts to the DB counts, then throw
> away the dict.  The advantage is code and conceptual simplicity.

Gaw, Tim, do you ever run out of good ideas?

So this /does/ make things simpler.  In fact, I've completely eliminated
the need for dbdict.py.  Results are good, training 1088 messages:

pickle:
  real    0m26.581s
  user    0m25.110s
  sys     0m0.620s

db:
  real    0m52.737s
  user    0m31.730s
  sys     0m10.260s

I can live with 2x slower.  Training and scoring single messages with
the db still blows the pickle's doors off, of course.

I want to know what people think of this diff.  I stopped short of doing
away with WordInfo altogether, though I was tempted >-]

To do this in as simple a way possible, I added three new methods to the
Classifier class:

    def _wordinfoget(self, word):
        return self.wordinfo.get(word)

    def _wordinfoset(self, word, record):
        self.wordinfo[word] = record

    def _wordinfodel(self, word):
        del self.wordinfo[word]

These are then overloaded by the DBDict class.  I also got rid of the
lame MetaInfo class.

Well, a picture's worth a thousand words:

Index: classifier.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/classifier.py,v
retrieving revision 1.60
diff -u -r1.60 classifier.py
--- classifier.py	26 Nov 2002 20:22:05 -0000	1.60
+++ classifier.py	27 Nov 2002 07:27:38 -0000
@@ -46,31 +46,7 @@
 
 LN2 = math.log(2)       # used frequently by chi-combining
 
-PICKLE_VERSION = 4
-
-class MetaInfo(object):
-    """Information about the corpora.
-
-    Contains nham and nspam, used for calculating probabilities.
-
-    """
-    def __init__(self):
-        self.__setstate__((PICKLE_VERSION, 0, 0))
-
-    def __repr__(self):
-        return "MetaInfo%r" % repr((self._nspam,
-                                    self._nham,
-                                    self.revision))
-
-    def __getstate__(self):
-        return (PICKLE_VERSION, self.nspam, self.nham)
-
-    def __setstate__(self, t):
-        if t[0] != PICKLE_VERSION:
-            raise ValueError("Can't unpickle -- version %s unknown" % t[0])
-        self.nspam, self.nham = t[1:]
-        self.revision = 0
-
+PICKLE_VERSION = 5
 
 class WordInfo(object):
     # Invariant:  For use in a classifier database, at least one of
@@ -108,32 +84,18 @@
 
     def __init__(self):
         self.wordinfo = {}
-        self.meta = MetaInfo()
         self.probcache = {}
+        self.nspam = self.nham = 0
 
     def __getstate__(self):
-        return PICKLE_VERSION, self.wordinfo, self.meta
+        return (PICKLE_VERSION, self.wordinfo, self.nspam, self.nham)
 
     def __setstate__(self, t):
         if t[0] != PICKLE_VERSION:
             raise ValueError("Can't unpickle -- version %s unknown" % t[0])
-        self.wordinfo, self.meta = t[1:]
+        (self.wordinfo, self.nspam, self.nham) = t[1:]
         self.probcache = {}
 
-    # Slacker's way out--pass calls to nham/nspam up to the meta class
-
-    def get_nham(self):
-        return self.meta.nham
-    def set_nham(self, val):
-        self.meta.nham = val
-    nham = property(get_nham, set_nham)
-
-    def get_nspam(self):
-        return self.meta.nspam
-    def set_nspam(self, val):
-        self.meta.nspam = val
-    nspam = property(get_nspam, set_nspam)
-
     # spamprob() implementations.  One of the following is aliased to
     # spamprob, depending on option settings.
 
@@ -330,8 +292,8 @@
         except KeyError:
             pass
 
-        nham = float(self.meta.nham or 1)
-        nspam = float(self.meta.nspam or 1)
+        nham = float(self.nham or 1)
+        nspam = float(self.nspam or 1)
 
         assert hamcount <= nham
         hamratio = hamcount / nham
@@ -419,14 +381,12 @@
     def _add_msg(self, wordstream, is_spam):
         self.probcache = {}    # nuke the prob cache
         if is_spam:
-            self.meta.nspam += 1
+            self.nspam += 1
         else:
-            self.meta.nham += 1
+            self.nham += 1
 
-        wordinfo = self.wordinfo
-        wordinfoget = wordinfo.get
         for word in Set(wordstream):
-            record = wordinfoget(word)
+            record = self._wordinfoget(word)
             if record is None:
                 record = self.WordInfoClass()
 
@@ -435,25 +395,22 @@
             else:
                 record.hamcount += 1
 
-            # Needed to tell a persistent DB that the content changed.
-            wordinfo[word] = record
+            self._wordinfoset(word, record)
 
 
     def _remove_msg(self, wordstream, is_spam):
         self.probcache = {}    # nuke the prob cache
         if is_spam:
-            if self.meta.nspam <= 0:
+            if self.nspam <= 0:
                 raise ValueError("spam count would go negative!")
-            self.meta.nspam -= 1
+            self.nspam -= 1
         else:
-            if self.meta.nham <= 0:
+            if self.nham <= 0:
                 raise ValueError("non-spam count would go negative!")
-            self.meta.nham -= -1
+            self.nham -= -1
 
-        wordinfo = self.wordinfo
-        wordinfoget = wordinfo.get
         for word in Set(wordstream):
-            record = wordinfoget(word)
+            record = self._wordinfoget(word)
             if record is not None:
                 if is_spam:
                     if record.spamcount > 0:
@@ -462,11 +419,9 @@
                     if record.hamcount > 0:
                         record.hamcount -= 1
                 if record.hamcount == 0 == record.spamcount:
-                    del wordinfo[word]
+                    self._wordinfodel(word)
                 else:
-                    # Needed to tell a persistent DB that the content
-                    # changed.
-                    wordinfo[word] = record
+                    self._wordinfoset(word, record)
 
     def _getclues(self, wordstream):
         mindist = options.minimum_prob_strength
@@ -475,9 +430,8 @@
         clues = []  # (distance, prob, word, record) tuples
         pushclue = clues.append
 
-        wordinfoget = self.wordinfo.get
         for word in Set(wordstream):
-            record = wordinfoget(word)
+            record = self._wordinfoget(word)
             if record is None:
                 prob = unknown
             else:
@@ -491,6 +445,16 @@
             del clues[0 : -options.max_discriminators]
         # Return (prob, word, record).
         return [t[1:] for t in clues]
+
+    def _wordinfoget(self, word):
+        return self.wordinfo.get(word)
+
+    def _wordinfoset(self, word, record):
+        self.wordinfo[word] = record
+
+    def _wordinfodel(self, word):
+        del self.wordinfo[word]
+        
 
 
 Bayes = Classifier
Index: dbdict.py
===================================================================
RCS file: dbdict.py
diff -N dbdict.py
--- dbdict.py	25 Nov 2002 20:49:16 -0000	1.4
+++ /dev/null	1 Jan 1970 00:00:00 -0000
@@ -1,152 +0,0 @@
-#! /usr/bin/env python
-
-"""DBDict.py - Dictionary access to anydbm
-
-Classes:
-    DBDict - wraps an anydbm file
-
-Abstract:
-    DBDict class wraps an anydbm file with a reasonably complete set
-    of dictionary access methods.  DBDicts can be iterated like a dictionary.
-    
-    The constructor accepts a class name which is used specifically to
-    to pickle/unpickle an instance of that class.  When an instance of
-    that class is being pickled, the pickler (actually __getstate__) prepends
-    a 'W' to the pickled string, and when the unpickler (really __setstate__)
-    encounters that 'W', it constructs that class (with no constructor
-    arguments) and executes __setstate__ on the constructed instance.
-
-    DBDict accepts an iterskip operand on the constructor.  This is a tuple
-    of hash keys that will be skipped (not seen) during iteration.  This
-    is for iteration only.  Methods such as keys() will return the entire
-    complement of keys in the dbm hash, even if they're in iterskip.  An
-    iterkeys() method is provided for iterating with skipped keys, and
-    itervaluess() is provided for iterating values with skipped keys.
-
-        >>> d = DBDict('/tmp/goober.db', MODE_CREATE, ('skipme', 'skipmetoo'))
-        >>> d['skipme'] = 'booga'
-        >>> d['countme'] = 'wakka'
-        >>> print d.keys()
-        ['skipme', 'countme']
-        >>> for k in d.iterkeys():
-        ...     print k
-        countme
-        >>> for v in d.itervalues():
-        ...     print v
-        wakka
-        >>> for k,v in d.iteritems():
-        ...     print k,v
-        countme wakka
-
-To Do:
-    """
-
-# This module is part of the spambayes project, which is Copyright 2002
-# The Python Software Foundation and is covered by the Python Software
-# Foundation license.
-
-__author__ = "Neale Pickett <neale@woozle.org>, \
-              Tim Stone <tim@fourstonesExpressions.com>"
-__credits__ = "Tim Peters (author of DBDict class), \
-               all the spambayes contributors."
-
-try:
-    import cPickle as pickle
-except ImportError:
-    import pickle
-
-import anydbm
-import errno
-import copy
-import shutil
-import os
-
-MODE_CREATE = 'c'       # create file if necessary, open for readwrite
-MODE_NEW = 'n'          # always create new file, open for readwrite
-MODE_READWRITE = 'w'    # open existing file for readwrite
-MODE_READONLY = 'r'     # open existing file for read only
-
-
-class DBDict:
-    """Database Dictionary.
-
-    This wraps an anydbm database to make it look even more like a
-    dictionary, much like the built-in shelf class.  The difference is
-    that a DBDict supports all dict methods.
-
-    Call it with the database.  Optionally, you can specify a list of
-    keys to skip when iterating.  This only affects iterators; things
-    like .keys() still list everything.  For instance:
-
-    >>> d = DBDict('goober.db', MODE_CREATE, ('skipme', 'skipmetoo'))
-    >>> d['skipme'] = 'booga'
-    >>> d['countme'] = 'wakka'
-    >>> print d.keys()
-    ['skipme', 'countme']
-    >>> for k in d.iterkeys():
-    ...     print k
-    countme
-
-    """
-
-    def __init__(self, dbname, mode, wclass, iterskip=()):
-        self.hash = anydbm.open(dbname, mode)
-        if not iterskip:
-            self.iterskip = iterskip
-        else:
-            self.iterskip = ()
-        self.wclass=wclass
-
-    def __getitem__(self, key):
-        v = self.hash[key]
-        if v[0] == 'W':
-            val = pickle.loads(v[1:])
-            # We could be sneaky, like pickle.Unpickler.load_inst,
-            # but I think that's overly confusing.
-            obj = self.wclass()
-            obj.__setstate__(val)
-            return obj
-        else:
-            return pickle.loads(v)
-
-    def __setitem__(self, key, val):
-        if isinstance(val, self.wclass):
-            val = val.__getstate__()
-            v = 'W' + pickle.dumps(val, 1)
-        else:
-            v = pickle.dumps(val, 1)
-        self.hash[key] = v
-
-    def __getitem__(self, key):
-        return pickle.loads(self.hash[key])
-
-    def __setitem__(self, key, val):
-        self.hash[key] = pickle.dumps(val, 1)
-
-    def __delitem__(self, key, val):
-        del(self.hash[key])
-
-    def __contains__(self, name):
-        return self.has_key(name)
-
-    def __getattr__(self, name):
-        # Pass the buck
-        return getattr(self.hash, name)
-
-    def get(self, key, dfl=None):
-        if self.has_key(key):
-            return self[key]
-        else:
-            return dfl
-
-open = DBDict
-
-def _test():
-    import doctest
-    import dbdict
-
-    doctest.testmod(dbdict)
-
-if __name__ == '__main__':
-    _test()
-
Index: storage.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/storage.py,v
retrieving revision 1.2
diff -u -r1.2 storage.py
--- storage.py	26 Nov 2002 00:43:51 -0000	1.2
+++ storage.py	27 Nov 2002 07:27:38 -0000
@@ -4,7 +4,7 @@
 
 Classes:
     PickledClassifier - Classifier that uses a pickle db
-    DBDictClassifier - Classifier that uses a DBDict db
+    DBDictClassifier - Classifier that uses a DBM db
     Trainer - Classifier training observer
     SpamTrainer - Trainer for spam
     HamTrainer - Trainer for ham
@@ -17,8 +17,8 @@
     datastore.  This database is relatively small, but slower than other
     databases.
 
-    DBDictClassifier is a Classifier class that uses a DBDict
-    datastore.
+    DBDictClassifier is a Classifier class that uses a database
+    store.
 
     Trainer is concrete class that observes a Corpus and trains a
     Classifier object based upon movement of messages between corpora  When
@@ -49,8 +49,8 @@
 import classifier
 from Options import options
 import cPickle as pickle
-import dbdict
 import errno
+import shelve
 
 PICKLE_TYPE = 1
 NO_UPDATEPROBS = False   # Probabilities will not be autoupdated with training
@@ -83,10 +83,11 @@
             tempbayes = pickle.load(fp)
             fp.close()
 
+        # XXX: why not self.__setstate__(tempbayes.__getstate__())?
         if tempbayes:
             self.wordinfo = tempbayes.wordinfo
-            self.meta.nham = tempbayes.get_nham()
-            self.meta.nspam = tempbayes.get_nspam()
+            self.nham = tempbayes.nham
+            self.nspam = tempbayes.nspam
 
             if options.verbose:
                 print '%s is an existing pickle, with %d ham and %d spam' \
@@ -96,8 +97,8 @@
             if options.verbose:
                 print self.db_name,'is a new pickle'
             self.wordinfo = {}
-            self.meta.nham = 0
-            self.meta.nspam = 0
+            self.nham = 0
+            self.nspam = 0
 
     def store(self):
         '''Store self as a pickle'''
@@ -109,59 +110,78 @@
         pickle.dump(self, fp, PICKLE_TYPE)
         fp.close()
 
-    def __getstate__(self):
-        return PICKLE_TYPE, self.wordinfo, self.meta
-
-    def __setstate__(self, t):
-        if t[0] != PICKLE_TYPE:
-            raise ValueError("Can't unpickle -- version %s unknown" % t[0])
-        self.wordinfo, self.meta = t[1:]
-
 
 class DBDictClassifier(classifier.Classifier):
-    '''Classifier object persisted in a WIDict'''
+    '''Classifier object persisted in a caching database'''
 
     def __init__(self, db_name, mode='c'):
         '''Constructor(database name)'''
 
         classifier.Classifier.__init__(self)
+        self.wordcache = {}
         self.statekey = "saved state"
         self.mode = mode
         self.db_name = db_name
         self.load()
 
     def load(self):
-        '''Load state from WIDict'''
+        '''Load state from database'''
 
         if options.verbose:
-            print 'Loading state from',self.db_name,'WIDict'
+            print 'Loading state from',self.db_name,'database'
 
-        self.wordinfo = dbdict.DBDict(self.db_name, self.mode,
-                             classifier.WordInfo,iterskip=[self.statekey])
+        self.db = shelve.DbfilenameShelf(self.db_name, self.mode)
 
-        if self.wordinfo.has_key(self.statekey):
-            (nham, nspam) = self.wordinfo[self.statekey]
-            self.set_nham(nham)
-            self.set_nspam(nspam)
+        if self.db.has_key(self.statekey):
+            t = self.db[self.statekey]
+            if t[0] != classifier.PICKLE_VERSION:
+                raise ValueError("Can't unpickle -- version %s unknown" % t[0])
+            (self.nspam, self.nham) = t[1:]
 
             if options.verbose:
-                print '%s is an existing DBDict, with %d ham and %d spam' \
-                      % (self.db_name, self.nham, self.nspam)
+                print '%s is an existing database, with %d spam and %d ham' \
+                      % (self.db_name, self.nspam, self.nham)
         else:
             # new dbdict
             if options.verbose:
-                print self.db_name,'is a new DBDict'
-            self.set_nham(0)
-            self.set_nspam(0)
+                print self.db_name,'is a new database'
+            self.nspam = 0
+            self.nham = 0
+        self.wordinfo = {}
 
     def store(self):
         '''Place state into persistent store'''
 
         if options.verbose:
-            print 'Persisting',self.db_name,'state in WIDict'
+            print 'Persisting',self.db_name,'state in database'
+
+        for key, val in self.wordinfo.iteritems():
+            if val == None:
+                del self.wordinfo[key]
+                try:
+                    del self.db[key]
+                except KeyError:
+                    pass
+            else:
+                self.db[key] = val.__getstate__()
+        self.db[self.statekey] = (classifier.PICKLE_VERSION,
+                                  self.nspam, self.nham)
+        self.db.sync()
+
+    def _wordinfoget(self, word):
+        ret = self.wordinfo.get(word)
+        if not ret:
+            r = self.db.get(word)
+            if r:
+                ret = self.WordInfoClass()
+                ret.__setstate__(r)
+                self.wordinfo[word] = ret
+        return ret
+
+    # _wordinfoset is the same
 
-        self.wordinfo[self.statekey] = (self.get_nham(), self.get_nspam())
-        self.wordinfo.sync()
+    def _wordinfodel(self, word):
+        self.wordinfo[word] = None
 
 
 class Trainer:

From neale@woozle.org  Wed Nov 27 07:38:09 2002
From: neale@woozle.org (Neale Pickett)
Date: 26 Nov 2002 23:38:09 -0800
Subject: [Spambayes] Current version
In-Reply-To: <16E1010E4581B049ABC51D4975CEDB885E2E21@UKDCX001.uk.int.atosorigin.com>
References: <16E1010E4581B049ABC51D4975CEDB885E2E21@UKDCX001.uk.int.atosorigin.com>
Message-ID: <w53r8d75uou.fsf@woozle.org>

So then, "Moore, Paul" <Paul.Moore@atosorigin.com> is all like:

> OK. Here's a patch.

Paul, your patch looks super, but as I'm about to nuke dbdict.py I'm
holding off on integrating it :) I will use your patch, or a derivative
of it, after I've sifted through the burnt embers of dbdict's final
hours.

Thanks!

Neale

From Paul.Moore@atosorigin.com  Wed Nov 27 10:02:17 2002
From: Paul.Moore@atosorigin.com (Moore, Paul)
Date: Wed, 27 Nov 2002 10:02:17 -0000
Subject: [Spambayes] Re: Important information for Outlook users
Message-ID: <16E1010E4581B049ABC51D4975CEDB885E2E27@UKDCX001.uk.int.atosorigin.com>

From: Mark Hammond [mailto:mhammond@skippinet.com.au]
> I'll try and get back to the meat of this later, but I saw Paul
> mention her was fairly sure unregistering the addin had some effect.
>
> I am fairly certain that unregistering the adding will have zero
> effect on any of the field workings.

I was working from your comment that the addin works by playing with
the field on the first message of a folder if it detects that the
field is not present on startup. When I run the deletion script, this
starts Outlook via COM. I was guessing that starting Outlook via COM
loads the addin, which sees no field, so does its magic to create it,
which the delete script then deletes. But I could very easily be wrong
about this (I'm having a hard time this week recalling what day it
is :-))

> Hope this makes some sense - well, as much sense to you as it does
> to me <wink>

Thanks. I think it does...

Anyway, as this is basically only an issue in upgrading from the "old"
field format to the "new" one, I decided to manually fix the Spam
fields' format in the folders where it was wrong. All is now OK.

Paul.

From tdickenson@devmail.geminidataloggers.co.uk  Wed Nov 27 10:43:04 2002
From: tdickenson@devmail.geminidataloggers.co.uk (Toby Dickenson)
Date: Wed, 27 Nov 2002 10:43:04 +0000
Subject: [Spambayes] Re: Guidance re pickles versus DB for Outlook
In-Reply-To: <LNBBLJKPBEHFEDALKOLCIEINCPAB.tim_one@email.msn.com>
References: <LNBBLJKPBEHFEDALKOLCIEINCPAB.tim_one@email.msn.com>
Message-ID: <200211271043.04698.tdickenson@devmail.geminidataloggers.co.uk>

On Wednesday 27 November 2002 5:19 am, Tim Peters wrote:
> [David Bolen]
>
> > ...
> > One of the interesting thoughts for ZODB that I haven't seen mentione=
d
> > here would be the possibility of using ZEO to permit multiple clients
> > to share the same database transparently.
>
> Jeremy (Hylton) has already done this; the code is in this project's ps=
pam
> subdirectory.  But if some people are frightened by ZODB ...

The ZEO protocol requires that the server trust all clients that are give=
n=20
write access. If you use ZEO to allow multiple users to share a database,=
=20
then you need to find a different mechanism for training.


From msergeant@startechgroup.co.uk  Wed Nov 27 11:00:45 2002
From: msergeant@startechgroup.co.uk (Matt Sergeant)
Date: Wed, 27 Nov 2002 11:00:45 +0000
Subject: [Spambayes] Bouncing Spam
References: <XV1VXNIPSQ76QRMWUAW1VYSXUVU.3de43a5b@riven>
	<20021126201756.A20174@blighty.com>
Message-ID: <3DE4A5DD.9010904@startechgroup.co.uk>

Steve Atkins said the following on 27/11/02 04:17:
> The only way to safely bounce spam is to drop it with a permanent
> (5xx) rejection at an appropriate point in the SMTP transaction at the
> first point it enters your network (i.e. if your secondary MX accepts
> it, don't reject the delivery to your primary). Even then you need to
> read RFC 2821 very carefully. A lot of people throw non-defined errors
> or throw them at the wrong point in the transaction, which can make
> things worse rather than better. Doing this with the MTAs support is
> the only easy way to do it.

Another thing that came up discussing this on the SpamAssassin list is 
that this only works on the mail server that is your MX server. If you 
get mail through a third party (e.g. an ISP that might forward to your 
SMTP server) then it doesn't work.

Matt.


From jm@jmason.org  Wed Nov 27 12:21:02 2002
From: jm@jmason.org (Justin Mason)
Date: Wed, 27 Nov 2002 12:21:02 +0000
Subject: [Spambayes] Introduction to list: Bill Yerazunis 
In-Reply-To: Message from "Tim Peters" <tim_one@email.msn.com> 
	<LNBBLJKPBEHFEDALKOLCOEJDCPAB.tim_one@email.msn.com> 
Message-ID: <20021127122107.215F616F17@jmason.org>


(Hi Bill!  Welcome to the de-facto-std anti-spam
probability-combining-algorithm discussion forum ;)

Tim Peters said:
> I expect it's impossible to compare schemes convincingly without a shared
> test set.  We've got people here with very easy data, and with
> excruciatingly difficult data, but for the most part only the spam is
> sharable.  My main test turned out to be on the easy end, and the ham in
> that test consists of 20,000 msgs taken at random from a public archive of
> comp.lang.python mailing-list traffic.  In theory, anyone could use that
> ham, but the spam has to be taken from a different source, and that creates
> all sorts of problems of its own (there are too many clues in the headers
> about the source of msgs to avoid getting great results for bad reasons,
> unless great care is taken to blind the classifier to such clues).

Reminder: http://SpamAssassin.org/publiccorpus/ has a public corpus of
about 6200 mixed spam, "easy" ham and "hard" ham messages, all from the
same source -- my own mail.

(It's a bit small right now, but I do add to it occasionally as time goes
on and I get a chance.  Still, a reference 6k message corpus is more
useful than a poke in the eye ;)

--j.

From tim@fourstonesExpressions.com  Wed Nov 27 13:36:25 2002
From: tim@fourstonesExpressions.com (Tim Stone - Four Stones Expressions)
Date: Wed, 27 Nov 2002 07:36:25 -0600
Subject: [Spambayes] Guidance re pickles versus DB for Outlook
In-Reply-To: <LNBBLJKPBEHFEDALKOLCAEJBCPAB.tim_one@email.msn.com>
Message-ID: <3YZVA7DASO4W2NL1YGF82C0963NHKG.3de4ca59@riven>

11/26/2002 11:52:17 PM, "Tim Peters" <tim_one@email.msn.com> wrote:

>[Richie Hindle]
>> As Mark says, we're going to have to package this thing up anyway, so why
>> not make ZODB a part of that package?  All this assumes (as Skip points
>> out) that ZODB is as portable as Spambayes.
>
>Few things are as portable as pure Python code.  ZODB has some C code in it
>that hasn't been ported as widely, but it should work fine except on
>Platforms from Mars.  The bulk of ZODB is written in pure Python; the
>underlying persistence and BTree machinery is coded in C.
>
>> On the subject of packaging: I've used InnoSetup before and been very
>> impressed.  Someone mentioned Install Shield - I don't believe there's a
>> credible free version of that, whereas InnoSetup is completely free.
>
>InnoSetup works great, and especially for straightforward installs.
>Recommended here too.  The harder the install, the more valuable other
>installers become.  The spambayes installer could probably consist of a
>plain zip file -- except that one of my sisters doesn't know how to unzip
><wink/sigh>.

Well, with Tim1's blessing, I'll give it another crack.  Checking in a 
firstcut simple pop3proxy install.  - TimS

>
>
>_______________________________________________
>Spambayes mailing list
>Spambayes@python.org
>http://mail.python.org/mailman/listinfo/spambayes
>
>


c'est moi - TimS
www.fourstonesExpressions.com 


From Paul.Moore@atosorigin.com  Wed Nov 27 13:41:30 2002
From: Paul.Moore@atosorigin.com (Moore, Paul)
Date: Wed, 27 Nov 2002 13:41:30 -0000
Subject: [Spambayes] Guidance re pickles versus DB for Outlook
Message-ID: <16E1010E4581B049ABC51D4975CEDB885E2E2A@UKDCX001.uk.int.atosorigin.com>

From: Tim Stone - Four Stones Expressions
> Well, with Tim1's blessing, I'll give it another crack.  Checking
> in a firstcut simple pop3proxy install.

If you're talking about a binary installer, you could upload it as
a separate file into the Sourceforge files area. Dunno how easy this
is (I've never used SF's developer facilities) but IIRC handling
binary files in CVS can be a bit of a nightmare (not least the fact
that changes result in a complete re-download as you can't send a
patch for a binary file, not good if the file is big).

Paul

From tim@fourstonesExpressions.com  Wed Nov 27 13:46:12 2002
From: tim@fourstonesExpressions.com (Tim Stone - Four Stones Expressions)
Date: Wed, 27 Nov 2002 07:46:12 -0600
Subject: [Spambayes] Guidance re pickles versus DB for Outlook
In-Reply-To: <16E1010E4581B049ABC51D4975CEDB885E2E2A@UKDCX001.uk.int.atosorigin.com>
Message-ID: <MIAA8TRA95XTTQPJ31HFJD1UUSMI.3de4cca4@riven>

11/27/2002 7:41:30 AM, "Moore, Paul" <Paul.Moore@atosorigin.com> wrote:

>From: Tim Stone - Four Stones Expressions
>> Well, with Tim1's blessing, I'll give it another crack.  Checking
>> in a firstcut simple pop3proxy install.
>
>If you're talking about a binary installer, you could upload it as
>a separate file into the Sourceforge files area. Dunno how easy this
>is (I've never used SF's developer facilities) but IIRC handling
>binary files in CVS can be a bit of a nightmare (not least the fact
>that changes result in a complete re-download as you can't send a
>patch for a binary file, not good if the file is big).

Ya, good thinking.  It's a 350K binary, probably not a good candidate for cvs.  
Being a neophyte to sf, I'll do some learning about the sf files area.  Where 
do you suggest I start?

FYI, innosetup is really great, but it doesn't appear to allow you to have 
multiple 'main' executables... Stay tuned...

- TimS

>
>Paul
>
>_______________________________________________
>Spambayes mailing list
>Spambayes@python.org
>http://mail.python.org/mailman/listinfo/spambayes
>
>


c'est moi - TimS
www.fourstonesExpressions.com 


From Paul.Moore@atosorigin.com  Wed Nov 27 13:56:48 2002
From: Paul.Moore@atosorigin.com (Moore, Paul)
Date: Wed, 27 Nov 2002 13:56:48 -0000
Subject: [Spambayes] Guidance re pickles versus DB for Outlook
Message-ID: <16E1010E4581B049ABC51D4975CEDB8861995B@UKDCX001.uk.int.atosorigin.com>

From: Tim Stone - Four Stones Expressions
> Ya, good thinking.  It's a 350K binary, probably not a good candidate
> for cvs. Being a neophyte to sf, I'll do some learning about the sf
> files area.  Where do you suggest I start?

To be honest, I'm not sure. As I said, I'm not a SF developer for
anything, and a brief look couldn't find anything useful on SF.

Actually, going via www.python.org, and looking at the "Python Project"
link in the left hand sidebar, I ended up finding
http://sfdocs.sourceforge.net/. That looks like a useful place to start
browsing.

Hope this helps,
Paul

From skip@pobox.com  Wed Nov 27 14:05:38 2002
From: skip@pobox.com (Skip Montanaro)
Date: Wed, 27 Nov 2002 08:05:38 -0600
Subject: [Spambayes] Bouncing Spam
In-Reply-To: <w531y577ayy.fsf@woozle.org>
References: <XV1VXNIPSQ76QRMWUAW1VYSXUVU.3de43a5b@riven>
        <20021126201756.A20174@blighty.com>
        <w531y577ayy.fsf@woozle.org>
Message-ID: <15844.53554.34595.267094@montanaro.dyndns.org>

    Neale> Check out the red line:

    Neale>   http://woozle.org/stats/spam.html

Kinda hard to see the trend.  Any chance you can normalize the "Total
analyzed messages" line to 1.0 then plot everything else as a fraction
between 0.0 and 1.0?

Skip


From steffen.siebert@logware.de  Wed Nov 27 12:49:49 2002
From: steffen.siebert@logware.de (Steffen Siebert)
Date: Wed, 27 Nov 2002 13:49:49 +0100
Subject: [Spambayes] Outlook Plugin crashes
Message-ID: <71B37AB6576DD211AB7E008048CD4D310279881B@EXCHANGE>

Hello,

I'm trying to install the spambayes outlook plugin, but it crashes all the
time (see traceback below).

I'm using python 2.2.2 with win32all-150 under windows 2000.

I've made some tests and to my astonishment it is possible to start
manager.py from the commandline and it works fine! I suspected that outlook
starts a different python, but I've deleted all other python.exe and
python*.dll files from my machine and also modified manager.py to print the
value of sys.version, it says its using the right version.

I hope someone can help me here.

Thanks in advance,
  Steffen

Traceback:

Outlook Spam Addin module loading
SpamAddin - Connecting to Outlook
Created new configuration file
'D:\spambayes\spambayes\Outlook2000\default_configuration.pck'
Traceback (most recent call last):
  File "D:\Python22\lib\site-packages\win32com\universal.py", line 150, in
dispatch
    retVal = ob._InvokeEx_(meth.dispid, 0, pythoncom.DISPATCH_METHOD, args,
None, None)
  File "D:\Python22\lib\site-packages\win32com\server\policy.py", line 322,
in _InvokeEx_
    return self._invokeex_(dispid, lcid, wFlags, args, kwargs,
serviceProvider)
  File "D:\Python22\lib\site-packages\win32com\server\policy.py", line 562,
in _invokeex_
    return DesignatedWrapPolicy._invokeex_( self, dispid, lcid, wFlags,
args, kwArgs, serviceProvider)
  File "D:\Python22\lib\site-packages\win32com\server\policy.py", line 510,
in _invokeex_
    return apply(func, args)
  File "D:\spambayes\spambayes\Outlook2000\addin.py", line 538, in
OnConnection
    self.manager = manager.GetManager(application)
  File "D:\spambayes\spambayes\Outlook2000\manager.py", line 317, in
GetManager
    _mgr = BayesManager(outlook=outlook, verbose=verbose)
  File "D:\spambayes\spambayes\Outlook2000\manager.py", line 67, in __init__
    import_core_spambayes_stuff(self.ini_filename)
  File "D:\spambayes\spambayes\Outlook2000\manager.py", line 41, in
import_core_spambayes_stuff
    import classifier
  File "D:\spambayes\spambayes\classifier.py", line 37, in ?
    from Options import options
  File "D:\spambayes\spambayes\Options.py", line 512, in ?
    options.mergefilelike(d)
  File "D:\spambayes\spambayes\Options.py", line 479, in mergefilelike
    self._update()
  File "D:\spambayes\spambayes\Options.py", line 497, in _update
    value = getattr(c, fetcher)(section, option)
  File "D:\Python22\Lib\ConfigParser.py", line 306, in getfloat
    return self.__get(section, float, option)
  File "D:\Python22\Lib\ConfigParser.py", line 300, in __get
    return conv(self.get(section, option))
exceptions.ValueError: invalid literal for float(): 10.00

From popiel@wolfskeep.com  Wed Nov 27 16:55:04 2002
From: popiel@wolfskeep.com (T. Alexander Popiel)
Date: Wed, 27 Nov 2002 08:55:04 -0800
Subject: [Spambayes] Bouncing Spam 
In-Reply-To: Message from Matt Sergeant <msergeant@startechgroup.co.uk> 
   of "Wed, 27 Nov 2002 11:00:45 GMT." <3DE4A5DD.9010904@startechgroup.co.uk> 
References: <XV1VXNIPSQ76QRMWUAW1VYSXUVU.3de43a5b@riven>
	<20021126201756.A20174@blighty.com>  <3DE4A5DD.9010904@startechgroup.co.uk> 
Message-ID: <20021127165504.918FCF589@cashew.wolfskeep.com>

In message:  <3DE4A5DD.9010904@startechgroup.co.uk>
             Matt Sergeant <msergeant@startechgroup.co.uk> writes:
>Steve Atkins said the following on 27/11/02 04:17:
>> The only way to safely bounce spam is to drop it with a permanent
>> (5xx) rejection at an appropriate point in the SMTP transaction at the
>> first point it enters your network (i.e. if your secondary MX accepts
>> it, don't reject the delivery to your primary).
>
>Another thing that came up discussing this on the SpamAssassin list is 
>that this only works on the mail server that is your MX server. If you 
>get mail through a third party (e.g. an ISP that might forward to your 
>SMTP server) then it doesn't work.

As an aside, I'll note that it's perfectly reasonable behaviour to
have a bounce response late in the forwarding chain (as in a primary MX
rejecting it when a secondary MX accepted it).  The problem is not with
a violation of SMTP, but in the fact that the spammers routinely ignore
foreign-generated bounce messages (since that would require them to run
an SMTP server of their own (and actually process data from it)) and
only pay attention to errors in outgoing conversations that their mail-
blasting tools have.

- Alex

From tim@fourstonesExpressions.com  Wed Nov 27 17:01:18 2002
From: tim@fourstonesExpressions.com (Tim Stone - Four Stones Expressions)
Date: Wed, 27 Nov 2002 11:01:18 -0600
Subject: [Spambayes] Bouncing Spam 
In-Reply-To: <20021127165504.918FCF589@cashew.wolfskeep.com>
Message-ID: <CAGGD76ZULJTSIGKGOL64CAZVUQ4WJD.3de4fa5e@riven>

Well, ok then, the question arises as to whether or not spambayes should offer 
some functionality that is integratable into MX agents, or an MX agent proxy, 
or something like that, that is spambayes enabled, for webmasters or others 
who administer such things...  I certainly wouldn't have the slightest idea 
how to make such a thing, but it seems reasonable.

- TimS

11/27/2002 10:55:04 AM, "T. Alexander Popiel" <popiel@wolfskeep.com> wrote:

>In message:  <3DE4A5DD.9010904@startechgroup.co.uk>
>             Matt Sergeant <msergeant@startechgroup.co.uk> writes:
>>Steve Atkins said the following on 27/11/02 04:17:
>>> The only way to safely bounce spam is to drop it with a permanent
>>> (5xx) rejection at an appropriate point in the SMTP transaction at the
>>> first point it enters your network (i.e. if your secondary MX accepts
>>> it, don't reject the delivery to your primary).
>>
>>Another thing that came up discussing this on the SpamAssassin list is 
>>that this only works on the mail server that is your MX server. If you 
>>get mail through a third party (e.g. an ISP that might forward to your 
>>SMTP server) then it doesn't work.
>
>As an aside, I'll note that it's perfectly reasonable behaviour to
>have a bounce response late in the forwarding chain (as in a primary MX
>rejecting it when a secondary MX accepted it).  The problem is not with
>a violation of SMTP, but in the fact that the spammers routinely ignore
>foreign-generated bounce messages (since that would require them to run
>an SMTP server of their own (and actually process data from it)) and
>only pay attention to errors in outgoing conversations that their mail-
>blasting tools have.
>
>- Alex
>
>_______________________________________________
>Spambayes mailing list
>Spambayes@python.org
>http://mail.python.org/mailman/listinfo/spambayes
>
>


c'est moi - TimS
www.fourstonesExpressions.com 


From popiel@wolfskeep.com  Wed Nov 27 17:25:44 2002
From: popiel@wolfskeep.com (T. Alexander Popiel)
Date: Wed, 27 Nov 2002 09:25:44 -0800
Subject: [Spambayes] Bouncing Spam 
In-Reply-To: Message from Tim Stone - Four Stones Expressions
	<tim@fourstonesExpressions.com> 
	<CAGGD76ZULJTSIGKGOL64CAZVUQ4WJD.3de4fa5e@riven> 
References: <CAGGD76ZULJTSIGKGOL64CAZVUQ4WJD.3de4fa5e@riven> 
Message-ID: <20021127172544.B924EF589@cashew.wolfskeep.com>

In message:  <CAGGD76ZULJTSIGKGOL64CAZVUQ4WJD.3de4fa5e@riven>
             <tim@fourstonesExpressions.com> writes:

>Well, ok then, the question arises as to whether or not spambayes should
>offer some functionality that is integratable into MX agents, or an MX
>agent proxy, or something like that, that is spambayes enabled, for
>webmasters or others who administer such things...  I certainly wouldn't
>have the slightest idea how to make such a thing, but it seems reasonable.

I personally wouldn't bother trying to integrate with existing mail
transport agents (MTAs).  They all have wildly different internal
architectures, and several don't really have any 'plug-in' type
interfaces.  (This is speaking from my experience with sendmail,
postfix, qmail, and exim.)  Integrating with each MTA would be a
separate task (just like building any of the client front-ends).

Making a proxy MTA would be relatively easy (certainly easier than
trying to integrate with other MTAs directly), but then you end up
having much higher reliability standards to meet, and you have to
deal with... umm... creative interpretations of SMTP.  It's a can
of worms that I wouldn't want to open.

Finally, there's enough barriers against it actually being useful
against the spammers (usually needing to decide to reject just based
on the envelope from and to addresses, with no look at the full
message (and thus no chance to correct false positives)) that it's
IMHO not worth it.

If you want to pursue this path despite the hazards and the low
return, more power to you...

- Alex

From skip@pobox.com  Wed Nov 27 17:35:28 2002
From: skip@pobox.com (Skip Montanaro)
Date: Wed, 27 Nov 2002 11:35:28 -0600
Subject: [Spambayes] how to kill process on Windows started with os.spawn?
Message-ID: <15845.608.895210.155241@montanaro.dyndns.org>


I'm trying to coax pop3proxy into running an ssh command in the background,
so my pop sessions can tunnel through an encrypted channel.  On Unix (and
MacOSX) this looks like it will be a breeze.

If I start a process in the background like so:

    pid = os.spawnvp(os.P_NOWAIT, cmd, args)

on Unix systems I can later execute

    os.kill(pid, signal.SIGHUP)

According to the docs, on Windows systems, a process handle is returned from
os.spawn*.  How would I kill that process on Windows, since (once again,
according to the docs), os.kill is only available on Unix systems?

Thx,

Skip

From neale@woozle.org  Wed Nov 27 17:38:43 2002
From: neale@woozle.org (Neale Pickett)
Date: 27 Nov 2002 09:38:43 -0800
Subject: [Spambayes] Bouncing Spam
In-Reply-To: <15844.53554.34595.267094@montanaro.dyndns.org>
References: <XV1VXNIPSQ76QRMWUAW1VYSXUVU.3de43a5b@riven>
	<20021126201756.A20174@blighty.com> <w531y577ayy.fsf@woozle.org>
	<15844.53554.34595.267094@montanaro.dyndns.org>
Message-ID: <w53isyi6hgc.fsf@woozle.org>

So then, Skip Montanaro <skip@pobox.com> is all like:

>     Neale> Check out the red line:
> 
>     Neale>   http://woozle.org/stats/spam.html
> 
> Kinda hard to see the trend.  Any chance you can normalize the "Total
> analyzed messages" line to 1.0 then plot everything else as a fraction
> between 0.0 and 1.0?

Yeah, but that wouldn't do you any good, since any rejected messages
will never be analyzed.  On November 16, for example, I rejected more
than I analyzed.  So those numbers aren't really related aside from
certain days being busier than others.

I can do something to clean up the graph legibility, though.  It's on my
todo list.

Neale

From neale@woozle.org  Wed Nov 27 17:41:53 2002
From: neale@woozle.org (Neale Pickett)
Date: 27 Nov 2002 09:41:53 -0800
Subject: [Spambayes] Bouncing Spam
In-Reply-To: <CAGGD76ZULJTSIGKGOL64CAZVUQ4WJD.3de4fa5e@riven>
References: <CAGGD76ZULJTSIGKGOL64CAZVUQ4WJD.3de4fa5e@riven>
Message-ID: <w53el966hb2.fsf@woozle.org>

So then, Tim Stone - Four Stones Expressions <tim@fourstonesExpressions.com> is all like:

> Well, ok then, the question arises as to whether or not spambayes
> should offer some functionality that is integratable into MX agents,
> or an MX agent proxy, or something like that, that is spambayes
> enabled, for webmasters or others who administer such things...  I
> certainly wouldn't have the slightest idea how to make such a thing,
> but it seems reasonable.

Can't.  You have to send the 5xx response before any message data,
including headers, is even sent.  By the time they've sent the whole
message and you get to tokenize it, the spammer has moved on to the next
victim.  If they don't care about bounce mail they're not going to care
about 500 errors after they've sent the entire message.


From skip@pobox.com  Wed Nov 27 17:43:38 2002
From: skip@pobox.com (Skip Montanaro)
Date: Wed, 27 Nov 2002 11:43:38 -0600
Subject: [Spambayes] can pop3proxy train on uploaded mbox?
Message-ID: <15845.1098.994309.871138@montanaro.dyndns.org>

Can pop3proxy train on an mbox uploaded via the web interface or is that
restricted to individual messages?

Skip

From msergeant@startechgroup.co.uk  Wed Nov 27 17:52:55 2002
From: msergeant@startechgroup.co.uk (Matt Sergeant)
Date: 27 Nov 2002 17:52:55 +0000
Subject: [Spambayes] Bouncing Spam
In-Reply-To: <20021127165504.918FCF589@cashew.wolfskeep.com>
References: <XV1VXNIPSQ76QRMWUAW1VYSXUVU.3de43a5b@riven>
	<3DE4A5DD.9010904@startechgroup.co.uk>
	<20021127165504.918FCF589@cashew.wolfskeep.com>
Message-ID: <1038419575.10222.1.camel@felony.int.star.co.uk>

On Wed, 2002-11-27 at 16:55, T. Alexander Popiel wrote:
> As an aside, I'll note that it's perfectly reasonable behaviour to
> have a bounce response late in the forwarding chain (as in a primary MX
> rejecting it when a secondary MX accepted it).  The problem is not with
> a violation of SMTP, but in the fact that the spammers routinely ignore
> foreign-generated bounce messages (since that would require them to run
> an SMTP server of their own (and actually process data from it)) and
> only pay attention to errors in outgoing conversations that their mail-
> blasting tools have.

It's not that they ignore it, but if you 550 back to your ISP's MTA, all
they can do is generate a bounce message to the MAIL FROM user. And
that's likely to be a forgery. So you end up spamming an innocent user.

Matt.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part
Url : http://mail.python.org/pipermail/spambayes/attachments/20021127/ea22fa6b/attachment.bin
From wsy@merl.com  Wed Nov 27 18:10:20 2002
From: wsy@merl.com (Bill Yerazunis)
Date: Wed, 27 Nov 2002 13:10:20 -0500
Subject: [Spambayes] Bouncing Spam
In-Reply-To: <w53el966hb2.fsf@woozle.org> (message from Neale Pickett on 27
	Nov 2002 09:41:53 -0800)
References: <CAGGD76ZULJTSIGKGOL64CAZVUQ4WJD.3de4fa5e@riven>
	<w53el966hb2.fsf@woozle.org>
Message-ID: <200211271810.gARIAK002800@localhost.localdomain>


   Organization: WoozleWorks (woozle.org)
   X-Thought: If I had a computer like that, I wouldn't need friends either.
   X-PGP-Key-Fingerprint: A862 F105 13EF 7FAF 4F08  78B4 9168 856B 48BF F157
   X-Callsign: KD7OQI
   From: Neale Pickett <neale@woozle.org>
   Date: 27 Nov 2002 09:41:53 -0800
   cc: Spambayes <spambayes@python.org>
   Sender: spambayes-bounces@python.org
   X-Spam-Status: No, hits=-12.8 required=7.0
	   tests=IN_REP_TO,NOSPAM_INC,QUOTED_EMAIL_TEXT,REFERENCES,
		 SPAM_PHRASE_01_02,USER_AGENT
	   version=2.41
   X-Spam-Level: 

   So then, Tim Stone - Four Stones Expressions <tim@fourstonesExpressions.com> is all like:

   > Well, ok then, the question arises as to whether or not spambayes
   > should offer some functionality that is integratable into MX agents,
   > or an MX agent proxy, or something like that, that is spambayes
   > enabled, for webmasters or others who administer such things...  I
   > certainly wouldn't have the slightest idea how to make such a thing,
   > but it seems reasonable.

   Can't.  You have to send the 5xx response before any message data,
   including headers, is even sent.

I used to think that- in fact, it's _not_ necessary.  You can bounce after 
the "data" / 354 "message text, end with '.' on a line by itself" exchange.

    By the time they've sent the whole
   message and you get to tokenize it, the spammer has moved on to the next
   victim.  If they don't care about bounce mail they're not going to care
   about 500 errors after they've sent the entire message.

My understanding (having it beat into me by people who do RFC2821/2822
for a living, including some of the original implementors back at
the 821/822 level) is:

 1) tme spammers do in fact "pipeline" and not wait for any error
     messages, but:

 2) it's a violation of the RFC to not wait for the 250 OK message
     from the recipient's server, and any mail sender that doesn't
     wait for the 250 OK cannot expect reliable delivery, hence it's
     OK for us to toss the spam rather than doing the
     otherwise-required deliver-or-bounce-without-fail.

 3) it _is_ legitimate to return a 55x error code after message data.

 4) it's actually desirable that we 5xx after the data, that way
    a legitimate sender (i.e. a false reject) will get a bounce 
    message from their MTA and will know to retry the messae or
    use another medium, rather than expecting the nominal "deliver
    or bounce but never just drop" behavior.

At least, I've been beat up on another list enough about it, and
re-read the RFC with the bloody nose enough to believe them.  :-)

Read section 4.1.1.4 DATA (DATA) of RFC2821 at:

    http://www.faqs.org/rfcs/rfc2821.html

and also section 4.2.3, same URL, for the list of nominative reply
codes.  550 is reasonable to use for rejection, 554 is also 
reasonable.

If you wanted, you could even split out Nigerians and other 
make.money.fast scams, and vector those as in:

  551 User not local; please try uce@fcc.gov		

which might have interesting consequences.  :-)

      -Bill Yerazunis

From samr@slriv.com  Wed Nov 27 18:18:21 2002
From: samr@slriv.com (Sam Robertson)
Date: Wed, 27 Nov 2002 10:18:21 -0800
Subject: [Spambayes] Bouncing Spam
In-Reply-To: <CAGGD76ZULJTSIGKGOL64CAZVUQ4WJD.3de4fa5e@riven>
References: <CAGGD76ZULJTSIGKGOL64CAZVUQ4WJD.3de4fa5e@riven>
Message-ID: <3DE50C6D.7040202@slriv.com>

Tim Stone - Four Stones Expressions wrote:

>Well, ok then, the question arises as to whether or not spambayes should offer 
>some functionality that is integratable into MX agents, or an MX agent proxy, 
>or something like that, that is spambayes enabled, for webmasters or others 
>who administer such things...  I certainly wouldn't have the slightest idea 
>how to make such a thing, but it seems reasonable.
>
>- TimS
>  
>
Hi, I'm Sam and have been lurking here for quite a while.  I wanted to 
chime in on this topic.  Any effort expended on notifying the bulk 
mailers as to the validity of an account at message accept isn't really 
going to net any benefit.  Bulk mailers spend more time farming new 
addresses than maintaining their list of old addresses.

If you are trying to reduce load, you might consider just blocking the 
offending MTAs with an RBL like solution, or at the firewall.  For 
example, I have pretty much all of asia and brazil blocked from one of 
my servers.  Kind of sad really, but so goes it.

-Sam


From wsy@merl.com  Wed Nov 27 18:21:05 2002
From: wsy@merl.com (Bill Yerazunis)
Date: Wed, 27 Nov 2002 13:21:05 -0500
Subject: [Spambayes] Bouncing Spam
In-Reply-To: <3DE50C6D.7040202@slriv.com> (message from Sam Robertson on Wed,
	27 Nov 2002 10:18:21 -0800)
References: <CAGGD76ZULJTSIGKGOL64CAZVUQ4WJD.3de4fa5e@riven>
	<3DE50C6D.7040202@slriv.com>
Message-ID: <200211271821.gARIL5o02831@localhost.localdomain>


   From: Sam Robertson <samr@slriv.com>

   >Well, ok then, the question arises as to whether or not spambayes
   >should offer some functionality that is integratable into MX
   >agents, or an MX agent proxy, or something like that, that is
   >spambayes enabled, for webmasters or others who administer such
   >things...  I certainly wouldn't have the slightest idea how to
   >make such a thing, but it seems reasonable.
   >
   >- TimS
   >  
   >
   Hi, I'm Sam and have been lurking here for quite a while.  I wanted to 
   chime in on this topic.  Any effort expended on notifying the bulk 
   mailers as to the validity of an account at message accept isn't really 
   going to net any benefit.  Bulk mailers spend more time farming new 
   addresses than maintaining their list of old addresses.

   If you are trying to reduce load, you might consider just blocking the 
   offending MTAs with an RBL like solution, or at the firewall.  For 
   example, I have pretty much all of asia and brazil blocked from one of 
   my servers.  Kind of sad really, but so goes it.

The gain isn't in telling the _spammers_ they aren't welcome, the gain
is in telling _legitimate_ users whose mail was incorrectly rejected that the
mail _did_ bounce, and is not languishing in the spam-bucket waiting
to be deleted unread.

That's the gain- it makes a false rejection far less horrible a fate
than it would otherwise be.

     -Crash


From msergeant@startechgroup.co.uk  Wed Nov 27 18:17:05 2002
From: msergeant@startechgroup.co.uk (Matt Sergeant)
Date: 27 Nov 2002 18:17:05 +0000
Subject: [Spambayes] Bouncing Spam
In-Reply-To: <3DE50C6D.7040202@slriv.com>
References: <CAGGD76ZULJTSIGKGOL64CAZVUQ4WJD.3de4fa5e@riven> 
	<3DE50C6D.7040202@slriv.com>
Message-ID: <1038421025.10258.4.camel@felony.int.star.co.uk>

On Wed, 2002-11-27 at 18:18, Sam Robertson wrote:
> Hi, I'm Sam and have been lurking here for quite a while.  I wanted to 
> chime in on this topic.  Any effort expended on notifying the bulk 
> mailers as to the validity of an account at message accept isn't really 
> going to net any benefit.  Bulk mailers spend more time farming new 
> addresses than maintaining their list of old addresses.

The benefit of 5xx'ing the spam is that FP's get informed right away. I
can't think of any other reason to do it.

Matt.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part
Url : http://mail.python.org/pipermail/spambayes/attachments/20021127/b4309691/attachment.bin
From neale@woozle.org  Wed Nov 27 18:24:13 2002
From: neale@woozle.org (Neale Pickett)
Date: 27 Nov 2002 10:24:13 -0800
Subject: [Spambayes] Bouncing Spam
In-Reply-To: <200211271810.gARIAK002800@localhost.localdomain>
References: <CAGGD76ZULJTSIGKGOL64CAZVUQ4WJD.3de4fa5e@riven>
	<w53el966hb2.fsf@woozle.org>
	<200211271810.gARIAK002800@localhost.localdomain>
Message-ID: <w53wumy50s2.fsf@woozle.org>

So then, Bill Yerazunis <wsy@merl.com> is all like:

>    Can't.  You have to send the 5xx response before any message data,
>    including headers, is even sent.
> 
> I used to think that- in fact, it's _not_ necessary.  You can bounce
> after the "data" / 354 "message text, end with '.' on a line by
> itself" exchange.

Right.  What I meant was, it doesn't do any good to do so, from the
perspective of anti-spam software.  By the time we know enough to send
back an error message, the damage has already been done--we've devoted
resources to accepting and tokenizing the message.  Probably the disk
space to store it, too.  So there's really not much to be gained by
hooking into the MTA, when we can do just as good a job with a normal
mail filter

>  2) it's a violation of the RFC to not wait for the 250 OK message
>      from the recipient's server, and any mail sender that doesn't
>      wait for the 250 OK cannot expect reliable delivery, hence it's
>      OK for us to toss the spam rather than doing the
>      otherwise-required deliver-or-bounce-without-fail.

I write SMTP proxies for a living, and it would be nice if everyone
played by the RFC.  Unfortunately, if you take a position of silently
dropping mail from senders who aren't 100% RFC compliant, you get a ton
of complaints from end-users.

I'm not saying that we *can't* hook into the MTA at recieve time.  I'd
love it if we could, because then I could hook spambayes directly into
$FIRM's flagship product.  I am, however, saying that doing so would be
of limited value when compared to all the other more easily-implemented
options available (notably, server-side filtering such as hammiefilter
or an equivalent on a Windows box).

Neale

From tim@fourstonesExpressions.com  Wed Nov 27 18:30:49 2002
From: tim@fourstonesExpressions.com (Tim Stone - Four Stones Expressions)
Date: Wed, 27 Nov 2002 12:30:49 -0600
Subject: [Spambayes] Bouncing Spam
In-Reply-To: <w53wumy50s2.fsf@woozle.org>
Message-ID: <B072BA2VA0ZFWR93WU09C3VD9B0G.3de50f59@riven>

11/27/2002 12:24:13 PM, Neale Pickett <neale@woozle.org> wrote:

>So then, Bill Yerazunis <wsy@merl.com> is all like:
>
>>    Can't.  You have to send the 5xx response before any message data,
>>    including headers, is even sent.
>> 
>> I used to think that- in fact, it's _not_ necessary.  You can bounce
>> after the "data" / 354 "message text, end with '.' on a line by
>> itself" exchange.
>
>Right.  What I meant was, it doesn't do any good to do so, from the
>perspective of anti-spam software.  By the time we know enough to send
>back an error message, the damage has already been done--we've devoted
>resources to accepting and tokenizing the message.  Probably the disk
>space to store it, too.  So there's really not much to be gained by
>hooking into the MTA, when we can do just as good a job with a normal
>mail filter
>
>>  2) it's a violation of the RFC to not wait for the 250 OK message
>>      from the recipient's server, and any mail sender that doesn't
>>      wait for the 250 OK cannot expect reliable delivery, hence it's
>>      OK for us to toss the spam rather than doing the
>>      otherwise-required deliver-or-bounce-without-fail.
>
>I write SMTP proxies for a living, and it would be nice if everyone
>played by the RFC.  Unfortunately, if you take a position of silently
>dropping mail from senders who aren't 100% RFC compliant, you get a ton
>of complaints from end-users.
>
>I'm not saying that we *can't* hook into the MTA at recieve time.  I'd
>love it if we could, because then I could hook spambayes directly into
>$FIRM's flagship product.  I am, however, saying that doing so would be
>of limited value when compared to all the other more easily-implemented
>options available (notably, server-side filtering such as hammiefilter
>or an equivalent on a Windows box).

What I was originally looking for was predicated on the belief that amongst 
spammers, there are degrees of evilness.  The truely evil ones couldn't care 
less about a rejection, but the less evil ones (while still quite evil) may 
actually cull rejections.  I would think eventually even the really evil ones 
would have to do some culling based on rejections.  It's too easy to change 
email addresses, and their lists would fill up with worthless addresses.  
While the cost of sending to a single worthless address is negligible, if 
their list is 50% bad addresses, then the costs add up, and they are incented 
to remove those bad addresses.

So notifying FP is nice, though I can do that from my own mailer if I wish.  
But the possibility of actually having my address removed from a spammer's 
list is infinitely more attractive to me.

- TimS
>
>Neale
>
>


c'est moi - TimS
www.fourstonesExpressions.com 


From samr@slriv.com  Wed Nov 27 18:49:57 2002
From: samr@slriv.com (Sam Robertson)
Date: Wed, 27 Nov 2002 10:49:57 -0800
Subject: [Spambayes] Bouncing Spam
In-Reply-To: <CAGGD76ZULJTSIGKGOL64CAZVUQ4WJD.3de4fa5e@riven>
References: <CAGGD76ZULJTSIGKGOL64CAZVUQ4WJD.3de4fa5e@riven>
	<3DE50C6D.7040202@slriv.com> <200211271821.gARIL5o02831@localhost.localdomain>
Message-ID: <3DE513D5.4050305@slriv.com>

>
>
>The gain isn't in telling the _spammers_ they aren't welcome, the gain
>is in telling _legitimate_ users whose mail was incorrectly rejected that the
>mail _did_ bounce, and is not languishing in the spam-bucket waiting
>to be deleted unread.
>
>That's the gain- it makes a false rejection far less horrible a fate
>than it would otherwise be.
>
>     -Crash
>
So, then you would want some way for a user to 'bless' themselves.   
That moves into new territory and also complicates installation 
(deployment).  If the NDR is purely based on bayes, how do you tell the 
sender how to write a 'proper' message that won't be 'tainted'?  To me 
this is like just announcing the mta is active to the more nefarious 
spammers.  Maybe not always a concern, but for me with 10 or so 
accounts, I would rather not be a blip on the radar, and just siltently 
take my beatings. ;)

(I need more coffee...)

-Sam


From richie@entrian.com  Wed Nov 27 18:50:20 2002
From: richie@entrian.com (Richie Hindle)
Date: Wed, 27 Nov 2002 18:50:20 +0000
Subject: [Spambayes] can pop3proxy train on uploaded mbox?
In-Reply-To: <15845.1098.994309.871138@montanaro.dyndns.org>
References: <15845.1098.994309.871138@montanaro.dyndns.org>
Message-ID: <ht4auu8n6r8cfqo56o86npjlfudcc5p6h8@4ax.com>


[Skip]
> Can pop3proxy train on an mbox uploaded via the web interface or is that
> restricted to individual messages?

The bang-up-to-date one can, yes - I committed that feature yesterday.  Let
me know whether it works for you - it's very very new!

-- 
Richie Hindle
richie@entrian.com


From francois.granger@free.fr  Wed Nov 27 20:24:51 2002
From: francois.granger@free.fr (=?iso-8859-1?Q?Fran=E7ois?= Granger)
Date: Wed, 27 Nov 2002 21:24:51 +0100
Subject: [Spambayes] Outlook Plugin crashes
In-Reply-To: <71B37AB6576DD211AB7E008048CD4D310279881B@EXCHANGE>
References: <71B37AB6576DD211AB7E008048CD4D310279881B@EXCHANGE>
Message-ID: <a0510030eba0ada09e38b@[192.168.1.11]>

At 13:49 +0100 27/11/02, in message [Spambayes] Outlook Plugin 
crashes, Steffen Siebert wrote:
>Hello,
>
>I'm trying to install the spambayes outlook plugin, but it crashes all the
>time (see traceback below).

[...]

>   File "D:\Python22\Lib\ConfigParser.py", line 300, in __get
>     return conv(self.get(section, option))
>exceptions.ValueError: invalid literal for float(): 10.00

I guess that something isq wrong in your bayescustomize.ini file.

Have a look to it to see if on one line you have:
something: "10.00"
or similar...


-- 
Le courrier �lectronique est un moyen de communication. Les gens devraient
se poser des questions sur les implications politiques des choix (ou non
choix) de leurs outils et technologies.
Pour des courriers propres : http://minilien.com/?IXZneLoID0 - 
http://marc.herbert.free.fr/mail/ http://expita.com/nomime.html

From francois.granger@free.fr  Wed Nov 27 21:03:10 2002
From: francois.granger@free.fr (=?iso-8859-1?Q?Fran=E7ois?= Granger)
Date: Wed, 27 Nov 2002 22:03:10 +0100
Subject: [Spambayes] Outlook Plugin crashes
In-Reply-To: <ODEOJDHACBEJMNHBEBHOEEBKCBAA.steffen.siebert@logware.de>
References: <ODEOJDHACBEJMNHBEBHOEEBKCBAA.steffen.siebert@logware.de>
Message-ID: <a05100314ba0ae35b1442@[192.168.1.11]>

At 21:38 +0100 27/11/02, in message RE: [Spambayes] Outlook Plugin 
crashes, Steffen Siebert wrote:
>  > >   File "D:\Python22\Lib\ConfigParser.py", line 300, in __get
>>  >     return conv(self.get(section, option))
>>  >exceptions.ValueError: invalid literal for float(): 10.00
>>
>>  I guess that something isq wrong in your bayescustomize.ini file.
>>
>>  Have a look to it to see if on one line you have:
>>  something: "10.00"
>>  or similar...
>
>Since this happens on the first invocation, there isn't yet any
>bayescustomize.ini file. The offending line comes from Options.py which
>seems to contain the default values:
>
>"best_cutoff_fp_weight:     10.00"
>
>If I change the 10.00 in this line to 9.00, I'll get the same error saying
>9.00 is an invalid literal for float. This makes no sense to me :)


As far as I understand the current state of the software, this part 
is never used by the Outlook plug in. Try rem ing it by starting the 
line with #

-- 
Le courrier �lectronique est un moyen de communication. Les gens devraient
se poser des questions sur les implications politiques des choix (ou non
choix) de leurs outils et technologies.
Pour des courriers propres : http://minilien.com/?IXZneLoID0 - 
http://marc.herbert.free.fr/mail/ http://expita.com/nomime.html

From steffen.siebert@logware.de  Wed Nov 27 20:38:23 2002
From: steffen.siebert@logware.de (Steffen Siebert)
Date: Wed, 27 Nov 2002 21:38:23 +0100
Subject: [Spambayes] Outlook Plugin crashes
In-Reply-To: <a0510030eba0ada09e38b@[192.168.1.11]>
Message-ID: <ODEOJDHACBEJMNHBEBHOEEBKCBAA.steffen.siebert@logware.de>

> >   File "D:\Python22\Lib\ConfigParser.py", line 300, in __get
> >     return conv(self.get(section, option))
> >exceptions.ValueError: invalid literal for float(): 10.00
>
> I guess that something isq wrong in your bayescustomize.ini file.
>
> Have a look to it to see if on one line you have:
> something: "10.00"
> or similar...

Since this happens on the first invocation, there isn't yet any
bayescustomize.ini file. The offending line comes from Options.py which
seems to contain the default values:

"best_cutoff_fp_weight:     10.00"

If I change the 10.00 in this line to 9.00, I'll get the same error saying
9.00 is an invalid literal for float. This makes no sense to me :)

Ciao,
  Steffen


From tim_one@email.msn.com  Wed Nov 27 21:14:36 2002
From: tim_one@email.msn.com (Tim Peters)
Date: Wed, 27 Nov 2002 16:14:36 -0500
Subject: [Spambayes] Outlook Plugin crashes
In-Reply-To: <71B37AB6576DD211AB7E008048CD4D310279881B@EXCHANGE>
Message-ID: <LNBBLJKPBEHFEDALKOLCKEMNCPAB.tim_one@email.msn.com>

[Steffen Siebert]
> ...
> I'm using python 2.2.2 with win32all-150 under windows 2000.
>
> ...
>   File "D:\spambayes\spambayes\Options.py", line 479, in mergefilelike
>     self._update()
>   File "D:\spambayes\spambayes\Options.py", line 497, in _update
>     value = getattr(c, fetcher)(section, option)
>   File "D:\Python22\Lib\ConfigParser.py", line 306, in getfloat
>     return self.__get(section, float, option)
>   File "D:\Python22\Lib\ConfigParser.py", line 300, in __get
>     return conv(self.get(section, option))
> exceptions.ValueError: invalid literal for float(): 10.00

Makes no sense.  Try importing Options.py all by itself from an interactive
shell.  I've sometimes seen reports elsewhere that ill-behaved software sets
the user's locale to something where decimal points (".") aren't allowed in
float literals anymore.  Since Python uses the platform C library to convert
float literals to float numbers, that may be the cause.  If you change '.'
to ',' in Options.py, that may make the problem appear to go away (if so, it
would tell you it *is* a locale problem, but wouldn't tell you what's
causing it).


From lists@morpheus.demon.co.uk  Tue Nov 26 23:09:33 2002
From: lists@morpheus.demon.co.uk (Paul Moore)
Date: Tue, 26 Nov 2002 23:09:33 +0000
Subject: [Spambayes] Guidance re pickles versus DB for Outlook
References: <w534ra49u76.fsf@woozle.org>
	<WQMKF96OM4XJ2UGC2XH4Z6ZHCUQ1X.3de3a370@riven>
	<15843.50399.167526.290462@montanaro.dyndns.org>
Message-ID: <n2m-g.k7j0j5ci.fsf@morpheus.demon.co.uk>

Skip Montanaro <skip@pobox.com> writes:

>     Tim> Francois gave us a clue on that one yesterday (or so).  Looks like
>     Tim> we can rearrange this, but it will require copying the module into
>     Tim> spambayes...  yuk... another solution is to clone the
>     Tim> module... call it spambayesdbm.  Maybe that would have several
>     Tim> advantages.
>
> What's wrong with
>
>     import anydbm
>     anydbm._names.remove("dbhash")
>
> ?

Because anydbm does its magic at import time. After the import
statement has completed, _names is never used again :-(

And removing dbhash on a PythonLabs Windows distribution means you get
dumbdbm (no gdbm or dbm). Not good.

Paul.
-- 
This signature intentionally left blank

From mhammond@skippinet.com.au  Wed Nov 27 21:20:35 2002
From: mhammond@skippinet.com.au (Mark Hammond)
Date: Thu, 28 Nov 2002 08:20:35 +1100
Subject: [Spambayes] how to kill process on Windows started with os.spawn?
In-Reply-To: <15845.608.895210.155241@montanaro.dyndns.org>
Message-ID: <LCEPIIGDJPKCOIHOBJEPGEPDHOAA.mhammond@skippinet.com.au>

> I'm trying to coax pop3proxy into running an ssh command in the 
> background,
> so my pop sessions can tunnel through an encrypted channel.  On Unix (and
> MacOSX) this looks like it will be a breeze.
> 
> If I start a process in the background like so:
> 
>     pid = os.spawnvp(os.P_NOWAIT, cmd, args)
> 
> on Unix systems I can later execute
> 
>     os.kill(pid, signal.SIGHUP)
> 
> According to the docs, on Windows systems, a process handle is 
> returned from
> os.spawn*.  How would I kill that process on Windows, since (once again,
> according to the docs), os.kill is only available on Unix systems?

This worked for me just then:

>>> os.spawnl(os.P_NOWAIT, "f:\\windows\\notepad.exe")
548
>>> import win32api
>>> win32api.TerminateProcess(548,0)
>>> 

And sure enough, notepad was killed.

I don't know of a way without the win32api extensions.

Mark.

From tim_one@email.msn.com  Wed Nov 27 21:25:09 2002
From: tim_one@email.msn.com (Tim Peters)
Date: Wed, 27 Nov 2002 16:25:09 -0500
Subject: [Spambayes] how to kill process on Windows started with os.spawn?
In-Reply-To: <15845.608.895210.155241@montanaro.dyndns.org>
Message-ID: <LNBBLJKPBEHFEDALKOLCCEMPCPAB.tim_one@email.msn.com>

[Skip Montanaro]
> ...
> According to the docs, on Windows systems, a process handle is
> returned from os.spawn*.  How would I kill that process on Windows,
> since (once again, according to the docs), os.kill is only available on
> Unix systems?

Before Python 2.3, you need the win32 extensions (which *this* project has
anyway -- for Outlook users).  For 2.3, I implemented something or other for
the Python core; probably killpid(); I'm on vacation and am not going to
waste it searching NEWS <wink>.


From neale@woozle.org  Wed Nov 27 22:41:38 2002
From: neale@woozle.org (Neale Pickett)
Date: 27 Nov 2002 14:41:38 -0800
Subject: [Spambayes] don't update if you don't want to retrain
Message-ID: <w538yze4ov1.fsf@woozle.org>

I checked in the caching dbdict stuff.  Because of changes to how things
are stored, and the removal of the lame MetaInfo class, I had to up the
pickle version.  So don't update until you're ready to retrain.

This scheme uses much less cruft, which should make Jeremy happy.  It's
certainly made me happy :)

Neale

From tim@fourstonesExpressions.com  Wed Nov 27 23:07:04 2002
From: tim@fourstonesExpressions.com (Tim Stone - Four Stones Expressions)
Date: Wed, 27 Nov 2002 17:07:04 -0600
Subject: [Spambayes] don't update if you don't want to retrain
In-Reply-To: <w538yze4ov1.fsf@woozle.org>
Message-ID: <WRON1UQKXNMB7KH3OKJF5YQP8296B6.3de55018@riven>

So... does this lay to rest forever the pickle/dbm debate?  Is there any 
reason left to use a pickle?

- TimS

11/27/2002 4:41:38 PM, Neale Pickett <neale@woozle.org> wrote:

>I checked in the caching dbdict stuff.  Because of changes to how things
>are stored, and the removal of the lame MetaInfo class, I had to up the
>pickle version.  So don't update until you're ready to retrain.
>
>This scheme uses much less cruft, which should make Jeremy happy.  It's
>certainly made me happy :)
>
>Neale
>
>_______________________________________________
>Spambayes mailing list
>Spambayes@python.org
>http://mail.python.org/mailman/listinfo/spambayes
>
>


c'est moi - TimS
www.fourstonesExpressions.com 


From steve@blighty.com  Thu Nov 28 01:28:40 2002
From: steve@blighty.com (Steve Atkins)
Date: Wed, 27 Nov 2002 17:28:40 -0800
Subject: [Spambayes] Bouncing Spam
In-Reply-To: <20021127165504.918FCF589@cashew.wolfskeep.com>;
	from popiel@wolfskeep.com on Wed, Nov 27, 2002 at 08:55:04AM -0800
References: <XV1VXNIPSQ76QRMWUAW1VYSXUVU.3de43a5b@riven>
	<20021126201756.A20174@blighty.com> <3DE4A5DD.9010904@startechgroup.co.uk>
	<msergeant@startechgroup.co.uk>
	<20021127165504.918FCF589@cashew.wolfskeep.com>
Message-ID: <20021127172840.B22133@blighty.com>

On Wed, Nov 27, 2002 at 08:55:04AM -0800, T. Alexander Popiel wrote:
> In message:  <3DE4A5DD.9010904@startechgroup.co.uk>
>              Matt Sergeant <msergeant@startechgroup.co.uk> writes:
> >Steve Atkins said the following on 27/11/02 04:17:
> >> The only way to safely bounce spam is to drop it with a permanent
> >> (5xx) rejection at an appropriate point in the SMTP transaction at the
> >> first point it enters your network (i.e. if your secondary MX accepts
> >> it, don't reject the delivery to your primary).
> >
> >Another thing that came up discussing this on the SpamAssassin list is 
> >that this only works on the mail server that is your MX server. If you 
> >get mail through a third party (e.g. an ISP that might forward to your 
> >SMTP server) then it doesn't work.
> 
> As an aside, I'll note that it's perfectly reasonable behaviour to
> have a bounce response late in the forwarding chain (as in a primary MX
> rejecting it when a secondary MX accepted it).

Yes, it is. It's the Right Way to do it if you're engineering huge
mail systems and want to keep them secure, too. Sucks when the
envelope-from is forged, though.

> The problem is not with
> a violation of SMTP, but in the fact that the spammers routinely ignore
> foreign-generated bounce messages (since that would require them to run
> an SMTP server of their own (and actually process data from it))

That's not the issue. It's that the envelope-from is routinely (pretty
much invariably) forged, so they're never going to see any bounce or
NDR anyway.

> and
> only pay attention to errors in outgoing conversations that their mail-
> blasting tools have.

They tend not to pay attention to that in low-level delivery
spamware. But some do, and dictionary attack and list verification
spamware do pay attention which'll gradually reduce delivery
attempts. That's not the important point, though.

The important thing is that if you are rejecting during the original
delivery you are not causing the spam to be sent to any innocent
third-party[1], which in all the other cases (faked bounces, SMTP
level bounces anywhere other than the MX) you will be doing most of
the time.

If you run software that causes the spam that was originally sent to
you to be bounced to an innocent third-party you're a part of the
problem. So it's a good thing to avoid doing that.

Cheers,
  Steve

[1] The postmaster of an open-relay... I have a lot of sympathy for
    them, but they're not just a random innocent third party in this
    context.

From Paul.Moore@atosorigin.com  Thu Nov 28 09:35:59 2002
From: Paul.Moore@atosorigin.com (Moore, Paul)
Date: Thu, 28 Nov 2002 09:35:59 -0000
Subject: [Spambayes] don't update if you don't want to retrain
Message-ID: <16E1010E4581B049ABC51D4975CEDB8861995D@UKDCX001.uk.int.atosorigin.com>

From: Tim Stone - Four Stones Expressions
> So... does this lay to rest forever the pickle/dbm debate?  Is there =
any=20
> reason left to use a pickle?

Sorry, quite the opposite (IMHO). The patch switches to using shelve, =
which
uses anydbm, which (still) uses the buggy BerkeleyDB 1.85 on Windows. So
Windows users should probably still use pickles.

Basically, you're never going to avoid the fact that Windows users don't
have a reliable DBM implementation by default (unless you count =
dumbdbm).
So you either use pickles, or ship/require some 3rd party solution.

[Assuming that *in practice* the risk involved with the DBM 1.85 bugs is
high enough to be worth worrying about - it's only a Python 2.2 issue, =
as
2.3 will have a newer DBM implementation included].

Paul.

From steffen.siebert@logware.de  Thu Nov 28 11:08:23 2002
From: steffen.siebert@logware.de (Steffen Siebert)
Date: Thu, 28 Nov 2002 12:08:23 +0100
Subject: [Spambayes] Outlook Plugin crashes
Message-ID: <71B37AB6576DD211AB7E008048CD4D310279881D@EXCHANGE>

> [Tim Peters]
> I've sometimes seen reports elsewhere that 
> ill-behaved software sets
> the user's locale to something where decimal points (".") 
> aren't allowed in
> float literals anymore.

> If 
> you change '.'
> to ',' in Options.py, that may make the problem appear to go 
> away (if so, it
> would tell you it *is* a locale problem, but wouldn't tell you what's
> causing it).

I've done that and the outlook plugin seems to work now :) I'm using the
german version of Outlook 2000 and since the german locale uses the comma
instead of the point in decimals, outlook itself may cause the problem.

Ciao,
  Steffen

From jm@jmason.org  Thu Nov 28 11:52:55 2002
From: jm@jmason.org (Justin Mason)
Date: Thu, 28 Nov 2002 11:52:55 +0000
Subject: [Spambayes] fwd: robinson f(w) equation: X constant confusion
Message-ID: <20021128115300.6C18416F16@jmason.org>


hi folks --

just wondering about this.  I ran some tests which wandered across
the landscape of X and S values (as used in Gary Robinson's f(w)
equation), and computed a cost figure based on a corpus of 2000
spam v. 1000 ham, then graphed it.

Results are here:

   http://spamassassin.taint.org/qa/s_x_gary.png

note that X=0.53 S=0.05 and X=0.69 S=0.32 seem to
give the best results.

However, computing X, as per Gary's webpage, results in a value of 0.32.
But according to that graph, 0.32 is pretty much crap ;)

As Allen says below:

> I can't see any reason why that would cause this - it's the same corpus
> giving an 0.32 result, after all. I'm more thinking that, as per the above,
> that the optimal robinson_x almost certainly _isn't_ a simple average of the
> p-values - especially not of the p-values computed using the robinson
> equation in the first place and using ones that have less than 10 or so
> points of data each. Something to work on at some point...

Anyone thought about this?  How did you guys come up with your X and S
figures?

(BTW same thing for Chi-squared combining is at
http://spamassassin.taint.org/qa/s_x_chi.png, if you're interested.  Note
that optimal values seem to be quite different here!)

--j.

------- Forwarded Message

Date:    Thu, 28 Nov 2002 00:45:28 -0500
From:    Ed Allen Smith <easmith@beatrice.rutgers.edu>
To:      jm@jmason.org
cc:      SpamAssassin-devel@lists.sourceforge.net
Subject: Re: [SAdev] bayes 10pcv results, pass 8

In message <20021127113958.52AA916F89@jmason.org> (on 27 November 2002 11:39:53
 +0000), jm@jmason.org (Justin Mason) wrote:
>
>Ed Allen Smith said:
>> >- 85.80 robx30
>> >   In other words, using the computed value for robinson_x as suggested
>> >   by Allen; 0.32 on my test corpus.  Didn't work :(  there was more
>> >   spillage of scores across the middle-ground.
>> 
>> Ah, well. Wait - 0.32? Fascinating. Was
>> http://spamassassin.taint.org/qa/s_x_gary.png run versus that corpus? It's
>> showing the optimal robinson_x being slightly _above_ 0.5, which may
>> indicate a different means of computing the optimal robinson_x than the
>> current method.
>
>yep, it's all run on the same corpus.

Huh. 

>Strange, isn't it?
>
>Maybe it's just illustrating some overfitting to my corpus...

I can't see any reason why that would cause this - it's the same corpus
giving an 0.32 result, after all. I'm more thinking that, as per the above,
that the optimal robinson_x almost certainly _isn't_ a simple average of the
p-values - especially not of the p-values computed using the robinson
equation in the first place and using ones that have less than 10 or so
points of data each. Something to work on at some point...

       -Allen

-- 
Allen Smith			http://cesario.rutgers.edu/easmith/
September 11, 2001		A Day That Shall Live In Infamy II
"They that can give up essential liberty to obtain a little temporary
safety deserve neither liberty nor safety." - Benjamin Franklin


------- End of Forwarded Message


From papaDoc@videotron.ca  Thu Nov 28 14:11:16 2002
From: papaDoc@videotron.ca (papaDoc)
Date: Thu, 28 Nov 2002 09:11:16 -0500
Subject: [Spambayes] pop3proxy with multiple pop server
Message-ID: <3DE62404.5070608@videotron.ca>

Hi,

   I was testing pop3proxy with only one server: pop.videotron.ca
and everything was going smootly. The ham and spam were filtered and
the unsure was very low.

Then I said, since it is woking great I will implement this for
all my accounts. he.. he...

pop3proxy let you define several server in the Options.py but
how can you define the username for the different account ?

Ex:
name1 for mail.gmc.ulaval.ca
name2 for pop.videotron.ca
name3 for pop.videotron.net

I look at the code but no clue, (I need some practice with python ;-6 )

If someone helps me with this I will write instructions on how to
use pop3proxy with mozilla and submit my work to be included in the
INTEGRATION.txt file


papaDoc


From steffen.siebert@logware.de  Thu Nov 28 14:34:55 2002
From: steffen.siebert@logware.de (Steffen Siebert)
Date: Thu, 28 Nov 2002 15:34:55 +0100
Subject: [Spambayes] Outlook - Orphaned Buttons
Message-ID: <71B37AB6576DD211AB7E008048CD4D3102798823@EXCHANGE>

Hi,

the outlook plugin works now, but now I face the problem that the toolbar
has two sets of Spambayes Buttons, where one set has no function and the
other works as expected.

Running "python addin.py --unregister" successfully unregisters the plugin,
leaving only one set of buttons (the non-functioning ones ;) I tried to
remove them from the toolbar myself, but the non-standard buttons are
greyed-out and can't be removed.

Can someone help me on this?

Ciao,
  Steffen


From Paul.Moore@atosorigin.com  Thu Nov 28 14:54:58 2002
From: Paul.Moore@atosorigin.com (Moore, Paul)
Date: Thu, 28 Nov 2002 14:54:58 -0000
Subject: [Spambayes] pop3proxy with multiple pop server
Message-ID: <16E1010E4581B049ABC51D4975CEDB885E2E30@UKDCX001.uk.int.atosorigin.com>

From: papaDoc [mailto:papaDoc@videotron.ca]
> pop3proxy let you define several server in the Options.py but
> how can you define the username for the different account ?

The proxy passes the username through to the ISP's server. As an
example, suppose you have pop3proxy, serving

    pop3.isp1.com on port 8110
    pop3.isp2.com on port 8111

And suppose your accounts are user1 on isp1 and user2 on isp2.

Then, in your client, you define 2 mail sources:

    mail1:  user ID user1 on localhost:8110
    mail2:  user ID user2 on localhost:8111

When the client connects, it passes the username through the proxy
to the ISP's server.

Did that make sense?

Paul.

From tim@fourstonesExpressions.com  Thu Nov 28 14:57:52 2002
From: tim@fourstonesExpressions.com (Tim Stone - Four Stones Expressions)
Date: Thu, 28 Nov 2002 08:57:52 -0600
Subject: [Spambayes] pop3proxy with multiple pop server
Message-ID: <GANJUPD0MG83YGD4WXRB7LG4WTN0B7.3de62ef0@riven>

papadoc, the username and password are simply passed through by the proxy from 
your email client.  For each email account you have, you configure your email 
client roughly as follows:

server name: localhost
server port: <a unique port number>
userid: <whatever userid is appropriate>
password: <whatever password is appropriate>

For each email account you have, you add configuration to the pop3proxy as 
follows:

pop3proxy_servers: <a pop3 server>
pop3proxy_ports: <a unique port number, matching above>

This ties the pop3server to the local port, allowing the proxy to intercept 
your messages and insert the spambayes header.  In your example, quoted below, 
it would kinda look like this:

***************************************************
Email client configured for three servers:

server name: localhost
server port: 6111
userid: name1
password: ********

server name: localhost
server port: 6112
userid: name2
password: *******

server name: localhost
server port: 6113
userid: name3
password: *******

Pop3proxy configured for three servers

pop3proxy_servers: mail.gmc.ulava1.ca,pop.videotron.ca,pop.videotron.net
pop3proxy_ports: 6111,6112,6113

*****************************************************

All that said, I need to tell you that we're currently working a problem with 
this configuration, where occasionally the proxy will hang.  If you experience 
this problem, the only solution right now is to run three separate pop3proxy 
instances, one for each account.  You cannot use the configuration items in 
Options.py to support this configuration, so you'll have to use command line 
options.  The equivalent command lines would therefore be:

pop3proxy.py -l 6111 -u 8881 -d sbdb1 mail.gmc.ulava1.ca
pop3proxy.py -l 6112 -u 8882 -d sbdb2 pop.videotron.ca
pop3proxy.py -l 6113 -u 8883 -d sbdb3 pop.videotron.net

As you see, in this configuration it is not possible to share training 
databases.  You could make (for example) sbdb1 your master, and only do 
training to it, through a web browser pointed at http://localhost:8881.  The 
master could then be periodically copied to sbdb2 and sbdb3, so that those 
proxy processes make the same spam decisions.  The proxies will need to be 
terminated before this copy, as a hot copy of those files will probably kill 
the processes.

As you can tell, this is far less than an ideal situation, and only temporary 
anyway.  We will certainly correct our hang problem, and then life will be 
good!

Hope this helps.  - Tim Stone

11/28/2002 8:11:16 AM, papaDoc <papaDoc@videotron.ca> wrote:

>Hi,
>
>   I was testing pop3proxy with only one server: pop.videotron.ca
>and everything was going smootly. The ham and spam were filtered and
>the unsure was very low.
>
>Then I said, since it is woking great I will implement this for
>all my accounts. he.. he...
>
>pop3proxy let you define several server in the Options.py but
>how can you define the username for the different account ?
>
>Ex:
>name1 for mail.gmc.ulaval.ca
>name2 for pop.videotron.ca
>name3 for pop.videotron.net
>
>I look at the code but no clue, (I need some practice with python ;-6 )
>
>If someone helps me with this I will write instructions on how to
>use pop3proxy with mozilla and submit my work to be included in the
>INTEGRATION.txt file
>
>
>papaDoc
>
>
>_______________________________________________
>Spambayes mailing list
>Spambayes@python.org
>http://mail.python.org/mailman/listinfo/spambayes
>
>


c'est moi - TimS
www.fourstonesExpressions.com 


From richie@entrian.com  Thu Nov 28 16:13:05 2002
From: richie@entrian.com (Richie Hindle)
Date: Thu, 28 Nov 2002 16:13:05 +0000
Subject: [Spambayes] pop3proxy with multiple pop server
In-Reply-To: <GANJUPD0MG83YGD4WXRB7LG4WTN0B7.3de62ef0@riven>
References: <GANJUPD0MG83YGD4WXRB7LG4WTN0B7.3de62ef0@riven>
Message-ID: <g9fcuusu24nrrr6uasvbvc823qd0ovhmdf@4ax.com>


[Tim Stone]
> All that said, I need to tell you that we're currently working a problem with 
> this configuration, where occasionally the proxy will hang.

This is now fixed (it wasn't a problem with multiple servers as it turns
out, but it's fixed anyway).

-- 
Richie Hindle
richie@entrian.com


From papaDoc@videotron.ca  Thu Nov 28 16:27:37 2002
From: papaDoc@videotron.ca (papaDoc)
Date: Thu, 28 Nov 2002 11:27:37 -0500
Subject: [Spambayes] pop3proxy with multiple pop server
Message-ID: <3DE643F9.1040509@videotron.ca>

Hi,

   Thanks to you Richie, TimS, Paul,

Now it is working...... but now I have to write an How-to <wink>.


papaDoc


From francois.granger@free.fr  Thu Nov 28 17:13:53 2002
From: francois.granger@free.fr (Fran=?ISO-8859-1?B?5w==?=ois Granger)
Date: Thu, 28 Nov 2002 18:13:53 +0100
Subject: [Spambayes] pop3proxy with multiple pop server
In-Reply-To: <GANJUPD0MG83YGD4WXRB7LG4WTN0B7.3de62ef0@riven>
Message-ID: <BA0C0D60.5D678%francois.granger@free.fr>

The part of the mail below _has_ to go in the documentation. It is a really
clear and simple explanation of the pop3proxy configuration.

Thanks

on 28/11/02 15:57, Tim Stone - Four Stones Expressions at
tim@fourstonesExpressions.com wrote:

> papadoc, the username and password are simply passed through by the proxy from
> your email client.  For each email account you have, you configure your email
> client roughly as follows:
> 
> server name: localhost
> server port: <a unique port number>
> userid: <whatever userid is appropriate>
> password: <whatever password is appropriate>
> 
> For each email account you have, you add configuration to the pop3proxy as
> follows:
> 
> pop3proxy_servers: <a pop3 server>
> pop3proxy_ports: <a unique port number, matching above>
> 
> This ties the pop3server to the local port, allowing the proxy to intercept
> your messages and insert the spambayes header.  In your example, quoted below,
> it would kinda look like this:
> 
> ***************************************************
> Email client configured for three servers:
> 
> server name: localhost
> server port: 6111
> userid: name1
> password: ********
> 
> server name: localhost
> server port: 6112
> userid: name2
> password: *******
> 
> server name: localhost
> server port: 6113
> userid: name3
> password: *******
> 
> Pop3proxy configured for three servers
> 
> pop3proxy_servers: mail.gmc.ulava1.ca,pop.videotron.ca,pop.videotron.net
> pop3proxy_ports: 6111,6112,6113

-- 
Le courrier est un moyen de communication. Les gens devraient
se poser des questions sur les implications politiques des choix (ou non
choix) de leurs outils et technologies. Pour des courriers propres :
<http://marc.herbert.free.fr/mail/> -- <http://minilien.com/?IXZneLoID0>


From tim@fourstonesExpressions.com  Thu Nov 28 17:13:20 2002
From: tim@fourstonesExpressions.com (Tim Stone - Four Stones Expressions)
Date: Thu, 28 Nov 2002 11:13:20 -0600
Subject: [Spambayes] pop3proxy log now off by default
Message-ID: <HFVSC7SP31EC1ZC04B8SMA0GCQ76UQ.3de64eb0@riven>

I've changed the pop3proxy so that the log file is only written if 
options.verbose == True.  This is so that when the proxy is being used by the 
masses, this log file doesn't accumulate on their hard drive.

c'est moi - TimS
www.fourstonesExpressions.com 


From papaDoc@videotron.ca  Thu Nov 28 17:33:09 2002
From: papaDoc@videotron.ca (papaDoc)
Date: Thu, 28 Nov 2002 12:33:09 -0500
Subject: [Spambayes] pop3proxy with multiple pop server
In-Reply-To: <BA0C0D60.5D678%francois.granger@free.fr>
References: <BA0C0D60.5D678%francois.granger@free.fr>
Message-ID: <3DE65355.1030605@videotron.ca>

 > Fran=E7ois Granger wrote:

>The part of the mail below _has_ to go in the documentation. It is a=
 really
>clear and simple explanation of the pop3proxy configuration.
>
I will include it since this help me to figure out what was going on.

I will do the howto during the weeking and have a first draft for mon=
day.


papaDoc


From tim_one@email.msn.com  Thu Nov 28 18:22:57 2002
From: tim_one@email.msn.com (Tim Peters)
Date: Thu, 28 Nov 2002 13:22:57 -0500
Subject: [Spambayes] fwd: robinson f(w) equation: X constant confusion
In-Reply-To: <20021128115300.6C18416F16@jmason.org>
Message-ID: <LNBBLJKPBEHFEDALKOLCMEADDAAB.tim_one@email.msn.com>

[Justin Mason]
> just wondering about this.  I ran some tests which wandered across
> the landscape of X and S values (as used in Gary Robinson's f(w)
> equation), and computed a cost figure based on a corpus of 2000
> spam v. 1000 ham, then graphed it.
>
> Results are here:
>
>    http://spamassassin.taint.org/qa/s_x_gary.png
>
> note that X=0.53 S=0.05 and X=0.69 S=0.32 seem to
> give the best results.
>
> However, computing X, as per Gary's webpage, results in a value of 0.32.
> But according to that graph, 0.32 is pretty much crap ;)

Rob Hooft ran some downhill Simplex optimizations that also converged on X a
bit over 0.5, and S substantially smaller than we use by default (we use
0.45 by default).

On three different sets of test data, I measured "the average" spamprob to
be a bit over 0.5 too (it ranged from 0.52 to 0.56).

A difference is that the test data I used had about the same number of ham
as spam, while you've got a 1::2 ratio.  Are you sure you weren't using 1000
spam vs 2000 ham?  If you were, and "the true unknown word" spamprob were
about 0.5, I'd expect you to measure one near 1/3, since there would be (to
a 0th-order approximation <wink>) about twice as many ham-word spamprobs
feeding into the computed average than there were spam-word spamprobs
feeding into it, and that would drag the average below 0.5 simply due to
having more of one kind of word than the other.

IOW, Gary's suggestion for guessing x appears to me to be sensitive to the
ham::spam ratio, but the method used for guessing spamprobs tries (with
mixed results) not to be sensitive to that ratio.  Mismatching assumptions,
then.

The X=0.53 S=0.05 result is cute -- it roughly says "it's about 50-50, but
don't pay much attention to it".  I'm not sure what your cost measure is; as
we measure costs by default, an FP is charged 10, in which case the contour
lines ranging from 80 to 90 are showing the difference between one FP more
or less; this *can* make them supremely sensitive to just one or two oddball
msgs.


From tim@fourstonesExpressions.com  Thu Nov 28 18:33:59 2002
From: tim@fourstonesExpressions.com (Tim Stone - Four Stones Expressions)
Date: Thu, 28 Nov 2002 12:33:59 -0600
Subject: [Spambayes] Options for the masses
Message-ID: <62ON5TOURXRQMTNKGYUOEDUPXS51YS.3de66197@riven>

Seeking input on the following two lists.

This set of files is the 'minimum' that is needed to execute either pop3proxy 
or hammie*.  I've not done an analysis on the outlook stuff yet.  (I'd like to 
see mboxutils be absorbed into Corpus).

chi2.py
classifier.py
Corpus.py
dbdict.py
FileCorpus.py
hammie.py
hammiebulk.py
hammiecli.py
hammiefilter.py
hammiesrv.py
mboxutils.py
Options.py
pop3proxy.py
sets.py
storage.py
tokenizer.py


I've scanned these files for options that are used.  Unfortunately, they were 
alphabetized by my efforts to isolate them...  Of these options, I've splatted 
the ones that I think should normally be customizable by an average user type 
person.

address_headers
basic_header_skip
basic_header_tokenize
basic_header_tokenize_only
check_octets
clue_mailheader_cutoff
count_all_header_lines
experimental_ham_spam_imbalance_adjustment
extract_dow
generate_long_skips
generate_time_buckets
*ham_cutoff
hammie_debug_header
hammie_debug_header_name
*hammie_header_name
*hammiefilter_persistent_storage_file
hammiefilter_persistent_use_database
*header_ham_string
header_score_digits
header_score_logarithm
*header_spam_string
*header_unsure_string
*html_ui_launch_browser
*html_ui_port
max_discriminators
mine_received_headers
minimum_prob_strength
octet_prefix_size
*pop3proxy_cache_expiry_days
pop3proxy_cache_use_gzip
*pop3proxy_ham_cache
*pop3proxy_persistent_storage_file
pop3proxy_persistent_use_database
pop3proxy_port
*pop3proxy_ports
pop3proxy_server_name
pop3proxy_server_port
*pop3proxy_servers
*pop3proxy_spam_cache
*pop3proxy_unknown_cache
record_header_absence
replace_nonascii_chars
retain_pure_html_tags
safe_headers
skip_max_word_size
*spam_cutoff
unknown_word_prob
unknown_word_strength
use_chi_squared_combining
use_gary_combining

Opinions?

c'est moi - TimS
www.fourstonesExpressions.com 


From papaDoc@videotron.ca  Thu Nov 28 19:32:48 2002
From: papaDoc@videotron.ca (papaDoc)
Date: Thu, 28 Nov 2002 14:32:48 -0500
Subject: [Spambayes] More explicit prinout ?
Message-ID: <3DE66F60.8010705@videotron.ca>

Hi,

I'm submitting this patch for pop3proxy.
Instead of having only
BayesProxyListener listening on port 6110.
BayesProxyListener listening on port 6111.
BayesProxyListener listening on port 6112.
the patch makes pop3proxy prints
BayesProxyListener listening on port 6110 for mail.gmc.ulaval.ca:110.
BayesProxyListener listening on port 6111 for pop.videotron.ca:110.
BayesProxyListener listening on port 6112 for pop.videotron.net:110.

This helps you set your stuff, if you don't remember what is the order
of the pop server in Options.py

Please be nice with me this is my first python code modification.
I had to look in the Documentation to find out the type of factoryArgs and
then how to print a tuple.

Index: pop3proxy.py
===================================================================
RCS file: /cvsroot/spambayes/spambayes/pop3proxy.py,v
retrieving revision 1.26
diff -r1.26 pop3proxy.py
162c162,166
<         print "%s listening on port %d." % (self.__class__.__name__, port)
---
 >         if len(factoryArgs) >= 2:
 >           print "%s listening on port %d for %s:%d." % 
(self.__class__.__name__, port,
 >                                                         
factoryArgs[0], factoryArgs[1])
 >         else:
 >             print "%s listening on port %d." % 
(self.__class__.__name__, port)


papaDoc


From richie@entrian.com  Thu Nov 28 19:36:18 2002
From: richie@entrian.com (Richie Hindle)
Date: Thu, 28 Nov 2002 19:36:18 +0000
Subject: [Spambayes] More explicit prinout ?
In-Reply-To: <3DE66F60.8010705@videotron.ca>
References: <3DE66F60.8010705@videotron.ca>
Message-ID: <2urcuugf1u746177bnalqmgiiso0np0jfn@4ax.com>

Hi papaDoc,

> I'm submitting this patch for pop3proxy.
> Instead of having only
> BayesProxyListener listening on port 6110.
> BayesProxyListener listening on port 6111.
> BayesProxyListener listening on port 6112.
> the patch makes pop3proxy prints
> BayesProxyListener listening on port 6110 for mail.gmc.ulaval.ca:110.
> BayesProxyListener listening on port 6111 for pop.videotron.ca:110.
> BayesProxyListener listening on port 6112 for pop.videotron.net:110.

Many thanks, but Tim Stone has beaten you to it!  Update from CVS and
you'll get something like this:

Listener on port 110 is proxying pop3.demon.co.uk:110
Listener on port 111 is proxying pop3.freeserve.net:110
User interface url is http://localhost:8880

-- 
Richie Hindle
richie@entrian.com


From jm@jmason.org  Thu Nov 28 22:25:17 2002
From: jm@jmason.org (Justin Mason)
Date: Thu, 28 Nov 2002 22:25:17 +0000
Subject: [Spambayes] fwd: robinson f(w) equation: X constant confusion 
In-Reply-To: Message from "Tim Peters" <tim_one@email.msn.com> 
	<LNBBLJKPBEHFEDALKOLCMEADDAAB.tim_one@email.msn.com> 
Message-ID: <20021128222522.535AB16F16@jmason.org>


Tim Peters said:
> Rob Hooft ran some downhill Simplex optimizations that also converged on X a
> bit over 0.5, and S substantially smaller than we use by default (we use
> 0.45 by default).
> On three different sets of test data, I measured "the average" spamprob to
> be a bit over 0.5 too (it ranged from 0.52 to 0.56).

Interesting!

> A difference is that the test data I used had about the same number of ham
> as spam, while you've got a 1::2 ratio.  Are you sure you weren't using 1000
> spam vs 2000 ham?  If you were, and "the true unknown word" spamprob were
> about 0.5, I'd expect you to measure one near 1/3, since there would be (to
> a 0th-order approximation <wink>) about twice as many ham-word spamprobs
> feeding into the computed average than there were spam-word spamprobs
> feeding into it, and that would drag the average below 0.5 simply due to
> having more of one kind of word than the other.

Actually, I've just checked -- it's not 2k:1k, it's 2k:2k.  so it should
be even.

> IOW, Gary's suggestion for guessing x appears to me to be sensitive to the
> ham::spam ratio, but the method used for guessing spamprobs tries (with
> mixed results) not to be sensitive to that ratio.  Mismatching assumptions,
> then.

Interesting, BTW.  Do you guys use the estimated X instead of a constant?
Sounds like it could vary greatly depending on corpus ratios...

> The X=0.53 S=0.05 result is cute -- it roughly says "it's about 50-50, but
> don't pay much attention to it".

There's another "sweet spot" at X=.69 and S=.42, which mystifies me;
I would have thought that would cause more FPs, which is worst for
the cost (see below).

> I'm not sure what your cost measure is; as
> we measure costs by default, an FP is charged 10, in which case the contour
> lines ranging from 80 to 90 are showing the difference between one FP more
> or less; this *can* make them supremely sensitive to just one or two oddball
> msgs.

The cost measure is a direct copy of the spambayes one, so they can
be compared ;)  (I also use TCR, the cost measure used by Ion
Androutsopoulos' papers; but being able to see "unsures" helps
us pick a good scheme which maps well into SpamAssassin scores.)

BTW an interesting factor is that those scores are measured using
a high "min prob strength" factor; I used 0.27.  I'm running more tests
where this varies, and I think that'll be quite interesting too ;)

PS: while I'm here -- I'm also comparing chi2 with gary-combining.  I'm
finding chi2 to have quite a few more FPs in particular, right in the 0.00
spike.  Do you guys see much of this?  Or have I screwed up my code with
all this constant-tweaking? ;)

--j.

From mhammond@skippinet.com.au  Sat Nov 30 00:49:15 2002
From: mhammond@skippinet.com.au (Mark Hammond)
Date: Sat, 30 Nov 2002 11:49:15 +1100
Subject: [Spambayes] Easy task for Outlook
Message-ID: <LCEPIIGDJPKCOIHOBJEPAEHLHPAA.mhammond@skippinet.com.au>

Just incase someone is bored, and would like to contribute to the Outlook
plugin, but don't think they can.  Well, do I have the job for you <wink>.

The plugin could do with some kind of "log file" strategy.  Currently all
"print" statements go to the win32traceutil package.  However, once we
package this up as a stand-alone DLL, this wont fly.

Therefore, what I would like is code that does something reasonable with a
log file, and redirects sys.stdout to this file.  "Something reasonable"
means somehow deleting old log messages.  I dont care if we keep a different
log per day automatically deleting old ones, or any other strategy that is
both reasonable and not overkill.

A new module would be perfect - then addin.py can import and use it when it
detects it is frozen.

No big deal, and not a huge job, but certainly should be fairly simple for
anyone with reasonable Python skills, but no Windows or Outlook skills.

Mark.