From sanjaydarisi at cox.net  Sat Nov  1 05:13:08 2003
From: sanjaydarisi at cox.net (sanjaydarisi@cox.net)
Date: Sat Nov  1 05:13:17 2003
Subject: [spambayes-dev] Question about binary installer!
Message-ID: <20031101101308.EFFQ24944.fed1mtao01.cox.net@smtp.west.cox.net>


I am curious on how to make the spambayes outlook addin binary installer. Is py2exe used or McMillan installer? 'cos I tried with both and got errors. Like I tried using py2exe with setup_all.py file in the windows/py2exe dir and got a few errors regarding the options of py2exe like 'exclude-dll' etc. I have py2exe 0.4.2 and python 2.3
Then i went back to McMillan installer and tried. I used the Outlook2000\installer\spambayes_addin.py|spec|iss files. I was able to get a .exe file at the end. But, it won't work like I was able to install and the spambayes won't come up when I start the Outlook. Is it the same file that is used in building the spambayes outlook addin binary installer? I have few COM errors in the log file, complaining about lack of resource section for the image files and an assertion error saying 'Should not yet have a toolbar' Is there anything else that I have to do inaddition to using those files? Could anyone let me know how the binary installer of spambayes Outlook addin is built and could share the script used in this process?

I'd really appreciate your help.

Thank you,
Sanjay.


From theller at python.net  Sat Nov  1 06:44:19 2003
From: theller at python.net (Thomas Heller)
Date: Sat Nov  1 06:44:20 2003
Subject: [spambayes-dev] Re: Question about binary installer!
References: <20031101101308.EFFQ24944.fed1mtao01.cox.net@smtp.west.cox.net>
Message-ID: <65i4uxd8.fsf@python.net>

<sanjaydarisi@cox.net> writes:

> I am curious on how to make the spambayes outlook addin binary
> installer. Is py2exe used or McMillan installer? 'cos I tried with
> both and got errors. Like I tried using py2exe with setup_all.py file
> in the windows/py2exe dir and got a few errors regarding the options
> of py2exe like 'exclude-dll' etc. I have py2exe 0.4.2 and python 2.3

If you try py2exe, you should use the 0.5.0a prerelease in the files
section.  You need win32all build 161, however.

Thomas


From skip at pobox.com  Mon Nov  3 12:54:01 2003
From: skip at pobox.com (Skip Montanaro)
Date: Mon Nov  3 12:54:10 2003
Subject: [spambayes-dev] imbalance within ham or spam training sets?
Message-ID: <16294.38457.868547.794422@montanaro.dyndns.org>

We know some problems arise if grossly different numbers of ham or spam
exist in the training databases.  I wonder if there might be problems within
datasets if different numbers of particular hams or spams have been used in
the training.

That's probably not worded well.  Let me demonstrate with a concrete
example.  Suppose I've trained on exactly 1000 ham and 1000 spam, just to
eliminate that source of problems.  Within the 1000 hams, suppose I've
trained on 800 python messages, 100 messages about cars and 100 messages
about pop psychology.  We know that if I get a message about a subject which
I've never trained on before (say, woodworking) that there are likely to be
topic-specific clues I've never seen which won't contribute to scoring the
message as ham ("router", "lathe", "sawdust", ...).

Questions:

    * How many woodworking messages will I need to train as ham to get the
      system to properly recognize those messages as ham?  Would that large
      glut of python-related messages hamper the ability of the classifier
      to detect woodworking messages as ham?

    * Similarly, would the 8:1 ratio of python messages to messages about
      cars or pop psychology have an effect on scoring any of those messages
      accurately?

Skip


From tdickenson at geminidataloggers.com  Mon Nov  3 13:15:21 2003
From: tdickenson at geminidataloggers.com (Toby Dickenson)
Date: Mon Nov  3 13:15:25 2003
Subject: [spambayes-dev] imbalance within ham or spam training sets?
In-Reply-To: <16294.38457.868547.794422@montanaro.dyndns.org>
References: <16294.38457.868547.794422@montanaro.dyndns.org>
Message-ID: <200311031815.21719.tdickenson@geminidataloggers.com>

On Monday 03 November 2003 17:54, Skip Montanaro wrote:
> We know some problems arise if grossly different numbers of ham or spam
> exist in the training databases.  I wonder if there might be problems
> within datasets if different numbers of particular hams or spams have been
> used in the training.

Dont scare the new users with talk of problems.....

I train using *everything* in my kmail folders. That is 1 part spam, 4 parts 
python mailing lists, 6 parts other lists, 1 part personal email, and 4 parts 
automated log message. No perceptable problems so far.

-- 
Toby Dickenson


From kennypitt at hotmail.com  Mon Nov  3 13:17:16 2003
From: kennypitt at hotmail.com (Kenny Pitt)
Date: Mon Nov  3 13:18:18 2003
Subject: [spambayes-dev] imbalance within ham or spam training sets?
In-Reply-To: <16294.38457.868547.794422@montanaro.dyndns.org>
Message-ID: <E1AGjHM-0007ME-5X@mail.python.org>

Skip Montanaro wrote:
> Suppose I've trained on exactly 1000 ham and 1000 spam,
> just to eliminate that source of problems.  Within the 1000 hams,
> suppose I've trained on 800 python messages, 100 messages about cars
> and 100 messages about pop psychology.  We know that if I get a
> message about a subject which I've never trained on before (say,
> woodworking) that there are likely to be topic-specific clues I've
> never seen which won't contribute to scoring the message as ham
> ("router", "lathe", "sawdust", ...). 
> 
> Questions:
> 
>     * How many woodworking messages will I need to train as ham to
>       get the system to properly recognize those messages as ham? 
>       Would that large glut of python-related messages hamper the
>       ability of the classifier to detect woodworking messages as ham?

I would think one would be sufficient, assuming of course that none of
the words in your woodworking message already appear in your *spam*
training.  SpamBayes only considers tokens that are *in* the message
being classified, not tokens that are *not in* the message.  So,
regardless of how many times a token has appeared in the python
messages, it will not even be considered in the scoring if it does not
appear in the woodworking message.  On the other hand, if that token
*does* appear in the woodworking message then it will be solidly scored
as ham and therefore increase the probability of the message being
correctly classified.

>     * Similarly, would the 8:1 ratio of python messages to messages
>       about cars or pop psychology have an effect on scoring any of
>       those messages accurately?

I wouldn't think so.  Since all of these messages are considered ham,
the tokens from the python messages would at best reinforce the
*correct* classification of the other messages, and at worst would
contribute nothing one way or the other to the scoring.

Just my thoughts, totally unproven scientifically.

-- 
Kenny Pitt


From skip at pobox.com  Mon Nov  3 13:42:30 2003
From: skip at pobox.com (Skip Montanaro)
Date: Mon Nov  3 13:42:47 2003
Subject: [spambayes-dev] imbalance within ham or spam training sets?
In-Reply-To: <200311031815.21719.tdickenson@geminidataloggers.com>
References: <16294.38457.868547.794422@montanaro.dyndns.org>
	<200311031815.21719.tdickenson@geminidataloggers.com>
Message-ID: <16294.41366.356102.723324@montanaro.dyndns.org>

    >> I wonder if there might be problems within datasets if different
    >> numbers of particular hams or spams have been used in the training.

    Toby> Dont scare the new users with talk of problems.....

Any new user who subscribes to spambayes-dev deserves to get scared every
once in awhile.

Skip


From skip at pobox.com  Mon Nov  3 14:09:04 2003
From: skip at pobox.com (Skip Montanaro)
Date: Mon Nov  3 14:09:20 2003
Subject: [spambayes-dev] imbalance within ham or spam training sets?
In-Reply-To: <20031103181814.F241FF58F3@orb.pobox.com>
References: <16294.38457.868547.794422@montanaro.dyndns.org>
	<20031103181814.F241FF58F3@orb.pobox.com>
Message-ID: <16294.42960.302363.849243@montanaro.dyndns.org>


    >> * How many woodworking messages will I need to train as ham to get
    >>   the system to properly recognize those messages as ham?  Would that
    >>   large glut of python-related messages hamper the ability of the
    >>   classifier to detect woodworking messages as ham?

    Kenny> I would think one would be sufficient, assuming of course that
    Kenny> none of the words in your woodworking message already appear in
    Kenny> your *spam* training.  SpamBayes only considers tokens that are
    Kenny> *in* the message being classified, not tokens that are *not in*
    Kenny> the message.  So, regardless of how many times a token has
    Kenny> appeared in the python messages, it will not even be considered
    Kenny> in the scoring if it does not appear in the woodworking message.
    Kenny> On the other hand, if that token *does* appear in the woodworking
    Kenny> message then it will be solidly scored as ham and therefore
    Kenny> increase the probability of the message being correctly
    Kenny> classified.

Let me rephrase the question again.  There's a discussion in Gary Robinson's
LJ article

    http://www.linuxjournal.com/article.php?sid=6467

about dealing with rare words which I didn't really follow.  If I've trained
on 1000 other ham messages and now encounter a woodworking message, some of
the words in there are likely to have not been seen before ("lathe", for
example).  Such words obviously can't contribute to scoring that message.
Let's assume I then train that message as ham.  "lathe" now has a hamcount
of 1 and a spamcount of 0.  It is a "rare word".  How many more messages
which contain "lathe" do I have to train on before it is no longer "rare".
In particular, by training on 1000 other hams which don't contain that word,
have I somehow created an artificial barrier to getting woodworking-specific
words to have full effect as ham indicators?

If there is a problem, it might be fairly easy to fall into a trap which is
a bit difficult to get out of.  Suppose I'm starting from scratch and I know
I have several mailboxes:

    * python - 800 messages
    * cars - 100 messages
    * pop-psycology - 100 messages
    * spam - 1000 messages

As a new user, it might be very easy for me to ask SB to score all messages
in the first three mailboxes as ham and all in the fourth as spam, thus
creating a problem (if one exists).  *If* such a problem exists (and it very
well may not), it might be better if I could tell the system to pick a
random sample of each of my collections such that the relative number of
hams and spams is about equal and so that the imbalance between mailboxes
classified as ham or spam is not too great either.

Skip

From popiel at wolfskeep.com  Mon Nov  3 15:14:45 2003
From: popiel at wolfskeep.com (T. Alexander Popiel)
Date: Mon Nov  3 15:14:50 2003
Subject: [spambayes-dev] imbalance within ham or spam training sets? 
In-Reply-To: Message from Skip Montanaro <skip@pobox.com> of "Mon,
	03 Nov 2003 13:09:04 CST."
	<16294.42960.302363.849243@montanaro.dyndns.org> 
References: <16294.38457.868547.794422@montanaro.dyndns.org>
	<20031103181814.F241FF58F3@orb.pobox.com>
	<16294.42960.302363.849243@montanaro.dyndns.org> 
Message-ID: <20031103201445.D5C852DF59@cashew.wolfskeep.com>

In message:  <16294.42960.302363.849243@montanaro.dyndns.org>
             Skip Montanaro <skip@pobox.com> writes:
>
>Let me rephrase the question again.  There's a discussion in Gary Robinson's
>LJ article
>
>    http://www.linuxjournal.com/article.php?sid=6467
>
>about dealing with rare words which I didn't really follow.

It's talking about the math behind unknown_word_strength and
unknown_word_prob.

>If I've trained
>on 1000 other ham messages and now encounter a woodworking message, some of
>the words in there are likely to have not been seen before ("lathe", for
>example).  Such words obviously can't contribute to scoring that message.
>Let's assume I then train that message as ham.  "lathe" now has a hamcount
>of 1 and a spamcount of 0.  It is a "rare word".  How many more messages
>which contain "lathe" do I have to train on before it is no longer "rare".

A word is not "rare" or "not rare" according to the classifier... it's
not just a binary switch.  All words have their probabilities adjusted
towards unknown_word_prob by an amount determined by unknown_word_strength
and the number of trained messages in which the word has appeared.  The
more often the word has been seen (and trained), the smaller the adjustment.

The only way this could be a binary switch would be if the unknown word
adjustments were strong enough to pull the probability for a word inside
the .4-.6 range (assuming default settings) that the classifier outright
ignores... but the default settings for unknown_word_* aren't that strong.
I seem to recall that the hapax values (from only a single instance trained)
are around .31 and .69 for ham and spam respectively.

>In particular, by training on 1000 other hams which don't contain that word,
>have I somehow created an artificial barrier to getting woodworking-specific
>words to have full effect as ham indicators?

No.  Training on other mail which does not contain the word does not
affect the score for a word at all (unless you have the experimental
ham/spam imbalance adjustment enabled and it's actually doing something...
and you specifically engineered for question to make the imbalance
adjustment moot).

>If there is a problem, it might be fairly easy to fall into a trap which is
>a bit difficult to get out of.

Lucky for us, there is no problem here. ;-)

- Alex

From kennypitt at hotmail.com  Mon Nov  3 15:23:14 2003
From: kennypitt at hotmail.com (Kenny Pitt)
Date: Mon Nov  3 15:23:35 2003
Subject: [spambayes-dev] imbalance within ham or spam training sets?
In-Reply-To: <16294.42960.302363.849243@montanaro.dyndns.org>
Message-ID: <Law11-OE54x69iGiGux00012559@hotmail.com>

Skip Montanaro wrote:
> Let me rephrase the question again.  There's a discussion in Gary
> Robinson's LJ article
> 
>     http://www.linuxjournal.com/article.php?sid=6467
> 
> about dealing with rare words which I didn't really follow.  If I've
> trained on 1000 other ham messages and now encounter a woodworking
> message, some of the words in there are likely to have not been seen
> before ("lathe", for example).  Such words obviously can't contribute
> to scoring that message. Let's assume I then train that message as
> ham.  "lathe" now has a hamcount of 1 and a spamcount of 0.  It is a
> "rare word".  How many more messages which contain "lathe" do I have
> to train on before it is no longer "rare". In particular, by training
> on 1000 other hams which don't contain that word, have I somehow
> created an artificial barrier to getting woodworking-specific words
> to have full effect as ham indicators? 

OK, I see where you're coming from.  I answered a related (albeit much
simpler <wink>) question for someone on the Spambayes list not long ago.

The "rare word" adjustment is a way of adjusting the contributed
probability for words that haven't been seen very often.  In your
example of "lathe" with ham=1 and spam=0, the straight probability of
spam [spam / (spam + ham)] would be 0.0, but one occurrence doesn't make
it the most reliable indicator.  SpamBayes adjusts this using the
"unknown_word_strength" (s in the Robinson article) and
"unknown_word_prob" (x in the article) options.  You can see the
adjustment calculation in the probability() function in classifier.py.

The default for these options in Options.py are s=0.45 and x=0.5.  Using
these defaults with the case of 1 ham and no spam, the actual
probability contributed to the chi2 combining is 0.155172.  As the total
number of occurrences of the token increases, the contributed
probability gets closer and closer to the straight probability.  So, for
ham=5 and spam=0, contributed probablity is 0.041284; for ham=10 and
spam=0, contributed probability is 0.021531; and for ham=50 and spam=0,
contributed probability is 0.004460.  As you can see, the probability
moves back toward the straight probability fairly quickly.

The important thing to note with respect to your original concerns,
though, is that this "rare" word calculation is entirely independent of
any other tokens in the training data.  The calculation involves the
original straight probability, the fixed factors of s and x, and the
total number of occurrences of that token in both ham and spam.  There
is no fixed cutoff that says a word is no longer rare, but neither does
the definition of rare depend on the relative numbers compared to any
other token in the training data.

-- 
Kenny Pitt

From skip at pobox.com  Mon Nov  3 15:47:04 2003
From: skip at pobox.com (Skip Montanaro)
Date: Mon Nov  3 15:47:24 2003
Subject: [spambayes-dev] imbalance within ham or spam training sets? 
In-Reply-To: <20031103201445.D5C852DF59@cashew.wolfskeep.com>
References: <16294.38457.868547.794422@montanaro.dyndns.org>
	<20031103181814.F241FF58F3@orb.pobox.com>
	<16294.42960.302363.849243@montanaro.dyndns.org>
	<20031103201445.D5C852DF59@cashew.wolfskeep.com>
Message-ID: <16294.48840.459042.449590@montanaro.dyndns.org>


    >> Let me rephrase the question again.  There's a discussion in Gary
    >> Robinson's LJ article
    >> 
    >> http://www.linuxjournal.com/article.php?sid=6467
    >> 
    >> about dealing with rare words which I didn't really follow.

    alex> It's talking about the math behind unknown_word_strength and
    alex> unknown_word_prob.

    >> If I've trained on 1000 other ham messages and now encounter a
    >> woodworking message, some of the words in there are likely to have
    >> not been seen before ("lathe", for example).  Such words obviously
    >> can't contribute to scoring that message.  Let's assume I then train
    >> that message as ham.  "lathe" now has a hamcount of 1 and a spamcount
    >> of 0.  It is a "rare word".  How many more messages which contain
    >> "lathe" do I have to train on before it is no longer "rare".

    alex> A word is not "rare" or "not rare" according to the
    alex> classifier...

I understand that it's not a binary thing.  I used that term because Gary
used it in his article.

I seem to be having trouble making my ideas understood today...  Was my
exposition that vague?

    >> If there is a problem, it might be fairly easy to fall into a trap
    >> which is a bit difficult to get out of.

    alex> Lucky for us, there is no problem here. ;-)

That's all I was asking.

Skip

From skip at pobox.com  Mon Nov  3 16:08:03 2003
From: skip at pobox.com (Skip Montanaro)
Date: Mon Nov  3 16:09:20 2003
Subject: [spambayes-dev] imbalance within ham or spam training sets?
In-Reply-To: <Law11-OE54x69iGiGux00012559@hotmail.com>
References: <16294.42960.302363.849243@montanaro.dyndns.org>
	<Law11-OE54x69iGiGux00012559@hotmail.com>
Message-ID: <16294.50099.106403.122008@montanaro.dyndns.org>


    Kenny> The important thing to note with respect to your original
    Kenny> concerns, though, is that this "rare" word calculation is
    Kenny> entirely independent of any other tokens in the training data.
    Kenny> The calculation involves the original straight probability, the
    Kenny> fixed factors of s and x, and the total number of occurrences of
    Kenny> that token in both ham and spam.  There is no fixed cutoff that
    Kenny> says a word is no longer rare, but neither does the definition of
    Kenny> rare depend on the relative numbers compared to any other token
    Kenny> in the training data.

Thanks, this is what I was getting at.

Skip

From tim.one at comcast.net  Mon Nov  3 16:49:46 2003
From: tim.one at comcast.net (Tim Peters)
Date: Mon Nov  3 16:49:55 2003
Subject: [spambayes-dev] imbalance within ham or spam training sets? 
In-Reply-To: <20031103201445.D5C852DF59@cashew.wolfskeep.com>
Message-ID: <LNBBLJKPBEHFEDALKOLCMEJMGOAB.tim.one@comcast.net>

[T. Alexander Popiel]
> No.  Training on other mail which does not contain the word does not
> affect the score for a word at all ...

It's a bit curious that this is true only so long as the word has appeared
in only one kind of training data (only in spam, or only in ham).  As soon
as a word appears in at least one of each, training on msgs that don't
contain the word can change the word's score.

Example:  suppose we've trained on 100 ham and a 100 spam, and "lathe"
appeared in exactly one ham.  Its by-counting spamprob is then

>>> h = 1./100
>>> s = 0./100
>>> s/(h+s)
0.0
>>>

So long as we never see "lathe" in spam, s's numerator is 0 no matter how
many additional ham and spam we train on, so s is 0, so the by-counting
spamprob remains 0/(h+0) = 0.

Change the example so we've seen "lathe" in one ham and one spam:

>>> h = 1./100
>>> s = 1./100
>>> s/(h+s)
0.5
>>>

The by-counting spamprob is then 0.5, which makes fine intuitive sense.  Now
suppose we train on 100 more ham, and don't see "lathe" again:

>>> h = 1./200
>>> s = 1./100
>>> s/(h+s)
0.66666666666666674
>>>

Now "lathe" seems spammy!  It should, since we've seen it in a greater
percentage of spam than ham.  I'm not sure we've got the best guess to 17
significant digits, though <wink>.  Make the imbalance wilder and the
by-counting spamprob gets wilder too:

>>> h = 1./20000
>>> s = 1./100
>>> s/(h+s)
0.99502487562189057
>>>

That offends my intuition -- the word is so rare (2 of 20100 msgs) that it's
hard to believe that 99.5% is a sane guess.  The Bayesian adjustment knocks
it down a lot based on how few times it's been seen in total:

>>> (.45*.5 + 2.0*_)/(.45 + 2.0)
0.90410193928317584
>>>

But that still seems like a high guess to me.  The experimental ham/spam
imbalance option knocked it down a lot more.  Unfortunately, that also moved
spamprobs a lot closer to 0.5 for words that appeared lots of times in the
over-represented category, and that made it a Bad Idea overall.

It's tempting to ignore words that haven't appeared in at least N messages
total (for some N).  Alas, Graham's original algorithm had a gimmick like
that, and testing said it worked better not to have such a cutoff.  And for
the mistake-based training many of us have fallen into, scoring hapaxes is
very important.

So we can't ignore rare words -- but in the presence of strong imbalance, I
think we're still missing a trick.


From popiel at wolfskeep.com  Mon Nov  3 17:07:57 2003
From: popiel at wolfskeep.com (T. Alexander Popiel)
Date: Mon Nov  3 17:08:00 2003
Subject: [spambayes-dev] imbalance within ham or spam training sets? 
In-Reply-To: Message from "Tim Peters" <tim.one@comcast.net> of "Mon,
	03 Nov 2003 16:49:46 EST."
	<LNBBLJKPBEHFEDALKOLCMEJMGOAB.tim.one@comcast.net> 
References: <LNBBLJKPBEHFEDALKOLCMEJMGOAB.tim.one@comcast.net> 
Message-ID: <20031103220757.323B72DF59@cashew.wolfskeep.com>

In message:  <LNBBLJKPBEHFEDALKOLCMEJMGOAB.tim.one@comcast.net>
             "Tim Peters" <tim.one@comcast.net> writes:
>[T. Alexander Popiel]
>> No.  Training on other mail which does not contain the word does not
>> affect the score for a word at all ...
>
>It's a bit curious that this is true only so long as the word has appeared
>in only one kind of training data (only in spam, or only in ham).  As soon
>as a word appears in at least one of each, training on msgs that don't
>contain the word can change the word's score.

Yarg.  I stand corrected.

Perhaps it's time to test a variation where the prob is based on
hamcount and spamcount instead of hamratio and spamratio.  Hrm.
*tap, tap, tap*  I'll be back in a few hours...

- Alex

From tim.one at comcast.net  Mon Nov  3 18:43:32 2003
From: tim.one at comcast.net (Tim Peters)
Date: Mon Nov  3 18:43:40 2003
Subject: [spambayes-dev] imbalance within ham or spam training sets? 
In-Reply-To: <20031103220757.323B72DF59@cashew.wolfskeep.com>
Message-ID: <LNBBLJKPBEHFEDALKOLCGEKIGOAB.tim.one@comcast.net>

>> [T. Alexander Popiel]
>>> No.  Training on other mail which does not contain the word does not
>>> affect the score for a word at all ...

[Tim]
>> It's a bit curious that this is true only so long as the word has
>> appeared in only one kind of training data (only in spam, or only in
>> ham).  As soon as a word appears in at least one of each, training
>> on msgs that don't contain the word can change the word's score.
>> ...

[Alex]
> Yarg.  I stand corrected.
>
> Perhaps it's time to test a variation where the prob is based on
> hamcount and spamcount instead of hamratio and spamratio.  Hrm.
> *tap, tap, tap*  I'll be back in a few hours...

Well, they're all the same if the # of training ham == the # of training
spam.  Computing spambprobs based on ratios is a first attempt at surviving
in the face of unbalanced training data.  For example, if a token appeared
in 99 of 100 spam, and 100 of 10,000 ham, a spamprob of 0.5 (100/(100+100))
doesn't make intuitive sense.  In effect, computing based on ratios (s/(s+h)
where s = 99/100 and h=100/10000) answers what would happen *if* we had
trained on equal numbers of each, while keeping the percentages of ham and
spam containing the token fixed.  In the example, if 99 of 100 spam
contained a given token, then our best guess is that, if we had seen 10,000
spam instead, we would have seen the token in 9,900 of those.  Then
9900/(9900+100) gives the same result as the current s/(s+h).

IOW, s/(s+h) gives the result that "prob is based on hamcount and spamcount"
gives if we extrapolate our actual training data to what it would be if it
were balanced.  If it's already balanced, the computed spamprob is the same
whether computed by raw count or by ratio.  So if you try raw count, the
only interesting tests would be on unbalanced training data.


From popiel at wolfskeep.com  Mon Nov  3 19:34:58 2003
From: popiel at wolfskeep.com (T. Alexander Popiel)
Date: Mon Nov  3 19:35:01 2003
Subject: [spambayes-dev] imbalance within ham or spam training sets? 
In-Reply-To: Message from "Tim Peters" <tim.one@comcast.net> of "Mon,
	03 Nov 2003 18:43:32 EST."
	<LNBBLJKPBEHFEDALKOLCGEKIGOAB.tim.one@comcast.net> 
References: <LNBBLJKPBEHFEDALKOLCGEKIGOAB.tim.one@comcast.net> 
Message-ID: <20031104003458.142372DF59@cashew.wolfskeep.com>

In message:  <LNBBLJKPBEHFEDALKOLCGEKIGOAB.tim.one@comcast.net>
             "Tim Peters" <tim.one@comcast.net> writes:
>>
>> Perhaps it's time to test a variation where the prob is based on
>> hamcount and spamcount instead of hamratio and spamratio.  Hrm.
>> *tap, tap, tap*  I'll be back in a few hours...
>
>Well, they're all the same if the # of training ham == the # of training
>spam.  Computing spambprobs based on ratios is a first attempt at surviving
>in the face of unbalanced training data.

Hrm, yes.  I'm obviously not thinking all that well today.
This gives leads me to thoughts where the elements of the
probability are scaled nonlinearly by the ham/spam imbalance
before combining them into the prob, instead of scaling the
perceived number of messages (and thus effecively scaling
unknown_word_strength) afterward...

Time to cogitate on which continuous asymptotic functions
might be effective at this.

>IOW, s/(s+h) gives the result that "prob is based on hamcount and spamcount"
>gives if we extrapolate our actual training data to what it would be if it
>were balanced.  If it's already balanced, the computed spamprob is the same
>whether computed by raw count or by ratio.  So if you try raw count, the
>only interesting tests would be on unbalanced training data.

I'm currently testing against my RL data, which is between
60% and 70% spam overall (rising to about 90% spam in recent
weeks).

- Alex

From anthony at interlink.com.au  Tue Nov  4 06:27:26 2003
From: anthony at interlink.com.au (Anthony Baxter)
Date: Tue Nov  4 06:31:15 2003
Subject: [spambayes-dev] 1.0a7
Message-ID: <200311041127.hA4BRQ73005475@localhost.localdomain>


Well, it's the end of a very long weekend (I love living in a place
that gives us a public holiday for a horse race :-) and the 1.0a7 
release is available for now from 

http://www.interlink.com.au/anthony/tmp/spambayes-1.0a7.tar.gz

Can people please check it out and make sure it's sane? I'm trying
to follow the process for getting a Windows checkout of the code to
get Windows-style line-endings on the file (to paraphrase Dilbert:
"Here's a nickel kid. Go buy a real line-ending") but WinCVS is 
being a snarky little sod. A zipfile will end up at the same place
if/when I get it to play nice. The release is tagged, so if someone
who's suffered windows long enough to get a working CVS checkout
wants to make the zip, this would be fine. Alternately, the windows
users can deal with correct line endings <wink>.

I'll push the release itself out first thing tomorrow if it looks 
good.

I've made a bunch of changes to the README.txt to remove old application
names from it - README-DEVEL.txt is, however, woefully out of date.
I've put a note in it mentioning this.

Anthony


From anthony at interlink.com.au  Tue Nov  4 06:42:15 2003
From: anthony at interlink.com.au (Anthony Baxter)
Date: Tue Nov  4 06:46:00 2003
Subject: [spambayes-dev] 1.0a7 
In-Reply-To: <200311041127.hA4BRQ73005475@localhost.localdomain> 
Message-ID: <200311041142.hA4BgGIK005815@localhost.localdomain>


>>> Anthony Baxter wrote
> http://www.interlink.com.au/anthony/tmp/spambayes-1.0a7.tar.gz

There's also 
http://www.interlink.com.au/anthony/tmp/spambayes-1.0a7.zip
now.

I gave up on trying to make sense of the insanity that is Windows,
and just found the magic zip option to mangle line endings. A bit
of 'find' magic, and ta-da, only the .txt files are mangled, the
JPGs &c are still ok. I think <wink>

Anthony

-- 
Anthony Baxter     <anthony@interlink.com.au>   
It's never too late to have a happy childhood.

From theller at python.net  Tue Nov  4 07:07:46 2003
From: theller at python.net (Thomas Heller)
Date: Tue Nov  4 07:07:53 2003
Subject: [spambayes-dev] Re: 1.0a7
References: <200311041127.hA4BRQ73005475@localhost.localdomain>
	<200311041142.hA4BgGIK005815@localhost.localdomain>
Message-ID: <ad7c9w19.fsf@python.net>

Anthony Baxter <anthony@interlink.com.au> writes:

>>>> Anthony Baxter wrote
>> http://www.interlink.com.au/anthony/tmp/spambayes-1.0a7.tar.gz
>
> There's also 
> http://www.interlink.com.au/anthony/tmp/spambayes-1.0a7.zip
> now.
>
> I gave up on trying to make sense of the insanity that is Windows,
> and just found the magic zip option to mangle line endings. A bit
> of 'find' magic, and ta-da, only the .txt files are mangled, the
> JPGs &c are still ok. I think <wink>

I use command line cvs on Windows, and never have problems with line
endings.

Unless some crazy guy commits Windows line ending files with cygwin
tools.

Thomas


From anthony at interlink.com.au  Tue Nov  4 08:00:36 2003
From: anthony at interlink.com.au (Anthony Baxter)
Date: Tue Nov  4 08:04:43 2003
Subject: [spambayes-dev] Re: 1.0a7 
In-Reply-To: <ad7c9w19.fsf@python.net> 
Message-ID: <200311041300.hA4D0a5p007315@localhost.localdomain>


>>> Thomas Heller wrote
> I use command line cvs on Windows, and never have problems with line
> endings.
> Unless some crazy guy commits Windows line ending files with cygwin
> tools.

The idea here is to get a checked out spambayes with the txt files 
with windows line endings. After going down the path of WinCVS, 
TortoiseCVS, Putty, 14 different implementations of ssh for windows
each with different functionality, I gave up and just found the magic
zip flag on Unix. It would be nice to know a way to get cvs working on 
windows with ssh auth - it looks (to me) like the only way is to install 
most of cygwin, which seems to defeat the purpose <wink>

Anthony
-- 
Anthony Baxter     <anthony@interlink.com.au>   
It's never too late to have a happy childhood.


From theller at python.net  Tue Nov  4 08:49:48 2003
From: theller at python.net (Thomas Heller)
Date: Tue Nov  4 08:49:58 2003
Subject: [spambayes-dev] Re: 1.0a7
References: <ad7c9w19.fsf@python.net>
	<200311041300.hA4D0a5p007315@localhost.localdomain>
Message-ID: <4qxk9rb7.fsf@python.net>

Anthony Baxter <anthony@interlink.com.au> writes:

>>>> Thomas Heller wrote
>> I use command line cvs on Windows, and never have problems with line
>> endings.
>> Unless some crazy guy commits Windows line ending files with cygwin
>> tools.
>
> The idea here is to get a checked out spambayes with the txt files 
> with windows line endings. After going down the path of WinCVS, 
> TortoiseCVS, Putty, 14 different implementations of ssh for windows
> each with different functionality, I gave up and just found the magic
> zip flag on Unix. It would be nice to know a way to get cvs working on 
> windows with ssh auth - it looks (to me) like the only way is to install 
> most of cygwin, which seems to defeat the purpose <wink>

I followed the instructions here <http://www.python.org/dev/winssh.txt>
a looong time ago, and it still works.

This is also linked from here:

<http://www.python.org/dev/devfaq.html#development-on-windows>

An alternative would be to use anon cvs, which doesn't require ssh auth.

Thomas


From anthony at interlink.com.au  Tue Nov  4 09:19:40 2003
From: anthony at interlink.com.au (Anthony Baxter)
Date: Tue Nov  4 09:23:29 2003
Subject: [spambayes-dev] Re: 1.0a7 
In-Reply-To: <4qxk9rb7.fsf@python.net> 
Message-ID: <200311041419.hA4EJemW016562@localhost.localdomain>


>>> Thomas Heller wrote
> I followed the instructions here <http://www.python.org/dev/winssh.txt>
> a looong time ago, and it still works.

Aha. I was foolishly reading the docs on the wincvs &c websites. Thanks!

> An alternative would be to use anon cvs, which doesn't require ssh auth.

But for SF's 24-hour delay in anon cvs, that keeps it more-or-less 
useless.


-- 
Anthony Baxter     <anthony@interlink.com.au>   
It's never too late to have a happy childhood.


From kennypitt at hotmail.com  Tue Nov  4 09:35:26 2003
From: kennypitt at hotmail.com (Kenny Pitt)
Date: Tue Nov  4 09:35:54 2003
Subject: [spambayes-dev] Re: 1.0a7 
In-Reply-To: <200311041419.hA4EJemW016562@localhost.localdomain>
Message-ID: <E1AH2Hg-0006ur-JF@mail.python.org>

Anthony Baxter wrote:
> But for SF's 24-hour delay in anon cvs, that keeps it more-or-less
> useless.

Seems to be much less than 24-hour now, although still not perfect.  As
of 9:30am EST time, I pulled all your release_1_0_a7 tagged changes up
through MANIFEST.in 1.7.2.2.  The only update I haven't seen yet is the
1.4.2.3 update to README-DEVEL.txt.

-- 
Kenny Pitt


From kennypitt at hotmail.com  Tue Nov  4 11:15:18 2003
From: kennypitt at hotmail.com (Kenny Pitt)
Date: Tue Nov  4 11:15:48 2003
Subject: [spambayes-dev] imbalance within ham or spam training sets? 
In-Reply-To: <LNBBLJKPBEHFEDALKOLCMEJMGOAB.tim.one@comcast.net>
Message-ID: <E1AH3qM-0007fE-Oq@mail.python.org>

Tim Peters wrote:
> I'm not sure we've got the best guess
> to 17 significant digits, though <wink>.  Make the imbalance wilder
> and the by-counting spamprob gets wilder too:
> 
>>>> h = 1./20000
>>>> s = 1./100
>>>> s/(h+s)
> 0.99502487562189057
>>>> 
> 
> That offends my intuition -- the word is so rare (2 of 20100 msgs)
> that it's hard to believe that 99.5% is a sane guess.  The Bayesian
> adjustment knocks it down a lot based on how few times it's been seen
> in total: 
> 
>>>> (.45*.5 + 2.0*_)/(.45 + 2.0)
> 0.90410193928317584
>>>> 

Wow, that's interesting.  I had always considered words that were either
ham or spam, but never a little of both.  In a way it makes sense
because 1/20000 ham is so close to zero that the word should be
considered spammy.

This seems even more scary, though.  Compare your last example to the
case where the token has only been seen in 1 spam and no ham:

>>> h = 0./20000
>>> s = 1./100
>>> s/(h+s)
1.0
>>> (.45*.5 + 1.*_)/(.45 + 1.)
0.84482758620689669
>>>

The spam prob here is less than the case of 1 ham and 1 spam because of
the "rare word" adjustment.  So, if the token has only been seen once in
spam and is later seen once in ham, it gets spammier?  Yikes!  If we go
to h=10:

>>> h = 10./20000
>>> s = 1./100
>>> s/(h+s)
0.95238095238095233
>>> (.45*.5 + 11.*_)/(.45 + 11.)
0.93460178831357876
>>>

And the spam prob is still going up!  So whenever we have an extreme
imbalance like this, the first n occurrences of a token added to the
larger corpus, where n depends on the size of the imbalance, actually
causes the probability of the *opposite* classification to *increase*.

-- 
Kenny Pitt


From richie at entrian.com  Tue Nov  4 14:08:43 2003
From: richie at entrian.com (Richie Hindle)
Date: Tue Nov  4 14:08:57 2003
Subject: [spambayes-dev] Re: 1.0a7 
In-Reply-To: <200311041300.hA4D0a5p007315@localhost.localdomain>
References: <ad7c9w19.fsf@python.net>
	<200311041300.hA4D0a5p007315@localhost.localdomain>
Message-ID: <r1ufqvskb9ojbu6ulb62214cg41dmat4ov@4ax.com>


[Anthony]
> The idea here is to get a checked out spambayes with the txt files 
> with windows line endings. After going down the path of WinCVS, 
> TortoiseCVS, Putty, 14 different implementations of ssh for windows
> each with different functionality, I gave up and just found the magic
> zip flag on Unix. It would be nice to know a way to get cvs working on 
> windows with ssh auth - it looks (to me) like the only way is to install 
> most of cygwin, which seems to defeat the purpose <wink>

SourceForge have excellent instructions on how to set up WinCVS and PuTTY
here: http://sourceforge.net/docman/display_doc.php?docid=766&group_id=1

(Thanks for doing the release, BTW - I'll give it a bash later on
tonight.)

-- 
Richie Hindle
richie@entrian.com


From mhammond at skippinet.com.au  Tue Nov  4 17:32:33 2003
From: mhammond at skippinet.com.au (Mark Hammond)
Date: Tue Nov  4 17:32:13 2003
Subject: [spambayes-dev] More CVS branch/tags questions
Message-ID: <02b601c3a323$8b250d70$0500a8c0@eden>

I saw Skip raise this last week, but I think he was asking different
questions.

My understanding is that we are moving towards 1.0 on the release_1_0
branch.  Is that correct?

If so, I'm a little confused by this :)  If we look at an edited log from
sb_server, we see (Please see my comments/questions inline with "****", and
at the end:

RCS file: /cvsroot/spambayes/spambayes/scripts/sb_server.py,v
Working file: sb_server.py
head: 1.11
...
symbolic names:
        release_1_0_a7: 1.6.2.1
        outlook-1-0-fork: 1.11
        release_1_0: 1.6.0.2
        release_1_0_a6: 1.6

**** My reading of this is that this file was branched for 1.0 at 1.6.
Correct?

revision 1.11
date: 2003/10/07 00:36:30;  author: anadelonbrin;  state: Exp;  lines: +2 -3
Fix [ spambayes-Bugs-818871 ] sb_server.py calls undefined variable
----------------------------
revision 1.10
date: 2003/09/29 04:43:09;  author: anadelonbrin;  state: Exp;  lines:
+27 -0
...
----------------------------
revision 1.9
date: 2003/09/25 00:10:31;  author: mhammond;  state: Exp;  lines: +99 -15
Patch [ 809008 ] safe start/stop and exlusive execution on windows
...
----------------------------
revision 1.8
date: 2003/09/24 05:28:53;  author: anadelonbrin;  state: Exp;  lines: +3 -1
This should fix [ spambayes-Bugs-809769 ] TypeError when training 1.0a6
----------------------------
revision 1.7
date: 2003/09/19 23:38:10;  author: anadelonbrin;  state: Exp;  lines: +5 -6
...
Add the various interface improvements discussed on spambayes-dev.  In
particular,
an advanced 'find token' query is available, the 'find message' query is
improved,
and the review messages page is more customisable.
----------------------------
...
----------------------------
revision 1.6.2.1
date: 2003/09/24 03:54:14;  author: anadelonbrin;  state: Exp;  lines: +4 -1
Stupid global variables!

Thanks to a global variable not being updated, when we recreated everything,
the userinterface kept
using the old classifier.
Since we now behave and close that one, this caused all sorts of problems.
Get rid of the damn glob
al variable, and correctly
update it, and all is well in the world again.

In addition, don't save an empty database.  I think we make assumptions
about the db being non-empty
 in some places.

This should fix [ spambayes-Bugs-809769 ] TypeError when training 1.0a6

(I can't believe it took so long for me to find this!)
============================================================================
=


*** From my reading of this, the "1.0" release is missing a number of
significant patches - all 1.7->1.11 checkings appear to *not* be on the 1.0
release.

And very interestingly, note that 1.8 and 1.6.2.1 *both* claim to fix
[809769], and on the same day.

I doubt this is the intention - I can't recall anyone deciding to fix real,
verified bugs *after* the 1.0 release.  Can anyone shed any light?

Thanks,

Mark.


From popiel at wolfskeep.com  Tue Nov  4 17:41:47 2003
From: popiel at wolfskeep.com (T. Alexander Popiel)
Date: Tue Nov  4 17:41:51 2003
Subject: [spambayes-dev] imbalance within ham or spam training sets? 
In-Reply-To: Message from "T. Alexander Popiel" <popiel@wolfskeep.com> 
	of "Mon, 03 Nov 2003 14:07:57 PST."
	<20031103220757.323B72DF59@cashew.wolfskeep.com> 
References: <LNBBLJKPBEHFEDALKOLCMEJMGOAB.tim.one@comcast.net>
	<20031103220757.323B72DF59@cashew.wolfskeep.com> 
Message-ID: <20031104224147.8893E2DE36@cashew.wolfskeep.com>

In message:  <20031103220757.323B72DF59@cashew.wolfskeep.com>
             "T. Alexander Popiel" <popiel@wolfskeep.com> writes:
>
>Perhaps it's time to test a variation where the prob is based on
>hamcount and spamcount instead of hamratio and spamratio.  Hrm.
>*tap, tap, tap*  I'll be back in a few hours...

FWIW, basing the prob on the raw counts instead of the ratios is
an incredibly clearcut loss.  Only won twice on the false positives
(by relatively small margins), but lost EVERY time on the false
negatives by large amounts.

- Alex

From richie at entrian.com  Tue Nov  4 17:46:58 2003
From: richie at entrian.com (Richie Hindle)
Date: Tue Nov  4 17:47:11 2003
Subject: [spambayes-dev] Re: 1.0a7 
In-Reply-To: <r1ufqvskb9ojbu6ulb62214cg41dmat4ov@4ax.com>
References: <ad7c9w19.fsf@python.net>
	<200311041300.hA4D0a5p007315@localhost.localdomain>
	<r1ufqvskb9ojbu6ulb62214cg41dmat4ov@4ax.com>
Message-ID: <3vagqvsc0kv1m74f1in0n8567396agou5f@4ax.com>


[Me, earlier]
> (Thanks for doing the release, BTW - I'll give it a bash later on
> tonight.)

A quick smoke-test of the web interface and the POP3 server on Windows
failed to smoke, so all's fine with me.

-- 
Richie Hindle
richie@entrian.com


From kennypitt at hotmail.com  Tue Nov  4 17:54:48 2003
From: kennypitt at hotmail.com (Kenny Pitt)
Date: Tue Nov  4 17:55:16 2003
Subject: [spambayes-dev] More CVS branch/tags questions
In-Reply-To: <02b601c3a323$8b250d70$0500a8c0@eden>
Message-ID: <E1AHA4v-0005ru-7Y@mail.python.org>

Mark Hammond wrote:
> I saw Skip raise this last week, but I think he was asking different
> questions.
> 
> My understanding is that we are moving towards 1.0 on the release_1_0
> branch.  Is that correct?
> 
> If so, I'm a little confused by this :)  If we look at an edited log
> from sb_server, we see (Please see my comments/questions inline with
> "****", and at the end:
> 
> RCS file: /cvsroot/spambayes/spambayes/scripts/sb_server.py,v
> Working file: sb_server.py
> head: 1.11
> ...
> symbolic names:
>         release_1_0_a7: 1.6.2.1
>         outlook-1-0-fork: 1.11
>         release_1_0: 1.6.0.2
>         release_1_0_a6: 1.6
> 
> **** My reading of this is that this file was branched for 1.0 at 1.6.
> Correct?
> 
> *** From my reading of this, the "1.0" release is missing a number of
> significant patches - all 1.7->1.11 checkings appear to *not* be on
> the 1.0 release.
> 
> And very interestingly, note that 1.8 and 1.6.2.1 *both* claim to fix
> [809769], and on the same day.
> 
> I doubt this is the intention - I can't recall anyone deciding to fix
> real, verified bugs *after* the 1.0 release.  Can anyone shed any
> light? 

After looking at the full log, here is how I see it.  Fixes 1.8 and
1.6.2.1 went in on the same day because that is the correct way to do
things at this stage.  Fixes that apply to both the 1.0 and 1.1 releases
need to be made on both the branch and the trunk (because the current
state of those versions could be different at the time the fix is made).
Revs 1.7 and 1.10 are enhancements to the UI that came after the feature
freeze for 1.0, so were not applied to the branch.  I'm not certain, but
I think 1.11 was a fix to a problem caused by the mods in rev 1.10.

Rev 1.9 is a bit gray.  The problem definitely applies to 1.0, so would
probably make a reasonable fix to add to the 1.0 branch.  On the other
hand, it is in a sense adding a "new" feature to prevent multiple
execution.

-- 
Kenny Pitt


From skip at pobox.com  Tue Nov  4 17:59:51 2003
From: skip at pobox.com (Skip Montanaro)
Date: Tue Nov  4 17:59:58 2003
Subject: [spambayes-dev] less is more?
Message-ID: <16296.12135.892469.753587@montanaro.dyndns.org>

I've been meaning to try restarting my training database from scratch for
quite awhile.  I finally broke down and did that this afternoon.  I'm quite
satisfied with the performance of the system with its current 24 spams and
11 hams.  As a bonus, the database is a lot smaller (currently 340k vs 20MB
for the old db) and things seem to run somewhat faster since it's a lot
easier to keep the entire db file in memory.

Overnight I'm sure I'll get a fair number of errors and I get a load of
email the current db hasn't seen, but so far, so good.

Skip


From richie at entrian.com  Tue Nov  4 18:31:24 2003
From: richie at entrian.com (Richie Hindle)
Date: Tue Nov  4 18:31:39 2003
Subject: [spambayes-dev] More CVS branch/tags questions
In-Reply-To: <02b601c3a323$8b250d70$0500a8c0@eden>
References: <02b601c3a323$8b250d70$0500a8c0@eden>
Message-ID: <o9bgqv0g1b8jl4d5rkk5eoqnq4evfaitji@4ax.com>


[Mark]
> My understanding is that we are moving towards 1.0 on the release_1_0
> branch.  Is that correct?

I think so, yes.  release_1_0 was supposed to be bugfix-only, and the head
(moving towards version 1.1) was for enhancements.  Here's Tony's original
mail:

> As discussed earlier, I've created a cvs branch - 'release_1_0' - to move
> toward 1.0b1 and then 1.0.
> 
> If I understand things rightly (going by Jeremy and Richie's comments) the
> main branch is now for 1.1 work, so is un-feature frozen ;).  If people
> could check 1.0 bugfixes into the release_1_0 branch (and 1.1, as needed),
> that would be great.

Re-reading Tony's mail, I should have pointed out at the time that we
shouldn't commit edits to both places, but should use "cvs up [-j
moving-tag] -j release_1_0" to periodically merge the bugfix branch onto
the head.  Nuts.

>From looking at the logs, it seems you're right, Mark - bugfixes have been
hitting the head instead of release_1_0.  Also, some fixes have been
committed to both the head and release_1_0, which will probably make
merging release_1_0 back onto the head a pain - you always get more
conflicts when you do that.  (I should have encouraged more discussion of
branch strategy when all this came up - we make heavy use of CVS branches
at work, and we know a bit about how best to manage them.)

One thing that CVS is spectacularly bad at is giving you an overview of
what's been happening, so it's hard to say where we should go from here.
How much enhancement work has gone onto the head since release_1_0 was
taken?  If it's not very much then maybe we should just give it a solid
testing then merge release_1_0 back onto the head as soon as 1.0a7 is out.
We then either take 1.0a8 from the head (bletch) or start again with a new
bugfix branch and a better-advertised branch management strategy...

-- 
Richie two-commits-in-two-months-like-he-has-room-to-talk Hindle
richie@entrian.com


From richie at entrian.com  Tue Nov  4 18:58:56 2003
From: richie at entrian.com (Richie Hindle)
Date: Tue Nov  4 18:59:09 2003
Subject: [spambayes-dev] More CVS branch/tags questions
In-Reply-To: <E1AHA4v-0005ru-7Y@mail.python.org>
References: <02b601c3a323$8b250d70$0500a8c0@eden>
	<E1AHA4v-0005ru-7Y@mail.python.org>
Message-ID: <emdgqv48e3ctjk3hkjechkgv019a8al7cq@4ax.com>


[Kenny]
> Fixes that apply to both the 1.0 and 1.1 releases
> need to be made on both the branch and the trunk (because the current
> state of those versions could be different at the time the fix is made).

No, definitely not!  Fixes made on the bugfix branch should be
batch-merged onto the head once in a while, using "cvs up -j".  If you
have a policy of manually fixing both places then inevitably you sometimes
forget.  If you do lots of manual merging and only then try to merge the
branches you get hundreds of conflicts.  By doing a periodic merge, you
keep the code roughly in step as far as bugfixes go - close enough that
cvs can do meaningful merges - and you fix conflicts when you get them.

The fact that the head and the bugfix branch differ is rarely a problem
when it comes to merging - if you get conflicts, you fix them.  The fix is
usually obvious, and only needs to applied once.

Here's how the system works over time:

Head: 1.1   1.2   1.3  1.4 ....... 1.9 ......... 1.20 ........ 1.30
                   |                ^             ^             ^
    bugfix branch  |                | merge 1     | merge2      | merge 3
 (eg. release_1_0) |                |             |             |
                    --> 1.3.1.1  1.3.2.3 ..... 1.3.2.9 ..... 1.3.2.20

So you start with the head, then at some point you take your bugfix
branch.  1.4 is a feature, 1.3.1.1 is a bugfix (the specific numbers don't
matter).  At some point you decide to merge the bugfix branch onto the
head.  Why?

 o People working on the head are frustrated by the bugs
 o It's been a long time and the branches are getting out of step
 o Someone wants to start a major piece of work and wants to branch off
   the head to do it, and they want the bugfixes in place on their branch

So you get CVS to apply all the edits that have been made on the bugfix
branch to the head: you take a head checkout and do "cvs up -j bugfix".
You get some conflicts where bugfixes have been made to code that's
changed on the head, and you fix them (this is less of a problem than you
might think, once the code has stopped migrating wholesale from place to
place as it can do in the early stages of a project).

The next time you're going to want to do this merge, you'll want to take
all the edits made on the bugfix branch between this merge and the time
you do the next merge, and apply them to the head.  So after this first
merge, you mark the point on the bugfix branch at which you did your merge
by tagging it: "cvs tag bugfix_to_head" in a bugfix checkout.  That will
apply the bugfix_to_head tag to 1.3.2.3 on the bugfix branch (again, you
don't care about the numbers in the real world because you're operating on
entire branches).

Then at some later date comes merge 2: take the edits between
bugfix_to_head and the current state of the bugfix branch and apply them
to the head.  In a head checkout, "cvs up -j bugfix_to_head -j bugfix".
Fix your conflicts, and in a bugfix checkout, move your marker tag to the
new position: "cvs tag -F bugfix_to_head".  That moves bugfix_to_head to
1.3.2.9, ready for merge 3 at some later date.

All this takes longer to explain than to just do.  8-)  In the long run it
guarantees that bugfixes don't get lost, and that people can consistently
use each branch for its intended purpose.  Bugfix releases are always made
from the bugfix branch, which eventually comes to an end when a new
feature release goes out - and a new bugfix branch is taken for that
release.

-- 
Richie Hindle
richie@entrian.com


From anthony at interlink.com.au  Wed Nov  5 08:12:26 2003
From: anthony at interlink.com.au (Anthony Baxter)
Date: Wed Nov  5 08:17:36 2003
Subject: [spambayes-dev] More CVS branch/tags questions 
In-Reply-To: <o9bgqv0g1b8jl4d5rkk5eoqnq4evfaitji@4ax.com> 
Message-ID: <200311051313.hA5DCQaE017601@localhost.localdomain>


>>> Richie Hindle wrote
> Re-reading Tony's mail, I should have pointed out at the time that we
> shouldn't commit edits to both places, but should use "cvs up [-j
> moving-tag] -j release_1_0" to periodically merge the bugfix branch onto
> the head.  Nuts.

Note that you can also just use cvs diff to apply the fix by hand to both
the trunk and the branch. This is likely to cause less pain, as you're
then relying on less cvs magic.

>From my recent python-dev postings about the python maintenance branch,
I'd suggest the following:

  - checkin to the trunk. If the fix is a bugfix, and suitable for the
    branch, include "bugfix candidate" in the checkin message.

  - (preferably) check your bugfix into the branch as well. I suggest
    having two checkouts, one on the branch, one on the trunk.

  - (otherwise) someone else notices that the "bugfix" needs to be
    applied to the branch as well, and does so.

I need to apply the various doc changes I made in the last couple of
days to the trunk, I will try to do so soon.


-- 
Anthony Baxter     <anthony@interlink.com.au>   
It's never too late to have a happy childhood.


From skip at pobox.com  Wed Nov  5 09:08:08 2003
From: skip at pobox.com (Skip Montanaro)
Date: Wed Nov  5 09:08:35 2003
Subject: [spambayes-dev] More CVS branch/tags questions
In-Reply-To: <o9bgqv0g1b8jl4d5rkk5eoqnq4evfaitji@4ax.com>
References: <02b601c3a323$8b250d70$0500a8c0@eden>
	<o9bgqv0g1b8jl4d5rkk5eoqnq4evfaitji@4ax.com>
Message-ID: <16297.1096.280851.95630@montanaro.dyndns.org>


    Richie> From looking at the logs, it seems you're right, Mark - bugfixes
    Richie> have been hitting the head instead of release_1_0.  Also, some
    Richie> fixes have been committed to both the head and release_1_0,
    Richie> which will probably make merging release_1_0 back onto the head
    Richie> a pain - you always get more conflicts when you do that.

My take on this is that at this point in time we should not be working on
any branches.  Everything should happen on the trunk until a release is
about the be cut, at which point a branch is made, then frozen except for
crucial bug fixes.  Once the release is complete, the branch dies (or at
best any changes it contains which are not on the trunk are merged back into
the trunk).  After we've actually had a 1.0 release, we create a branch
called something like release10_maint, to which bug fixes are backported
from the trunk.  At some point in time, that branch also dies (probably
fairly quickly, after a 1.1 or 1.2 release).

This is more-or-less how the Python development works.  The advantage from
my standpoint is that most developers can be content to only check in
changes to the trunk and occasionally backport their changes (if they are
bug fixes) to an obvious branch.  The only people who have to worry much
about branches are release managers.  Branches leading up to a release are
very short-lived.

Skip

From anthony at interlink.com.au  Wed Nov  5 09:28:59 2003
From: anthony at interlink.com.au (Anthony Baxter)
Date: Wed Nov  5 09:32:23 2003
Subject: [spambayes-dev] More CVS branch/tags questions 
In-Reply-To: <16297.1096.280851.95630@montanaro.dyndns.org> 
Message-ID: <200311051429.hA5ESxa3019296@localhost.localdomain>


>>> Skip Montanaro wrote
> My take on this is that at this point in time we should not be working on
> any branches.  Everything should happen on the trunk until a release is
> about the be cut, at which point a branch is made, then frozen except for
> crucial bug fixes.  

This is how the Python release process works. At the moment we seem to be
following something more like the Mozilla process, where we cut a branch for
the upcoming release once we're past the point of adding new features. 

Having said that, I'd say the time to branch is at the point where we're about
to cut the first beta. So we've possibly done it too soon here. 

OTOH, I don't know what is stopping us from cutting 1.0b1 in a couple of 
weeks, with a possible RC a couple of weeks after that. 

Anthony

-- 
Anthony Baxter     <anthony@interlink.com.au>   
It's never too late to have a happy childhood.


From kennypitt at hotmail.com  Wed Nov  5 09:41:33 2003
From: kennypitt at hotmail.com (Kenny Pitt)
Date: Wed Nov  5 09:41:54 2003
Subject: [spambayes-dev] More CVS branch/tags questions
In-Reply-To: <emdgqv48e3ctjk3hkjechkgv019a8al7cq@4ax.com>
Message-ID: <E1AHOr2-0005zl-HY@mail.python.org>

Richie Hindle wrote:
> [Kenny]
>> Fixes that apply to both the 1.0 and 1.1 releases
>> need to be made on both the branch and the trunk (because the current
>> state of those versions could be different at the time the fix is
>> made). 
> 
> No, definitely not!  Fixes made on the bugfix branch should be
> batch-merged onto the head once in a while, using "cvs up -j".

That's cool, I didn't know CVS could do that.  I was simply going off
the previous description of bug fixing that you referenced in your
previous message.

> So you get CVS to apply all the edits that have been made on the
> bugfix branch to the head: you take a head checkout and do "cvs up -j
> bugfix". You get some conflicts where bugfixes have been made to code
> that's changed on the head, and you fix them (this is less of a
> problem than you might think, once the code has stopped migrating
> wholesale from place to place as it can do in the early stages of a
> project). 

Have you ever seen this become an issue if the new line of development
does decide to do a significant refactoring of the code?

Thanks for the excellent description of the process.  That's good
information for anyone working on a CVS project, not just for SpamBayes.

-- 
Kenny Pitt


From richie at entrian.com  Wed Nov  5 13:37:18 2003
From: richie at entrian.com (Richie Hindle)
Date: Wed Nov  5 13:37:33 2003
Subject: [spambayes-dev] More CVS branch/tags questions 
In-Reply-To: <200311051429.hA5ESxa3019296@localhost.localdomain>
References: <16297.1096.280851.95630@montanaro.dyndns.org>
	<200311051429.hA5ESxa3019296@localhost.localdomain>
Message-ID: <d1giqv8pbeik6hm0bgu7l22vr2f2o93jmg@4ax.com>


[Kenny]
> Have you ever seen this become an issue if the new line of development
> does decide to do a significant refactoring of the code?

No, not really.  Either the code has moved so you get a conflict and have
to hand-merge, which is no harder than with any other branch management
scheme, or the bugfix is no longer meaningful because the code has gone
away, so you just take the version from the head.

[Skip]
> My take on this is that at this point in time we should not be working on
> any branches.  Everything should happen on the trunk until a release is
> about the be cut, at which point a branch is made, then frozen except for
> crucial bug fixes.  

That's fair enough as long we're happy to release new features with every
release.  As I understood it, at the time we had lots of new features in
the pipeline *and* a need to release some bugfixes.  Perhaps the features
should have gone onto a branch and the trunk remained the place to do
bugfixes - we also do that, and it's no problem.  People developing
significant new features get all the benefits of source control without
stepping on the toes of the other developers or the release process.

[Anthony]
> Note that you can also just use cvs diff to apply the fix by hand to both
> the trunk and the branch. This is likely to cause less pain, as you're
> then relying on less cvs magic.

CVS is good at this kind of thing as long as you give it enough help
(managing tags and branches sensibly) - I'd rather let CVS do the heavy
lifting than do it by hand.

All that said, the branch strategy you use is far less important than
letting everyone know what it is!  I don't have a big axe to grind about
this - spambayes isn't (yet) a big enough project for the choice of branch
strategy to be critical.  Our failure this time, if there even was a
failure, was in not advertising the strategy loudly enough.

[Anthony]
> OTOH, I don't know what is stopping us from cutting 1.0b1 in a couple of 
> weeks, with a possible RC a couple of weeks after that. 

The DBRunRecoveryError problem is stopping us, IMHO.  Stephen Harper
posted to the spambayes list yesterday with a possible method for
reproducing that
(http://mail.python.org/pipermail/spambayes/2003-November/009021.html)
which I need to find time to look into.

-- 
Richie Hindle
richie@entrian.com


From skip at pobox.com  Wed Nov  5 14:54:59 2003
From: skip at pobox.com (Skip Montanaro)
Date: Wed Nov  5 14:55:11 2003
Subject: [spambayes-dev] More CVS branch/tags questions 
In-Reply-To: <d1giqv8pbeik6hm0bgu7l22vr2f2o93jmg@4ax.com>
References: <16297.1096.280851.95630@montanaro.dyndns.org>
	<200311051429.hA5ESxa3019296@localhost.localdomain>
	<d1giqv8pbeik6hm0bgu7l22vr2f2o93jmg@4ax.com>
Message-ID: <16297.21907.471869.85745@montanaro.dyndns.org>


    Richie> [Anthony]
    >> OTOH, I don't know what is stopping us from cutting 1.0b1 in a couple
    >> of weeks, with a possible RC a couple of weeks after that.

    Richie> The DBRunRecoveryError problem is stopping us, IMHO.  Stephen
    Richie> Harper posted to the spambayes list yesterday with a possible
    Richie> method for reproducing that
    Richie> (http://mail.python.org/pipermail/spambayes/2003-November/009021.html)
    Richie> which I need to find time to look into.

Greg Smith just checked in some changes to the bsddb package in the Python
CVS tree related to deadlocks and threading.  It might be worth seeing if
those changes help.

Skip

From richie at entrian.com  Wed Nov  5 14:56:00 2003
From: richie at entrian.com (Richie Hindle)
Date: Wed Nov  5 14:56:14 2003
Subject: [spambayes-dev] Re: [Spambayes] Lotus Notes filter error KeyError:
	('Hammie', 'header_spam_string')
In-Reply-To: <2384386.1068059505079.JavaMail.root@gonzo.psp.pas.earthlink.net>
References: <2384386.1068059505079.JavaMail.root@gonzo.psp.pas.earthlink.net>
Message-ID: <50liqvg29dnuk1s4k5hdvf1q5um1r01jop@4ax.com>


[Mike]
> File "C:\Program Files\Python23\Scripts\sb_notesfilter.py", line 237, in processAndTrain
>     str = options["Hammie", "header_spam_string"]

I don't know much about the Notes stuff, but that looks like a bug.  That
piece of code should probably be:

    if is_spam:
        str = options["Headers", "header_spam_string"]
    else:
        str = options["Headers", "header_ham_string"]

You can see from spambayes/Options.py that header_spam_string goes in the
Headers section, not the Hammie section.  There are a few other places in
sb_notesfilter.py where similar code ("options["Hammie",
"header_xxx_string"]") appears - that should be changed too.

Mike, does changing "Hammie" to "Headers" in those places fix your
problem?

I'm forwarding this to spambayes-dev to see whether anyone there knows for
sure whether I'm right about this...?

-- 
Richie Hindle
richie@entrian.com


From richie at entrian.com  Thu Nov  6 02:58:40 2003
From: richie at entrian.com (Richie Hindle)
Date: Thu Nov  6 02:58:52 2003
Subject: [spambayes-dev] More CVS branch/tags questions 
In-Reply-To: <16297.21907.471869.85745@montanaro.dyndns.org>
References: <16297.1096.280851.95630@montanaro.dyndns.org>
	<200311051429.hA5ESxa3019296@localhost.localdomain>
	<d1giqv8pbeik6hm0bgu7l22vr2f2o93jmg@4ax.com>
	<16297.21907.471869.85745@montanaro.dyndns.org>
Message-ID: <qnvjqvsr8fm4jqomqk0hgfeothg15fq9oc@4ax.com>


[Skip]
> Greg Smith just checked in some changes to the bsddb package in the Python
> CVS tree related to deadlocks and threading.  It might be worth seeing if
> those changes help.

Thanks.  If I can reproduce the problem with Spambayes, I'll try it again
with CVS Python.

-- 
Richie Hindle
richie@entrian.com


From kennypitt at hotmail.com  Thu Nov  6 10:22:18 2003
From: kennypitt at hotmail.com (Kenny Pitt)
Date: Thu Nov  6 10:22:44 2003
Subject: [spambayes-dev] RE: [Spambayes] Spambayes 1.0a7 - windows
	proxy_tray installation
In-Reply-To: <3FA9D6E0.3000409@swiftdsl.com.au>
Message-ID: <E1AHly3-0001r1-KB@mail.python.org>

Phil Pierotti wrote:
> [mebbe I didn't install something the way it was expecting]
> But what I see is that under the distribution, there's
>     windows\pop3proxy_*.py ("the scripts")
>     windows\resources\ (with all the .ico etc resources for the
> systray program)
> 
> The scripts are installed under
>     \Python23\Scripts
> by setuup.py, but there's no corresponding
>     \Python23\windows\resources\
> with all the icon/resources
> 
> So:
> 
> (a) did I not install something properly
> (b) dd the installer not install the resources properly
> (c) are the paths in the script wrong
> (d) all of the above
> (e) none of the above, I'm just smoking too much crack (as per usual)

Looks like you hit the nail right on the head.  Glad I finished reading
my inbox before replying to your previous message <wink>.

At some point in the not-too-distant past, a decision was made that the
Windows scripts pop3proxy_service.py and pop3proxy_tray.py should be
installed to the Python Scripts directory along with the other
command-line scripts.  It seems this was a bit premature, as
pop3proxy_tray obviously isn't designed to be run that way.  When run
from source, the icon resources are required to be present in a
directory structure that isn't appropriate for installing into the main
Python directory.

For now, you should be able to get around this problem by going to the
windows dir in your original source tree and running pop3proxy_tray.py
from there.  I've CC'd the spambayes-dev list in hopes that someone can
take a look at this.  At the very least, we should probably stop copying
it to the Python\Scripts directory until the problem is fixed.

-- 
Kenny Pitt


From richie at entrian.com  Thu Nov  6 16:36:20 2003
From: richie at entrian.com (Richie Hindle)
Date: Thu Nov  6 16:36:35 2003
Subject: [spambayes-dev] Hunter-killer drones
Message-ID: <2eflqvc6ris6cgrd7aq8j6kmedvqpoj0cc@4ax.com>

Dev people,

Before I start digging through the CVS logs to see who committed the
following code to BrighterAsyncChat.handle_error in Dibbler.py:

        if type == socket.error and v[0] == 9:  # Why?  Who knows...
            pass

so that I know where to dispatch the hunter-killer drones, does anybody
want to confess to it?

Throwing away that exception is causing an infinite loop in sb_server.py
whenever something happens to a browser socket, like someone going to a
different page during training.  Rumour has it that this leads to the
infamous DBRunRecoveryError, but I haven't confirmed that yet.

Own up and explain yourself, or the next thing you hear will be the whine
of tiny nuclear-tipped turbines...

-- 
Richie Hindle
richie@entrian.com


From tim.one at comcast.net  Thu Nov  6 16:58:20 2003
From: tim.one at comcast.net (Tim Peters)
Date: Thu Nov  6 16:58:25 2003
Subject: [spambayes-dev] Hunter-killer drones
In-Reply-To: <2eflqvc6ris6cgrd7aq8j6kmedvqpoj0cc@4ax.com>
Message-ID: <LNBBLJKPBEHFEDALKOLCIEHOGPAB.tim.one@comcast.net>

[Riche Hindle]
> Dev people,
>
> Before I start digging through the CVS logs to see who committed the
> following code to BrighterAsyncChat.handle_error in Dibbler.py:
>
>         if type == socket.error and v[0] == 9:  # Why?  Who knows...
>             pass
>
> so that I know where to dispatch the hunter-killer drones, does
> anybody want to confess to it?

I'd love to, except I didn't do it.  CVS annotate says it's been like that
since version 1.1, and, indeed, the oldest version in the repository already
had it:

    Revision 1.1
    Fri Jan 17 20:21:07 2003 UTC (9 months, 2 weeks ago) by richiehindle


> Throwing away that exception is causing an infinite loop in
> sb_server.py whenever something happens to a browser socket, like
> someone going to a different page during training.

That's probably not good <wink>.

> Rumour has it that this leads to the infamous DBRunRecoveryError,

Cool!  Doubly worth pursuing then.

> but I haven't confirmed that yet.
>
> Own up and explain yourself, or the next thing you hear will be the
> whine of tiny nuclear-tipped turbines...

It's far too late to stop that now -- I just saw the drones whiz by my
window!


From skip at pobox.com  Thu Nov  6 17:02:25 2003
From: skip at pobox.com (Skip Montanaro)
Date: Thu Nov  6 17:02:38 2003
Subject: [spambayes-dev] OptionsClass.is_valid too picky?
Message-ID: <16298.50417.896066.477124@montanaro.dyndns.org>

I run SpamBayes on a couple machines filtering scoring mail for several
email addresses which eventually find their way to my mailbox.  I'd like to
stuff the hostname into the score somehow.  My first attempt was

    [Headers]
    classification_header_name: X-Spambayes-Classification: titan

This failed with this error:

    Attempted to set [Headers] classification_header_name with invalid value ...

Without considering it further, I then tried:

    [Headers]
    header_spam_string: titan: spam
    header_ham_string: titan: ham

This also failed.  Next, I tried

    [Headers]
    header_spam_string: titan:spam
    header_ham_string: titan:ham

then

    [Headers]
    header_spam_string: titan-spam
    header_ham_string: titan-ham

which finally worked.  It looks to me like OptionsClass.HEADER_VALUE is too
restrictive, but I'll leave it for the author of that code to decide whether
or not to loosen it up.

Skip

From nas-spambayes at python.ca  Thu Nov  6 17:04:46 2003
From: nas-spambayes at python.ca (Neil Schemenauer)
Date: Thu Nov  6 17:02:59 2003
Subject: [spambayes-dev] Hunter-killer drones
In-Reply-To: <2eflqvc6ris6cgrd7aq8j6kmedvqpoj0cc@4ax.com>
References: <2eflqvc6ris6cgrd7aq8j6kmedvqpoj0cc@4ax.com>
Message-ID: <20031106220445.GA24610@mems-exchange.org>

On Thu, Nov 06, 2003 at 09:36:20PM +0000, Richie Hindle wrote:
> Before I start digging through the CVS logs to see who committed the
> following code to BrighterAsyncChat.handle_error in Dibbler.py:
> 
>         if type == socket.error and v[0] == 9:  # Why?  Who knows...
>             pass

On my system 9 == errno.EBADF (Bad file descriptor).  No idea why someone
would want to ignore it.

  Neil

From richie at entrian.com  Thu Nov  6 17:20:46 2003
From: richie at entrian.com (Richie Hindle)
Date: Thu Nov  6 17:21:01 2003
Subject: [spambayes-dev] Hunter-killer drones
In-Reply-To: <LNBBLJKPBEHFEDALKOLCIEHOGPAB.tim.one@comcast.net>
References: <2eflqvc6ris6cgrd7aq8j6kmedvqpoj0cc@4ax.com>
	<LNBBLJKPBEHFEDALKOLCIEHOGPAB.tim.one@comcast.net>
Message-ID: <5vhlqvoif2857jl2lc675a7b9v0k9p0idp@4ax.com>


[Tim]
> CVS annotate says it's been like that since version 1.1

That's why I don't want to have to dig through CVS - it was introduced
before the code was moved from pop3proxy.py (or possibly the original web
configurator, whatever that was called) into Dibbler.py.  So I need to
hunt around in the attic... much easier to deploy the Drones.

-- 
Richie Hindle
richie@entrian.com


From richie at entrian.com  Thu Nov  6 17:34:07 2003
From: richie at entrian.com (Richie Hindle)
Date: Thu Nov  6 17:34:21 2003
Subject: [spambayes-dev] Hunter-killer drones
In-Reply-To: <2eflqvc6ris6cgrd7aq8j6kmedvqpoj0cc@4ax.com>
References: <2eflqvc6ris6cgrd7aq8j6kmedvqpoj0cc@4ax.com>
Message-ID: <mcilqvkf408pu2e2ja9uq59feftdtijqim@4ax.com>


[Me]
> Own up and explain yourself, or the next thing you hear will be the whine
> of tiny nuclear-tipped turbines...

Ha!  The Drones are now en route to Central Illinois.  Take a picture of
*this*!  Ka-BOOM!

Sadly I'm not sure the culprit reads this list any more, so he may never
know the cause of his demise (and that of the unfortunate West Central
Illinois Tractor Pullers Association - http://www.wcitpa.com - whose
headquarters will unfortunately be destroyed as well).  I'll send him a
personal email to apologise - perhaps he'll see it before the Drones reach
him.

-- 
Richie Hindle
richie@entrian.com


From papaDoc at videotron.ca  Thu Nov  6 17:38:39 2003
From: papaDoc at videotron.ca (papaDoc)
Date: Thu Nov  6 17:38:42 2003
Subject: [spambayes-dev] Hunter-killer drones
In-Reply-To: <mcilqvkf408pu2e2ja9uq59feftdtijqim@4ax.com>
References: <2eflqvc6ris6cgrd7aq8j6kmedvqpoj0cc@4ax.com>
	<mcilqvkf408pu2e2ja9uq59feftdtijqim@4ax.com>
Message-ID: <3FAACD6F.3040303@videotron.ca>

Hi,

We want a name, we want a name ........    ;-)


What a nice explosion, I was able to see it from Montreal !!!

Remi


From kennypitt at hotmail.com  Fri Nov  7 13:13:34 2003
From: kennypitt at hotmail.com (Kenny Pitt)
Date: Fri Nov  7 13:13:59 2003
Subject: [spambayes-dev] RE: [Spambayes] Outlook Envelope Tray Icon
In-Reply-To: <IEEDKAHMEBPPLILCLFKFCEECCJAA.bob@jellyvision.com>
Message-ID: <E1AIB7K-0001Ie-C3@mail.python.org>

Bob Chojnacki wrote:
> Hi,
> 
> I really like SpamBayes and the Outlook plugin.  It is working much
> better than other spam filters, considering I get 85-95% spam. I am
> currently using version 008.1.  I read your FAQ about the problems
> with making the Outlook envelope tray icon go away.  (I am also not
> sure if this is the right email address to send this comment, so
> please bear with me if it isn't.) 
> 
> Is the following link helpful? (Keep in mind that I am not a Windows
> programmer):
> 
> 	http://www.slipstick.com/dev/code/clearenvicon.htm

Thanks for the link.  I created the following code to implement this in
the Outlook plugin and attached it to a menu item for testing.  It was,
in fact, successful in removing the new mail envelope from the taskbar.
Now, the *really* tricky part is figuring out when to remove the icon.

====================

def RemoveNewMailIcon():
    win32gui.EnumWindows(_removeIconCallback, None)
    
def _removeIconCallback(hwnd, extra):
    # Check for Outlook window class.
    if win32gui.GetClassName(hwnd) == "rctrl_renwnd32":
        # Got the correct class, but we need to make sure window title
is
        # empty because there may be other top-level Outlook windows.
        if win32gui.GetWindowText(hwnd) == "":
            return not _killNewMailIcon(hwnd)
        else:
            return True
    else:
        return True
        
WUM_RESETNOTIFICATION = win32con.WM_USER + 7
def _killNewMailIcon(hwnd):
    nid = (hwnd, 0)
    if not win32gui.Shell_NotifyIcon(win32gui.NIM_DELETE, nid):
        return False
    else:
        win32gui.SendMessage(hwnd, WUM_RESETNOTIFICATION, 0, 0)
        return True

====================

-- 
Kenny Pitt


From bob at jellyvision.com  Fri Nov  7 13:40:33 2003
From: bob at jellyvision.com (Bob Chojnacki)
Date: Fri Nov  7 13:37:11 2003
Subject: [spambayes-dev] RE: [Spambayes] Outlook Envelope Tray Icon
In-Reply-To: <20031107121504.7ea42d3ec3ef466293ceca3f8ae215f9.in@ansel.jellyvision.com>
Message-ID: <IEEDKAHMEBPPLILCLFKFMEEECJAA.bob@jellyvision.com>

> Now, the *really* tricky part is figuring out when to remove the icon.

I noticed right after I sent my email (blush) the comment in their code

  ' add some code to check whether the latest items are "interesting"

The comment is akin to the old Steve Martin comedy routine:

	"How to become a millionaire.  First, get a million dollars..."

Sorry about that.

Bob

> -----Original Message-----
> From: Kenny Pitt [mailto:kennypitt@hotmail.com]
> Sent: Friday, November 07, 2003 12:14 PM
> To: 'Bob Chojnacki'; spambayes@python.org
> Cc: spambayes-dev@python.org
> Subject: RE: [Spambayes] Outlook Envelope Tray Icon
> 
> 
> Bob Chojnacki wrote:
> > Hi,
> > 
> > I really like SpamBayes and the Outlook plugin.  It is working much
> > better than other spam filters, considering I get 85-95% spam. I am
> > currently using version 008.1.  I read your FAQ about the problems
> > with making the Outlook envelope tray icon go away.  (I am also not
> > sure if this is the right email address to send this comment, so
> > please bear with me if it isn't.) 
> > 
> > Is the following link helpful? (Keep in mind that I am not a Windows
> > programmer):
> > 
> > 	http://www.slipstick.com/dev/code/clearenvicon.htm
> 
> Thanks for the link.  I created the following code to implement this in
> the Outlook plugin and attached it to a menu item for testing.  It was,
> in fact, successful in removing the new mail envelope from the taskbar.
> Now, the *really* tricky part is figuring out when to remove the icon.
> 
> ====================
> 
> def RemoveNewMailIcon():
>     win32gui.EnumWindows(_removeIconCallback, None)
>     
> def _removeIconCallback(hwnd, extra):
>     # Check for Outlook window class.
>     if win32gui.GetClassName(hwnd) == "rctrl_renwnd32":
>         # Got the correct class, but we need to make sure window title
> is
>         # empty because there may be other top-level Outlook windows.
>         if win32gui.GetWindowText(hwnd) == "":
>             return not _killNewMailIcon(hwnd)
>         else:
>             return True
>     else:
>         return True
>         
> WUM_RESETNOTIFICATION = win32con.WM_USER + 7
> def _killNewMailIcon(hwnd):
>     nid = (hwnd, 0)
>     if not win32gui.Shell_NotifyIcon(win32gui.NIM_DELETE, nid):
>         return False
>     else:
>         win32gui.SendMessage(hwnd, WUM_RESETNOTIFICATION, 0, 0)
>         return True
> 
> ====================
> 
> -- 
> Kenny Pitt

From rmalayter at bai.org  Fri Nov  7 13:47:54 2003
From: rmalayter at bai.org (Ryan Malayter)
Date: Fri Nov  7 13:47:59 2003
Subject: [spambayes-dev] RE: [Spambayes] Outlook Envelope Tray Icon
Message-ID: <792DE28E91F6EA42B4663AE761C41C2A012C3765@cliff.bai.org>

 
> From: Bob Chojnacki
> Subject: RE: [Spambayes] Outlook Envelope Tray Icon
> 
> I noticed right after I sent my email (blush) the comment in 
> their code
> 
>   ' add some code to check whether the latest items are "interesting"
> 
> The comment is akin to the old Steve Martin comedy routine:
> 
> 	"How to become a millionaire.  First, get a million dollars..."
> 
> Sorry about that.

Is it really that hard? 

Maybe I'm not thinking it through enough, but I suggest this simple
approach: 
	Check for unread messages in the SpamBayes "watched" folders. 
	Check the spam score on each of those unread messages.
	If any exist where the Spam score is below the certain ham
threshold, show the icon
	if not, everything new was spam, and you can remove the icon.

This might take a second or two two but it can happen right after every
SpamBayes scoring run gets triggered. So we'll see the new mail icon for
at most a few seconds.

Regards,
	Ryan


From adam.walker at rbwconsulting.com  Fri Nov  7 14:07:48 2003
From: adam.walker at rbwconsulting.com (Adam Walker)
Date: Fri Nov  7 14:08:03 2003
Subject: [spambayes-dev] Re: [Spambayes] Outlook Envelope Tray Icon
In-Reply-To: <792DE28E91F6EA42B4663AE761C41C2A012C3765@cliff.bai.org>
References: <792DE28E91F6EA42B4663AE761C41C2A012C3765@cliff.bai.org>
Message-ID: <3FABED84.5050607@rbwconsulting.com>

What about mail delivered to unwatched folders? What about mail 
delivered to watched and unwatched folders in the same batch? Why do 
people feel they need to drop everything and read an email when it comes in?

Ryan Malayter wrote:

>
>Is it really that hard? 
>
>Maybe I'm not thinking it through enough, but I suggest this simple
>approach: 
>	Check for unread messages in the SpamBayes "watched" folders. 
>	Check the spam score on each of those unread messages.
>	If any exist where the Spam score is below the certain ham
>threshold, show the icon
>	if not, everything new was spam, and you can remove the icon.
>
>This might take a second or two two but it can happen right after every
>SpamBayes scoring run gets triggered. So we'll see the new mail icon for
>at most a few seconds.
>
>Regards,
>	Ryan
>
>  
>


From kennypitt at hotmail.com  Fri Nov  7 14:08:12 2003
From: kennypitt at hotmail.com (Kenny Pitt)
Date: Fri Nov  7 14:08:36 2003
Subject: [spambayes-dev] RE: [Spambayes] Outlook Envelope Tray Icon
In-Reply-To: <792DE28E91F6EA42B4663AE761C41C2A012C3765@cliff.bai.org>
Message-ID: <E1AIByD-00055C-R8@mail.python.org>

Ryan Malayter wrote:
>> From: Bob Chojnacki
>> Subject: RE: [Spambayes] Outlook Envelope Tray Icon
>> 
>> I noticed right after I sent my email (blush) the comment in
>> their code
>> 
>>   ' add some code to check whether the latest items are "interesting"
>> 
> Is it really that hard?

Maybe it isn't.  I just know I'm not the guy familiar enough with the
code to determine that. <wink>

> 
> Maybe I'm not thinking it through enough, but I suggest this simple
> approach:
> 	Check for unread messages in the SpamBayes "watched" folders.
> 	Check the spam score on each of those unread messages.
> 	If any exist where the Spam score is below the certain ham
> threshold, show the icon
> 	if not, everything new was spam, and you can remove the icon.
> 
> This might take a second or two two but it can happen right after
> every SpamBayes scoring run gets triggered. So we'll see the new mail
> icon for at most a few seconds.

Some possible issues I can think of:

First, depending on your SpamBayes configuration the processing can be
triggered each time a new message is added to the Inbox.  I'm not sure
if it would be a good idea to check all messages in all watched folders
every time a new message is received, as that might prove time-consuming
(especially for those of us who tend to let our Inboxes get cluttered
with old mail).

Second, we've seen in other cases that we can't always rely on Outlook
to do things in the order that we expect.  I'm not sure we can guarantee
that we are processing the message *after* Outlook has already created
the tray icon.  We could end up "removing" an icon that doesn't yet
exist, and then have Outlook add it after we've finished our processing.

P.S.  I've moved this discussion over to the spambayes-dev list, which
is probably a more appropriate venue for these implementation details.

-- 
Kenny Pitt


From cdellario at whatif-productions.com  Fri Nov  7 14:43:37 2003
From: cdellario at whatif-productions.com (Chris Dellario)
Date: Fri Nov  7 14:42:47 2003
Subject: [spambayes-dev] RE: [Spambayes] Outlook Envelope Tray Icon
Message-ID: <113EE4C6211B1D41A34E54A089F4795C0AFCAB@mailbox.whatif-productions.com>

Because, unfortunately, some of us have co-workers (boss, boss's boss,
any number of department heads, etc) who expect us to have read an email
a few minutes after they've sent it.  Those of us who have the privilege
of working offset are often expected to reply sooner than others to
"prove" that we're working.

------------------------------------------------------------
Chris Dellario
Lead Engineer
Whatif Productions LLC
http://www.whatif.info
(617) 977-0115


-----Original Message-----
From: Adam Walker [mailto:adam.walker@rbwconsulting.com] 
Sent: Friday, November 07, 2003 2:08 PM
To: Ryan Malayter
Cc: spambayes-dev@python.org; spambayes@python.org; Bob Chojnacki
Subject: Re: [Spambayes] Outlook Envelope Tray Icon

What about mail delivered to unwatched folders? What about mail 
delivered to watched and unwatched folders in the same batch? Why do 
people feel they need to drop everything and read an email when it comes
in?

Ryan Malayter wrote:

>
>Is it really that hard? 
>
>Maybe I'm not thinking it through enough, but I suggest this simple
>approach: 
>	Check for unread messages in the SpamBayes "watched" folders. 
>	Check the spam score on each of those unread messages.
>	If any exist where the Spam score is below the certain ham
>threshold, show the icon
>	if not, everything new was spam, and you can remove the icon.
>
>This might take a second or two two but it can happen right after every
>SpamBayes scoring run gets triggered. So we'll see the new mail icon
for
>at most a few seconds.
>
>Regards,
>	Ryan
>
>  
>


_______________________________________________
Spambayes@python.org
http://mail.python.org/mailman/listinfo/spambayes
Check the FAQ before asking: http://spambayes.sf.net/faq.html

From spambayes at whateley.com  Sat Nov  8 02:13:16 2003
From: spambayes at whateley.com (Brendon)
Date: Sun Nov  9 16:12:03 2003
Subject: [spambayes-dev] Re: [Spambayes] Outlook Envelope Tray Icon
In-Reply-To: <3FABED84.5050607@rbwconsulting.com>
References: <792DE28E91F6EA42B4663AE761C41C2A012C3765@cliff.bai.org>
	<3FABED84.5050607@rbwconsulting.com>
Message-ID: <200311071220.06215.spambayes@whateley.com>

On Friday 07 November 2003 11:07 am, Adam Walker wrote:
> What about mail delivered to unwatched folders? What about mail
> delivered to watched and unwatched folders in the same batch? Why do
> people feel they need to drop everything and read an email when it comes
> in?
>

Probably the same reason most people can't _not_ answer a ringing phone!

That, or just lonely?

Brendon.


From richie at entrian.com  Mon Nov 10 15:46:19 2003
From: richie at entrian.com (Richie Hindle)
Date: Mon Nov 10 15:46:43 2003
Subject: [spambayes-dev] Re: Offer to Help / Development Participation
In-Reply-To: <AD325213AB726740BC611F5955F6310A0F80A4@mxs001.lexxi.com>
References: <AD325213AB726740BC611F5955F6310A0F80A4@mxs001.lexxi.com>
Message-ID: <8rsvqv0vt1804s1akodgn77060evpgr83e@4ax.com>

David, Darrell,

[David]
> I'm very impressed with your work and would be glad to help [...]

[Darrell]
> [...] any development help you need don't hesitate to drop me a line.

Many thanks for the offers!  Maybe other developers would like to make
specific suggestions (hence I've forwarded this to the spambayes-dev
mailing list), but there's a whole bunch of things that you could do,
starting with the non-technical:

 o Try to reproduce bugs that we're having trouble reproducing; see the
   bug list at http://sourceforge.net/tracker/?group_id=61702&atid=498103
   (807217 is my personal hate figure).

 o Help with testing; we're only able to test within our own environments,
   and only the developers who are around at the time of a release are
   able to do even that.  Some "real people" who could help test in their
   environments would be a big help.

 o Help improve the website; there's a Wiki page about that at
   http://www.entrian.com/sbwiki/WebSiteDevelopment

 o Help improve the documentation, especially for the non-Outlook
   applications (POP3 proxy, IMAP filter, Notes filter, sb_filter).

 o Help out newbies on the mailing list.

 o Make contributions to the Wiki, http://www.entrian.com/sbwiki - any
   hints and tips, scripts, recipes etc.

 o Taking part in discussions on the developer's mailing list at
   spambayes-dev@python.org.  You don't need to be a developer to
   participate, you just need to have a decent grasp of the project and
   have opinions about how it should be developed.

For those with programming skills, there's even more you could help with,
even without in-depth knowledge of the code.  The code's pretty
accessible, and developers are always glad to answer questions about how
it all works.  Here's a small list off the top of my head:

 o Test patches, tidying them up, making them fit the coding standard
   (http://www.python.org/peps/pep-0008.html) if they don't already.
   See http://sourceforge.net/tracker/?group_id=61702&atid=498105

 o Fix bugs - turning a bug report into a patch makes it far more likely
   to be fixed!

 o Improve our unit tests, or help develop an acceptance test framework.

 o Once you've got a handle on how the code works, implement feature
   requests.

 o Backport bugfixes from the head onto the bugfix branch, although our
   branch strategy is a little up in the air at the moment, so that's one
   for the future.

 o Help with sailing our fleet of luxury yachts from the Caribbean to the
   Med for the Spring season... or am I dreaming again?  8-)

There are probably a dozen other things that I haven't thought of.

-- 
Richie Hindle
richie@entrian.com


From skip at pobox.com  Mon Nov 10 17:49:43 2003
From: skip at pobox.com (Skip Montanaro)
Date: Mon Nov 10 17:49:49 2003
Subject: [spambayes-dev] Another tweak to try - asciify_subject
Message-ID: <16304.5639.788824.89239@montanaro.dyndns.org>

We're all familiar with the recent attempts to foil spam filters by adding
Latin-1 accents to message subjects (and sometimes to message bodies):

    We c?n mak? it l?nger now

The attached context diff maps subjects through a "latscii" codec I wrote
which does little more than strip accents.  (It also maps various symbols to
reasonable ASCII equivalents, like mapping '?' -> '!'.)  This showed a small
improvement in false negatives for me (1 out of 10 on the timcv meter, n ==
10, 500 messages per bucket) and no change in false positives:

    false positive percentages
	0.600  0.600  tied          
	0.000  0.000  tied          
	0.200  0.200  tied          
	0.400  0.400  tied          
	0.000  0.000  tied          
	0.800  0.800  tied          
	0.200  0.200  tied          
	0.800  0.800  tied          
	0.200  0.200  tied          
	0.400  0.400  tied          

    won   0 times
    tied 10 times
    lost  0 times

    total unique fp went from 18 to 18 tied          
    mean fp % went from 0.36 to 0.36 tied          

    false negative percentages
	2.200  2.200  tied          
	1.000  1.000  tied          
	2.200  2.000  won     -9.09%
	3.000  3.000  tied          
	1.600  1.600  tied          
	2.000  2.000  tied          
	1.000  1.000  tied          
	2.000  2.000  tied          
	1.600  1.600  tied          
	1.400  1.400  tied          

    won   1 times
    tied  9 times
    lost  0 times

    total unique fn went from 90 to 89 won     -1.11%
    mean fn % went from 1.8 to 1.78 won     -1.11%

    ham mean                     ham sdev
       4.92    4.94   +0.41%       14.95   14.98   +0.20%
       5.14    5.16   +0.39%       15.47   15.48   +0.06%
       4.89    4.90   +0.20%       14.51   14.53   +0.14%
       5.31    5.34   +0.56%       15.80   15.85   +0.32%
       4.61    4.62   +0.22%       14.80   14.83   +0.20%
       5.71    5.75   +0.70%       17.21   17.28   +0.41%
       4.32    4.33   +0.23%       13.45   13.50   +0.37%
       4.83    4.85   +0.41%       14.83   14.87   +0.27%
       4.38    4.38   +0.00%       13.97   14.02   +0.36%
       5.96    5.97   +0.17%       17.38   17.40   +0.12%

    ham mean and sdev for all runs
       5.01    5.02   +0.20%       15.29   15.33   +0.26%

    spam mean                    spam sdev
      90.76   90.84   +0.09%       19.66   19.58   -0.41%
      91.16   91.23   +0.08%       17.64   17.57   -0.40%
      91.25   91.29   +0.04%       18.84   18.79   -0.27%
      88.31   88.36   +0.06%       22.55   22.49   -0.27%
      90.54   90.62   +0.09%       18.50   18.42   -0.43%
      91.64   91.68   +0.04%       17.75   17.69   -0.34%
      91.19   91.33   +0.15%       17.82   17.71   -0.62%
      91.66   91.69   +0.03%       18.76   18.74   -0.11%
      91.31   91.39   +0.09%       17.97   17.85   -0.67%
      91.87   91.96   +0.10%       17.07   16.97   -0.59%

    spam mean and sdev for all runs
      90.97   91.04   +0.08%       18.74   18.66   -0.43%

    ham/spam mean difference: 85.96 86.02 +0.06

If you test this out, it will have no effect if you don't have any messages
in your training databases which use this trick.  When I first ran it, I
hadn't factored in any recent messages and saw nothing.  After I ran
splitndirs.py over my current small (153 spam, 102 ham) training databases,
then ran rebal -n 300 followed by rebal -n 500 to stir the pot a bit, I saw
the above changes.

While I was at it, I wrote a simple Makefile to run the cross validation
tests.  This should speed things up in the common case where your training
database and your base.ini file don't change (cutting processing time
approximately in half).  Use it like so:

    make BASE=std TRIAL=ascii

A plain 

    make

assumes your base and trial option files are std.ini and trial.ini,
respectively.

Skip

-------------- next part --------------
A non-text attachment was scrubbed...
Name: Makefile
Type: application/octet-stream
Size: 737 bytes
Desc: Makefile for running cross validations
Url : http://mail.python.org/pipermail/spambayes-dev/attachments/20031110/9d95bb33/Makefile.obj
-------------- next part --------------
A non-text attachment was scrubbed...
Name: sb.diff
Type: application/octet-stream
Size: 6759 bytes
Desc: asciify_subject tweak
Url : http://mail.python.org/pipermail/spambayes-dev/attachments/20031110/9d95bb33/sb.obj
From tim.one at comcast.net  Tue Nov 11 11:08:45 2003
From: tim.one at comcast.net (Tim Peters)
Date: Tue Nov 11 11:08:49 2003
Subject: [spambayes-dev] RE: Bug in UserInterface.py
In-Reply-To: <E1AJU46-0003xz-00@sc8-sf-web3.sourceforge.net>
Message-ID: <LNBBLJKPBEHFEDALKOLCAENDHAAB.tim.one@comcast.net>

Would someone familiar with UserInterface.py please check in the attached
patch, or add it to the patch manager if you're unsure about it?  Thanks!
-------------- next part --------------
An embedded message was scrubbed...
From: "Mats Kindahl" <matkin@users.sourceforge.net>
Subject: Bug in UserInterface.py
Date: Tue, 11 Nov 2003 03:39:58 -0500
Size: 1983
Url: http://mail.python.org/pipermail/spambayes-dev/attachments/20031111/8a8f0679/attachment.mht
From kennypitt at hotmail.com  Tue Nov 11 12:18:09 2003
From: kennypitt at hotmail.com (Kenny Pitt)
Date: Tue Nov 11 12:18:49 2003
Subject: [spambayes-dev] RE: Bug in UserInterface.py
In-Reply-To: <LNBBLJKPBEHFEDALKOLCAENDHAAB.tim.one@comcast.net>
Message-ID: <E1AJcA7-0000bm-Bm@mail.python.org>

Tim Peters wrote:
> Would someone familiar with UserInterface.py please check in the
> attached patch, or add it to the patch manager if you're unsure about
> it?  Thanks! 

[from attached patch]
"""
diff -r1.32 UserInterface.py
274c274
<     sc_re = re.compile("%s:(.*)\n" % \
---
>     sc_re = re.compile("%s:\s*(\d*\.\d+|\d+\.\d*).*\n" % \
"""

This would probably work if and when the fix for bug #831388 is applied.
However, the current code inserts the probability in the
X-Spambayes-Spam-Probability: header using str(prob), which can go into
the "e" exponent notation for very small probs.  If this happens, the
patched regex will fail to properly identify the probability.  The regex
can be modified as follows if you want to account for this possibility:

sc_re = re.compile("%s:\s*((\d*\.\d+|\d+\.\d*)(e[-+]\d+)?).*\n" % \

In addition, I believe that there will always be at least one digit
before the decimal point as the leading zero is always included, so we
should be able to simplify the expression to:

sc_re = re.compile("%s:\s*(\d+\.\d*(e[-+]\d+)?).*\n" % \

-- 
Kenny Pitt


From richie at entrian.com  Tue Nov 11 16:58:41 2003
From: richie at entrian.com (Richie Hindle)
Date: Tue Nov 11 16:59:01 2003
Subject: [spambayes-dev] Website bug and proposed fix
Message-ID: <htl2rvcagrvlce3nunegkcfn52o2ece9g9@4ax.com>

Hi,

Jens Rantil has kindly pointed out that we have some broken links on
our website, in particular the "SF Project Page" link that appears
throughout the site.

I've never looked at the website stuff before so I could be way off
base, but the problem seems to be that we're applying posixpath.normpath
to a URL, with results that look like this:

>>> import posixpath
>>> posixpath.normpath("http://sourceforge.net/projects/spambayes")
'http:/sourceforge.net/projects/spambayes'

I'd say the fix was to break apart the URL and only run the path
component through normpath.  Here's a patch - I don't want to commit it,
partly because I don't know the code, and partly because the website
build system doesn't fully work on my machine so I can't thoroughly test
it.  I've also removed a rather cryptic comment that seems to refer to
history rather than the current state of play.


Index: scripts/ht2html/LinkFixer.py
===================================================================
RCS file: /cvsroot/spambayes/website/scripts/ht2html/LinkFixer.py,v
retrieving revision 1.2
diff -c -r1.2 LinkFixer.py
*** scripts/ht2html/LinkFixer.py	28 Oct 2003 04:37:08 -0000	1.2
--- scripts/ht2html/LinkFixer.py	11 Nov 2003 21:57:40 -0000
***************
*** 8,13 ****
--- 8,15 ----
  """
  
  import sys
+ import urlparse
+ import posixpath # use posix semantics for urls
  from types import StringType
  
  SLASH = '/'
***************
*** 37,49 ****
              url = 'index.html'
          elif url[-1] == '/':
              url = url + 'index.html'
!         absurl = SLASH.join([self.__rootdir, self.__relthis, url])
          # normalize the path, kind of the way os.path.normpath() does.
!         # urlparse ought to have something like this...
!         # hrm - MarkH thinks this is broken, so it has been replaced
!         # with normpath - what is the problem with normpath?
!         import posixpath # use posix semantics for urls
!         absurl = posixpath.normpath(absurl)
          self.msg('absurl= %s', absurl)
          return absurl
  
--- 39,51 ----
              url = 'index.html'
          elif url[-1] == '/':
              url = url + 'index.html'
!         
          # normalize the path, kind of the way os.path.normpath() does.
!         # urlparse ought to have something like this built in...
!         scheme, addr, path, params, query, frag = urlparse.urlparse(url)
!         abspath = SLASH.join([self.__rootdir, self.__relthis, path])
!         path = posixpath.normpath(abspath)
!         absurl = urlparse.urlunparse((scheme, addr, path, params, query, frag))
          self.msg('absurl= %s', absurl)
          return absurl
  

-- 
Richie Hindle
richie@entrian.com


From mhammond at skippinet.com.au  Tue Nov 11 17:22:28 2003
From: mhammond at skippinet.com.au (Mark Hammond)
Date: Tue Nov 11 17:22:12 2003
Subject: [spambayes-dev] Website bug and proposed fix
In-Reply-To: <htl2rvcagrvlce3nunegkcfn52o2ece9g9@4ax.com>
Message-ID: <08f801c3a8a2$4b7011f0$0500a8c0@eden>

> Hi,
>
> Jens Rantil has kindly pointed out that we have some broken links on
> our website, in particular the "SF Project Page" link that appears
> throughout the site.
>
> I've never looked at the website stuff before so I could be way off
> base, but the problem seems to be that we're applying
> posixpath.normpath
> to a URL, with results that look like this:
>
> >>> import posixpath
> >>> posixpath.normpath("http://sourceforge.net/projects/spambayes")
> 'http:/sourceforge.net/projects/spambayes'
>
> I'd say the fix was to break apart the URL and only run the path
> component through normpath.  Here's a patch - I don't want to
> commit it,
> partly because I don't know the code, and partly because the website
> build system doesn't fully work on my machine so I can't
> thoroughly test
> it.

Mea culpa.  This was a hack I made when trying to get the
apps/outlook/bugs.html file working.  The code was breaking for me with
relative links, and I had the impression that the "link fixer" was only
fixing relative links.  I've checked your fix in (I still have a related
problem with bugs.html, but it exists before and after your patch.)

Mozilla appears to have done the right thing with those links!

> I've also removed a rather cryptic comment that seems to refer to
> history rather than the current state of play.

hehe - surely that comment helped you track the bug?  At least I put my name
next to my suspect code ;)

Mark.


From skip at pobox.com  Wed Nov 12 09:48:25 2003
From: skip at pobox.com (Skip Montanaro)
Date: Wed Nov 12 09:48:31 2003
Subject: [spambayes-dev] sb_filter change
Message-ID: <16306.18489.981478.992986@montanaro.dyndns.org>

I modified sb_filter.py to accept one or more file names on the command
line.  Existing behavior should be retained.  If a single message is read
from stdin, the output message will have a From_ line only if the input
message did.  When processing files from the command line, it uses
mboxutils.getmbox() to decipher their format.  In such cases, the output is
always a Unix-style mailbox on stdout.

This change probably doesn't have a lot of practical use, but I find it
helpful in one situation.  If I want to score a mailbox full of messages to
identify outliers (perhaps mistakes in my classification of a large body of
messages), I used to do this:

    formail -s sb_filter.py < somembox \
    | egrep -i '^(x-spambayes-classification|message-id): '

which incurred sb_filter.py startup for each message.  Now I execute

    sb_filter.py somembox \
    | egrep -i '^(x-spambayes-classification|message-id): '    

which runs a lot faster.

I should be able to figure out how to process my incoming mail that was as
well, then spit the result into

    formail -s procmail

to do the usual procmail processing.

This usage suggests an enhancement to mboxutils.getmbox().  Currently, it
doesn't recognize Tim-style training databases (e.g. Data/Ham/SetN where all
files have numeric filenames.  mboxutils.DirOfTxtFileMailbox could be
extended to simply accept all plain files as messages and all subdirectories
as nested Dir_ofTxtFileMailboxes.  Would that change break anyone's usage?
(What are .lorien files anyway?)

Skip

From patterson at Tech2020.org  Wed Nov 12 12:28:30 2003
From: patterson at Tech2020.org (Kevin Patterson)
Date: Wed Nov 12 12:28:36 2003
Subject: [spambayes-dev] (no subject)
Message-ID: <E0A7B1E85CB75C44A0D087CD13420E386877@intranet.tech2020.org>

Will spambayes ever work in a Terminal server/Citrix enviroment? It
works fine when only one instance of outlook is running. Do you know of
anything I could do config wise on spambayes to fix this? Or on the
server side. Keep up the great work. 


Thank you!
Kevin Patterson
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.python.org/pipermail/spambayes-dev/attachments/20031112/8574420c/attachment.html
From skip at pobox.com  Wed Nov 12 12:40:45 2003
From: skip at pobox.com (Skip Montanaro)
Date: Wed Nov 12 12:44:03 2003
Subject: [spambayes-dev] Who can explain this?
Message-ID: <16306.28829.807690.155222@montanaro.dyndns.org>


Racking my brain trying to figure out just what persistent storage file I
was using (because sometimes it seemed to use ~/hammie.db and sometimes
~/.hammiedb), I came across this in sb_filter.py:

        # This is a bit of a hack to counter the default for
        # persistent_storage_file changing from ~/.hammiedb to hammie.db
        # This will work unless a user:
        #   * had hammie.db as their value for persistent_storage_file, and
        #   * their config file was loaded by Options.py.
        if options["Storage", "persistent_storage_file"] == \
           options.default("Storage", "persistent_storage_file"):
            options["Storage", "persistent_storage_file"] = \
                                    "~/.hammiedb"

Can we just rip this hack out and let the user's options file dictate
things?

Skip

From skip at pobox.com  Wed Nov 12 13:05:21 2003
From: skip at pobox.com (Skip Montanaro)
Date: Wed Nov 12 13:05:28 2003
Subject: [spambayes-dev] proposal for more uniform option setting from the
	command line
Message-ID: <16306.30305.187928.62359@montanaro.dyndns.org>


Our command lines still seem to be a mish mash of little hacks.  Everything
of interest can be set via the INI file, but there are only a few options
which can be set via the command line, and not (I don't believe) in a
consistent way across SB apps.

How about instead of only allowing specific options to be overridden on the
command line we use a consistent syntax for overriding *any* option from the
command line?  For example, to set the ["Storage",
"persistent_storage_file"] we could use something like

    -o Storage:persistent_storage_file:~/.hammiedb

or

    --option=Storage:persistent_storage_file:~/.hammiedb

The general syntax of an option setting command line arg would then be
section:field:value.  The post-getopt.getopt() code might look something
like:

    from spambayes.Options import options

    for opt, arg in opts:
        ...
        elif opt in ('-o', '--option'):
            options.set(*arg.split(':'))
        ...

We would then deprecate any command line args used to twiddle options using
any other syntax.  Use of those args would trigger a message to stderr like:

    Deprecated form: "-d ~/hammie.db" found.
    Use "-o Storage:persistent_storage_file:~/hammie.db" instead.

This could be extended further.  Should the user give an incomplete -o flag
such as "-o Storage" or "-o Storage:spam_cache", help about that section or
variable could be emitted:

    saw_help = False
    for opt, arg in opts:
        ...
        elif opt in ('-o', '--option'):
            # this would probably be folded into an OptionsClass method
            val = arg.split(':')
            if len(val) < 3:
                options.help(sys.stderr, *val)
                saw_help = True
            else:           
                options.set(*arg.split(':'))
        ...
    if saw_help:
        raise SystemExit

where OptionsClass.OptionsClass.help() would look something like:

    def help(self, stream, sect=None, opt=None):
        if sect is None:
            # dump help about all options
        elif opt is None:
            # dump help about sect
        else:
            # dump help about options[sect, opt]

Skip

From papaDoc at videotron.ca  Wed Nov 12 13:15:23 2003
From: papaDoc at videotron.ca (papaDoc)
Date: Wed Nov 12 13:15:26 2003
Subject: [spambayes-dev] proposal for more uniform option setting from
	the	command line
In-Reply-To: <16306.30305.187928.62359@montanaro.dyndns.org>
References: <16306.30305.187928.62359@montanaro.dyndns.org>
Message-ID: <3FB278BB.2040402@videotron.ca>

Hi,

> Our command lines still seem to be a mish mash of little hacks.  Everything of interest can be set via the INI file, but there are only a few options which can be set via the command line, and not (I don't believe) in a consistent way across SB apps.

This is really true and there was several complain about that.


> The general syntax of an option setting command line arg would then be section:field:value.  The post-getopt.getopt() code might look something like:

This is interesting
+1

But We have to check if it could be possible to have and option that 
will not be/can not be included in
the options file like my patch about the -t for the sb_mboxtrain.py


Remi


From kennypitt at hotmail.com  Wed Nov 12 13:43:06 2003
From: kennypitt at hotmail.com (Kenny Pitt)
Date: Wed Nov 12 13:47:05 2003
Subject: [spambayes-dev] proposal for more uniform option setting from
	thecommand line
In-Reply-To: <16306.30305.187928.62359@montanaro.dyndns.org>
Message-ID: <E1AK017-00076h-Lb@mail.python.org>

Skip Montanaro wrote:
> How about instead of only allowing specific options to be overridden
> on the command line we use a consistent syntax for overriding *any*
> option from the command line?  For example, to set the ["Storage",
> "persistent_storage_file"] we could use something like
> 
>     -o Storage:persistent_storage_file:~/.hammiedb
> 
> or
> 
>     --option=Storage:persistent_storage_file:~/.hammiedb

This sounds useful for those doing testing with various options, and I'm
all for it from that standpoint.  However, I'm not sure how useful it
would be for the average user.

> We would then deprecate any command line args used to twiddle options
> using any other syntax.  Use of those args would trigger a message to
> stderr like: 
> 
>     Deprecated form: "-d ~/hammie.db" found.
>     Use "-o Storage:persistent_storage_file:~/hammie.db" instead.

I don't know if it's good to go that far.  The new syntax is rather
cumbersome, especially if I'm typing the command manually.  Also, some
command line flags can set several related option values to the correct
combination (e.g. set both the database filename and type with one
flag), and the new syntax would require knowing the correct combination
and providing all the correct values.

> This could be extended further.  Should the user give an incomplete
> -o flag such as "-o Storage" or "-o Storage:spam_cache", help about
> that section or variable could be emitted:

What about options that have no effect on the application being run?
Would it be possible to detect them and show help in that case also?
How would we present a list of useful options to the end user without
overwhelming them with rarely changed settings and gory internal
details?

-- 
Kenny Pitt


From popiel at wolfskeep.com  Wed Nov 12 13:53:30 2003
From: popiel at wolfskeep.com (T. Alexander Popiel)
Date: Wed Nov 12 13:53:35 2003
Subject: [spambayes-dev] proposal for more uniform option setting from the
	command line 
In-Reply-To: Message from Skip Montanaro <skip@pobox.com> of "Wed,
	12 Nov 2003 12:05:21 CST."
	<16306.30305.187928.62359@montanaro.dyndns.org> 
References: <16306.30305.187928.62359@montanaro.dyndns.org> 
Message-ID: <20031112185330.D4ADD2DDA2@cashew.wolfskeep.com>

In message:  <16306.30305.187928.62359@montanaro.dyndns.org>
             Skip Montanaro <skip@pobox.com> writes:
>
>How about instead of only allowing specific options to be overridden on the
>command line we use a consistent syntax for overriding *any* option from the
>command line?

+1

>The general syntax of an option setting command line arg would then be
>section:field:value.  The post-getopt.getopt() code might look something
>like:
>
>    from spambayes.Options import options
>
>    for opt, arg in opts:
>        ...
>        elif opt in ('-o', '--option'):
>            options.set(*arg.split(':'))
>        ...

Only problem here is that this particular phrasing makes it impossible
to set an option value with a colon in it.  Better would be to use
options.set(*arg.split(':', 2)).

>We would then deprecate any command line args used to twiddle options using
>any other syntax.  Use of those args would trigger a message to stderr like:
>
>    Deprecated form: "-d ~/hammie.db" found.
>    Use "-o Storage:persistent_storage_file:~/hammie.db" instead.

+1

>This could be extended further.  Should the user give an incomplete -o flag
>such as "-o Storage" or "-o Storage:spam_cache", help about that section or
>variable could be emitted:

I would tend to put this in a separate syntax, and have an incomplete
specification just emit an error message (possibly saying something like
'use --help=Storage for more information').

- Alex

From skip at pobox.com  Wed Nov 12 14:48:31 2003
From: skip at pobox.com (Skip Montanaro)
Date: Wed Nov 12 14:48:40 2003
Subject: [spambayes-dev] proposal for more uniform option setting from
	the     command line
In-Reply-To: <3FB278BB.2040402@videotron.ca>
References: <16306.30305.187928.62359@montanaro.dyndns.org>
	<3FB278BB.2040402@videotron.ca>
Message-ID: <16306.36495.340289.654317@montanaro.dyndns.org>


    Remi> But We have to check if it could be possible to have and option
    Remi> that will not be/can not be included in the options file like my
    Remi> patch about the -t for the sb_mboxtrain.py

Sure.  Command line args which are specific to an application and don't
involve modifications to the options database would still be fine.  I'm more
after the "'-d file' means file is a dbhash and '-D file' means file is a
pickle" sort of arg.  These can be dispensed with if the user can set the
appropriate option(s) from the command line in a more general fashion.

This sort of thing might also be useful for at least casual testing.  I have
this asciify_subject option I'm playing with.  I could compare the output of
these two commands:

    sb_filter.py ~/Mail/unsure \
    | egrep -i 'x-spambayes-classification'

    sb_filter.py -o Tokenizer:asciify_subject:True ~/Mail/unsure \
    | egrep -i 'x-spambayes-classification'

to see if it helps push some of my current unsures in the right direction.

Skip

From skip at pobox.com  Wed Nov 12 14:54:04 2003
From: skip at pobox.com (Skip Montanaro)
Date: Wed Nov 12 14:55:27 2003
Subject: [spambayes-dev] proposal for more uniform option setting from
	thecommand line
In-Reply-To: <20031112184659.240962764A7@orb.pobox.com>
References: <16306.30305.187928.62359@montanaro.dyndns.org>
	<20031112184659.240962764A7@orb.pobox.com>
Message-ID: <16306.36828.190258.68060@montanaro.dyndns.org>


    >> -o Storage:persistent_storage_file:~/.hammiedb
    >> 
    >> or
    >> 
    >> --option=Storage:persistent_storage_file:~/.hammiedb

    Kenny> This sounds useful for those doing testing with various options,
    Kenny> and I'm all for it from that standpoint.  However, I'm not sure
    Kenny> how useful it would be for the average user.

Correct.  However, the average user probably shouldn't be giving any command
line options (or very few) anyway, but should be twiddling bits in the
options file to make them persistent.

    >> Deprecated form: "-d ~/hammie.db" found.
    >> Use "-o Storage:persistent_storage_file:~/hammie.db" instead.

    Kenny> I don't know if it's good to go that far.  The new syntax is
    Kenny> rather cumbersome, especially if I'm typing the command manually.

I can buy that, though if you're using a modern shell, command recall can
mitigate most of that.  (I don't think DOS shells or vanilla /bin/sh qualify
as "modern shells".  I'm talking tcsh, bash, ksh, etc.)

    Kenny> Also, some command line flags can set several related option
    Kenny> values to the correct combination (e.g. set both the database
    Kenny> filename and type with one flag), and the new syntax would
    Kenny> require knowing the correct combination and providing all the
    Kenny> correct values.

I think that's more confusing than it ought to be.  Having -d and -D
simultaneously set two options seems 

    >> This could be extended further.  Should the user give an incomplete
    >> -o flag such as "-o Storage" or "-o Storage:spam_cache", help about
    >> that section or variable could be emitted:

    Kenny> What about options that have no effect on the application being
    Kenny> run?  

I hadn't considered that.  

    Kenny> Would it be possible to detect them and show help in that case
    Kenny> also?

I suppose so, but the application would then have to register all the
options it's interested in.  How would the application author know what all
the storage options were without diving into storage.py and friends?

    Kenny> How would we present a list of useful options to the end user
    Kenny> without overwhelming them with rarely changed settings and gory
    Kenny> internal details?

Experiment, I suppose.

It appears the majority of users will use the Outlook plugin for which this
doesn't apply.  I suspect I'm appealing more to the propeller heads among
us.

Skip


From skip at pobox.com  Wed Nov 12 14:59:44 2003
From: skip at pobox.com (Skip Montanaro)
Date: Wed Nov 12 14:59:54 2003
Subject: [spambayes-dev] proposal for more uniform option setting from the
	command line 
In-Reply-To: <20031112185330.D4ADD2DDA2@cashew.wolfskeep.com>
References: <16306.30305.187928.62359@montanaro.dyndns.org>
	<20031112185330.D4ADD2DDA2@cashew.wolfskeep.com>
Message-ID: <16306.37168.627459.610349@montanaro.dyndns.org>


    >> This could be extended further.  Should the user give an incomplete
    >> -o flag such as "-o Storage" or "-o Storage:spam_cache", help about
    >> that section or variable could be emitted:

    Alex> I would tend to put this in a separate syntax, and have an
    Alex> incomplete specification just emit an error message (possibly
    Alex> saying something like 'use --help=Storage for more information').

I thought about this.  I suppose you're right about the incomplete flags.
It doesn't give you a way to ask about all option file sections either.  I
think it's best to leave --help/-h alone (no args) and have a pair of
standard options used like

    --help-section=Storage
    --help-section=Storage:spam_cache
    --help-all-sections

with obvious semantics.  Or maybe you can glob things:

    --help-section=*            # problematic - * can be special to shells

or

    --help-section=all          # relies on special "section" "all"

Skip

From kennypitt at hotmail.com  Wed Nov 12 15:09:54 2003
From: kennypitt at hotmail.com (Kenny Pitt)
Date: Wed Nov 12 15:13:54 2003
Subject: [spambayes-dev] proposal for more uniform option setting from
	thecommand line
In-Reply-To: <16306.36828.190258.68060@montanaro.dyndns.org>
Message-ID: <E1AK1N9-0002pb-PD@mail.python.org>

Skip Montanaro wrote:
> 
>     Kenny> Also, some command line flags can set several related
option
>     Kenny> values to the correct combination (e.g. set both the
database
>     Kenny> filename and type with one flag), and the new syntax would
>     Kenny> require knowing the correct combination and providing all
the
>     Kenny> correct values.
> 
> I think that's more confusing than it ought to be.  Having -d and -D
> simultaneously set two options seems

Bad example <wink>.  Should have known from past experience that those
were the ones you're gunning for.

> 
>     >> This could be extended further.  Should the user give an
incomplete
>     >> -o flag such as "-o Storage" or "-o Storage:spam_cache", help
about
>     >> that section or variable could be emitted: 
> 
>     Kenny> What about options that have no effect on the application
being
>     Kenny> run?
> 
> I hadn't considered that.
> 
>     Kenny> Would it be possible to detect them and show help in that
case
>     Kenny> also?
> 
> I suppose so, but the application would then have to register all the
> options it's interested in.  How would the application author know
> what all the storage options were without diving into storage.py and
> friends? 

Good point.  There are quite a few layers to most operations, and
digging up an exhaustive list of what is actually used for a particular
case would be extremely difficult.

> 
>     Kenny> How would we present a list of useful options to the end
user
>     Kenny> without overwhelming them with rarely changed settings and
gory
>     Kenny> internal details?
> 
> Experiment, I suppose.
> 
> It appears the majority of users will use the Outlook plugin for
> which this doesn't apply.  I suspect I'm appealing more to the
> propeller heads among us.

If that is the intended audience then all of my comments above are
pretty much moot.  As I said initially, I'm all for it from the
standpoint of testing, and the propeller heads don't need no stinkin'
help, right? <wink>

So, my final vote: +1

-- 
Kenny Pitt


From mhammond at skippinet.com.au  Wed Nov 12 16:39:16 2003
From: mhammond at skippinet.com.au (Mark Hammond)
Date: Wed Nov 12 16:38:58 2003
Subject: [spambayes-dev] (no subject)
In-Reply-To: <E0A7B1E85CB75C44A0D087CD13420E386877@intranet.tech2020.org>
Message-ID: <02f001c3a965$6c8432e0$0500a8c0@eden>

You have not given us any indication of what problem you are seeing.  Please
see the "Troubleshooting Guide" that comes with SpamBayes, and create a new
bug, being sure to upload the log file for your session.

Regards,

Mark

 -----Original Message-----
From: spambayes-dev-bounces@python.org
[mailto:spambayes-dev-bounces@python.org]On Behalf Of Kevin Patterson
Sent: Thursday, 13 November 2003 4:28 AM
To: spambayes-dev@python.org
Subject: [spambayes-dev] (no subject)


  Will spambayes ever work in a Terminal server/Citrix enviroment? It works
fine when only one instance of outlook is running. Do you know of anything I
could do config wise on spambayes to fix this? Or on the server side. Keep
up the great work.


  Thank you!
  Kevin Patterson
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.python.org/pipermail/spambayes-dev/attachments/20031113/39085ab7/attachment.html
From skip at pobox.com  Wed Nov 12 17:13:34 2003
From: skip at pobox.com (Skip Montanaro)
Date: Wed Nov 12 17:13:44 2003
Subject: [spambayes-dev] Re: proposal for more uniform option setting from
	the command line 
Message-ID: <16306.45198.286429.808432@montanaro.dyndns.org>


    me> How about instead of only allowing specific options to be overridden
    me> on the command line we use a consistent syntax for overriding *any*
    me> option from the command line?

I took the first step in this direction, adding a set_from_cmdline method to
OptionsClass.OptionsClass and modifying sb_filter.py to accept -o/--option
flags.  Seems to work for me.  I decided to push as much processing into
OptionsClass.py as possible (including error message display) to make it as
easy as possible to process these options from other apps.  In sb_filter.py,
I had to add 'o:' and 'option=' to the appropriate getopt.getopt() arg and
add

        elif opt in ('-o', '--option'):
            Options.options.set_from_cmdline(arg, sys.stderr)

to the post-getopt() call processing.  If you want to handle error recovery
yourself, simply omit the sys.stderr arg from the call and wrap the call in
the appropriate try/except incantation.  Note that the way I have it set up,
it displays the error but continues processing.  I don't know if that's
necessarily a good way to do it, but under the assumption that this is
mostly for propeller head use, it seems okay to let processing continue and
thus be able to potentially catch multiple errors.

Skip

From tp at diffenbach.org  Thu Nov 13 23:47:06 2003
From: tp at diffenbach.org (TP Diffenbach)
Date: Thu Nov 13 23:43:40 2003
Subject: [spambayes-dev] Code locations in Spambayes Outlook plugin
Message-ID: <DBECJAAFMJPBJFMMBHFAKELPIJAA.tp@diffenbach.org>

I'd like to extend the Spambayes Outlook plugin a bit.

In the Spambayes Outlook Plugin, in which module are the header lines
(Outlook lingo: CdoPR_TRANSPORT_MESSAGE_HEADERS) extracted?

In which module is the spam percentage score added to the Outlook mail item?

(Why I'm doing this: the headers aren't accessible in Outlook except via
View|Options, or programmatically. I want an Outlook form that automatically
displays the headers, but doing it in Visual Basic Script is problematic
because of Outlook's security policies. Other work-arounds (using Redemption
for Outlook or writing my own hook) are possible, but I'd prefer just to
leverage Spambayes. So I'd like to add the headers as a user-defined field
(ugh, duplication) are design a form that merely accesses that.)

Thanks,
Tom


-------------- next part --------------
A non-text attachment was scrubbed...
Name: winmail.dat
Type: application/ms-tnef
Size: 1044 bytes
Desc: not available
Url : http://mail.python.org/pipermail/spambayes-dev/attachments/20031113/a2c53459/winmail.bin
From tim.one at comcast.net  Fri Nov 14 01:14:57 2003
From: tim.one at comcast.net (Tim Peters)
Date: Fri Nov 14 01:14:59 2003
Subject: [spambayes-dev] Code locations in Spambayes Outlook plugin
In-Reply-To: <DBECJAAFMJPBJFMMBHFAKELPIJAA.tp@diffenbach.org>
Message-ID: <LNBBLJKPBEHFEDALKOLCAENMHBAB.tim.one@comcast.net>

[TP Diffenbach]
> I'd like to extend the Spambayes Outlook plugin a bit.
>
> In the Spambayes Outlook Plugin, in which module are the header lines
> (Outlook lingo: CdoPR_TRANSPORT_MESSAGE_HEADERS) extracted?

The plugin doesn't use the CDO API -- it's too problematic across Outlook
variations (e.g., it appears it's not even available in most IMO
configurations, unless the user manually installs CDO from their Office CD).
It uses low-level MAPI instead.  Search the source for the MAPI
PR_TRANSPORT_MESSAGE_HEADERS_A property and you'll soon find it.  Be warned
that raw MAPI can be extremely painful to work with (although it's a hell of
a lot easier to work with from Python than from C!); OTOH, it's much faster
than CDO too, and that's important to the plugin for high-volume users.

> In which module is the spam percentage score added to the Outlook
> mail item? 

The same module you'll find above, in the SetField method of class
MAPIMsgStoreMsg (I'm resisting becoming a remote search button in your text
editor <wink>).

> (Why I'm doing this: the headers aren't accessible in Outlook except
> via View|Options, or programmatically. I want an Outlook form that
> automatically displays the headers, but doing it in Visual Basic
> Script is problematic because of Outlook's security policies.

Using MAPI directly appears to sidestep most Outlook whining.  At least so
far.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: winmail.dat
Type: application/ms-tnef
Size: 1040 bytes
Desc: not available
Url : http://mail.python.org/pipermail/spambayes-dev/attachments/20031114/3e30d968/winmail.bin
From kennypitt at hotmail.com  Fri Nov 14 09:08:29 2003
From: kennypitt at hotmail.com (Kenny Pitt)
Date: Fri Nov 14 09:08:55 2003
Subject: [spambayes-dev] Code locations in Spambayes Outlook plugin
In-Reply-To: <DBECJAAFMJPBJFMMBHFAKELPIJAA.tp@diffenbach.org>
Message-ID: <E1AKed0-0005tg-RY@mail.python.org>

TP Diffenbach wrote:
> I'd like to extend the Spambayes Outlook plugin a bit.
> 
> In the Spambayes Outlook Plugin, in which module are the header lines
> (Outlook lingo: CdoPR_TRANSPORT_MESSAGE_HEADERS) extracted?
> 
> In which module is the spam percentage score added to the Outlook
> mail item? 
> 
> (Why I'm doing this: the headers aren't accessible in Outlook except
> via View|Options, or programmatically. I want an Outlook form that
> automatically displays the headers, but doing it in Visual Basic
> Script is problematic because of Outlook's security policies. Other
> work-arounds (using Redemption for Outlook or writing my own hook)
> are possible, but I'd prefer just to leverage Spambayes. So I'd like
> to add the headers as a user-defined field (ugh, duplication) are
> design a form that merely accesses that.) 

If you do "Show spam clues for current message" and scroll past the
Significant Tokens section to the Message Stream section, it shows the
raw content of the e-mail including all headers.  Is that the kind of
information you're trying to get access to?

-- 
Kenny Pitt


From tim.one at comcast.net  Fri Nov 14 20:02:45 2003
From: tim.one at comcast.net (Tim Peters)
Date: Fri Nov 14 20:02:53 2003
Subject: [spambayes-dev] A spectacular false positive
In-Reply-To: <E1AKkQf-0003k0-KG@mail.python.org>
Message-ID: <LNBBLJKPBEHFEDALKOLCKEDBHCAB.tim.one@comcast.net>

Jeremy (Hylton) sent me some work-related email today, the output from
running a statistics-gathering program over a ZODB database.  We both
wondered why I hadn't gotten the message, and I eventually discovered that
it was actually in my Spam folder, and at "the wrong end" to boot (the view
on my Spam folder is sorted by spam score).  It had an internal ham score of
exactly 0 and an internal spam score of exactly 1.

So I trained on it as ham, and the next time he sent a similar report,
things were reversed:  the new one got ham=1 and spam=0.

So what unforgivable sin had he committed in the first email?  Heh.  It had
virtually no English text, but lots, and lots, and lots of different
integers (about 100KB worth).  There were about a half dozen strong ham
clues that it had come from him, but about 140 spam clues from the variety
of little integers, most hapaxes that had appeared in one training spam
each.

I view that mostly as a danger of mistake-based training:  as I've mentioned
before, mistake-based training tends toward being hapax-driven, and hapaxes
are brittle.  There's nothing *inherently* spammy about, say, 16384, and
because that's a power of 2 and I'm a computer geek, that *would* have
appeared in several training ham if I hadn't fallen into mistake-based
training (yes, 16384 had indeed appeared in one training spam).

So it's a cute one.  I have to note that it argues in favor of a whitelist
gimmick too -- although that wouldn't have done me any good since I never
would have anticipated that anything Jeremy sent would get scored as spam.
Even if I had anticipated it, I don't remember all the email accounts he
uses, and probably wouldn't have thought to whitelist the account he used to
send this one.

So if any spammers are reading this, here's how to get by my mistake-based
filter now:  add scads of random little integers to your spam.  If the rest
of your spam is brief enough, it will get a spam score of 0, because now my
database has even more little integer hapaxes in the *ham* direction.

amusedly y'rs  - tim


From rob at hooft.net  Sat Nov 15 04:27:57 2003
From: rob at hooft.net (Rob Hooft)
Date: Sat Nov 15 04:31:03 2003
Subject: [spambayes-dev] A spectacular false positive
In-Reply-To: <LNBBLJKPBEHFEDALKOLCKEDBHCAB.tim.one@comcast.net>
References: <LNBBLJKPBEHFEDALKOLCKEDBHCAB.tim.one@comcast.net>
Message-ID: <3FB5F19D.5070506@hooft.net>

Tim Peters wrote:

> I view that mostly as a danger of mistake-based training:  as I've mentioned
> before, mistake-based training tends toward being hapax-driven, and hapaxes
> are brittle.  There's nothing *inherently* spammy about, say, 16384, and
> because that's a power of 2 and I'm a computer geek, that *would* have
> appeared in several training ham if I hadn't fallen into mistake-based
> training (yes, 16384 had indeed appeared in one training spam).

I am now training on all mistakes and unsures, plus all ham scoring more 
than 0.02 and all spam scoring less than 0.99. Total trained messages is 
~250 both ways, and 97+ of spam scores 0.99+ leaving only 1-2 new spams 
per day, less than 1 unsure per day, and ~1 new ham per day to train on.

I am really pleased by the performance of this training schedule. It is 
not as brittle as mistake-based training, but it still ignores the 
obvious repeating things like CVS log messages of which I receive a few 
dozen per day. It keeps the database reasonably small, but not really 
hapax driven.

Rob

-- 
Rob W.W. Hooft  ||  rob@hooft.net  ||  http://www.hooft.net/people/rob/


From tp at diffenbach.org  Sat Nov 15 06:33:07 2003
From: tp at diffenbach.org (TP Diffenbach)
Date: Sat Nov 15 06:29:27 2003
Subject: [spambayes-dev] test -- can I use an arbitrary from address?
Message-ID: <DBECJAAFMJPBJFMMBHFACENEIJAA.tp@diffenbach.org>

Can I mail to the list from my real address, or only from the dares I signed
up with
-------------- next part --------------
A non-text attachment was scrubbed...
Name: winmail.dat
Type: application/ms-tnef
Size: 1048 bytes
Desc: not available
Url : http://mail.python.org/pipermail/spambayes-dev/attachments/20031115/cf90e3de/winmail.bin
From richie at entrian.com  Sat Nov 15 11:06:52 2003
From: richie at entrian.com (Richie Hindle)
Date: Sat Nov 15 11:07:22 2003
Subject: [spambayes-dev] A spectacular false positive
In-Reply-To: <LNBBLJKPBEHFEDALKOLCKEDBHCAB.tim.one@comcast.net>
References: <E1AKkQf-0003k0-KG@mail.python.org>
	<LNBBLJKPBEHFEDALKOLCKEDBHCAB.tim.one@comcast.net>
Message-ID: <6ijcrvc1tmqk2effqafko1veiup2seg06h@4ax.com>


[Tim]
> There were about a half dozen strong ham
> clues that it had come from him, but about 140 spam clues from the variety
> of little integers, most hapaxes that had appeared in one training spam
> each.

Perhaps it's argument for not classifying using hapaxes?  Wait for any
given clue to appear in more than one message before it becomes valid for
classification.  Has anyone tried this?  (And not just for SpamBayes -
Bill?)

It could well have helped with the similar spectacular false positive that
I reported a few weeks ago - that was from a colleague as well, and
consisted of a list of US state codes and state names.  Many of those were
spam hapaxes.

-- 
Richie Hindle
richie@entrian.com


From rob at hooft.net  Sat Nov 15 11:26:24 2003
From: rob at hooft.net (Rob Hooft)
Date: Sat Nov 15 11:29:31 2003
Subject: [spambayes-dev] A spectacular false positive
In-Reply-To: <6ijcrvc1tmqk2effqafko1veiup2seg06h@4ax.com>
References: <E1AKkQf-0003k0-KG@mail.python.org>	<LNBBLJKPBEHFEDALKOLCKEDBHCAB.tim.one@comcast.net>
	<6ijcrvc1tmqk2effqafko1veiup2seg06h@4ax.com>
Message-ID: <3FB653B0.3080507@hooft.net>

Richie Hindle wrote:
> [Tim]
> 
>>There were about a half dozen strong ham
>>clues that it had come from him, but about 140 spam clues from the variety
>>of little integers, most hapaxes that had appeared in one training spam
>>each.
> 
> 
> Perhaps it's argument for not classifying using hapaxes?  Wait for any
> given clue to appear in more than one message before it becomes valid for
> classification.  Has anyone tried this?  (And not just for SpamBayes -
> Bill?)

? h?v? n?t tr??d ?t, b?t ? ?m q??t? s?r? ?t w??ld p?rf?rm w?rs?!

Rob

-- 
Rob W.W. Hooft  ||  rob@hooft.net  ||  http://www.hooft.net/people/rob/


From richie at entrian.com  Sat Nov 15 11:30:22 2003
From: richie at entrian.com (Richie Hindle)
Date: Sat Nov 15 11:30:44 2003
Subject: [spambayes-dev] Whitelists (was: A spectacular false positive)
In-Reply-To: <LNBBLJKPBEHFEDALKOLCKEDBHCAB.tim.one@comcast.net>
References: <E1AKkQf-0003k0-KG@mail.python.org>
	<LNBBLJKPBEHFEDALKOLCKEDBHCAB.tim.one@comcast.net>
Message-ID: <p0kcrvket75ggl1uk75i2boq4nnjql3b8m@4ax.com>


[Tim]
> I have to note that it argues in favor of a whitelist
> gimmick too -- although that wouldn't have done me any good since I never
> would have anticipated that anything Jeremy sent would get scored as spam.
> Even if I had anticipated it, I don't remember all the email accounts he
> uses, and probably wouldn't have thought to whitelist the account he used to
> send this one.

I've been thinking about whitelists, and the more I think about them the
more I'm in favour of them.  We can do things with a built-in SpamBayes
whitelist that you just can't do with standard email client filters -
things that I think would address your objections, Tim.

All these rules would be optional, and possibly behind another rule that
says "An address must qualify N times before this happens":

 o Whenever a message is trained as ham, add the From address to the
   whitelist.

 o Whenever a message is trained as spam, remove the From address from the
   whitelist.

 o Whenever a message is received from a whitelisted addresses, and scores
   as solid (for some value of 'solid') ham, auto-train the message as
   ham.  You'd use this for personal acquaintances only, and not for
   mailing lists or organisations (amazon.com, ebay.com, etc.)

Add a couple of other features:

 o Give it an mbox file (or Outlook folder, etc.) and it adds all the
   addresses to the whitelist.

 o Support wildcard patterns in the whitelist, eg. *@myemployer.com

and I think you have something that would be mostly automated.  You
wouldn't need to dig out all your acquaintances addresses and add them by
hand, because the act of training would catch many of them.  The ability
to add all the addresses in a folder would catch most of the rest (for
anyone that keeps a good deal of old email around, which I suspect is most
people, especially in a working environment).

The upshot: I still don't trust SpamBayes to delete my Spam without
looking it.  This feature would mean I *would* trust it, because I could
be sure that when one of my friends or colleagues sends me a spammy
message (cf. the list of US state names I received a while ago) it doesn't
get classified as spam.  I'm prepared to take the risk of forged From
addresses because the time spent weeding out those will be far less than
the time I currently take glancing down my entire list of ~150 spams per
day.  I'm prepared to take the risk that the first ever email a friend
sends me gets deleted as spam (very unlikely).  I keep all my old mail,
sorted into ham and spam, so generating my whitelist will be easy (and
even if you don't keep all your old mail, generating a training-based
whitelist for frequent correspondents, or adding wildcard patterns for all
work addresses, would be easy).

Other features we'd need:

 o Manual editing in web interface / an Outlook dialog - just a
   newline-separated list of names or wildcard patterns.

 o Import / export of whitelists as plain text files (choice of merge or
   replace on import)

Classification would just override whatever the classifier said, adding
"X-Spambayes-Classification: ham".  If you ask for evidence, you get
"X-Spambayes-Evidence: Whitelist rule '<rule>' matches 'From: <address>'".

Questions:

 o How to get the actual address from a To/From header - the address would
   need separating from the real name and any quoting.

 o Which headers to use?  Probably just From to keep it simple; maybe
   Reply-To as well.

 o Should there be a blacklist as well, for symmetry?  Probably not - a
   whitelist is far more useful.  A blacklist would only be useful if you
   were getting persistent false negatives from the same address despite
   repeated training - if that's happening then something's broken 8-)

 o Where to store the whitelist - it could get big, so bayescustomize.ini
   might not be the place.  Ongoing problems with DBRunRecovery errors put
   me off putting it in the clues database.

-- 
Richie Hindle
richie@entrian.com


From richie at entrian.com  Sat Nov 15 11:37:37 2003
From: richie at entrian.com (Richie Hindle)
Date: Sat Nov 15 11:37:58 2003
Subject: [spambayes-dev] A spectacular false positive
In-Reply-To: <3FB653B0.3080507@hooft.net>
References: <E1AKkQf-0003k0-KG@mail.python.org>	<LNBBLJKPBEHFEDALKOLCKEDBHCAB.tim.one@comcast.net>
	<6ijcrvc1tmqk2effqafko1veiup2seg06h@4ax.com>
	<3FB653B0.3080507@hooft.net>
Message-ID: <vdlcrvorl6uhsin2oml600t170bpj19joh@4ax.com>


[Richie]
> Perhaps it's argument for not classifying using hapaxes?  Wait for any
> given clue to appear in more than one message before it becomes valid for
> classification.  Has anyone tried this?  (And not just for SpamBayes -
> Bill?)

[Rob]
> ? h?v? n?t tr??d ?t, b?t ? ?m q??t? s?r? ?t w??ld p?rf?rm w?rs?!

8-)

I'm sure it would perform worse in the short term, but as the size of the
training set increased, I think the performance would pretty much catch up
while the chance of false positives would remain significantly smaller.
(I speak with the conviction of someone with no evidence and negligible
mathematical ability...)

-- 
Richie Hindle
richie@entrian.com


From richie at entrian.com  Sat Nov 15 11:57:27 2003
From: richie at entrian.com (Richie Hindle)
Date: Sat Nov 15 11:57:58 2003
Subject: [spambayes-dev] Re: [Spambayes] RV: I18N and L10N
In-Reply-To: <E1AKopS-0005WE-Sp@hostalia03.hostalia.com>
References: <E1AKopS-0005WE-Sp@hostalia03.hostalia.com>
Message-ID: <qilcrvkkv2cmod9g8vq56c68i48blahbjh@4ax.com>

Pablo,

[Moving this thread to spambayes-dev@python.org]

> All right, third try, then I'll quit :-)

Our apologies - we don't mean to ignore people who offer to help - far
from it!

> I just want to ask if there's any interest
> at all in having Spambayes available in Spanish or not.

We'd love to have international versions, though there are a lot of issues
involved.  I don't mean to put you off the idea, or to imply that we're
not prepared to put effort into this, but these things need taking into
account...

Many (most?) of the English strings in SpamBayes are mixed in with the
code.  Taking the source code as it is an translating the strings into
Spanish would be unmaintainable - we'd have two entirely separate versions
of the code, and any edits would have to be applied to both.  So the first
job to do would be to pull out all those hard-coded strings into a
language file.  That's not a huge job, and one that any computer-literate
person could probably do 95% of, even if they weren't a programmer.  Still
more effort than simply translating a collection of English phrases into
Spanish, though.

More English text appears in HTML pages.  Some of these are mostly text,
like the Outlook help pages, and maintaining two versions would not be too
bad (though any stylistic changes might have to be applied twice).  Some,
however, are quite strictly defined in ways that makes them
machine-readable - the web interface (as used by the POP3 proxy and IMAP
filter) defines its user interface in little pieces of HTML that are
joined together by the program at runtime.  Translating that stuff would
need more technical knowledge, and probably a significant amount of
re-engineering to make it maintainable.

A lot of Outlook interface (at least the dialogs) are defined in Windows
resource files, which require Visual Studio to edit them (or there may be
third party programs to it - any free ones that people know of?).  You can
edit them by hand but it's a huge pain.  They also contain a lot of
information that's not just language strings, meaning there's a lot of
duplication between the different language versions that causes
maintenance headaches.  I'd be interested to hear from people out there in
the world with solutions to this problem!

Lastly a social issue.  You could become the support department for
hundred of Spanish SpamBayes users!

So.  Do you still want to do this?  8-)  And are there SpamBayes
developers (or other Python-literate SpamBayes users - Pablo, you don't
say whether you're a Python programmer?) who have the time to make the
necessary software changes for this to work?

There may be other, cheaper, alternatives to doing the work that I haven't
considered (for instance, translating all the strings in place and then
maintaining the edits as a set of patches that get applied to each release
- are there Open Source projects that work that way?)

I may be painting an unnecessarily bleak picture here, because I'm no
expert on i18n.  I'd love for someone to come in and say "No, you've got
it all wrong, it's easy, look!"  Anyone?

-- 
Richie Hindle
richie@entrian.com


From skip at pobox.com  Sat Nov 15 13:56:19 2003
From: skip at pobox.com (Skip Montanaro)
Date: Sat Nov 15 13:56:43 2003
Subject: [spambayes-dev] A spectacular false positive
In-Reply-To: <3FB5F19D.5070506@hooft.net>
References: <LNBBLJKPBEHFEDALKOLCKEDBHCAB.tim.one@comcast.net>
	<3FB5F19D.5070506@hooft.net>
Message-ID: <16310.30419.316274.137644@montanaro.dyndns.org>


    Rob> I am now training on all mistakes and unsures, plus all ham scoring
    Rob> more than 0.02 and all spam scoring less than 0.99. 

I used to use that sort of scheme as well, but it gets tedious after awhile
and just grows my training database.  The problem was that most ham scored
0.0 and after concluding a message was ham I let procmail toss it in the
proper mailbox.  This meant that the few hams which didn't score 0.0 were
scattered all over the place, so I had to constantly be on the lookout for
them.  I suppose I could have added a copy rule to my procmailrc file to
save all non-zero ham, but that would have just been another mailbox to look
at.  I already have unsure, lospam and hispam.  That would add hiham.

Also, when you get two of essentially the same spam, do you train on both?
I'm trying to be careful now to minimize that sort of duplication.  I have
so many email addresses feeding into skip@mojam.com that I generally get
multiples of everything.

Finally, I also gave up on training on low-scoring spams.  If it's spam and
not a mistake, it's good enough for me.

At the moment I have a training database of 133 spams and 111 hams.

Skip

From matt at mondoinfo.com  Sat Nov 15 14:20:16 2003
From: matt at mondoinfo.com (Matthew Dixon Cowles)
Date: Sat Nov 15 14:20:21 2003
Subject: [spambayes-dev] Whitelists (was: A spectacular false positive)
In-Reply-To: <p0kcrvket75ggl1uk75i2boq4nnjql3b8m@4ax.com>
References: <E1AKkQf-0003k0-KG@mail.python.org>
	<LNBBLJKPBEHFEDALKOLCKEDBHCAB.tim.one@comcast.net>
	<p0kcrvket75ggl1uk75i2boq4nnjql3b8m@4ax.com>
Message-ID: <1068923279.1.1879@mint-julep.mondoinfo.com>

> Questions:
>
> o How to get the actual address from a To/From header - the address
> would need separating from the real name and any quoting.

That one's pretty easy:

>>> import email
>>> import email.Utils
>>> f=open("test")
>>> m=email.message_from_file(f)
>>> email.Utils.parseaddr(m.get("From",""))[1]
'richie@entrian.com'

Regards,
Matt


From listas at loquecreas.com  Sat Nov 15 15:02:13 2003
From: listas at loquecreas.com (Pablo Vieira)
Date: Sat Nov 15 15:02:43 2003
Subject: [spambayes-dev] RE: [Spambayes] RV: I18N and L10N
In-Reply-To: <qilcrvkkv2cmod9g8vq56c68i48blahbjh@4ax.com>
Message-ID: <E1AL6cx-00086e-Rp@hostalia03.hostalia.com>

Wow, looks challenging, but very interesting! I'm not a Python programmer
but I'm a C programmer and learning Java right now. I'll join you guys at
the developers list. I might have some suggestions.

Thanks for answering, finally! ;-)

Pablo

> -----Mensaje original-----
> De: Richie Hindle [mailto:richie@entrian.com]
> Enviado el: s?bado, 15 de noviembre de 2003 17:57
> Para: Pablo Vieira; spambayes@python.org; spambayes-dev@python.org
> Asunto: Re: [Spambayes] RV: I18N and L10N
> 
> Pablo,
> 
> [Moving this thread to spambayes-dev@python.org]
> 
> > All right, third try, then I'll quit :-)
> 
> Our apologies - we don't mean to ignore people who offer to help - far
> from it!
> 
> > I just want to ask if there's any interest
> > at all in having Spambayes available in Spanish or not.
> 
> We'd love to have international versions, though there are a lot of issues
> involved.  I don't mean to put you off the idea, or to imply that we're
> not prepared to put effort into this, but these things need taking into
> account...
> 
> Many (most?) of the English strings in SpamBayes are mixed in with the
> code.  Taking the source code as it is an translating the strings into
> Spanish would be unmaintainable - we'd have two entirely separate versions
> of the code, and any edits would have to be applied to both.  So the first
> job to do would be to pull out all those hard-coded strings into a
> language file.  That's not a huge job, and one that any computer-literate
> person could probably do 95% of, even if they weren't a programmer.  Still
> more effort than simply translating a collection of English phrases into
> Spanish, though.
> 
> More English text appears in HTML pages.  Some of these are mostly text,
> like the Outlook help pages, and maintaining two versions would not be too
> bad (though any stylistic changes might have to be applied twice).  Some,
> however, are quite strictly defined in ways that makes them
> machine-readable - the web interface (as used by the POP3 proxy and IMAP
> filter) defines its user interface in little pieces of HTML that are
> joined together by the program at runtime.  Translating that stuff would
> need more technical knowledge, and probably a significant amount of
> re-engineering to make it maintainable.
> 
> A lot of Outlook interface (at least the dialogs) are defined in Windows
> resource files, which require Visual Studio to edit them (or there may be
> third party programs to it - any free ones that people know of?).  You can
> edit them by hand but it's a huge pain.  They also contain a lot of
> information that's not just language strings, meaning there's a lot of
> duplication between the different language versions that causes
> maintenance headaches.  I'd be interested to hear from people out there in
> the world with solutions to this problem!
> 
> Lastly a social issue.  You could become the support department for
> hundred of Spanish SpamBayes users!
> 
> So.  Do you still want to do this?  8-)  And are there SpamBayes
> developers (or other Python-literate SpamBayes users - Pablo, you don't
> say whether you're a Python programmer?) who have the time to make the
> necessary software changes for this to work?
> 
> There may be other, cheaper, alternatives to doing the work that I haven't
> considered (for instance, translating all the strings in place and then
> maintaining the edits as a set of patches that get applied to each release
> - are there Open Source projects that work that way?)
> 
> I may be painting an unnecessarily bleak picture here, because I'm no
> expert on i18n.  I'd love for someone to come in and say "No, you've got
> it all wrong, it's easy, look!"  Anyone?
> 
> --
> Richie Hindle
> richie@entrian.com
> 
> 


From tim.one at comcast.net  Sat Nov 15 16:42:47 2003
From: tim.one at comcast.net (Tim Peters)
Date: Sat Nov 15 16:42:43 2003
Subject: [spambayes-dev] A spectacular false positive
In-Reply-To: <3FB5F19D.5070506@hooft.net>
Message-ID: <LNBBLJKPBEHFEDALKOLCIEGOHCAB.tim.one@comcast.net>

[Rob Hooft]
> I am now training on all mistakes and unsures, plus all ham scoring
> more than 0.02 and all spam scoring less than 0.99.

Then why not reset your ham and spam cutoffs to 0.02 and 0.99, to match?
Then you can describe the same thing as just "mistakes and unsures" (which
is what I mean by "mistake-based training").

> Total trained messages is ~250 both ways, and 97+ of spam scores 0.99+
> leaving only 1-2 new spams per day, less than 1 unsure per day, and
> ~1 new ham per day to train on.
>
> I am really pleased by the performance of this training schedule. It
> is not as brittle as mistake-based training, but it still ignores the
> obvious repeating things like CVS log messages of which I receive a
> few dozen per day. It keeps the database reasonably small, but not
> really hapax driven.

Sigh -- we need solid research on training disciplines that work great in
real-life use, respecting that anything requiring human input will barely
get used except by geeks who never tire of watching the training process.
We're getting a lot of anecdotal evidence (which ain't the same thing) about
different schemes, and I'm afraid no two of the developers train in the same
way anymore.  It's a good thing the algorithm appears to have turned out to
be robust against almost any training insanity short of what Outlook users
can stumble into <0.9 wink>.

Oh well.  In the meantime, I think your msg would be a great addition to
Richie's spambayes wiki.  I know *you* know where that is, because a
coworker found your

    http://www.entrian.com/sbwiki/RobsSetup

there yesterday, and it was exactly what he needed to set up our code with
his maildir-based system.


From tim.one at comcast.net  Sat Nov 15 17:02:58 2003
From: tim.one at comcast.net (Tim Peters)
Date: Sat Nov 15 17:02:53 2003
Subject: [spambayes-dev] A spectacular false positive
In-Reply-To: <vdlcrvorl6uhsin2oml600t170bpj19joh@4ax.com>
Message-ID: <LNBBLJKPBEHFEDALKOLCIEGPHCAB.tim.one@comcast.net>

[Richie Hindle]
>>> Perhaps it's argument for not classifying using hapaxes?  Wait for
>>> any given clue to appear in more than one message before it becomes
>>> valid for classification.  Has anyone tried this?  (And not just for
>>> SpamBayes - Bill?)

[Rob Hooft]
>> ? h?v? n?t tr??d ?t, b?t ? ?m q??t? s?r? ?t w??ld p?rf?rm w?rs?!

[Richie]
> 8-)
>
> I'm sure it would perform worse in the short term, but as the size of
> the training set increased, I think the performance would pretty much
> catch up while the chance of false positives would remain
> significantly smaller. (I speak with the conviction of someone with
> no evidence and negligible mathematical ability...)

Graham's original scheme ignored tokens that hadn't appeared at least 5
times in training data.  Some of the very earliest experiments played with
that, moving the cutoff both higher and lower.  The evidence was very clear
(not like the noise-level results most recent experiments have shown -- this
was "0 lost 1 tied 9 won" territory) that a cutoff of 0 worked best.

Part of the "reason" is surely that *every* token first *enters* the
database as a hapax.  When new kinds of fuzzy ham and spam appear, one
example often introduces enough hapaxes so that the next instance of the
same kind of thing is nailed to the correct category just from scoring the
hapaxes in it.

I noticed this dramatically during the last major round of worm spew, where
I was getting about 1,000 worm-related turds each day.  Like Skip suggested
recently, I only trained on one at a time, and then rescored the morning's
unsures.  Training on 6 total examples turned out to be enough that I never
had to train on another -- and "that worked" almost purely by capturing
different hapaxes unique to about 6 different variations of the worm spew I
was getting.

So hapaxes are (I believe) really the heart of what lets lazy, minimal
mistake-based training work as well as it does.  It will always be brittle,
though.

A scheme I would like to try can't be tried easily anymore because we
removed some of the info it needs from our database:  ignore hapaxes that
haven't been *used* in scoring over the last (say) week.  Spam especially
seems to come in spurts, where I might get 100 copies in a few days of a
spam containing "16384".  That hapax is very valuable in nailing minor
variations of that spam until that spam campaign ends; but after that point,
I probably never use it to score a spam again, yet it stays in the database
forever.  If it stays there long enough, Jeremy is eventually going to use
it too <wink>.

Especially since more & more of us are inclining toward using tiny databases
(compared to what we used to do), making space for a "last used" timestamp
may not be nearly as scary as it used to be.


From tp at diffenbach.org  Sat Nov 15 17:27:10 2003
From: tp at diffenbach.org (TP Diffenbach)
Date: Sat Nov 15 17:23:31 2003
Subject: FW: [spambayes-dev] Code locations in Spambayes Outlook plugin
Message-ID: <DBECJAAFMJPBJFMMBHFAGEOHIJAA.tp@diffenbach.org>


Tim,

thanks for your help. Knowing what to grep on make this a one line code
change, and running "python addin.py" painlessly installed it.

Thanks too to Kenny, for your response.

A bit of mucking about with the Outlook forms (and ignoring an Outlook
popped-up suggestion after a bit of Googling), made it work, and now I can
see the headers without having to delve into Outlook's menus, and I can do
it without code in the form, which disables Outlook's auto-preview (so much
of using Outlook seems to involve working around stupid design decisions in
Outlook).

BTW, I'm signed up to the spambayes-dev list under a different email than I
use to post to the list; will this cause any problems?

Thanks,
Tom

-----Original Message-----

[TP Diffenbach]
> I'd like to extend the Spambayes Outlook plugin a bit.
>
> In the Spambayes Outlook Plugin, in which module are the header lines
> (Outlook lingo: CdoPR_TRANSPORT_MESSAGE_HEADERS) extracted?

The plugin doesn't use the CDO API -- it's too problematic across Outlook
variations (e.g., it appears it's not even available in most IMO
configurations, unless the user manually installs CDO from their Office CD).
It uses low-level MAPI instead.  Search the source for the MAPI
PR_TRANSPORT_MESSAGE_HEADERS_A property and you'll soon find it.  Be warned
that raw MAPI can be extremely painful to work with (although it's a hell of
a lot easier to work with from Python than from C!); OTOH, it's much faster
than CDO too, and that's important to the plugin for high-volume users.

> In which module is the spam percentage score added to the Outlook
> mail item? 

The same module you'll find above, in the SetField method of class
MAPIMsgStoreMsg (I'm resisting becoming a remote search button in your text
editor <wink>).

-------------- next part --------------
A non-text attachment was scrubbed...
Name: winmail.dat
Type: application/ms-tnef
Size: 1213 bytes
Desc: not available
Url : http://mail.python.org/pipermail/spambayes-dev/attachments/20031115/9c5d5d63/winmail.bin
From rob at hooft.net  Sat Nov 15 18:09:13 2003
From: rob at hooft.net (Rob Hooft)
Date: Sat Nov 15 18:12:18 2003
Subject: [spambayes-dev] A spectacular false positive
In-Reply-To: <16310.30419.316274.137644@montanaro.dyndns.org>
References: <LNBBLJKPBEHFEDALKOLCKEDBHCAB.tim.one@comcast.net>	<3FB5F19D.5070506@hooft.net>
	<16310.30419.316274.137644@montanaro.dyndns.org>
Message-ID: <3FB6B219.6090706@hooft.net>

Skip Montanaro wrote:
>     Rob> I am now training on all mistakes and unsures, plus all ham scoring
>     Rob> more than 0.02 and all spam scoring less than 0.99. 
> 
> I used to use that sort of scheme as well, but it gets tedious after awhile
> and just grows my training database. 
[...]
> Also, when you get two of essentially the same spam, do you train on both?
> I'm trying to be careful now to minimize that sort of duplication.  I have
> so many email addresses feeding into skip@mojam.com that I generally get
> multiples of everything.

I do not get a lot of true duplicates, definitely not in the non-obvious 
spam.

This is my .procmailrc; it indeed has the copy-rule you mention.


LOGFILE=/home/h/hooft/procmail.log
:0 fw:hamlock
| /home/h/hooft/bin/sb_filter.py

# Messages that are so obviously spam that we should not train on them
:0
* ^X-SpamBayes-Classification: spam; 1.00
.ztrain.obvious-spam/

# Messages that are spam but we might want to train on them
:0
* ^X-SpamBayes-Classification: spam
.ztrain.spam/

# Unsure messages must be copied to the unsure folder for training
:0 c
* ^X-SpamBayes-Classification: unsure
.ztrain.unsure/

# Ham that doesn't score 0.00 is eligible for training as well
:0 c
* ^X-SpamBayes-Classification: ham; 0.0[2-9]
.ztrain.ham/

:0 c
* ^X-SpamBayes-Classification: ham; 0.1[0-9]
.ztrain.ham/

##
##
## Split into folders
##
##
:0
* ^List-Id:.*python-announce-list
.python.Announce/

## Etc.


-- 
Rob W.W. Hooft  ||  rob@hooft.net  ||  http://www.hooft.net/people/rob/


From rob at hooft.net  Sat Nov 15 18:15:05 2003
From: rob at hooft.net (Rob Hooft)
Date: Sat Nov 15 18:18:08 2003
Subject: [spambayes-dev] A spectacular false positive
In-Reply-To: <LNBBLJKPBEHFEDALKOLCIEGOHCAB.tim.one@comcast.net>
References: <LNBBLJKPBEHFEDALKOLCIEGOHCAB.tim.one@comcast.net>
Message-ID: <3FB6B379.3070304@hooft.net>

Tim Peters wrote:
> [Rob Hooft]
> 
>>I am now training on all mistakes and unsures, plus all ham scoring
>>more than 0.02 and all spam scoring less than 0.99.
> 
> 
> Then why not reset your ham and spam cutoffs to 0.02 and 0.99, to match?
> Then you can describe the same thing as just "mistakes and unsures" (which
> is what I mean by "mistake-based training").

Because I still "never look" at anything that scores over 0.90. They are 
all spam. But the spammiest of those, the ones over 0.995, are not even 
used for training. At the ham-side you're right: it is the same.

Rob


-- 
Rob W.W. Hooft  ||  rob@hooft.net  ||  http://www.hooft.net/people/rob/


From tim.one at comcast.net  Sat Nov 15 18:37:01 2003
From: tim.one at comcast.net (Tim Peters)
Date: Sat Nov 15 18:37:02 2003
Subject: [spambayes-dev] A spectacular false positive
In-Reply-To: <3FB6B379.3070304@hooft.net>
Message-ID: <LNBBLJKPBEHFEDALKOLCOEHGHCAB.tim.one@comcast.net>

[Rob Hooft]
>>> I am now training on all mistakes and unsures, plus all ham scoring
>>> more than 0.02 and all spam scoring less than 0.99.

[Tim]
>> Then why not reset your ham and spam cutoffs to 0.02 and 0.99, to
>> match? Then you can describe the same thing as just "mistakes and
>> unsures" (which is what I mean by "mistake-based training").

[Rob]
> Because I still "never look" at anything that scores over 0.90. They
> are all spam.

I don't understand.  Suppose a message scores 0.93.  0.93 > 0.90, so by what
you just said you never look at it.  But 0.93 < 0.99, so by what you first
said you *do* train on it.  Is it possible to simulataneously both train on
a thing and never look at it?  I guess I don't know what "never look" means.
You mean you don't use your eyeballs to physically look at the 0.93 message,
but let spambayes auto-train on its own "it's spam" decision then?  That
would be consistent with all that you said, so I'm assuming now that's the
intended meaning.


From popiel at wolfskeep.com  Sat Nov 15 18:42:51 2003
From: popiel at wolfskeep.com (T. Alexander Popiel)
Date: Sat Nov 15 18:42:54 2003
Subject: [spambayes-dev] A spectacular false positive 
In-Reply-To: Message from "Tim Peters" <tim.one@comcast.net> of "Sat,
	15 Nov 2003 17:02:58 EST."
	<LNBBLJKPBEHFEDALKOLCIEGPHCAB.tim.one@comcast.net> 
References: <LNBBLJKPBEHFEDALKOLCIEGPHCAB.tim.one@comcast.net> 
Message-ID: <20031115234251.228272DF6A@cashew.wolfskeep.com>

In message:  <LNBBLJKPBEHFEDALKOLCIEGPHCAB.tim.one@comcast.net>
             "Tim Peters" <tim.one@comcast.net> writes:
>
>Especially since more & more of us are inclining toward using tiny databases
>(compared to what we used to do), making space for a "last used" timestamp
>may not be nearly as scary as it used to be.

This is something that I don't understand... why do we care if the
database is huge?  With 100 gigabyte drives commonplace, why are
we quibbling over 20 or 40 megabytes?

- Alex

From rob at hooft.net  Sat Nov 15 18:54:01 2003
From: rob at hooft.net (Rob Hooft)
Date: Sat Nov 15 18:57:01 2003
Subject: [spambayes-dev] A spectacular false positive
In-Reply-To: <LNBBLJKPBEHFEDALKOLCOEHGHCAB.tim.one@comcast.net>
References: <LNBBLJKPBEHFEDALKOLCOEHGHCAB.tim.one@comcast.net>
Message-ID: <3FB6BC99.3080208@hooft.net>

Tim Peters wrote:
> [Rob Hooft]
> 
>>>>I am now training on all mistakes and unsures, plus all ham scoring
>>>>more than 0.02 and all spam scoring less than 0.99.
> 
> 
> [Tim]
> 
>>>Then why not reset your ham and spam cutoffs to 0.02 and 0.99, to
>>>match? Then you can describe the same thing as just "mistakes and
>>>unsures" (which is what I mean by "mistake-based training").
> 
> 
> [Rob]
> 
>>Because I still "never look" at anything that scores over 0.90. They
>>are all spam.
> 
> 
> I don't understand.  Suppose a message scores 0.93.  0.93 > 0.90, so by what
> you just said you never look at it.  But 0.93 < 0.99, so by what you first
> said you *do* train on it.  Is it possible to simulataneously both train on
> a thing and never look at it?  I guess I don't know what "never look" means.
> You mean you don't use your eyeballs to physically look at the 0.93 message,
> but let spambayes auto-train on its own "it's spam" decision then?  That
> would be consistent with all that you said, so I'm assuming now that's the
> intended meaning.

Exactly. I am assuming that the 0.93 message has some "old-fashioned" 
spammy characteristics, but the spammer is looking at new techniques to 
disguise his messages in the future. He is just not radical enough to 
get into my unsure box. My automatic training on these messages now 
makes sure that this new trick will be useless in the future.

Rob
-- 
Rob W.W. Hooft  ||  rob@hooft.net  ||  http://www.hooft.net/people/rob/


From rob at hooft.net  Sat Nov 15 18:58:05 2003
From: rob at hooft.net (Rob Hooft)
Date: Sat Nov 15 19:01:09 2003
Subject: [spambayes-dev] A spectacular false positive
In-Reply-To: <20031115234251.228272DF6A@cashew.wolfskeep.com>
References: <LNBBLJKPBEHFEDALKOLCIEGPHCAB.tim.one@comcast.net>
	<20031115234251.228272DF6A@cashew.wolfskeep.com>
Message-ID: <3FB6BD8D.8040902@hooft.net>

T. Alexander Popiel wrote:
> In message:  <LNBBLJKPBEHFEDALKOLCIEGPHCAB.tim.one@comcast.net>
>              "Tim Peters" <tim.one@comcast.net> writes:
> 
>>Especially since more & more of us are inclining toward using tiny databases
>>(compared to what we used to do), making space for a "last used" timestamp
>>may not be nearly as scary as it used to be.
> 
> 
> This is something that I don't understand... why do we care if the
> database is huge?  With 100 gigabyte drives commonplace, why are
> we quibbling over 20 or 40 megabytes?

My database is on an ISP's server with 100,000 clients and limited disk 
quota?

But then again:

% ll .hammiedb
-rw-rw-r--    1 hooft    hooft     1277952 Nov 14 07:35 .hammiedb

Rob

-- 
Rob W.W. Hooft  ||  rob@hooft.net  ||  http://www.hooft.net/people/rob/


From richie at entrian.com  Sat Nov 15 19:31:29 2003
From: richie at entrian.com (Richie Hindle)
Date: Sat Nov 15 19:31:51 2003
Subject: [spambayes-dev] A spectacular false positive
In-Reply-To: <LNBBLJKPBEHFEDALKOLCIEGPHCAB.tim.one@comcast.net>
References: <vdlcrvorl6uhsin2oml600t170bpj19joh@4ax.com>
	<LNBBLJKPBEHFEDALKOLCIEGPHCAB.tim.one@comcast.net>
Message-ID: <t6hdrv4f2rtlgj3ad008u8du2rjbra77ef@4ax.com>


[Tim]
> A scheme I would like to try can't be tried easily anymore because we
> removed some of the info it needs from our database:  ignore hapaxes that
> haven't been *used* in scoring over the last (say) week.

*That* I like.  Best of both worlds.

-- 
Richie Hindle
richie@entrian.com


From tim.one at comcast.net  Sat Nov 15 22:04:42 2003
From: tim.one at comcast.net (Tim Peters)
Date: Sat Nov 15 22:04:49 2003
Subject: [spambayes-dev] Code locations in Spambayes Outlook plugin
In-Reply-To: <DBECJAAFMJPBJFMMBHFAAEOHIJAA.tp@diffenbach.org>
Message-ID: <LNBBLJKPBEHFEDALKOLCMEIAHCAB.tim.one@comcast.net>

[TP Diffenbach]
> ...
> A bit of mucking about with the Outlook forms (and ignoring an Outlook
> popped-up suggestion after a bit of Googling), made it work, and now
> I can see the headers without having to delve into Outlook's menus,
> and I can do it without code in the form, which disables Outlook's
> auto-preview (so much of using Outlook seems to involve working
> around stupid design decisions in Outlook).

Also the lack of a full object model (e.g., the convolutions we endure to
deal with the toolbar, to play with the rule system sanely, and the
inability to automate setting up spam-score columns in views, are pretty
dreadful; we won't mention the maddening "new mail" systray icon).  It's a
pain.

> BTW, I'm signed up to the spambayes-dev list under a different email
> than I use to post to the list; will this cause any problems?

For whom?  It's OK by me -- it's not a restricted list (anyone can post
here, subscribed or not; and anyone can read the list, via its archive on
the web).
-------------- next part --------------
A non-text attachment was scrubbed...
Name: winmail.dat
Type: application/ms-tnef
Size: 1052 bytes
Desc: not available
Url : http://mail.python.org/pipermail/spambayes-dev/attachments/20031115/4d6ff582/winmail.bin
From tim.one at comcast.net  Sat Nov 15 22:34:59 2003
From: tim.one at comcast.net (Tim Peters)
Date: Sat Nov 15 22:35:04 2003
Subject: [spambayes-dev] A spectacular false positive 
In-Reply-To: <20031115234251.228272DF6A@cashew.wolfskeep.com>
Message-ID: <LNBBLJKPBEHFEDALKOLCAEIDHCAB.tim.one@comcast.net>

[T. Alexander Popiel]
> ...
> This is something that I don't understand... why do we care if the
> database is huge?  With 100 gigabyte drives commonplace, why are
> we quibbling over 20 or 40 megabytes?

I expect large drives are still rare among consumers, and this has become a
"mass market" application.  It wouldn't be *just* the database size, of
course -- keeping "last access" up to date also requires caching token
timestamps in memory, and most significantly updating the DB on disk after
scoring (we never have to write to disk after scoring now, only after
training).  So there are many costs.  I'd feel a lot better about it if
Berkeley DB were a lot faster on Windows, and wasn't still implicated in so
many maddeningly baffling database corruption reports.


From tim.one at comcast.net  Sat Nov 15 23:49:34 2003
From: tim.one at comcast.net (Tim Peters)
Date: Sat Nov 15 23:49:47 2003
Subject: [spambayes-dev] RE: [Spambayes] RV: I18N and L10N
In-Reply-To: <E1AKopS-0005WE-Sp@hostalia03.hostalia.com>
Message-ID: <LNBBLJKPBEHFEDALKOLCIEIGHCAB.tim.one@comcast.net>

[Pablo Vieira]
> ...
> Since you guys state very clearly that no one should email the
> developers directly I'm putting this here. I just want to ask if
> there's any interest at all in having Spambayes available in Spanish
> or not.

I expect that, like most of the rest of the developers, I think that would
be great, but don't have relevant experience, or time, to give to it.  If I
were you <wink>, I'd announce my *intent* on the newsgroup comp.lang.python,
to attract the interest of Python programmers with real-life I/L*N
experience.  There are people who know pretty much exactly what to do, but
they don't hang out on this list, and this has much more to do with using
Python's relevant features (like Unicode) correctly than with SpamBayes
specifically.  You might have luck asking on a Zope mailing list too (Zope
is a popular web content management system coded in Python, with users all
over the world, and within the last couple years has benefited by many
peoples' intense help with I/L*N).

A problem I know came up repeatedly in the Zope experience:  a 100%
commitment to Unicode can make life much easier, but old-time Python
programmers have to be dragged kicking and screaming to Unicode ("it's
inefficient", "it's wasteful", "it's too hard", ..., all the kinds of things
old people say when they're too cranky to learn new tricks <wink>).  You'll
have my support in fighting that battle, but not really much of my help --
because I'm one of the old farts who still hasn't learned anything about how
to live in a Unicode world.

Asians are likely to complain about Unicode too, but adapting SpamBayes to
Asian languages has many deep problems that European languages shouldn't
face (spambayes splits the body into tokens by whitespace, and that's it --
it deliberately didn't assume 7-bit ASCII English).

I'm not sure whether the Python email package plays nicely with Unicode.
That could be a real problem at the starting gate, if not.


From tim.one at comcast.net  Sun Nov 16 02:23:55 2003
From: tim.one at comcast.net (Tim Peters)
Date: Sun Nov 16 02:23:59 2003
Subject: [spambayes-dev] Native Outlook 2003 spam filtering
Message-ID: <LNBBLJKPBEHFEDALKOLCIEINHCAB.tim.one@comcast.net>

Some geeks with too much time on their hands reverse-engineered huge parts
of OL2003's secret spam gimmicks, and wrote a detailed account:

    http://www.mapilab.com/articles/outlook_spam_filter.html

As always, the first release of a thing from MS is so bizarre that
competitors are lulled into laughing MS off.  For example, the dictionary of
words and word weights is fixed:  it doesn't learn, and it's the same for
all users.  So, if you're a spammer, you just mail your spam to your own
OL2K+3, and fiddle it until the filter likes it.  Then all OL3K-997
installations will like it.

If anyone knows Bill Gates, please tell him he's welcome to use our code for
free <wink>.


From jm at jmason.org  Sun Nov 16 03:05:37 2003
From: jm at jmason.org (Justin Mason)
Date: Sun Nov 16 03:05:49 2003
Subject: [spambayes-dev] Native Outlook 2003 spam filtering 
In-Reply-To: <LNBBLJKPBEHFEDALKOLCIEINHCAB.tim.one@comcast.net> 
Message-ID: <20031116080538.7F4D416EFD@jmason.org>

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1


Tim Peters writes:
> Some geeks with too much time on their hands reverse-engineered huge parts
> of OL2003's secret spam gimmicks, and wrote a detailed account:
> 
>     http://www.mapilab.com/articles/outlook_spam_filter.html
> 
> As always, the first release of a thing from MS is so bizarre that
> competitors are lulled into laughing MS off.  For example, the dictionary of
> words and word weights is fixed:  it doesn't learn, and it's the same for
> all users.  So, if you're a spammer, you just mail your spam to your own
> OL2K+3, and fiddle it until the filter likes it.  Then all OL3K-997
> installations will like it.

Bizarre!  *Great* article, though.  Thanks for the pointer!
(PS: I like link 8 on Appendix B ;)

- --j.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.2 (GNU/Linux)
Comment: Exmh CVS

iD8DBQE/ty/RQTcbUG5Y7woRArJHAKDa9HLugCnpyEj51SN6JHp/hTScJgCfcQjy
HmDq9u9Ar72idaAzlqSG2Rc=
=GFVd
-----END PGP SIGNATURE-----


From papaDoc at videotron.ca  Sun Nov 16 14:44:53 2003
From: papaDoc at videotron.ca (Remi Ricard)
Date: Sun Nov 16 14:44:03 2003
Subject: [spambayes-dev] Re: [Spambayes] RV: I18N and L10N
In-Reply-To: <E1AKopS-0005WE-Sp@hostalia03.hostalia.com>
References: <E1AKopS-0005WE-Sp@hostalia03.hostalia.com>
Message-ID: <1069011891.3384.8.camel@porsche.hq.simlog.com>

Hi,


> All right, third try, then I'll quit :-),

Please don't quit,


>  I just want to ask if there's any interest
> at all in having Spambayes available in Spanish or not.
I might be interested by a French version bu I seem interested by a
Spanich version.

What you should ask is: "Is it easy to translate the English string to 
whatever."

If the answer is yes then "You" can do it and provide a patch.

Remi
-- 
Remi Ricard <papaDoc@videotron.ca>


From tim.one at comcast.net  Sun Nov 16 15:44:42 2003
From: tim.one at comcast.net (Tim Peters)
Date: Sun Nov 16 15:44:37 2003
Subject: [spambayes-dev] Native Outlook 2003 spam filtering 
In-Reply-To: <20031116080538.7F4D416EFD@jmason.org>
Message-ID: <LNBBLJKPBEHFEDALKOLCEEKNHCAB.tim.one@comcast.net>

[Justin Mason, on
    http://www.mapilab.com/articles/outlook_spam_filter.html
]

> Bizarre!  *Great* article, though.  Thanks for the pointer!

Ya, it reminds me of the summer month I spent disassembling the Radio Shack
Model 100's ROM, to track down a suspected bug in its BASIC interpreter.
The horrible fascination of it all keeps you going long after you find the
answer you were looking for.

> (PS: I like link 8 on Appendix B ;)

I didn't mention that on purpose -- I'm not sure the spambayes developers
could stand to see SpamBayes subjected to such no-holds-barred rigorous
criticism <wink>.


From sanjaydarisi at cox.net  Sun Nov 16 18:43:03 2003
From: sanjaydarisi at cox.net (sanjaydarisi@cox.net)
Date: Sun Nov 16 18:43:12 2003
Subject: [spambayes-dev] Quick questions!
Message-ID: <20031116234302.BADI9968.fed1mtao05.cox.net@smtp.west.cox.net>


I have three quick questions regarding spambayes outlook addin. 

Firstly, Isn't that possible to add spam field directly to the Outlook view instead of user adding it manually from user-defined fields?

Secondly, If I want to add the delete as spam/recover as ham buttons to the message view that is displayed when an email message is double-clicked in Outlook, how do I do that? Any ideas?

Thirdly, If I want to add some personal signature regarding spambayes after I install it for my Outlook, How do I do that? Any suggestions?

Thanks in advance,
Sanjay.


From tameyer at ihug.co.nz  Sun Nov 16 21:43:52 2003
From: tameyer at ihug.co.nz (Tony Meyer)
Date: Sun Nov 16 21:41:21 2003
Subject: [spambayes-dev] Who can explain this?
In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F130407C93C@its-xchg4.massey.ac.nz>
Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F13026F29CE@its-xchg4.massey.ac.nz>

> Racking my brain trying to figure out just what persistent 
> storage file I was using (because sometimes it seemed to use 
> ~/hammie.db and sometimes ~/.hammiedb), I came across this in 
> sb_filter.py:
[...]
> Can we just rip this hack out and let the user's options file 
> dictate things?

This is my bad from some time ago.  +1 to getting rid of it.

I'm not sure that 'hammie.db' is the best default for the option, though.
For one it can lead to lots of hammie.db files being created, depending on
what the cwd is at the time spambayes is run.  For another, it's nice to
default to having the personal files separated out from the application
ones.

os.path.expanduser("~/.hammiedb") works nicely enough on WinXP - does it
work happily with other Windows flavours and with pre-OSX Macs?  (I presume
OSX is fine).  Personally, I like that more as a default (although I suppose
if the default is to be changed, the 'hammie' name could also be dropped).

FWIW, these days, Windows users with Mark's extensions, for which spambayes
can't find a database, get one created in (effectively) ~/Application
Data/SpamBayes/Proxy), if memory serves me correctly.  (Called
statistics_database.db, I think).

=Tony Meyer


From tameyer at ihug.co.nz  Sun Nov 16 22:07:06 2003
From: tameyer at ihug.co.nz (Tony Meyer)
Date: Sun Nov 16 22:04:37 2003
Subject: [spambayes-dev] Re: [Spambayes] Lotus Notes filter error
	KeyError:('Hammie', 'header_spam_string')
In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F1303EE1F6B@its-xchg4.massey.ac.nz>
Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F130212B132@its-xchg4.massey.ac.nz>

> [Mike]
> > File "C:\Program Files\Python23\Scripts\sb_notesfilter.py", 
> line 237, in processAndTrain
> >     str = options["Hammie", "header_spam_string"]
> 
> I don't know much about the Notes stuff, but that looks like 
> a bug.  That piece of code should probably be:
> 
>     if is_spam:
>         str = options["Headers", "header_spam_string"]
>     else:
>         str = options["Headers", "header_ham_string"]
[...]
> I'm forwarding this to spambayes-dev to see whether anyone 
> there knows for sure whether I'm right about this...?

This is a bit out of date now, but I can't see a message confirming it, so:

Yes, that's right.  When I went through the scripts and updated the options
names I either missed some in notesfilter (as with some elsewhere), or
deliberately left notesfilter alone (I can't recall which) since it needs
updating in various places (TimS created it, used it, and planned to update
it, but never managed to find time).

I notice that this fix has been checked in anyway, so this is just in case
you were wondering, or for anyone reading the archives ;)

=Tony Meyer


From tameyer at ihug.co.nz  Sun Nov 16 22:17:20 2003
From: tameyer at ihug.co.nz (Tony Meyer)
Date: Sun Nov 16 22:14:50 2003
Subject: [spambayes-dev] RE: [Spambayes] Spambayes 1.0a7 -
	windowsproxy_tray installation
In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F1303EE1F96@its-xchg4.massey.ac.nz>
Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F13026F29CF@its-xchg4.massey.ac.nz>

> At some point in the not-too-distant past, a decision was 
> made that the Windows scripts pop3proxy_service.py and 
> pop3proxy_tray.py should be installed to the Python Scripts 
> directory along with the other command-line scripts.  It 
> seems this was a bit premature, as pop3proxy_tray obviously 
> isn't designed to be run that way.

This is my fault - I forgot about the icons.  The problem was (several
people reported it after the 1.0a6 release) that everything else *is*
installed into the python scripts directory, and so the readme tells people
to then discard the expanded archive - but they then didn't have the
contents of the (spambayes) windows directory.

> I've CC'd the 
> spambayes-dev list in hopes that someone can take a look at 
> this.

A few comments:
 * For the vast majority of people, this won't be a problem, because they'll
use the binary installer for spambayes and it'll install a frozen
pop3proxy_tray in the requested place.
 * Could the tray handle the icon files like the web ui handles it's non
python files?  (with resourcepackage)?  Would this be desirable?
 * What do other python programs do that have 'support' files?

> At the very least, we should probably stop copying it 
> to the Python\Scripts directory until the problem is fixed.

I would rather that we just came up with a solution to the missing icon
files problem and checked that in, rather than stop copying it.  After all,
it's only a copy - running the script from the expanded archive will work
fine.  If we do stop copying it, then the readme (and website?) needs to be
updated to explain which files need to be kept from the expanded archive,
and which don't.

=Tony Meyer


From tameyer at ihug.co.nz  Sun Nov 16 22:20:54 2003
From: tameyer at ihug.co.nz (Tony Meyer)
Date: Sun Nov 16 22:18:19 2003
Subject: [spambayes-dev] Re: Offer to Help / Development Participation
In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F130407BE77@its-xchg4.massey.ac.nz>
Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F130212B134@its-xchg4.massey.ac.nz>

> Many thanks for the offers!  Maybe other developers would 
> like to make specific suggestions (hence I've forwarded this 
> to the spambayes-dev mailing list), but there's a whole bunch 
> of things that you could do, starting with the non-technical:

[list of suggestions snipped]

> There are probably a dozen other things that I haven't thought of.

Richie - could you put something like this up on the wiki somewhere?  And
maybe link to it from the "how can I help" FAQ?  It's much more
comprehensive than stuff that I've seen/written before, and it is a fairly
common question.  (The wiki's probably better than the FAQ, since this'll
presumably change as things progress).

=Tony Meyer


From tameyer at ihug.co.nz  Sun Nov 16 22:28:48 2003
From: tameyer at ihug.co.nz (Tony Meyer)
Date: Sun Nov 16 22:26:14 2003
Subject: [spambayes-dev] OptionsClass.is_valid too picky?
In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F1303EE1FBF@its-xchg4.massey.ac.nz>
Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F130212B135@its-xchg4.massey.ac.nz>

> It looks to me like 
> OptionsClass.HEADER_VALUE is too restrictive, but I'll leave 
> it for the author of that code to decide whether or not to 
> loosen it up.

I wrote (many of) the regexes in OptionsClass, and in my defense I'll note
that (somewhere) at the time I pointed out that they needed to be checked
out by someone more expert at them than me.

It's currently "[\w\.\-\*]+".  Someone here must know offhand what the valid
characters in an email header are, yes?  Or do we just go with flexibility
and use ".+"?

=Tony Meyer


From tameyer at ihug.co.nz  Sun Nov 16 22:34:13 2003
From: tameyer at ihug.co.nz (Tony Meyer)
Date: Sun Nov 16 22:31:38 2003
Subject: [spambayes-dev] More CVS branch/tags questions 
In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F1303EE1F67@its-xchg4.massey.ac.nz>
Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F130212B136@its-xchg4.massey.ac.nz>

> [Anthony]
> OTOH, I don't know what is stopping us from cutting 1.0b1
> in a couple of
> weeks, with a possible RC a couple of weeks after that. 

[Richie]
> The DBRunRecoveryError problem is stopping us, IMHO.

I would agree with that, but I don't think that there's anything else
(although I haven't read the bug reports for the last couple of weeks...).
It would be interesting to see if the changes in 1.0a7 reduce the
occurrences of the problem in the messageinfo db (I think they should).

On a positive* note, it appears that during the time I was away, my wife's
system (current cvs spambayes, Python 2.2.2) has started corrupting the
stats db every time it's run, so I might be able to chase something down
from that in the next couple of days.

=Tony Meyer

* Not in her opinion, however <wink>.


From skip at pobox.com  Sun Nov 16 22:35:16 2003
From: skip at pobox.com (Skip Montanaro)
Date: Sun Nov 16 23:04:41 2003
Subject: [spambayes-dev] Whitelists (was: A spectacular false positive)
In-Reply-To: <p0kcrvket75ggl1uk75i2boq4nnjql3b8m@4ax.com>
References: <E1AKkQf-0003k0-KG@mail.python.org>
	<LNBBLJKPBEHFEDALKOLCKEDBHCAB.tim.one@comcast.net>
	<p0kcrvket75ggl1uk75i2boq4nnjql3b8m@4ax.com>
Message-ID: <16312.16884.440384.414721@montanaro.dyndns.org>


    Richie>  o Whenever a message is trained as spam, remove the From
    Richie>    address from the whitelist.

So when a spammer forges Barry Warsaw's address (as I've seen before), Barry
disappears from my whitelist?

Most of my email that comes in as ham is never a candidate for training.
Even if I fed my current ham training database to a whitelist generator it
wouldn't whitelist a single '@python.org' address.  It would get a number of
Python-related email addresses though:

    gerrit@nl.linux.org
    tim_one@users.sourceforge.net
    amk@amk.ca
    anthony@interlink.com.au
    ...

While all these addresses are certainly valuable contacts in the Python
world, they are hardly representative of the email addresses which would
float to the front of my cortex if I decided to build a whitelist manually.
They just happen to be authors of Python-related messages on which I've
trained.  My current set of ham includes 11 python-list messages, two
python-checkins messages, and one each from the spambayes, mailman and
python-dev lists.  My Python mailbox obviously contains a lot more mail, but
it includes messages from random people asking Python questions which I
simply forgot to delete as well as messages I've saved for their content,
not necessarily who they are from.

    Richie>  o Whenever a message is received from a whitelisted addresses,
    Richie>    and scores as solid (for some value of 'solid') ham,
    Richie>    auto-train the message as ham.  You'd use this for personal
    Richie>    acquaintances only, and not for mailing lists or
    Richie>    organisations (amazon.com, ebay.com, etc.)

Now we're back to growing large databases.  I think over time you might wind
up with a highly unbalanced set of ham and spam.

Of course, as Tim pointed out, we all seem to be flying more-or-less by the
seat of our pants vis a vis training, so one feature is as good as another.
Still, I get so few false positives that I find it hard to believe a
whitelist - even if it included my wife and my boss - would be helpful.

    Richie> The upshot: I still don't trust SpamBayes to delete my Spam
    Richie> without looking it.

I have auto-deleted spam with a classifation of "spam; 1.00" for a couple
months.  My boss hasn't fired me yet for not responding to an email.

Skip

From skip at pobox.com  Sun Nov 16 22:13:16 2003
From: skip at pobox.com (Skip Montanaro)
Date: Sun Nov 16 23:04:44 2003
Subject: [spambayes-dev] A spectacular false positive 
In-Reply-To: <20031115234251.228272DF6A@cashew.wolfskeep.com>
References: <LNBBLJKPBEHFEDALKOLCIEGPHCAB.tim.one@comcast.net>
	<20031115234251.228272DF6A@cashew.wolfskeep.com>
Message-ID: <16312.15564.157062.319322@montanaro.dyndns.org>


    >> Especially since more & more of us are inclining toward using tiny
    >> databases (compared to what we used to do), making space for a "last
    >> used" timestamp may not be nearly as scary as it used to be.

    Alex> This is something that I don't understand... why do we care if the
    Alex> database is huge?  With 100 gigabyte drives commonplace, why are
    Alex> we quibbling over 20 or 40 megabytes?

It's not an issue of 20-40 megabytes, it's how many messages are represented
by that file.  In my case, I had a training database of around 21MB and on
the order of 10,000 ham and somewhat fewer spam (maybe 7,000 or so),
depending on how agressively I'd been training and how recently I'd whacked
off the oldest 10%-20% of my messages.

I think there's a psychological hurdle to overcome to simply throw away
17,000 messages, even if it's not working optimally, because it does
represent a substantial time investment.  That hurdle is much lower when
your training database is under 500 messages.  Heck, I can rebuild one of
that size in next to no time.

Here's something I think would be interesting.  At the moment I have about
40 unsures awaiting a decision from me (train or discard).  I'm trying
conciously to be conservative.  What I'd like to know is which message, if
added to my training database, would have the greatest effect on the scores
of the other unsure messages.  That would help me decide which ones yield
the most benefit.  OTOH, maybe I'd do just as well to train on every fourth
unsure or select unsures to train on with a probability of 0.25 (1/4 picked
purely out of thin air, so don't ask where I got it :-).

Skip


From tim.one at comcast.net  Mon Nov 17 00:24:50 2003
From: tim.one at comcast.net (Tim Peters)
Date: Mon Nov 17 00:24:44 2003
Subject: [spambayes-dev] OptionsClass.is_valid too picky?
In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F130212B135@its-xchg4.massey.ac.nz>
Message-ID: <LNBBLJKPBEHFEDALKOLCMEMFHCAB.tim.one@comcast.net>

[Skip]
>> It looks to me like OptionsClass.HEADER_VALUE is too restrictive, but
>> I'll leave it for the author of that code to decide whether or not to
>> loosen it up.

[Tony]
> I wrote (many of) the regexes in OptionsClass, and in my defense I'll
> note that (somewhere) at the time I pointed out that they needed to
> be checked out by someone more expert at them than me.
>
> It's currently "[\w\.\-\*]+".  Someone here must know offhand what
> the valid characters in an email header are, yes?  Or do we just go
> with flexibility and use ".+"?

RFC 822 sez:

    The  field-name must be composed of printable ASCII characters
    (i.e., characters that  have  values  between  33.  and  126.,
    decimal, except colon).

But who cares?  Not me.

From tameyer at ihug.co.nz  Mon Nov 17 02:33:57 2003
From: tameyer at ihug.co.nz (Tony Meyer)
Date: Mon Nov 17 02:31:25 2003
Subject: [spambayes-dev] More CVS branch/tags questions
In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F1303EE1DC2@its-xchg4.massey.ac.nz>
Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F13026F29D1@its-xchg4.massey.ac.nz>

[This thread seems to have died a week ago, but since I was away, and have
things to say <wink>, and it doesn't seem to be resolved, I figured I'd
resurrect it.  While I'm doing notes: thanks Richie, Anthony and Skip for
outlining the various processes in more detail - great stuff for us cvs
newbies].

[Richie]
> Re-reading Tony's mail, I should have pointed out at the time that we
> shouldn't commit edits to both places, but should use "cvs up [-j
> moving-tag] -j release_1_0" to periodically merge the bugfix 
> branch onto the head.  Nuts.

I've been very guilty of this.  Basically for every edit if I thought it fit
in the 'bug fix' category, then I recommitted it on the branch.  The
changelog outlines (for the most part) which ones I put into the branch, and
which ones were only on the head.

> From looking at the logs, it seems you're right, Mark - 
> bugfixes have been
> hitting the head instead of release_1_0.  Also, some fixes have been
> committed to both the head and release_1_0, which will probably make
> merging release_1_0 back onto the head a pain - you always get more
> conflicts when you do that.

Apart from the last couple of weeks, I committed the majority of changes to
the branch (a mixture of stuff from me, and copying other people's fixes
from the head).  With the exception of the windows specific stuff, I'm
pretty sure that I branch-committed all the changes that looked to me like
bug fixes.  I did do it by copying and pasting, mostly (and then checking),
so hopefully there won't be too many conflicts.

> (I should have encouraged more discussion of
> branch strategy when all this came up - we make heavy use of 
> CVS branches at work, and we know a bit about how best to manage them.)

I should asked more questions, too, sorry.  I'm very much a newcomer to cvs,
and was probably pushing towards the 1.0 release most for a period, so
should have ensured that I was doing it right.

> How much enhancement work has gone onto the head since release_1_0 was
> taken?

A lot with the web interface.  The changelog details it - it's the stuff in
the 1.1a1 section, rather than the 1.0a7 one.

[Anthony]
> I'd suggest the following:
> 
>   - checkin to the trunk. If the fix is a bugfix, and suitable for the
>     branch, include "bugfix candidate" in the checkin message.
> 
>   - (preferably) check your bugfix into the branch as well. I suggest
>     having two checkouts, one on the branch, one on the trunk.
> 
>   - (otherwise) someone else notices that the "bugfix" needs to be
>     applied to the branch as well, and does so.

This is more-or-less what I was doing, I think, except that I based the
"bugfix candidate" decision on discussion in the lists the checkin message,
and my own fallible head, rather than an explicit message.

> Having said that, I'd say the time to branch is at the point
> where we're about to cut the first beta. So we've possibly done
> it too soon here. 

This was almost the intent here, too.  The original aim was to create the
branch, release 1.0a6, then in a very short time release 1.0b1; before
release the code that became 1.0a6 seemed pretty stable, and the main reason
for the release was to have an alpha with the new script/option names before
releasing a beta.  Of course, after it was released, the
db-closing/interface bug surfaced, and there was a resurgence of
dbrunrecovery errors, plus a few others.

[Richie]
> Our failure this time, if there even was a failure, was in
> not advertising the strategy loudly enough.

When a strategy is decided, what would be the best way to advertise it,
given that people may join the development team at any point?  Something in
readme-devel?

And speaking of deciding a strategy, what is the spambayes one? <wink>.
Personally, I'm in favour of someone else deciding and giving me steps to
follow :)  It does seem likely that if we can resolve the db corruption bug,
a beta wouldn't be far off, so it would be good to decide by then :)

=Tony Meyer


From tdickenson at devmail.geminidataloggers.co.uk  Mon Nov 17 04:11:36 2003
From: tdickenson at devmail.geminidataloggers.co.uk (Toby Dickenson)
Date: Mon Nov 17 04:11:47 2003
Subject: [spambayes-dev] A spectacular false positive
In-Reply-To: <LNBBLJKPBEHFEDALKOLCKEDBHCAB.tim.one@comcast.net>
References: <LNBBLJKPBEHFEDALKOLCKEDBHCAB.tim.one@comcast.net>
Message-ID: <200311170911.36239.tdickenson@devmail.geminidataloggers.co.uk>

On Saturday 15 November 2003 01:02, Tim Peters wrote:

> It had
> virtually no English text, but lots, and lots, and lots of different
> integers (about 100KB worth).  There were about a half dozen strong ham
> clues that it had come from him, but about 140 spam clues from the variety
> of little integers, most hapaxes that had appeared in one training spam
> each.
>
> I view that mostly as a danger of mistake-based training:  as I've
> mentioned before, mistake-based training tends toward being hapax-driven,
> and hapaxes are brittle.  There's nothing *inherently* spammy about, say,
> 16384, and because that's a power of 2 and I'm a computer geek, that
> *would* have appeared in several training ham if I hadn't fallen into
> mistake-based training (yes, 16384 had indeed appeared in one training
> spam).

I occasionally see the inverse problem. I train on every email I receive, 
including many hams containing lots of numbers like Jeremy sent you. 
Occasionally I get a spam where 2 or 3 numbers (in a price list, usually) are 
enough to classify it as ham.

Would you have been as suprised by the same result if Jeremy had sent you a 
long list of effectively random words? 

-- 
Toby Dickenson


From m0davis at pacbell.net  Mon Nov 17 07:23:27 2003
From: m0davis at pacbell.net (Martin Stone Davis)
Date: Mon Nov 17 07:30:25 2003
Subject: [spambayes-dev] Idea to re-energize corpus learning
Message-ID: <bpaej8$cev$2@sea.gmane.org>

I recently started this thread on the POPFile forum, but it applies just 
as well to SpamBayes.

https://sourceforge.net/forum/forum.php?thread_id=972652&forum_id=213099

-Martin


From skip at pobox.com  Mon Nov 17 08:34:15 2003
From: skip at pobox.com (Skip Montanaro)
Date: Mon Nov 17 08:34:31 2003
Subject: [spambayes-dev] OptionsClass.is_valid too picky?
In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F130212B135@its-xchg4.massey.ac.nz>
References: <1ED4ECF91CDED24C8D012BCF2B034F1303EE1FBF@its-xchg4.massey.ac.nz>
	<1ED4ECF91CDED24C8D012BCF2B034F130212B135@its-xchg4.massey.ac.nz>
Message-ID: <16312.52823.91199.400874@montanaro.dyndns.org>


    >> It looks to me like OptionsClass.HEADER_VALUE is too restrictive...

    Tony> It's currently "[\w\.\-\*]+".  Someone here must know offhand what
    Tony> the valid characters in an email header are, yes?  Or do we just
    Tony> go with flexibility and use ".+"?

Anything printable is okay, yes?  that would be [ -~]+ I think.  Do we need
to worry about people including control characters or high-bit-set stuff?

Skip


From skip at pobox.com  Mon Nov 17 08:47:30 2003
From: skip at pobox.com (Skip Montanaro)
Date: Mon Nov 17 08:47:50 2003
Subject: [spambayes-dev] Idea to re-energize corpus learning
In-Reply-To: <bpaej8$cev$2@sea.gmane.org>
References: <bpaej8$cev$2@sea.gmane.org>
Message-ID: <16312.53618.652677.190274@montanaro.dyndns.org>


    Martin> I recently started this thread on the POPFile forum, but it
    Martin> applies just as well to SpamBayes.

    Martin> https://sourceforge.net/forum/forum.php?thread_id=972652&forum_id=213099

See my note from Sunday on spambayes-dev:

    http://mail.python.org/pipermail/spambayes-dev/2003-November/001679.html

Just because you train on a gazillion spams and hams doesn't mean the best
course once you've screwed something up isn't to start over.  Like I said in
the above message, I think there's a certain psychological barrier you have
to overcome before you throw out a massive training database.  I suspect
POPfile learns about as quickly as SpamBayes, so without proof I assert that
starting over there is often going to be the right course as well.

For example, it's rather easy for me to scan my current training database
for mistakes, either in a semi-automated fashion using sb_filter.py or
manually, because it only contains about 250 messages.  This was extremely
difficult using my previous monster database (15k-20k messages).

Skip


From m0davis at pacbell.net  Mon Nov 17 09:25:08 2003
From: m0davis at pacbell.net (Martin Stone Davis)
Date: Mon Nov 17 09:24:53 2003
Subject: [spambayes-dev] Re: Idea to re-energize corpus learning
In-Reply-To: <16312.53618.652677.190274@montanaro.dyndns.org>
References: <bpaej8$cev$2@sea.gmane.org>
	<16312.53618.652677.190274@montanaro.dyndns.org>
Message-ID: <bpalne$ufs$1@sea.gmane.org>

Skip Montanaro wrote:

>     Martin> I recently started this thread on the POPFile forum, but it
>     Martin> applies just as well to SpamBayes.
> 
>     Martin> https://sourceforge.net/forum/forum.php?thread_id=972652&forum_id=213099
> 
> See my note from Sunday on spambayes-dev:
> 
>     http://mail.python.org/pipermail/spambayes-dev/2003-November/001679.html
> 
> Just because you train on a gazillion spams and hams doesn't mean the best
> course once you've screwed something up isn't to start over.  Like I said in
> the above message, I think there's a certain psychological barrier you have
> to overcome before you throw out a massive training database.  I suspect
> POPfile learns about as quickly as SpamBayes, so without proof I assert that
> starting over there is often going to be the right course as well.
> 
> For example, it's rather easy for me to scan my current training database
> for mistakes, either in a semi-automated fashion using sb_filter.py or
> manually, because it only contains about 250 messages.  This was extremely
> difficult using my previous monster database (15k-20k messages).
> 
> Skip

Wouldn't it be nice if there were some middle ground between continuing 
to train the huge immovable database and starting over fresh?  After 
all, it's more than just a psychological barrier.  Having to train 100% 
of incoming messages after starting over is real work, and especially 
frustrating when you *know* that 80-90% would have been correctly 
classified anyway if only you hadn't started over.

So why not soften the blow?  That's what my proposal amounts to: 
achieving some sort of middle ground between the status quo and starting 
over.  After performing a "Soften training SEVERELY" (where the counts 
are all set to their square roots), messages would still be classified 
in more-or-less the same way.  However, further training would then be 
far more effective, since the counts would be lower.

Doesn't that sound like a good idea?

-Martin

P.S. I'm also sure that POPfile learns just as quickly as SpamBayes, 
since they are based on the same principle.


From tim.one at comcast.net  Mon Nov 17 09:54:36 2003
From: tim.one at comcast.net (Tim Peters)
Date: Mon Nov 17 09:54:26 2003
Subject: [spambayes-dev] OptionsClass.is_valid too picky?
In-Reply-To: <16312.52823.91199.400874@montanaro.dyndns.org>
Message-ID: <LNBBLJKPBEHFEDALKOLCGEOBHCAB.tim.one@comcast.net>

[Skip]
> It looks to me like OptionsClass.HEADER_VALUE is too
> restrictive... 
> ...
> Anything printable is okay, yes?

The colon is forbidden in a header field name.

> that would be [ -~]+ I think.

The blank is also forbidden in a header field name.

> Do we need to worry about people including control characters or
> high-bit-set stuff?

Not if people are willing to adhere to the standard <wink>.

From skip at pobox.com  Mon Nov 17 10:04:23 2003
From: skip at pobox.com (Skip Montanaro)
Date: Mon Nov 17 10:04:37 2003
Subject: [spambayes-dev] OptionsClass.is_valid too picky?
In-Reply-To: <LNBBLJKPBEHFEDALKOLCGEOBHCAB.tim.one@comcast.net>
References: <16312.52823.91199.400874@montanaro.dyndns.org>
	<LNBBLJKPBEHFEDALKOLCGEOBHCAB.tim.one@comcast.net>
Message-ID: <16312.58231.468785.103935@montanaro.dyndns.org>


    Tim> [Skip]

    >> It looks to me like OptionsClass.HEADER_VALUE is too
    >> restrictive... 
    >> ...
    >> Anything printable is okay, yes?

    Tim> The colon is forbidden in a header field name.

    >> that would be [ -~]+ I think.

    Tim> The blank is also forbidden in a header field name.

    >> Do we need to worry about people including control characters or
    >> high-bit-set stuff?

    Tim> Not if people are willing to adhere to the standard <wink>.

I believe OptionsClass.HEADER_VALUE refers to the value of a particular, not
its name.  Everything you wrote is correct for OptionsClass.HEADER_NAME.
Right now, both have the same value:

    HEADER_NAME = r"[\w\.\-\*]+"
    HEADER_VALUE = r"[\w\.\-\*]+"

I am happy to leave HEADER_NAME as is, but would like to change HEADER_VALUE
to

    HEADER_VALUE = "[ -~]+"

or should that be

    HEADER_VALUE = "[\t -~]+"

?

Skip

From tim.one at comcast.net  Mon Nov 17 10:26:12 2003
From: tim.one at comcast.net (Tim Peters)
Date: Mon Nov 17 10:26:06 2003
Subject: [spambayes-dev] OptionsClass.is_valid too picky?
In-Reply-To: <16312.58231.468785.103935@montanaro.dyndns.org>
Message-ID: <LNBBLJKPBEHFEDALKOLCKEODHCAB.tim.one@comcast.net>

[Skip Montanaro]
> I believe OptionsClass.HEADER_VALUE refers to the value of a
> particular, not its name.  Everything you wrote is correct for
> OptionsClass.HEADER_NAME. Right now, both have the same value:
>
>     HEADER_NAME = r"[\w\.\-\*]+"
>     HEADER_VALUE = r"[\w\.\-\*]+"
>
> I am happy to leave HEADER_NAME as is, but would like to change
> HEADER_VALUE to
>
>     HEADER_VALUE = "[ -~]+"
>
> or should that be
>
>     HEADER_VALUE = "[\t -~]+"

        http://www.faqs.org/rfcs/rfc822.html

        The field-body may be composed of any ASCII characters, except
        CR or LF.  (While CR and/or LF may be present in the actual text,
        they are removed by the action of unfolding the field.)

This seems to contradict the definition of "text" given later, which allows
bare CR and bare LF too, just the CRLF combination.  "ASCII characters"
isn't clearly defined, although the lexical definition for CHAR later is
*described* as "any ASCII character" in English and *defined* as decimal 0
to decimal 127.

One reason email clients get incompatible is that these early standards can
be darned hard to make full sense of.  So "suit yourself" is what many do in
practice, although "be liberal in what you accept" is the Official Mantra
offered as equally fuzzy advice <wink>.


From kennypitt at hotmail.com  Mon Nov 17 10:42:23 2003
From: kennypitt at hotmail.com (Kenny Pitt)
Date: Mon Nov 17 10:43:13 2003
Subject: [spambayes-dev] A spectacular false positive
In-Reply-To: <LNBBLJKPBEHFEDALKOLCIEGOHCAB.tim.one@comcast.net>
Message-ID: <E1ALlWk-0007h5-EU@mail.python.org>

Tim Peters wrote:
> Sigh -- we need solid research on training disciplines that work
> great in real-life use, respecting that anything requiring human
> input will barely get used except by geeks who never tire of watching
> the training process. We're getting a lot of anecdotal evidence
> (which ain't the same thing) about different schemes, and I'm afraid
> no two of the developers train in the same way anymore.  It's a good
> thing the algorithm appears to have turned out to be robust against
> almost any training insanity short of what Outlook users can stumble
> into <0.9 wink>. 

Yes, the Outlook plugin pretty much guarantees mistake-based training
for anyone not familiar enough with the program (or too lazy <wink>) to
update the training through SpamBayes Manager periodically.  The
majority of my ham comes either from the same list of senders at work,
or from the SpamBayes lists, so didn't take SpamBayes long to start
classifying all of those correctly.  I got up to almost 10:1 spam to ham
ratio pretty quickly.

To try to work around the problem, I implemented two experimental
options to train on all certain ham and train on all certain spam.
Since I can turn them on or off independently, I can use them to get my
ratio back in balance and then turn them off.  What I'd like to
implement is a way to do this automatically.  I'd like to say something
like, "If my spam count reaches twice my ham count then train on all
certain hams until the counts are within 5% of each other again."  These
cutoffs would of course be configurable.

It will take me a little while to get around to implementing this and
even longer to see if it is effective, but I'll report results (or at
least perceptions) when I have them.

-- 
Kenny Pitt


From eckert at indiana.edu  Mon Nov 17 11:22:12 2003
From: eckert at indiana.edu (Eckert, Robert D)
Date: Mon Nov 17 11:23:21 2003
Subject: [spambayes-dev] Can't move items that are in the results list from
	an Outlook Find when SpamBayes is installed
Message-ID: <885BB3CAB85CBD44B73B52CFBC1FC55EAAF2D5@iu-mssg-mbx08.exchange.iu.edu>

Hi,
   I am using Outlook 2002 with an Exchange 2002 server when I work.
The copy of Outlook is locally installed on my PC which is running 
Windows 2000 Professional. All software is up to date and patched.

When I do "Find" operation on Inbox and get a results list, then
select all the items and attempt to drag and drop them into another
folder in my folder list, Outlook says: Can't move the items in a
dialog box.

When SpamBayes (or Qurb before SpamBayes) is installed, the move
operation
fails, yet with SpamBayes is not installed, the operation succeeds
without 
a problem.

Can you address what is happening here?

Thank you,

An otherwise *very* satisfied SpamBayes user.

-Bob


                    Bob Eckert - Principal Analyst
         eckert@indiana.edu (812) 855-7209 - (812) 855-8299 Fax
                        Indiana University
              University Information Technology Services
                  University Information Services
                 2711 East 10th Street - Room 101.5
                      Bloomington, IN 47408


From kennypitt at hotmail.com  Mon Nov 17 12:47:18 2003
From: kennypitt at hotmail.com (Kenny Pitt)
Date: Mon Nov 17 12:47:54 2003
Subject: [spambayes-dev] RE: [Spambayes] Spambayes 1.0a7 -
	windowsproxy_tray installation
In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F13026F29CF@its-xchg4.massey.ac.nz>
Message-ID: <E1ALnTY-0007V5-7o@mail.python.org>

Tony Meyer wrote:
>> At some point in the not-too-distant past, a decision was
>> made that the Windows scripts pop3proxy_service.py and
>> pop3proxy_tray.py should be installed to the Python Scripts
>> directory along with the other command-line scripts.  It
>> seems this was a bit premature, as pop3proxy_tray obviously
>> isn't designed to be run that way.
> 
> This is my fault - I forgot about the icons.  The problem was (several
> people reported it after the 1.0a6 release) that everything else *is*
> installed into the python scripts directory, and so the readme tells
> people to then discard the expanded archive - but they then didn't
> have the contents of the (spambayes) windows directory.
> 
> A few comments:
>  * For the vast majority of people, this won't be a problem, because
> they'll use the binary installer for spambayes and it'll install a
> frozen pop3proxy_tray in the requested place.

Very true, as this has proven the norm for the Outlook plugin.  We just
need to finalize the installer and get it out in front of people.

These Windows-specific scripts seem to be more akin to the Outlook
plugin than to the more Unix-oriented command line scripts, so would the
best course be to handle them the same way?  For the Outlook plugin, the
binary installer is the general case, and those who want/need to run
from source do so from the complete source directory.

>  * Could the tray handle the icon files like the web ui handles it's
> non python files?  (with resourcepackage)?  Would this be desirable?

Probably, with enough extra work.  The Web UI is quite happy having the
raw file data available in an in-memory object.  Windows will happily
load an icon from either a disk file or a properly formatted resource
(as the binary will use).  IIRC, thoug, it doesn't provide much help if
you have the same data as the file but it isn't physically in a file.

>> At the very least, we should probably stop copying it
>> to the Python\Scripts directory until the problem is fixed.
> 
> I would rather that we just came up with a solution to the missing
> icon files problem and checked that in, rather than stop copying it. 

Agreed.  I was thinking only about reducing confusion in the meantime.

-- 
Kenny Pitt


From skip at pobox.com  Mon Nov 17 13:28:36 2003
From: skip at pobox.com (Skip Montanaro)
Date: Mon Nov 17 13:31:57 2003
Subject: [spambayes-dev] Re: Idea to re-energize corpus learning
In-Reply-To: <bpalne$ufs$1@sea.gmane.org>
References: <bpaej8$cev$2@sea.gmane.org>
	<16312.53618.652677.190274@montanaro.dyndns.org>
	<bpalne$ufs$1@sea.gmane.org>
Message-ID: <16313.4948.609409.61316@montanaro.dyndns.org>

>>>>> "Martin" == Martin Stone Davis <m0davis@pacbell.net> writes:

    Martin> Skip Montanaro wrote:
    Martin> I recently started this thread on the POPFile forum, but it
    Martin> applies just as well to SpamBayes.
    >> 
    Martin> https://sourceforge.net/forum/forum.php?thread_id=972652&forum_id=213099
    >> 
    >> See my note from Sunday on spambayes-dev:
    >> 
    >> http://mail.python.org/pipermail/spambayes-dev/2003-November/001679.html

    Martin> Wouldn't it be nice if there were some middle ground between
    Martin> continuing to train the huge immovable database and starting
    Martin> over fresh?

Sure, it would, but why propagate mistakes, even if they are smaller in
magnitude?  I should have continued my previous message instead of leaving
people to draw their own conclusions.  With a small database, if you have an
error, it's easier to find, and if you can't find it, starting from scratch
is not a big problem.  With a large database there's this feeling that,
"but... but... but...  I'll be throwing away all that *good* data and all my
(valuable) work!"

    Martin> Having to train 100% of incoming messages after starting over is
    Martin> real work, and especially frustrating when you *know* that
    Martin> 80-90% would have been correctly classified anyway if only you
    Martin> hadn't started over.

If you only train on mistakes and unsures (as many of us appear to do now),
then the effort is lessened.  I don't see any practical benefit to training
on every Python-related message I receive as ham.  I currently have about 20
in my training database.  If I was smart, I could probably figure out how to
reduce that number.  As far as I can tell, nearly every valid Python-related
message I receive gets a ham score of 0.00 (rounded).  None get scored as
unsure or spam.  How long should I beat that particular dead horse?

Since blowing away my gazillion message training database I've started from
scratch twice.  Considering the volume of mail I get, getting back to a
250-message training database is little effort at all for me.  SpamBayes
seems to start scoring most stuff pretty well after seeing just a few hams
and spams, so the cost is minimal.  The problem with spam is that it varies
all over the map (subject wise).  My hams fall into just a few categories
though, so good messages begin to be correctly classified almost
immediately.  Spam tends to linger in the unsure category must longer.  My
current approach to that problem is to try and push my spam_cutoff down
further.

If you want to seed a training database, you might try initially adding just
the most recent message from each of your active ham mailboxes.  I could add
just ten messages and be almost certain they would all be useful indicators
of ham.  Once I've added a few spams, I'd probably see pretty good
classification results.

Given a 20k-message training database which contains mistakes, I will have a
hard time finding and correcting those mistakes.  Your approach is to reduce
the magnitude of the mistakes by reducing the weight of the current training
database.  I effectively take the same approach, it's just that I've
actually deleted the mistakes.  I've thrown the baby out with the bath water
(you just shrink your babies ;-), but I get plenty of babies in my incoming
mail feed.  If I'm careful, perhaps I'll avoid introducing the same mistakes
next time.

    Martin> Doesn't that sound like a good idea?

I suppose.  Mine doesn't require any new code to be written though.

I'm really not saying your idea is bad, just that mine ought to be "good
enough" and requires no extra code to be written.  You should be able to
write a little Python script which will march through your database and
reduce the counts by appropriate amounts.  You will have to be aware of a
couple corner conditions:

    * The counts for some words will round to zero.  You have to decide
      whether to keep them as hapaxes or delete them altogether.

    * Roundoff error might leave you with some assertion errors like the
      dreaded 

        assert hamcount <= nham
        assert spamcount <= nspam 

      You'll also have to take care to avoid that case.

One thing I tried in the past was to whack off the oldest 10%-20% of my
training database and retrain on the result.  That's another option to try
to remove errors.  If you as a trainer get better at your job, over time you
will also reduce the number of mistakes in your training database.  This
approach also has the pleasant side effect of deleting old messages, keeping
your training data more current as the nature of spam shifts.  If you
initially trained on a large body of saved mail though, you might wind up
whacking out many/most/all the clues pertaining to a particular subject area
and have to add some new messages in to compensate.

Skip

From popiel at wolfskeep.com  Mon Nov 17 14:08:26 2003
From: popiel at wolfskeep.com (T. Alexander Popiel)
Date: Mon Nov 17 14:08:33 2003
Subject: [spambayes-dev] A spectacular false positive 
In-Reply-To: Message from "Kenny Pitt" <kennypitt@hotmail.com> 
	of "Mon, 17 Nov 2003 10:42:23 EST." <E1ALlWk-0007h5-EU@mail.python.org> 
References: <E1ALlWk-0007h5-EU@mail.python.org> 
Message-ID: <20031117190826.5EB772DF1B@cashew.wolfskeep.com>

In message:  <E1ALlWk-0007h5-EU@mail.python.org>
             "Kenny Pitt" <kennypitt@hotmail.com> writes:
>Tim Peters wrote:
>> Sigh -- we need solid research on training disciplines that work
>> great in real-life use, respecting that anything requiring human
>> input will barely get used except by geeks who never tire of watching
>> the training process.

FWIW, this sort of research is what I built the incremental harness
for.  It really ought to be named something like the time-sequence
harness, but I didn't think of that at the time.

In any case, use the harness, you can specify (in regimes.py) any
particular training behaviour you want.  Using that, you can run
cv-esque tests to check effectiveness.

Unfortunately, after building the harness, I lost all will to actually
use it. :-/

>To try to work around the problem, I implemented two experimental
>options to train on all certain ham and train on all certain spam.
>Since I can turn them on or off independently, I can use them to get my
>ratio back in balance and then turn them off.  What I'd like to
>implement is a way to do this automatically.  I'd like to say something
>like, "If my spam count reaches twice my ham count then train on all
>certain hams until the counts are within 5% of each other again."  These
>cutoffs would of course be configurable.

This is a training behaviour which is easily emulated using the harness
above.  I'd love to see some quantitative numbers on it vs. training on
everything or training on just mistakes and unsures (both of which are
preexisting regimes).

>It will take me a little while to get around to implementing this and
>even longer to see if it is effective, but I'll report results (or at
>least perceptions) when I have them.

Cool.

- Alex

From popiel at wolfskeep.com  Mon Nov 17 14:26:28 2003
From: popiel at wolfskeep.com (T. Alexander Popiel)
Date: Mon Nov 17 14:27:18 2003
Subject: [spambayes-dev] A spectacular false positive 
In-Reply-To: Message from Skip Montanaro <skip@pobox.com> of "Sun,
	16 Nov 2003 21:13:16 CST."
	<16312.15564.157062.319322@montanaro.dyndns.org> 
References: <LNBBLJKPBEHFEDALKOLCIEGPHCAB.tim.one@comcast.net>
	<20031115234251.228272DF6A@cashew.wolfskeep.com>
	<16312.15564.157062.319322@montanaro.dyndns.org> 
Message-ID: <20031117192628.B11E32DF1B@cashew.wolfskeep.com>

In message:  <16312.15564.157062.319322@montanaro.dyndns.org>
             Skip Montanaro <skip@pobox.com> writes:
>
>It's not an issue of 20-40 megabytes, it's how many messages are represented
>by that file.  In my case, I had a training database of around 21MB and on
>the order of 10,000 ham and somewhat fewer spam (maybe 7,000 or so),
>depending on how agressively I'd been training and how recently I'd whacked
>off the oldest 10%-20% of my messages.
>
>I think there's a psychological hurdle to overcome to simply throw away
>17,000 messages, even if it's not working optimally, because it does
>represent a substantial time investment.  That hurdle is much lower when
>your training database is under 500 messages.  Heck, I can rebuild one of
>that size in next to no time.

I agree that the time investment is an issue... even more than the message
count.  I have a scheduled job that runs every night and retrains from
scratch, but if I were doing it manually, then I too would probably
hesitate to whack the database.  However, with the foreknowledge that
I'd want to whack the database, I set stuff up (saving all mail, and
the categorization of said mail) so it would be easy to retrain.

Perhaps what we really need is some easy way to allow people to retrain
in bulk... with the data required for doing so collected by default
instead of only by unusual forethought.

>Here's something I think would be interesting.  At the moment I have about
>40 unsures awaiting a decision from me (train or discard).  I'm trying
>conciously to be conservative.  What I'd like to know is which message, if
>added to my training database, would have the greatest effect on the scores
>of the other unsure messages.  That would help me decide which ones yield
>the most benefit.

I tend to think that you're over-optimizing... many times over, this
project has shown that stupid beats smart.

>OTOH, maybe I'd do just as well to train on every fourth
>unsure or select unsures to train on with a probability of 0.25 (1/4 picked
>purely out of thin air, so don't ask where I got it :-).

I believe (without proof) this is true.

- Alex

From skip at pobox.com  Mon Nov 17 16:04:41 2003
From: skip at pobox.com (Skip Montanaro)
Date: Mon Nov 17 16:04:54 2003
Subject: [spambayes-dev] A spectacular false positive 
In-Reply-To: <20031117190826.5EB772DF1B@cashew.wolfskeep.com>
References: <E1ALlWk-0007h5-EU@mail.python.org>
	<20031117190826.5EB772DF1B@cashew.wolfskeep.com>
Message-ID: <16313.14313.535777.235945@montanaro.dyndns.org>


    Alex> FWIW, this sort of research is what I built the incremental
    Alex> harness for.  It really ought to be named something like the
    Alex> time-sequence harness, but I didn't think of that at the time.

I gather that several files in testttools are related to your harness?  A
quick read of incremental.HOWTO.txt suggests:

    incremental.*
    regimes.py
    mksets.py
    dotest.sh

Anything else?  It seems that these files dominate that directory.  Maybe we
should create a time-sequence or incremental-test subdirectory and push them
into that.

Skip

From skip at pobox.com  Mon Nov 17 16:08:44 2003
From: skip at pobox.com (Skip Montanaro)
Date: Mon Nov 17 16:08:58 2003
Subject: [spambayes-dev] A spectacular false positive 
In-Reply-To: <20031117192628.B11E32DF1B@cashew.wolfskeep.com>
References: <LNBBLJKPBEHFEDALKOLCIEGPHCAB.tim.one@comcast.net>
	<20031115234251.228272DF6A@cashew.wolfskeep.com>
	<16312.15564.157062.319322@montanaro.dyndns.org>
	<20031117192628.B11E32DF1B@cashew.wolfskeep.com>
Message-ID: <16313.14556.545461.7393@montanaro.dyndns.org>


    >> Here's something I think would be interesting.  At the moment I have
    >> about 40 unsures awaiting a decision from me (train or discard).  I'm
    >> trying conciously to be conservative.  What I'd like to know is which
    >> message, if added to my training database, would have the greatest
    >> effect on the scores of the other unsure messages.  That would help
    >> me decide which ones yield the most benefit.

    Alex> I tend to think that you're over-optimizing... many times over,
    Alex> this project has shown that stupid beats smart.

Agreed, but we're in more-or-less uncharted territory here.  We all know
that testing strategies haven't received nearly the attention that the basic
algorithm has.  My unsures are dominated by spams at the moment.  I'm just
experimenting with this stuff and trying to be careful about getting my
ham/spam ratio too out-of-whack.

Skip

From tim.one at comcast.net  Mon Nov 17 16:13:57 2003
From: tim.one at comcast.net (Tim Peters)
Date: Mon Nov 17 16:13:52 2003
Subject: [spambayes-dev] Re: Idea to re-energize corpus learning
In-Reply-To: <bpalne$ufs$1@sea.gmane.org>
Message-ID: <LNBBLJKPBEHFEDALKOLCMEAMHDAB.tim.one@comcast.net>

[Martin Stone Davis]
> ...
> So why not soften the blow?  That's what my proposal amounts to:
> achieving some sort of middle ground between the status quo and
> starting over.  After performing a "Soften training SEVERELY" (where
> the counts are all set to their square roots), messages would still
> be classified in more-or-less the same way.

You can't know that without running serious tests, and it sounds like
something tests would prove wrong.  SpamBayes effectively computes spamprobs
from ratios, and sqrt(x)/sqrt(y) = sqrt(x/y):  the effective relative ratios
would also get "square rooted", and that's likely to cause massive changes
in scoring.

"The usual" way (in many fields) to diminish counts that have grown "too
large" is to add 1, then shift right by a bit.  The purpose of adding 1
first is to prevent an original count of 1 from becoming 0.  Other than
that, it's basically "cut all the counts in half".  Then (x/2)/(y/2) = x/y,
so that relative ratios aren't affected (much; counts 2*i+1 and 2*i+2, for
any i >= 0, are both reduced to i+1, so relative ratios can still change
some, and especially for small i).

> However, further training would then be far more effective, since the
> counts would be lower.
>
> Doesn't that sound like a good idea?

If test results say that it is, yes; otherwise no.  A problem with
artificially mangling token counts is that you'll probably lose the ability
to meaningfully untrain a message again (the relationship betwen token
counts and total number of ham and spam trained on is destroyed by reducing
only one of them, but if you reduce the total counts too then you've got
more messages you *could* untrain on than the (reduced) total count believes
is possible; untraining anyway will then lead to worsening inaccuracy until
the reduced total count "goes negative", at which point the code will
probably blow up, or start to deliver pure nonsense results).

> -Martin
>
> P.S. I'm also sure that POPfile learns just as quickly as SpamBayes,
> since they are based on the same principle.

Sorry, but unless you've tested this, you have no basis for such a claim.
May be true, may be false, but "same principle" doesn't determine it a
priori (overlooking that the ways in which SpamBayes and POPfile determine a
category actually have very little in common).


From popiel at wolfskeep.com  Mon Nov 17 16:17:23 2003
From: popiel at wolfskeep.com (T. Alexander Popiel)
Date: Mon Nov 17 16:17:29 2003
Subject: [spambayes-dev] A spectacular false positive 
In-Reply-To: Message from Skip Montanaro <skip@pobox.com> of "Mon,
	17 Nov 2003 15:04:41 CST."
	<16313.14313.535777.235945@montanaro.dyndns.org> 
References: <E1ALlWk-0007h5-EU@mail.python.org>
	<20031117190826.5EB772DF1B@cashew.wolfskeep.com>
	<16313.14313.535777.235945@montanaro.dyndns.org> 
Message-ID: <20031117211723.CFCA32DF1B@cashew.wolfskeep.com>

In message:  <16313.14313.535777.235945@montanaro.dyndns.org>
             Skip Montanaro <skip@pobox.com> writes:
>
>    Alex> FWIW, this sort of research is what I built the incremental
>    Alex> harness for.  It really ought to be named something like the
>    Alex> time-sequence harness, but I didn't think of that at the time.
>
>I gather that several files in testttools are related to your harness?  A
>quick read of incremental.HOWTO.txt suggests:
>
>    incremental.*
>    regimes.py
>    mksets.py
>    dotest.sh
>
>Anything else?  It seems that these files dominate that directory.  Maybe we
>should create a time-sequence or incremental-test subdirectory and push them
>into that.

Docs:
  incremental.HOWTO.txt
  incremental.TODO.txt
Prep:
  es2hs.py
  sort+group.py
  mksets.py
Actual harness:
  incremental.py
  regimes.py
Analysis:
  mkgraph.py
Handy wrapper script:
  dotest.sh

Total of 9 files out of 19 in the testtools dir... not quite to domination,
but close. ;-)  If you think they should be pushed further down a directory
hole, that's OK with me... just update the sys.path mangling at the top of
relevant files to be sure to grab spambayes from the local tree and not
some installed version...

FWIW, the prep scripts can be used with stuff other than this harness, too.
They just build the semi-standard 1-message-per-file Data/{Ham,Spam}/Set*
testing tree with specially named files (names indicating sequence and
grouping information).

- Alex

From m0davis at pacbell.net  Mon Nov 17 20:03:12 2003
From: m0davis at pacbell.net (Martin Stone Davis)
Date: Mon Nov 17 20:03:37 2003
Subject: [spambayes-dev] Re: Idea to re-energize corpus learning
In-Reply-To: <LNBBLJKPBEHFEDALKOLCMEAMHDAB.tim.one@comcast.net>
References: <bpalne$ufs$1@sea.gmane.org>
	<LNBBLJKPBEHFEDALKOLCMEAMHDAB.tim.one@comcast.net>
Message-ID: <bpbr4v$784$1@sea.gmane.org>

Tim Peters wrote:

> [Martin Stone Davis]
> 
>>...
>>So why not soften the blow?  That's what my proposal amounts to:
>>achieving some sort of middle ground between the status quo and
>>starting over.  After performing a "Soften training SEVERELY" (where
>>the counts are all set to their square roots), messages would still
>>be classified in more-or-less the same way.
> 
> 
> You can't know that without running serious tests, and it sounds like
> something tests would prove wrong.  SpamBayes effectively computes spamprobs
> from ratios, and sqrt(x)/sqrt(y) = sqrt(x/y):  the effective relative ratios
> would also get "square rooted", and that's likely to cause massive changes
> in scoring.

Yes, scores in my system would get pushed closer to 1.  Which means it 
should act a little more "unsure" about all the words.  I don't see 
anything so terrible about that, but it's something to keep in mind.

> 
> "The usual" way (in many fields) to diminish counts that have grown "too
> large" is to add 1, then shift right by a bit.  The purpose of adding 1
> first is to prevent an original count of 1 from becoming 0.  Other than
> that, it's basically "cut all the counts in half".  Then (x/2)/(y/2) = x/y,
> so that relative ratios aren't affected (much; counts 2*i+1 and 2*i+2, for
> any i >= 0, are both reduced to i+1, so relative ratios can still change
> some, and especially for small i).

This way would be fine too.  As long as the counts are reduced somehow, 
I'd achieve the goal of making further training more effective.  I will 
try it though, so thanks for the tip.

> 
> 
>>However, further training would then be far more effective, since the
>>counts would be lower.
>>
>>Doesn't that sound like a good idea?
> 
> 
> If test results say that it is, yes; otherwise no.  A problem with
> artificially mangling token counts is that you'll probably lose the ability
> to meaningfully untrain a message again (the relationship betwen token
> counts and total number of ham and spam trained on is destroyed by reducing
> only one of them, but if you reduce the total counts too then you've got
> more messages you *could* untrain on than the (reduced) total count believes
> is possible; untraining anyway will then lead to worsening inaccuracy until
> the reduced total count "goes negative", at which point the code will
> probably blow up, or start to deliver pure nonsense results).

True, but the whole point of my system is that I don't want to have to 
go over previously trained stuff to try to make it work better.  So the 
fact that it's tough to meaningfully untrain messages after softening is 
no problem for me.

(Hmmm, you might still do it: train A, soften, train B, harden, untrain 
A.  That should be kinda meaningful, if a little confusing.  But again, 
it's not a big issue for me.)

> 
> 
>>-Martin
>>
>>P.S. I'm also sure that POPfile learns just as quickly as SpamBayes,
>>since they are based on the same principle.
> 
> 
> Sorry, but unless you've tested this, you have no basis for such a claim.
> May be true, may be false, but "same principle" doesn't determine it a
> priori (overlooking that the ways in which SpamBayes and POPfile determine a
> category actually have very little in common).

True, but I was just expressing my confidence in Skip's assertion to the 
same effect.  I'll be more careful next time.  :P

-Martin

P.S. Someone posted a hack to POPfile which will let me test this idea. 
  So that makes one tester...  I'll try both the "square root" method 
and the "cut all the counts in half" method.


From m0davis at pacbell.net  Mon Nov 17 20:22:09 2003
From: m0davis at pacbell.net (Martin Stone Davis)
Date: Mon Nov 17 20:22:30 2003
Subject: [spambayes-dev] Re: Idea to re-energize corpus learning
In-Reply-To: <16313.4948.609409.61316@montanaro.dyndns.org>
References: <bpaej8$cev$2@sea.gmane.org>	<16312.53618.652677.190274@montanaro.dyndns.org>	<bpalne$ufs$1@sea.gmane.org>
	<16313.4948.609409.61316@montanaro.dyndns.org>
Message-ID: <bpbs8f$9ik$1@sea.gmane.org>

Skip Montanaro wrote:

>>>>>>"Martin" == Martin Stone Davis <m0davis@pacbell.net> writes:
> 
> 
>     Martin> Skip Montanaro wrote:
>     Martin> I recently started this thread on the POPFile forum, but it
>     Martin> applies just as well to SpamBayes.
>     >> 
>     Martin> https://sourceforge.net/forum/forum.php?thread_id=972652&forum_id=213099
>     >> 
>     >> See my note from Sunday on spambayes-dev:
>     >> 
>     >> http://mail.python.org/pipermail/spambayes-dev/2003-November/001679.html
> 
>     Martin> Wouldn't it be nice if there were some middle ground between
>     Martin> continuing to train the huge immovable database and starting
>     Martin> over fresh?
> 
> Sure, it would, but why propagate mistakes, even if they are smaller in
> magnitude?  I should have continued my previous message instead of leaving
> people to draw their own conclusions.  With a small database, if you have an
> error, it's easier to find, and if you can't find it, starting from scratch
> is not a big problem.  With a large database there's this feeling that,
> "but... but... but...  I'll be throwing away all that *good* data and all my
> (valuable) work!"
> 
>     Martin> Having to train 100% of incoming messages after starting over is
>     Martin> real work, and especially frustrating when you *know* that
>     Martin> 80-90% would have been correctly classified anyway if only you
>     Martin> hadn't started over.
> 
> If you only train on mistakes and unsures (as many of us appear to do now),
> then the effort is lessened.  I don't see any practical benefit to training
> on every Python-related message I receive as ham.  I currently have about 20
> in my training database.  If I was smart, I could probably figure out how to
> reduce that number.  As far as I can tell, nearly every valid Python-related
> message I receive gets a ham score of 0.00 (rounded).  None get scored as
> unsure or spam.  How long should I beat that particular dead horse?
> 
> Since blowing away my gazillion message training database I've started from
> scratch twice.  Considering the volume of mail I get, getting back to a
> 250-message training database is little effort at all for me.  SpamBayes
> seems to start scoring most stuff pretty well after seeing just a few hams
> and spams, so the cost is minimal.  The problem with spam is that it varies
> all over the map (subject wise).  My hams fall into just a few categories
> though, so good messages begin to be correctly classified almost
> immediately.  Spam tends to linger in the unsure category must longer.  My
> current approach to that problem is to try and push my spam_cutoff down
> further.
> 
> If you want to seed a training database, you might try initially adding just
> the most recent message from each of your active ham mailboxes.  I could add
> just ten messages and be almost certain they would all be useful indicators
> of ham.  Once I've added a few spams, I'd probably see pretty good
> classification results.
> 
> Given a 20k-message training database which contains mistakes, I will have a
> hard time finding and correcting those mistakes.  Your approach is to reduce
> the magnitude of the mistakes by reducing the weight of the current training
> database.  I effectively take the same approach, it's just that I've
> actually deleted the mistakes.  I've thrown the baby out with the bath water
> (you just shrink your babies ;-), but I get plenty of babies in my incoming
> mail feed.  If I'm careful, perhaps I'll avoid introducing the same mistakes
> next time.
> 
>     Martin> Doesn't that sound like a good idea?
> 
> I suppose.  Mine doesn't require any new code to be written though.
> 
> I'm really not saying your idea is bad, just that mine ought to be "good
> enough" and requires no extra code to be written.  

I get your point.  But for whatever reason, I am just much less tolerant 
than you of having to futz with the training database.  Even if it isn't 
*perfect*, I feel better about shrinking those babies than throwing them 
out, since I really *hate* having to meet new babies.  Okay, we've 
stretch that analogy far enough!

 > You should be able to
> write a little Python script which will march through your database and
> reduce the counts by appropriate amounts.  You will have to be aware of a
> couple corner conditions:
> 
>     * The counts for some words will round to zero.  You have to decide
>       whether to keep them as hapaxes or delete them altogether.
> 
>     * Roundoff error might leave you with some assertion errors like the
>       dreaded 
> 
>         assert hamcount <= nham
>         assert spamcount <= nspam 
> 
>       You'll also have to take care to avoid that case.

Ah, but you see: I'm too lazy to learn enough Python to get that to 
work.  But if I ever do try, thanks for the pointers.

> 
> One thing I tried in the past was to whack off the oldest 10%-20% of my
> training database and retrain on the result.  

Hold it right there.  Whack off?  hehe hehehe hheheheheheehehe.

 > That's another option to try
> to remove errors.  If you as a trainer get better at your job, over time you
> will also reduce the number of mistakes in your training database.  This
> approach also has the pleasant side effect of deleting old messages, keeping
> your training data more current as the nature of spam shifts.  If you
> initially trained on a large body of saved mail though, you might wind up
> whacking out many/most/all the clues pertaining to a particular subject area
> and have to add some new messages in to compensate.

Let's call it the "kill the oldest babies" method.  I actually thought 
about that one first before I came up with the shrinking babies.  I 
figured that I would prefer shrinking them since I wouldn't usually know 
how much I liked those older babies.

Aghhhhhhhhh babies!

Thanks for the input,
-Martin


From sanjaydarisi at cox.net  Mon Nov 17 21:14:25 2003
From: sanjaydarisi at cox.net (Sanjay Darisi)
Date: Mon Nov 17 21:07:33 2003
Subject: [spambayes-dev] Accessing delivery time of an email message!
Message-ID: <3FB98081.6040204@cox.net>


If I want to access the time stamp on the email (Outlook), which 
property should I use? Is it PR_DELIVER_TIME that I need to use? It's a 
PT_SYSTIME type,  So the documentation says that it is pyTime object. 
So, I tried using time.ctime(int(deliverytime))  and it complains 
ValueError: unconvertible time

This is what i've done,

In SB\Outlook2000\msgstore.py

in class MAPIMsgStoreMsg,  I added PR_DELIVER_TIME to message_init_props 
. And

tag, deliverytime = prop_row[8]

self.deliverytime = deliverytime

And at the end, in the test() function

for msg in folder.GetMessageGenerator():
    print time.ctime(int(msg.deliverytime))

When I execute msgstore.py at the command prompt, it says ValueError: 
unconvertible time.  Am I missing anything obvious? How can I access the 
sent/delivery time of an email message in outlook?

Thank you in advance,
Sanjay.


From tim.one at comcast.net  Tue Nov 18 00:33:30 2003
From: tim.one at comcast.net (Tim Peters)
Date: Tue Nov 18 00:33:28 2003
Subject: [spambayes-dev] imbalance within ham or spam training sets? 
Message-ID: <LNBBLJKPBEHFEDALKOLCCEDNHDAB.tim.one@comcast.net>

[Tim, quite a while ago]
>> I'm not sure we've got the best guess
>> to 17 significant digits, though <wink>.  Make the imbalance wilder
>> and the by-counting spamprob gets wilder too:
>>
>> >>> h = 1./20000
>> >>> s = 1./100
>> >>> s/(h+s)
>> 0.99502487562189057
>> >>>
>>
>> That offends my intuition -- the word is so rare (2 of 20100 msgs)
>> that it's hard to believe that 99.5% is a sane guess.  The Bayesian
>> adjustment knocks it down a lot based on how few times it's been
>> seen in total:
>>
>> >>> (.45*.5 + 2.0*_)/(.45 + 2.0) 0.90410193928317584
>> >>>

[Kenny Pitt]
> Wow, that's interesting.  I had always considered words that were
> either ham or spam, but never a little of both.  In a way it makes
> sense because 1/20000 ham is so close to zero that the word should be
> considered spammy.
>
> This seems even more scary, though.  Compare your last example to the
> case where the token has only been seen in 1 spam and no ham:
>
> >>> h = 0./20000
> >>> s = 1./100
> >>> s/(h+s)
> 1.0
> >>> (.45*.5 + 1.*_)/(.45 + 1.)
> 0.84482758620689669
> >>>
>
> The spam prob here is less than the case of 1 ham and 1 spam because
> of the "rare word" adjustment.  So, if the token has only been seen
> once in spam and is later seen once in ham, it gets spammier?  Yikes!
> If we go to h=10:
>
> >>> h = 10./20000
> >>> s = 1./100
> >>> s/(h+s)
> 0.95238095238095233
> >>> (.45*.5 + 11.*_)/(.45 + 11.) 0.93460178831357876
>
> And the spam prob is still going up!  So whenever we have an extreme
> imbalance like this, the first n occurrences of a token added to the
> larger corpus, where n depends on the size of the imbalance, actually
> causes the probability of the *opposite* classification to *increase*.

That's an excellent analysis, and I repeated it in full so that it's easier
to find later <wink>.  This is a systematically counterintuitive effect
that's inevitable when working with highly unbalanced training data.  I know
Gary Robinson is thinking "about stuff like this" now, and I hope he has
time to dream up a better way to cope.

One gloss:

> I had always considered words that were either ham or spam, but never a
> little of both.

Most words are like that!  If you dig thru your entire database, and ignore
hapaxes (words that appeared only once total across all training data), I
bet you'll find that few appeared only in ham or only in spam.  Ours is a
"preponderance of evidence" scheme, not a "smoking gun" scheme.  That's what
makes it hard to fool (no fixed word, or even collection of words, is/are
strong enough on their own to force a decision).


From tim.one at comcast.net  Tue Nov 18 10:53:21 2003
From: tim.one at comcast.net (Tim Peters)
Date: Tue Nov 18 10:53:12 2003
Subject: [spambayes-dev] A spectacular false positive
In-Reply-To: <200311170911.36239.tdickenson@devmail.geminidataloggers.co.uk>
Message-ID: <LNBBLJKPBEHFEDALKOLCGEGHHDAB.tim.one@comcast.net>

[Toby Dickenson]
> I occasionally see the inverse problem. I train on every email I
> receive, including many hams containing lots of numbers like Jeremy
> sent you. Occasionally I get a spam where 2 or 3 numbers (in a price
> list, usually) are enough to classify it as ham.

If you train on everything, and you get substantially more ham than spam,
then your training data is unbalanced in a way that would (I think) push in
that direction.

> Would you have been as suprised by the same result if Jeremy had sent
> you a long list of effectively random words?

Yes, I'd expect that to tend toward unsure, given the way I've trained.

I tried generating a random email like so:

>>> f = file('/updates/word.lst')
>>> d = dict.fromkeys(f)
>>> len(d)
173528
>>> import random
>>> for w in random.sample(d, 300):
...    print w,

and then pasting the result into an email.  word.lst is just a list of
English words, one per line.

That wasn't particularly revealing:  it scored as a low Unsure (22), but
very few of the words had ever been trained on, so were simply ignored (for
example, I had never trained on burkites, zemstvo, or morphallaxes before).
The few words that remained were solidly hammy (compiler, initial) or
solidly spammy (male, sexy), about the same number of each.  What pushed it
toward the ham side of unsure were the half-dozen header clues claiming that
the message was sent from me, and to me using my real name.

I tried again, boosting the # of random words to 3000, to try to stumble
into more I'd actually trained on.  As expected, that pushed it more toward
exactly Unsure:

Combined Score: 47% (0.465326)
Internal ham score (*H*): 0.571772
Internal spam score (*S*): 0.502424

Little integers are different for me, because while they show up in tons of
geek ham, I've trained on very little of that because that kind of stuff
rarely scores above 1, and almost never scores above my ham cutoff of 20.
So mistake-based training almost never trains on geek ham anymore.  My
non-geek friends don't write much about integers <wink>.


From mhammond at skippinet.com.au  Tue Nov 18 21:09:19 2003
From: mhammond at skippinet.com.au (Mark Hammond)
Date: Tue Nov 18 21:09:03 2003
Subject: [spambayes-dev] RE: Accessing delivery time of an email message!
In-Reply-To: <3FB98081.6040204@cox.net>
Message-ID: <117801c3ae42$24c5e750$0500a8c0@eden>

I'm not sure exactly what property you should use - there are a number of
time related properties for a message.  See dump_props.py, and fiddle with
the code there - this dumps all the date properties for a message.

Mark.

> -----Original Message-----
> From: Sanjay Darisi [mailto:sanjaydarisi@cox.net]
> Sent: Tuesday, 18 November 2003 1:14 PM
> To: spambayes-dev@python.org; mhammond@skippinet.com.au
> Subject: Accessing delivery time of an email message!
>
>
>
> If I want to access the time stamp on the email (Outlook), which
> property should I use? Is it PR_DELIVER_TIME that I need to
> use? It's a
> PT_SYSTIME type,  So the documentation says that it is pyTime object.
> So, I tried using time.ctime(int(deliverytime))  and it complains
> ValueError: unconvertible time
>
> This is what i've done,
>
> In SB\Outlook2000\msgstore.py
>
> in class MAPIMsgStoreMsg,  I added PR_DELIVER_TIME to
> message_init_props
> . And
>
> tag, deliverytime = prop_row[8]
>
> self.deliverytime = deliverytime
>
> And at the end, in the test() function
>
> for msg in folder.GetMessageGenerator():
>     print time.ctime(int(msg.deliverytime))
>
> When I execute msgstore.py at the command prompt, it says ValueError:
> unconvertible time.  Am I missing anything obvious? How can I
> access the
> sent/delivery time of an email message in outlook?
>
> Thank you in advance,
> Sanjay.
>
>


From mhammond at skippinet.com.au  Tue Nov 18 21:59:28 2003
From: mhammond at skippinet.com.au (Mark Hammond)
Date: Tue Nov 18 21:59:12 2003
Subject: [spambayes-dev] Can't move items that are in the results list
	froman Outlook Find when SpamBayes is installed
In-Reply-To: <885BB3CAB85CBD44B73B52CFBC1FC55EAAF2D5@iu-mssg-mbx08.exchange.iu.edu>
Message-ID: <118801c3ae49$26afb1c0$0500a8c0@eden>

Hi,
  Please see the "troubleshooting guide" installed with SpamBayes for
information on how to report a bug, ensuring you attach the relevant log
files (also in the troubleshooting guide).

Regards,

Mark.

> -----Original Message-----
> From: spambayes-dev-bounces@python.org
> [mailto:spambayes-dev-bounces@python.org]On Behalf Of Eckert, Robert D
> Sent: Tuesday, 18 November 2003 3:22 AM
> To: spambayes-dev@python.org
> Cc: Eckert, Robert D
> Subject: [spambayes-dev] Can't move items that are in the results list
> froman Outlook Find when SpamBayes is installed
> 
> 
> Hi,
>    I am using Outlook 2002 with an Exchange 2002 server when I work.
> The copy of Outlook is locally installed on my PC which is running 
> Windows 2000 Professional. All software is up to date and patched.
> 
> When I do "Find" operation on Inbox and get a results list, then
> select all the items and attempt to drag and drop them into another
> folder in my folder list, Outlook says: Can't move the items in a
> dialog box.
> 
> When SpamBayes (or Qurb before SpamBayes) is installed, the move
> operation
> fails, yet with SpamBayes is not installed, the operation succeeds
> without 
> a problem.
> 
> Can you address what is happening here?
> 
> Thank you,
> 
> An otherwise *very* satisfied SpamBayes user.
> 
> -Bob
> 
> 
>                     Bob Eckert - Principal Analyst
>          eckert@indiana.edu (812) 855-7209 - (812) 855-8299 Fax
>                         Indiana University
>               University Information Technology Services
>                   University Information Services
>                  2711 East 10th Street - Room 101.5
>                       Bloomington, IN 47408
> 
> 
> _______________________________________________
> spambayes-dev mailing list
> spambayes-dev@python.org
> http://mail.python.org/mailman/listinfo/spambayes-dev
-------------- next part --------------
A non-text attachment was scrubbed...
Name: winmail.dat
Type: application/ms-tnef
Size: 2608 bytes
Desc: not available
Url : http://mail.python.org/pipermail/spambayes-dev/attachments/20031119/a5718a35/winmail.bin
From kennypitt at hotmail.com  Wed Nov 19 12:30:45 2003
From: kennypitt at hotmail.com (Kenny Pitt)
Date: Wed Nov 19 12:31:14 2003
Subject: [spambayes-dev] FW: [Spambayes] Problem with Spam bayes
	installation on Windows 2000pc
Message-ID: <E1AMWAX-0006w1-7R@mail.python.org>

tony.flury@bt.com wrote:
> The plug fails to initialise - I attach the Logs files from 4
> attempts - all of which mention permissions problems 
> 
>  <<spambayes1.log>>  <<spambayes3.log>>  <<spambayes2.log>> 
> <<spambayes1.log>> 
> 
> Any assistance would be useful

And in reply to a request for more info...

tony.flury@bt.com wrote:
> Outlook is 2002 (SP-2)
> 
> This is a new install into Outlook - outlook is running fine. Outlook
> was installed clean onto this PC.
> 
> Yes - the user I run under is not the Admin user.

This user appears to be experiencing a scenario that I had some concerns
about as I was testing the py2exe-based installer.  It would be great if
someone knowledgeable in this area could look into it.  Unfortunately,
that probably == Mark, as if he doesn't have enough to do. <wink>

Here's what I think is happening.  When we build the binary installer,
we have win32com pre-generate the typelib wrappers (gen_py cache) and
put them into the binary.  We do this using the Outlook 2000 typelib,
which has a typelib version of 9.0.

At runtime, win32com checks to see if that same typelib version is
installed.  If not, it checks to see if it can substitute a typelib with
a higher minor version number.  Outlook 2002 (XP) has a typelib version
of 9.1, and Outlook 2003 is version 9.2.

In this case, win32com does not find the version 9.0 typelib but it does
find the 9.1 typelib.  In order to substitute the newer typelib,
win32com attempts to regenerate the wrappers because they don't exist in
the binary, and it attempts to write them into the app installation
directory.  If the user doesn't have admin privileges then this fails.

Any thoughts on how we should handle this?  Should we include multiple
versions of the typelib wrappers?  Can we force win32com to output to
the user's temp directory?  Am I maybe just missing the root cause
entirely?

-- 
Kenny Pitt
-------------- next part --------------
A non-text attachment was scrubbed...
Name: spambayes1.log
Type: application/octet-stream
Size: 2912 bytes
Desc: not available
Url : http://mail.python.org/pipermail/spambayes-dev/attachments/20031119/42573c48/spambayes1.obj
-------------- next part --------------
A non-text attachment was scrubbed...
Name: spambayes3.log
Type: application/octet-stream
Size: 3267 bytes
Desc: not available
Url : http://mail.python.org/pipermail/spambayes-dev/attachments/20031119/42573c48/spambayes3.obj
-------------- next part --------------
A non-text attachment was scrubbed...
Name: spambayes2.log
Type: application/octet-stream
Size: 3267 bytes
Desc: not available
Url : http://mail.python.org/pipermail/spambayes-dev/attachments/20031119/42573c48/spambayes2.obj
-------------- next part --------------
A non-text attachment was scrubbed...
Name: spambayes1.log
Type: application/octet-stream
Size: 2912 bytes
Desc: not available
Url : http://mail.python.org/pipermail/spambayes-dev/attachments/20031119/42573c48/spambayes1-0001.obj
From mhammond at skippinet.com.au  Wed Nov 19 17:42:10 2003
From: mhammond at skippinet.com.au (Mark Hammond)
Date: Wed Nov 19 17:41:54 2003
Subject: [spambayes-dev] FW: [Spambayes] Problem with Spam
	bayesinstallation on Windows 2000pc
In-Reply-To: <E1AMWAX-0006w1-7R@mail.python.org>
Message-ID: <140101c3aeee$5ebf7140$0500a8c0@eden>

> Here's what I think is happening.  When we build the binary installer,
> we have win32com pre-generate the typelib wrappers (gen_py cache) and
> put them into the binary.  We do this using the Outlook 2000 typelib,
> which has a typelib version of 9.0.
>
> At runtime, win32com checks to see if that same typelib version is
> installed.  If not, it checks to see if it can substitute a
> typelib with
> a higher minor version number.  Outlook 2002 (XP) has a
> typelib version
> of 9.1, and Outlook 2003 is version 9.2.
>
> In this case, win32com does not find the version 9.0 typelib
> but it does
> find the 9.1 typelib.  In order to substitute the newer typelib,
> win32com attempts to regenerate the wrappers because they
> don't exist in
> the binary, and it attempts to write them into the app installation
> directory.  If the user doesn't have admin privileges then this fails.
>
> Any thoughts on how we should handle this?  Should we include multiple
> versions of the typelib wrappers?  Can we force win32com to output to
> the user's temp directory?  Am I maybe just missing the root cause
> entirely?

I think you are on the money.  However, the world has shifted.  Newer
versions will be released using py2exe, and will have the "gencache" inside
the .zip file.  win32com will then consider it "read-only", and in the
scenario you outline above *should* use the pre-generated 9.0 typelib (as it
has the same minor version).

My testing shows this to work fine with a 9.1 typelib.  I *hope* that the
logic still holds up when a 9.2 exists too <wink>.  The tlb
checking/validation code is pretty horrible and due for a cleanup (but I
didn't write it this time <wink>)

Mark.


From kennypitt at hotmail.com  Wed Nov 19 17:49:28 2003
From: kennypitt at hotmail.com (Kenny Pitt)
Date: Wed Nov 19 17:49:57 2003
Subject: [spambayes-dev] FW: [Spambayes] Problem with Spam
	bayesinstallation on Windows 2000pc
In-Reply-To: <140101c3aeee$5ebf7140$0500a8c0@eden>
Message-ID: <Law11-OE35wzGtts9ig00005d2d@hotmail.com>

Mark Hammond wrote:
>> In this case, win32com does not find the version 9.0 typelib
>> but it does
>> find the 9.1 typelib.  In order to substitute the newer typelib,
>> win32com attempts to regenerate the wrappers because they
>> don't exist in
>> the binary, and it attempts to write them into the app installation
>> directory.  If the user doesn't have admin privileges then this
>> fails. 
> 
> I think you are on the money.  However, the world has shifted.  Newer
> versions will be released using py2exe, and will have the "gencache"
> inside the .zip file.  win32com will then consider it "read-only",
> and in the scenario you outline above *should* use the pre-generated
> 9.0 typelib (as it has the same minor version).
> 
> My testing shows this to work fine with a 9.1 typelib.  I *hope* that
> the logic still holds up when a 9.2 exists too <wink>.  The tlb
> checking/validation code is pretty horrible and due for a cleanup
> (but I didn't write it this time <wink>)

I have all 3 typelibs on my system, so don't know if I'm getting the
complete picture.  When I run from the py2exe binary, I've never gotten
an error and it hasn't generated any new wrapper classes into the
dist\bin or dist\lib directories.  I'll try removing any wrappers from
site-packages\win32com\gen_py and my temp dir and run again to make sure
nothing gets regenerated elsewhere.

FYI, I installed the trial version of InBoxer and it has the same
problem since it is based on the old plug-in binary mechanism.  It
created a support\gen_py subdirectory in the app dir and wrote new
wrappers for my 9.2 typelib.

-- 
Kenny Pitt


From mhammond at skippinet.com.au  Wed Nov 19 18:23:35 2003
From: mhammond at skippinet.com.au (Mark Hammond)
Date: Wed Nov 19 18:23:16 2003
Subject: [spambayes-dev] FW: [Spambayes] Problem with Spam
	bayesinstallation on Windows 2000pc
In-Reply-To: <Law11-OE35wzGtts9ig00005d2d@hotmail.com>
Message-ID: <142101c3aef4$282f0810$0500a8c0@eden>

Kenny:

> I have all 3 typelibs on my system, so don't know if I'm getting the
> complete picture.  When I run from the py2exe binary, I've
> never gotten
> an error and it hasn't generated any new wrapper classes into the
> dist\bin or dist\lib directories.

Excellent - that is the correct behaviour.  It should be exactly the same
regardless of what typelibs you have installed - including *none* of them.
The intent now is that win32com.gen_py knows it is frozen, so *never*
attempts to load typelibs, for either generation or version checking
purposes.  This has performance advantages even in the usual case when the
typelibs are all installed.

> I'll try removing any wrappers from
> site-packages\win32com\gen_py and my temp dir and run again
> to make sure
> nothing gets regenerated elsewhere.

Hopefully your installed Python and the py2exe dll should not be able to
conflict, even if they wanted to!  If you can trick anything into failing
that smells like it might be related, let me know.

> FYI, I installed the trial version of InBoxer and it has the same
> problem since it is based on the old plug-in binary mechanism.  It
> created a support\gen_py subdirectory in the app dir and wrote new
> wrappers for my 9.2 typelib.

Yeah, the "old" way sucks for a number of reasons :)

Mark.


From tim.one at comcast.net  Wed Nov 19 21:38:20 2003
From: tim.one at comcast.net (Tim Peters)
Date: Wed Nov 19 21:38:22 2003
Subject: [spambayes-dev] It's only money
Message-ID: <LNBBLJKPBEHFEDALKOLCAEPLHDAB.tim.one@comcast.net>

Anyone want to chip in $7,500.00 for this once-in-a-month opportunity?  If
so, just make your check out to me.

couldn't-give-or-receive-a-finer-gift-ly y'rs  - tim
-------------- next part --------------
An embedded message was scrubbed...
From: "Cristi Brown" <Cristi_Brown@infoworld.com>
Subject: [PSF-Board] InfoWorld Product Spotlight Advertising Program
	forSpamBayes
Date: Wed, 19 Nov 2003 15:17:27 -0500
Size: 4173
Url: http://mail.python.org/pipermail/spambayes-dev/attachments/20031119/ad3cbdc7/attachment-0001.mht
From anthony at interlink.com.au  Wed Nov 19 22:24:30 2003
From: anthony at interlink.com.au (Anthony Baxter)
Date: Wed Nov 19 22:24:59 2003
Subject: [spambayes-dev] It's only money 
In-Reply-To: <LNBBLJKPBEHFEDALKOLCAEPLHDAB.tim.one@comcast.net> 
Message-ID: <200311200324.hAK3OWXj027252@localhost.localdomain>


>>> "Tim Peters" wrote
> Anyone want to chip in $7,500.00 for this once-in-a-month opportunity?  If
> so, just make your check out to me.
> 
> couldn't-give-or-receive-a-finer-gift-ly y'rs  - tim

Jeez. One thing the SB project's done for me is to make me realise that most
of these PC review magazines are on the take. So much for the wall between
editorial and advertising...


From kennypitt at hotmail.com  Thu Nov 20 10:14:59 2003
From: kennypitt at hotmail.com (Kenny Pitt)
Date: Thu Nov 20 10:15:53 2003
Subject: [spambayes-dev] FW: [Spambayes] Problem with Spam
	bayesinstallation on Windows 2000pc
In-Reply-To: <142101c3aef4$282f0810$0500a8c0@eden>
Message-ID: <Law11-OE63Dz1jWXStI00006494@hotmail.com>

Mark Hammond wrote:
> Kenny:
>> I'll try removing any wrappers from
>> site-packages\win32com\gen_py and my temp dir and run again
>> to make sure
>> nothing gets regenerated elsewhere.
> 
> Hopefully your installed Python and the py2exe dll should not be able
> to conflict, even if they wanted to!  If you can trick anything into
> failing that smells like it might be related, let me know.

Just wanted to report the results of my test.  Everything seems to be
working as expected.

I renamed my site-packages\win32com\gen_py dir (and recreated it with
only the original __init__.py file to be safe).  I also renamed the
registry keys for the Outlook 2000 and Office 2000 typelib versions so
that they wouldn't be found.  I then registered the py2exe binary addin
and loaded Outlook.  SpamBayes ran fine, and no files were created in
the dist\bin, dist\lib, or site-packages\win32com\gen_py directories.

-- 
Kenny Pitt


From skip at pobox.com  Sat Nov 22 01:17:26 2003
From: skip at pobox.com (Skip Montanaro)
Date: Sat Nov 22 01:17:31 2003
Subject: [spambayes-dev] more selective Received: header mining...
Message-ID: <16318.65398.374469.490455@montanaro.dyndns.org>


I made a change to the mine_received_headers stuff this evening, adding a
new option, gateway_machines.  The idea is that the only Received: header
which is really useful is the one which crosses the boundary between your
known "good" network and the wild free-for-all part of the net.  Received:
headers from hosts internal to your network are meaningless, since for the
most part, all mail passes through them, while Received: headers from hosts
external to your network probably just contain random garbage which clogs
your database with meaningless tokens.

On the other hand, information from the point at which your mail system
receives a message can be useful.  You can trust your network's mail server
to at least get the IP address of the delivering host.  When processing
Received: headers, I use the gateway_machines option (a regular expression)
to detect when I first encounter an SMTP server I trust.  I have four useful
email addresses: skip@mojam.com, skip@pobox.com, skip@python.org and
montanaro@users.sourceforge.net, so I set gateway_machines to

    mojam\.com|pobox\.com|python\.org|sourceforge\.net

The attached context diff implements the change.  If you leave
gateway_machines an empty string, mine_received_headers will have it's
original meaning.  If you set it to something, it will cause only the
earliest Received: header which matches your regular expression to be
processed.

It's hard to tell how well this will work, since improvements are
necessarily very small at this stage of the game.  It certainly seems like
it might be time-sensitive.  Machines which were open relays a year ago may
be closed off now, forcing spammers to use different routes to your mailbox.
I'm thinking it might be more helpful with small training databases and
small messages, as it adds more relevant clues for the classfier to munch
on.

My only testing to this point has been to see how it does on my current
unsure mailbox.  At the moment it contains about 50 messages, a mixture of
ham and spam (though mostly spam) which all scored unsure when they landed
there and which for one reason or another I have yet to delete or save
somewhere else.  Before enabling gateway_machines no messages scored as
spam.  After enabling it to the above regex and retraining from scratch
(~170 hams and 250 spams), three more messages from my unsure mailbox scored
as spam.

Not surprisingly, the number of 'received:' records in my training database
dropped substantially (from 2289 to 1254) after enabling this.

Finally, note that the couple of context diffs here were pulled out of
already modified versions of tokenizer.py and Options.py, so patch will
probably apply them with offsets.

Skip

-------------- next part --------------
A non-text attachment was scrubbed...
Name: sb.diffs
Type: application/octet-stream
Size: 5414 bytes
Desc: not available
Url : http://mail.python.org/pipermail/spambayes-dev/attachments/20031122/b4485c26/sb.obj
From skip at pobox.com  Sat Nov 22 23:40:29 2003
From: skip at pobox.com (Skip Montanaro)
Date: Sat Nov 22 23:40:50 2003
Subject: [spambayes-dev] "X-" as a prefix for experimental options
Message-ID: <16320.14909.793055.342411@montanaro.dyndns.org>


I think the easiest way for people to play with new options is if they are
in CVS instead of having to apply patches.  Posting context diffs doesn't
seem to be yielding a stampede of testers for several (trivial, perhaps)
recent ideas.  As an alternative, I propose that experimental options be
simply incorporated into CVS with an "X-" prefix on the option name (e.g.,
["Tokenizer", "X-gateway_machines"]) and that they always be off by default.
This allows a couple things to happen:

    * They would be more easily available to early adopters who might not
      have the usual facility we've come to expect with cvs and patch(1).
      As the Outlook plugin-using population continues to grow, the relative
      number of cvs-and-patch aficianados will dwindle.

    * They could documented as experimental and included in a SpamBayes
      release.

    * User interfaces like sb_server.py or the Outlook plugin could
      recognize such options and display them in a distinctive manner which
      makes it clear they are experimental, and possibly even solicit
      feedback on them (particularly if such applications could report some
      relevant statistics where warranted).

    * Elevating such options to non-experimental status only requires
      removing the "X-" prefix from that option's use in distributed code.
      Instances of the "X-" prefixed names which remain in options files
      might elicit a warning, but still serve to set the now
      non-experimental option value.

    * The options parser could warn (but not fatally) about option file
      settings that have "X-" prefixes which don't correspond to actual
      options.  This way, the code which implements them could be ripped out
      if they are deemed not useful without fear that programs which use
      them will begin to fail, possibly silently in the case of
      non-interactive use.

Skip

From mhammond at skippinet.com.au  Sun Nov 23 00:55:40 2003
From: mhammond at skippinet.com.au (Mark Hammond)
Date: Sun Nov 23 00:55:51 2003
Subject: [spambayes-dev] "X-" as a prefix for experimental options
In-Reply-To: <16320.14909.793055.342411@montanaro.dyndns.org>
Message-ID: <194801c3b186$6d8bd340$0500a8c0@eden>

A problem I see is that the users will have no way of measuring any changes.
The binaries don't come with any of the test tools, and relying on lots of
people giving subjective results doesn't seem useful.

I think we need some kind of better, application based testing framework
first.  The scripts we use now predate all of the applications, and I can
never remember how to run them.  If I could just get a test tool to run
directly over Outlook folders, we would be much closer (for Outlook anyway
<wink>).  This needn't be too hard - just abstracting the test tools a
little so they allow sub-classes to extract the actual message streams for
the test runs.

Ultimately, we end up with a simple way for either Outlook or sb_server to
run tests over the training sets, and report succinct results.  Otherwise, I
doubt anything will change in terms of the number of *users* running tests
(let alone developers <wink>)

Mark.
> -----Original Message-----
> From: spambayes-dev-bounces@python.org
> [mailto:spambayes-dev-bounces@python.org]On Behalf Of Skip Montanaro
> Sent: Sunday, 23 November 2003 3:40 PM
> To: spambayes-dev@python.org
> Subject: [spambayes-dev] "X-" as a prefix for experimental options
>
>
>
> I think the easiest way for people to play with new options
> is if they are
> in CVS instead of having to apply patches.  Posting context
> diffs doesn't
> seem to be yielding a stampede of testers for several
> (trivial, perhaps)
> recent ideas.  As an alternative, I propose that experimental
> options be
> simply incorporated into CVS with an "X-" prefix on the
> option name (e.g.,
> ["Tokenizer", "X-gateway_machines"]) and that they always be
> off by default.
> This allows a couple things to happen:
>
>     * They would be more easily available to early adopters
> who might not
>       have the usual facility we've come to expect with cvs
> and patch(1).
>       As the Outlook plugin-using population continues to
> grow, the relative
>       number of cvs-and-patch aficianados will dwindle.
>
>     * They could documented as experimental and included in a
> SpamBayes
>       release.
>
>     * User interfaces like sb_server.py or the Outlook plugin could
>       recognize such options and display them in a
> distinctive manner which
>       makes it clear they are experimental, and possibly even solicit
>       feedback on them (particularly if such applications
> could report some
>       relevant statistics where warranted).
>
>     * Elevating such options to non-experimental status only requires
>       removing the "X-" prefix from that option's use in
> distributed code.
>       Instances of the "X-" prefixed names which remain in
> options files
>       might elicit a warning, but still serve to set the now
>       non-experimental option value.
>
>     * The options parser could warn (but not fatally) about
> option file
>       settings that have "X-" prefixes which don't correspond
> to actual
>       options.  This way, the code which implements them
> could be ripped out
>       if they are deemed not useful without fear that
> programs which use
>       them will begin to fail, possibly silently in the case of
>       non-interactive use.
>
> Skip
>
> _______________________________________________
> spambayes-dev mailing list
> spambayes-dev@python.org
> http://mail.python.org/mailman/listinfo/spambayes-dev


From skip at pobox.com  Sun Nov 23 08:59:22 2003
From: skip at pobox.com (Skip Montanaro)
Date: Sun Nov 23 08:59:33 2003
Subject: [spambayes-dev] "X-" as a prefix for experimental options
In-Reply-To: <194801c3b186$6d8bd340$0500a8c0@eden>
References: <16320.14909.793055.342411@montanaro.dyndns.org>
	<194801c3b186$6d8bd340$0500a8c0@eden>
Message-ID: <16320.48442.882585.738466@montanaro.dyndns.org>


    Mark> If I could just get a test tool to run directly over Outlook
    Mark> folders, we would be much closer (for Outlook anyway <wink>).
    Mark> This needn't be too hard - just abstracting the test tools a
    Mark> little so they allow sub-classes to extract the actual message
    Mark> streams for the test runs.

You may be able to extend mboxutils.getmbox() to handle Outlook folders.
That would allow many tools (though not timcv.py) to handle them
automagically.  timcv.py may need more than sequential access to the
messages, however.  I've never looked at it.

    Mark> Ultimately, we end up with a simple way for either Outlook or
    Mark> sb_server to run tests over the training sets, and report succinct
    Mark> results.  Otherwise, I doubt anything will change in terms of the
    Mark> number of *users* running tests (let alone developers <wink>)

Yes, it would be nice to move in that direction.  PEP time? <wink>

Skip

From bernie at pobox.com  Mon Nov 24 05:47:52 2003
From: bernie at pobox.com (Bernard Payne)
Date: Mon Nov 24 05:47:53 2003
Subject: [spambayes-dev] Reviewing trained messages
Message-ID: <00a801c3b278$68f719f0$30132352@nec>

Hi -

Could you consider an entry in the FAQ section on how to access the database of messages which have been trained - i.e. in the case that you misclassify a message and want to sort it out?  In my case I have a message that was correctly classified, but it seems like it was not delivered to my Outlook Express inbox - a little odd I know, but if I could access the "already trained" database I could see if this is really the case or not.  I use the web interface pointing to localhost:8800 to review & train messages, in case this is relevant.

Thanks...Bernie Payne
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.python.org/pipermail/spambayes-dev/attachments/20031124/212d4cfb/attachment.html
From barry at python.org  Mon Nov 24 09:53:37 2003
From: barry at python.org (Barry Warsaw)
Date: Mon Nov 24 09:53:40 2003
Subject: [spambayes-dev] Three patches for better Evolution integration
Message-ID: <1069685616.2365.17.camel@geddy>

I finally spent some time this weekend trying to integrate sb and Ximian
Evolution (I use 1.4.5 as my primary mail reader).  As others have
pointed out before, Evolution lets you create a filter, one of the
criteria of which can be "pipe message to shell script".  Then Evolution
can match the exit code of the script to determine what to do with the
message.

So I use sb_imapfilter.py to train a local database, then I run
sb_xmlrpcserver.py to do the scoring against this database.  I wrote a
small client called sb_score.py which basically just pipes stdin to the
server, calls XMLHammie.score() and compares the float return value
against spam_cutoff and ham_cutoff.  The script returns 0 for ham, 1 for
unsure, and 2 for spam (also -1 if there's an error).  Currently, I'm
just moving spam to a separate folder, leaving ham and unsure in my
inbox.  I'll probably refine that soon to move unsures as well.

I've uploaded three patches to SF.  848311 is a small patch to
sb_imapfilter.py so that it honors html_ui::launch_browser when the -b
option is given.  I wanted it to start the web server and not start a
browser, but there didn't seem to be any way to make this happen.

In 848314 I had to make several changes to sb_xmlrpcserver.py to 1) make
the socket reusable, 2) fix XMLHammie.score().  The latter method was
trying to wrap the float return value in a Binary, but that's both
broken and unnecessary <wink>.  Now it just returns the float directly.

848319 is my sb_score.py script.

I don't like sb_imapfilter.py's tendency to create copies of messages
(with one marked deleted) in my ham and spam training folders.  I
vaguely remember some discussion about this and I'm not sure if it's
fixable or not (I'm using uw_imap -- yeah, yeah, I know, I know).  I may
try, but if I fail, I'll probably just rsync over those two folders and
do an mbox train on them.

Evolution is applying the filter and moving the message, and that all
looks good.  Evolution doesn't seem any slower <wink>.   I had two
problems though.  As I was getting things going, I'd stop my xmlrpc
server and retrain, then restart the server.  This seemed to give
Evolution fits, which it spitting up cryptic error messages and forcing
a restart.  This only happens occasionally though, but definitely seems
to be related to my training regimen.

It also wasn't doing a very good job of classifying messages.  Of
course, maybe that had something to do with bugs in my
bayescustomize.ini file where I swapped my ham and spam training folders
;).  I've fixed that now, blown away my database, retrained, and now am
awaiting the daily flood of messages, both tasty and rancid.

-Barry


From jens.rantil at telia.com  Mon Nov 24 16:13:34 2003
From: jens.rantil at telia.com (Jens Rantil)
Date: Mon Nov 24 16:14:46 2003
Subject: [spambayes-dev] Re: [Spambayes] RV: I18N and L10N
In-Reply-To: <qilcrvkkv2cmod9g8vq56c68i48blahbjh@4ax.com>
References: <E1AKopS-0005WE-Sp@hostalia03.hostalia.com>
	<qilcrvkkv2cmod9g8vq56c68i48blahbjh@4ax.com>
Message-ID: <20031124221334.0d4cfb22.jens.rantil@telia.com>

Hi Richie,
I am one of those passive readers at this forum =) and finally have something to
say...

On Sat, 15 Nov 2003 16:57:27 +0000
Richie Hindle <richie@entrian.com> wrote:

> We'd love to have international versions, though there are a lot of issues
> involved.  I don't mean to put you off the idea, or to imply that we're
> not prepared to put effort into this, but these things need taking into
> account...
> 
> Many (most?) of the English strings in SpamBayes are mixed in with the
> code.  Taking the source code as it is an translating the strings into
> Spanish would be unmaintainable - we'd have two entirely separate versions
> of the code, and any edits would have to be applied to both.  So the first
> job to do would be to pull out all those hard-coded strings into a
> language file.  That's not a huge job, and one that any computer-literate
> person could probably do 95% of, even if they weren't a programmer.  Still
> more effort than simply translating a collection of English phrases into
> Spanish, though.

No one here seems to have mentioned the GNU gettext project. Why not try to
integrate SB with gettext instead? I believe that should be the best
solution...and if no one is happy doing it I might have a look at it, however, I
can't promise having time. =)

See
http://www.python.org/doc/current/lib/module-gettext.html
and
http://www.gnu.org/software/gettext/gettext.html
for more info. 

Regards,

Jens Rantil, a fan of yours =)

PS. When reading thrue all the messages in the user mailing list I find that
many of the questions concerning the outlook plugin aren't replied. I would
suggest adding at least a line in the FAQ on how to set the logging frequency
and where to find the log file so that it can be attached to the mails. Perhaps
that would help a lot in solving some of the errors which seems to circumvent in
the plugin? DS.

From kennypitt at hotmail.com  Mon Nov 24 16:33:56 2003
From: kennypitt at hotmail.com (Kenny Pitt)
Date: Mon Nov 24 16:34:49 2003
Subject: [spambayes-dev] Re: [Spambayes] RV: I18N and L10N
In-Reply-To: <20031124221334.0d4cfb22.jens.rantil@telia.com>
Message-ID: <LAW11-OE166QqQaoQi500002b61@hotmail.com>

Jens Rantil wrote:
> Hi Richie,
> I am one of those passive readers at this forum =) and finally have
> something to say...
> 
> On Sat, 15 Nov 2003 16:57:27 +0000
> Richie Hindle <richie@entrian.com> wrote:
> 
>> We'd love to have international versions, though there are a lot of
>> issues involved.  I don't mean to put you off the idea, or to imply
>> that we're not prepared to put effort into this, but these things
>> need taking into account... 
> 
> No one here seems to have mentioned the GNU gettext project. Why not
> try to integrate SB with gettext instead?

IIRC, the GNU license is not compatible with the SpamBayes/PSF license.

-- 
Kenny Pitt


From barry at python.org  Mon Nov 24 16:59:30 2003
From: barry at python.org (Barry Warsaw)
Date: Mon Nov 24 16:59:37 2003
Subject: [spambayes-dev] Re: [Spambayes] RV: I18N and L10N
In-Reply-To: <LAW11-OE166QqQaoQi500002b61@hotmail.com>
References: <LAW11-OE166QqQaoQi500002b61@hotmail.com>
Message-ID: <1069711169.10090.5.camel@anthem>

On Mon, 2003-11-24 at 16:33, Kenny Pitt wrote:

> > No one here seems to have mentioned the GNU gettext project. Why not
> > try to integrate SB with gettext instead?
> 
> IIRC, the GNU license is not compatible with the SpamBayes/PSF license.

The GPL and the PSF license are compatible.

-Barry


From mhammond at skippinet.com.au  Mon Nov 24 17:03:14 2003
From: mhammond at skippinet.com.au (Mark Hammond)
Date: Mon Nov 24 17:04:15 2003
Subject: [spambayes-dev] Re: [Spambayes] RV: I18N and L10N
In-Reply-To: <1069711169.10090.5.camel@anthem>
Message-ID: <1cc901c3b2d6$c21f6a10$0500a8c0@eden>

> On Mon, 2003-11-24 at 16:33, Kenny Pitt wrote:
>
> > > No one here seems to have mentioned the GNU gettext
> project. Why not
> > > try to integrate SB with gettext instead?
> >
> > IIRC, the GNU license is not compatible with the
> SpamBayes/PSF license.
>
> The GPL and the PSF license are compatible.

Here we go again :)  Wouldn't it mean that we must release SB under the GPL?
That is what my understanding of Python being "compatible" means - Python
can be released in a GPL'd project, but not the other way around - using GNU
with Python doesn't allow us to re-licence the GNU stuff.

Mark.


From tim.one at comcast.net  Mon Nov 24 17:19:49 2003
From: tim.one at comcast.net (Tim Peters)
Date: Mon Nov 24 17:19:57 2003
Subject: [spambayes-dev] Re: [Spambayes] RV: I18N and L10N
In-Reply-To: <20031124221334.0d4cfb22.jens.rantil@telia.com>
Message-ID: <LNBBLJKPBEHFEDALKOLCGEKMHFAB.tim.one@comcast.net>

[Jens Rantil]
> ...
> PS. When reading thrue all the messages in the user mailing list I
> find that many of the questions concerning the outlook plugin aren't
> replied.

Yup, and I blame the users <wink>.  Seriously, there's exactly one person
with a deep understanding of the Outlook plugin internals, against more than
a hundred thousand people who have downloaded it.  Mark couldn't keep up
with the stream of questions even if he were paid to work on it full-time
(and, of course, he's not paid to work on it at all -- it has to come out of
his spare time).  So users with problems are going to have to help each
other solve them.

> I would suggest adding at least a line in the FAQ on how to
> set the logging frequency and where to find the log file so that
> it can be attached to the mails.

Suggesting more work for Mark to do probably isn't going to help.  If you
think you can write FAQ entries that would help people, please do!

> Perhaps that would help a lot in solving some of the errors which
> seems to circumvent in the plugin?

Alas, quite possibly not.  From all I can tell, the plugin works fine for
the vast majority of people who install it.  The cases where it doesn't work
are the ones we hear about, and if the developers had ever seen these
failures themselves, they would already be fixed.  Most people seem able to
find the log (the troubleshooting guide already contains lots of info about
exactly where to find it); a problem is that the logs don't seem to be much
help in resolving the problems that remain.


From barry at python.org  Mon Nov 24 17:24:21 2003
From: barry at python.org (Barry Warsaw)
Date: Mon Nov 24 17:24:27 2003
Subject: [spambayes-dev] Re: [Spambayes] RV: I18N and L10N
In-Reply-To: <1cc901c3b2d6$c21f6a10$0500a8c0@eden>
References: <1cc901c3b2d6$c21f6a10$0500a8c0@eden>
Message-ID: <1069712660.10090.18.camel@anthem>

On Mon, 2003-11-24 at 17:03, Mark Hammond wrote:
> > On Mon, 2003-11-24 at 16:33, Kenny Pitt wrote:
> >
> > > > No one here seems to have mentioned the GNU gettext
> > project. Why not
> > > > try to integrate SB with gettext instead?
> > >
> > > IIRC, the GNU license is not compatible with the
> > SpamBayes/PSF license.
> >
> > The GPL and the PSF license are compatible.
> 
> Here we go again :)  Wouldn't it mean that we must release SB under the GPL?
> That is what my understanding of Python being "compatible" means - Python
> can be released in a GPL'd project, but not the other way around - using GNU
> with Python doesn't allow us to re-licence the GNU stuff.

I haven't been following this thread, but in this specific example, I
doubt it's necessary even to worry about it <wink>.  Python has its own
gettext implementation that doesn't share any code with GNU gettext.  In
fact, Python's class-based API works better for Python code than classic
gettext API.

ducking-ly y'rs,
-Barry


From jens.rantil at telia.com  Mon Nov 24 17:29:21 2003
From: jens.rantil at telia.com (Jens Rantil)
Date: Mon Nov 24 17:30:55 2003
Subject: [spambayes-dev] Re: [Spambayes] RV: I18N and L10N
In-Reply-To: <20031124221334.0d4cfb22.jens.rantil@telia.com>
References: <E1AKopS-0005WE-Sp@hostalia03.hostalia.com>
	<qilcrvkkv2cmod9g8vq56c68i48blahbjh@4ax.com>
	<20031124221334.0d4cfb22.jens.rantil@telia.com>
Message-ID: <20031124232922.790e7369.jens.rantil@telia.com>

Once again,

On Mon, 24 Nov 2003 22:13:34 +0100
Jens Rantil <jens.rantil@telia.com> wrote:

> No one here seems to have mentioned the GNU gettext project.

Also...if there was such an implementation I would gratefully add a swedish
translation. =)

/Jens

From richie at entrian.com  Mon Nov 24 17:31:10 2003
From: richie at entrian.com (Richie Hindle)
Date: Mon Nov 24 17:31:42 2003
Subject: [spambayes-dev] Re: [Spambayes] RV: I18N and L10N
In-Reply-To: <1cc901c3b2d6$c21f6a10$0500a8c0@eden>
References: <1069711169.10090.5.camel@anthem>
	<1cc901c3b2d6$c21f6a10$0500a8c0@eden>
Message-ID: <bu05svsfnm1j5qnm4l9hdccp8igmjb256i@4ax.com>


[Jens]
> No one here seems to have mentioned the GNU gettext project.

[Kenny]
> IIRC, the GNU license is not compatible with the SpamBayes/PSF license.

[Barry]
> The GPL and the PSF license are compatible.

[Mark]
> Wouldn't it mean that we must release SB under the GPL?

I think that's a red herring.  I'm sure someone will correct me if I'm
wrong, but I believe Python's gettext support does not include any GNU
code.  It's a re-implementation of a subset of the GNU system in Python
(plus extra bits) and is PSF-licensed along with the rest of Python.
Certainly it ships with Python (unlike readline, for example).

Jens: I don't know anything about gettext apart from its existence 8-) and
the basic idea.  It would be fine for strings in source code (and was the
kind of job I meant when I said "any computer-literate person could
probably do 95% of [it]").  It certainly couldn't be used for the Windows
dialogs, and probably not for the HTML (or could it? - you seem to know
more about it than I do).

> ...a line in the FAQ on how to set the logging frequency...

I'll let one of the Outlook guys field that one - the Outlook plugin is
something else whose existence is pretty much the sum total of my
knowledge.  8-)

-- 
Richie Hindle
richie@entrian.com


From tameyer at ihug.co.nz  Mon Nov 24 17:51:32 2003
From: tameyer at ihug.co.nz (Tony Meyer)
Date: Mon Nov 24 17:51:47 2003
Subject: [spambayes-dev] Re: [Spambayes] RV: I18N and L10N
In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F1304315108@its-xchg4.massey.ac.nz>
Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F13026F29E3@its-xchg4.massey.ac.nz>

> PS. When reading thrue all the messages in the user mailing 
> list I find that many of the questions concerning the outlook 
> plugin aren't replied.

I don't reply to Outlook plug-in messages when I'm busy with other things
because:
 * I know that the FAQ, help pages, or recent archives have the answer to
the question, 
 * The user hasn't included enough information, even though the help pages
say exactly what needs to be included (including finding the log and
including it),
 * There's an open bug report about it already (so the info should be added
to that), or
 * I don't know the answer.

The people here volunteer their own time to help out with this project, so
it's not too much to expect that the users put some effort into asking for
help.  The first three of those reasons are the user's fault.  If info is
too hard to find, then a message about *that* would be welcome, and probably
dealt with reasonably promptly.

> I would suggest adding at least a line 
> in the FAQ on how to set the logging frequency and where to 
> find the log file so that it can be attached to the mails. 

The Outlook help pages do this - does it need to be in the FAQ as well?  (If
the users don't read the help pages, will they read the FAQ?).

=Tony Meyer


From barry at python.org  Mon Nov 24 18:02:29 2003
From: barry at python.org (Barry Warsaw)
Date: Mon Nov 24 18:02:41 2003
Subject: [spambayes-dev] Re: [Spambayes] RV: I18N and L10N
In-Reply-To: <bu05svsfnm1j5qnm4l9hdccp8igmjb256i@4ax.com>
References: <1069711169.10090.5.camel@anthem>
	<1cc901c3b2d6$c21f6a10$0500a8c0@eden>
	<bu05svsfnm1j5qnm4l9hdccp8igmjb256i@4ax.com>
Message-ID: <1069714948.7132.0.camel@anthem>

On Mon, 2003-11-24 at 17:31, Richie Hindle wrote:

> I think that's a red herring.  I'm sure someone will correct me if I'm
> wrong, but I believe Python's gettext support does not include any GNU
> code.  It's a re-implementation of a subset of the GNU system in Python
> (plus extra bits) and is PSF-licensed along with the rest of Python.

Correct.
-Barry


From barry at python.org  Mon Nov 24 21:22:05 2003
From: barry at python.org (Barry Warsaw)
Date: Mon Nov 24 21:22:16 2003
Subject: [spambayes-dev] Re: Three patches for better Evolution integration
In-Reply-To: <1069685616.2365.17.camel@geddy>
References: <1069685616.2365.17.camel@geddy>
Message-ID: <1069726924.7132.14.camel@anthem>

On Mon, 2003-11-24 at 09:53, Barry Warsaw wrote:

> I've uploaded three patches to SF.

I think these patches are pretty stable now.  Evolution is very nicely
filtering spam and unsures.  Anybody mind if I check these into the cvs
head?  (Anthony thought it would be okay.)

-Barry


From tameyer at ihug.co.nz  Mon Nov 24 21:43:33 2003
From: tameyer at ihug.co.nz (Tony Meyer)
Date: Mon Nov 24 21:43:44 2003
Subject: [spambayes-dev] Re: Three patches for better Evolution integration
In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F13043151BA@its-xchg4.massey.ac.nz>
Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F13026F29E4@its-xchg4.massey.ac.nz>

> I think these patches are pretty stable now.  Evolution is 
> very nicely filtering spam and unsures.  Anybody mind if I 
> check these into the cvs head?  (Anthony thought it would be okay.)

Quickly adding comments, both here and to the trackers...

#848319: sb_score.py - would sb_xmlscore.py be a better name? (I don't care
enough to debate this if you don't agree ;) so +1 to checking this in).
#848314: +1 to checking this in.
#848311: At the moment if you run "sb_imapfilter.py", you get the web server
with no browser launched.  I would prefer a different fix to add following
the config option, so that it matches sb_server more closely.  Patch
attached to tracker.

=Tony Meyer


From tim.one at comcast.net  Mon Nov 24 21:49:54 2003
From: tim.one at comcast.net (Tim Peters)
Date: Mon Nov 24 21:50:05 2003
Subject: [spambayes-dev] Re: Three patches for better Evolution integration
In-Reply-To: <1069726924.7132.14.camel@anthem>
Message-ID: <LNBBLJKPBEHFEDALKOLCCEMHHFAB.tim.one@comcast.net>

[Barry Warsaw]
> I've uploaded three patches to SF.
>
> I think these patches are pretty stable now.  Evolution is very nicely
> filtering spam and unsures.  Anybody mind if I check these into the
> cvs head?  (Anthony thought it would be okay.)

Well, you're a SpamBayes project admin, so if you can't check something in,
you're sorely in need of learning to abuse your power!  Sounds fine to me.

For those who don't know, Ximian's Evolution is a free email client for
Linux and Solaris, obviously aiming to mimic Outlook look-&-feel (but, one
hopes, not Outlook's variety of baffling bugs).  If Barry keeps honking on
this day and night for the next month, Ximian might be half as pleasant to
use with SpamBayes as Mark's Outlook addin <wink>.


From tameyer at ihug.co.nz  Mon Nov 24 21:56:38 2003
From: tameyer at ihug.co.nz (Tony Meyer)
Date: Mon Nov 24 21:56:46 2003
Subject: [spambayes-dev] Three patches for better Evolution integration
In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F1304314FFA@its-xchg4.massey.ac.nz>
Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F13026F29E5@its-xchg4.massey.ac.nz>

> I don't like sb_imapfilter.py's tendency to create copies of 
> messages (with one marked deleted) in my ham and spam 
> training folders.  I vaguely remember some discussion about 
> this and I'm not sure if it's fixable or not (I'm using 
> uw_imap -- yeah, yeah, I know, I know).  I may try, but if I 
> fail, I'll probably just rsync over those two folders and do 
> an mbox train on them.

The reason it does that is to mark the messages with an id so that it can
identify them in future (by adding an "X-SpamBayes-ID" header).  IMAP
doesn't let you modify a message (or even move one <sigh>), so the filter
makes a copy instead.  IMAP has ids of it's own (one for the message and one
for the folder), but they're not guaranteed to be permanent, and there were
early problems because with some servers they aren't (the best that the spec
offers is to let you know when the ids will be all wrong).

The reason for marking the messages is so that they aren't continually
trained (hence also the reliance on the 'message info' db).  If you can come
up with a way around this, that would be fantastic, and make imapfilter a
lot simpler.  If you can't be bothered trying, and have access to the mail
via something other than IMAP, then yes, that would be much easier.

=Tony Meyer


From barry at python.org  Mon Nov 24 23:05:28 2003
From: barry at python.org (Barry Warsaw)
Date: Mon Nov 24 23:05:38 2003
Subject: [spambayes-dev] Re: Three patches for better Evolution integration
In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F13026F29E4@its-xchg4.massey.ac.nz>
References: <1ED4ECF91CDED24C8D012BCF2B034F13026F29E4@its-xchg4.massey.ac.nz>
Message-ID: <1069733127.31869.15.camel@anthem>

On Tue, 2003-11-25 at 10:43, Tony Meyer wrote:

> #848319: sb_score.py - would sb_xmlscore.py be a better name? (I don't care
> enough to debate this if you don't agree ;) so +1 to checking this in).

I went with sb_evoscore.py for reasons given in the tracker. :)

> #848314: +1 to checking this in.

Done, thanks.

> #848311: At the moment if you run "sb_imapfilter.py", you get the web server
> with no browser launched.  I would prefer a different fix to add following
> the config option, so that it matches sb_server more closely.  Patch
> attached to tracker.

Here's the problem.  When I start sb_imapfilter.py with no options, I
get an error saying "You need to specify both a server and a username". 
Since I don't when I used -b, I don't know why I should need to do that
now.

The basic problem seems to be that launchUI is used both to determine
whether the server gets started and whether the browser gets started. 
The alternative patch doesn't fix this, so it doesn't help me too much. 
But I'm not sure what the intent is so I won't make any changes to this
script for now.

Thanks!
-Barry


From barry at python.org  Mon Nov 24 23:08:26 2003
From: barry at python.org (Barry Warsaw)
Date: Mon Nov 24 23:08:32 2003
Subject: [spambayes-dev] Three patches for better Evolution integration
In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F13026F29E5@its-xchg4.massey.ac.nz>
References: <1ED4ECF91CDED24C8D012BCF2B034F13026F29E5@its-xchg4.massey.ac.nz>
Message-ID: <1069733305.31869.19.camel@anthem>

On Tue, 2003-11-25 at 10:56, Tony Meyer wrote:

> The reason it does that is to mark the messages with an id so that it can
> identify them in future (by adding an "X-SpamBayes-ID" header).

Why won't Message-ID work for this?  I know that's not guaranteed to be
unique, or even present on the messages in the imap server, but they'll
/probably/ exist, and they'll /probably/ be unique, so it might be good
enough.  Alternatively, you could fingerprint some part of the message
that won't change and then store the fingerprint in the message
database.

-Barry


From tameyer at ihug.co.nz  Tue Nov 25 00:29:12 2003
From: tameyer at ihug.co.nz (Tony Meyer)
Date: Tue Nov 25 00:29:20 2003
Subject: [spambayes-dev] Three patches for better Evolution integration
In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F13043151E0@its-xchg4.massey.ac.nz>
Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F13026F29E6@its-xchg4.massey.ac.nz>

[Tony]
> The reason it does that is to mark the messages with an id so that it 
> can identify them in future (by adding an "X-SpamBayes-ID" header).

[Barry]
> Why won't Message-ID work for this?  I know that's not 
> guaranteed to be unique, or even present on the messages in 
> the imap server, but they'll /probably/ exist, and they'll 
> /probably/ be unique, so it might be good enough.  

I considered this early on, and it wouldn't be all that difficult to change
to it (the 'message info' db doesn't care what key it's given).  It seemed
simpler to just go with something definite, rather than 'if there is a
message id, use it, if not, then do something else'.  (As long as duplicates
are very rare, I don't care about that, since it would just mean that a
message was wrongly ignored).

Would you be willing to guess what percentage of messages have a message-id?
I thought that maybe the number of tokens in my db that start with
"message-id" would be an indication, but that's only 900/3779, which seems
far too low. (A portion of those messages never left the Exchange server, so
don't have anything much in the way of headers, but not that many).

> Alternatively, you could fingerprint some part of the message 
> that won't change and then store the fingerprint in the 
> message database.

I considered this, too, but wasn't sure whether it would work or not.  The
message can't really change (apart from the IMAP id and flags), but I wasn't
sure how much of the message I would need to use to make sure that it was
unique, what the best way of fingerprinting would be, or anything much,
really <wink>.  I think (depending on the results of "probably", above),
this is the best way to go, if anyone does want to implement it.

Basically what it comes down to is that (TimS and) I wrote imapfilter
because we were sick of answering requests for it (and as an exercise).  If
someone were to say "I'll take over imapfilter completely", I would be very
happy :).  As it is, I doubt that I'll get around to doing much in the way
of improvements (although I'm happy to work on bugs).  Hopefully someone
with the time & interest will come along at some point in the
not-too-distant future.

=Tony Meyer


From tameyer at ihug.co.nz  Tue Nov 25 01:00:36 2003
From: tameyer at ihug.co.nz (Tony Meyer)
Date: Tue Nov 25 01:00:46 2003
Subject: [spambayes-dev] Re: Three patches for better Evolutionintegration
In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F1304315211@its-xchg4.massey.ac.nz>
Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F13026F29E8@its-xchg4.massey.ac.nz>

[Barry]
> Here's the problem.  When I start sb_imapfilter.py with no 
> options, I get an error saying "You need to specify both a 
> server and a username". 

This is a bug hanging over from old code.  I'll put a new patch on the
tracker.

> The basic problem seems to be that launchUI is used both to 
> determine whether the server gets started and whether the 
> browser gets started. 

It looks like this, although that's not actually the case (whatever the
value of launchUI is, if neither doClassify or doTrain is true, the server
is started).  Again, hangover from older code.  I'll include fixing this in
the new patch.

> The alternative patch doesn't fix this, so it doesn't help me 
> too much. 
> But I'm not sure what the intent is so I won't make any 
> changes to this script for now.

When you have a chance, if you could take a look at the revised patch and
see if it meets your needs, that would be great :)

=Tony Meyer


From sjoerd at acm.org  Tue Nov 25 05:00:05 2003
From: sjoerd at acm.org (Sjoerd Mullender)
Date: Tue Nov 25 05:00:29 2003
Subject: [spambayes-dev] Three patches for better Evolution integration
In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F13026F29E5@its-xchg4.massey.ac.nz>
References: <1ED4ECF91CDED24C8D012BCF2B034F13026F29E5@its-xchg4.massey.ac.nz>
Message-ID: <3FC32825.9000600@acm.org>

Tony Meyer wrote:
> The reason for marking the messages is so that they aren't continually
> trained (hence also the reliance on the 'message info' db).  If you can come
> up with a way around this, that would be fantastic, and make imapfilter a
> lot simpler.  If you can't be bothered trying, and have access to the mail
> via something other than IMAP, then yes, that would be much easier.

I was thinking, could you use the IMAP command to add a flag to a 
message, such as STORE +FLAGS Classified and STORE +FLAGS Trained.  You 
can select messages with SEARCH KEYWORD Classified.  You wouldn't have 
to change the message, so you don't need to make copies (unless of 
course you have to move the message to a different folder).

-- 
Sjoerd Mullender <sjoerd@acm.org>

From tameyer at ihug.co.nz  Tue Nov 25 18:15:26 2003
From: tameyer at ihug.co.nz (Tony Meyer)
Date: Tue Nov 25 18:15:06 2003
Subject: [spambayes-dev] Three patches for better Evolution integration
In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F13043152C9@its-xchg4.massey.ac.nz>
Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F130212B191@its-xchg4.massey.ac.nz>

[Sjoerd Mullender]
> I was thinking, could you use the IMAP command to add a flag to a
> message, such as STORE +FLAGS Classified and STORE +FLAGS 
> Trained.  You can select messages with SEARCH KEYWORD Classified.  You 
> wouldn't have to change the message, so you don't need to make
> copies (unless of course you have to move the message to a
> different folder).

The problem is that not all IMAP servers allow you to store arbitary flags
(at least this is how the RFC reads to me; correct me if I'm wrong).  So
this would mean we only support a subset of IMAP servers.  Again, if someone
decides that this is the way to go, I don't really care, but I likewise, I
don't care enough to do it myself.

=Tony Meyer


From richie at entrian.com  Tue Nov 25 19:05:17 2003
From: richie at entrian.com (Richie Hindle)
Date: Tue Nov 25 19:06:16 2003
Subject: [spambayes-dev] RE: Bug in UserInterface.py
In-Reply-To: <E1AJcA7-0000bm-Bm@mail.python.org>
References: <LNBBLJKPBEHFEDALKOLCAENDHAAB.tim.one@comcast.net>
	<E1AJcA7-0000bm-Bm@mail.python.org>
Message-ID: <tdr7svob896meiqth36ttcruvud9mk3mv3@4ax.com>


[Mats]
> I have a minor bug that triggers when you have enabled
> header_score_logarithm, have a spam propability of more than 
> 0.995 (or less than 0.005) and tries to view clues from the 
> review page of the POP3PROXY.

[Kenny]
> This would probably work if and when the fix for bug #831388 is applied.

I've applied that fix, and a simplified version of Mats' patch - thanks to
all concerned!  UserInterface.py 1.34.

-- 
Richie Hindle
richie@entrian.com


From tim.one at comcast.net  Tue Nov 25 21:36:20 2003
From: tim.one at comcast.net (Tim Peters)
Date: Tue Nov 25 21:36:25 2003
Subject: [spambayes-dev] more selective Received: header mining...
In-Reply-To: <16318.65398.374469.490455@montanaro.dyndns.org>
Message-ID: <LNBBLJKPBEHFEDALKOLCEECNHGAB.tim.one@comcast.net>

[Skip Montanaro]
> I made a change to the mine_received_headers stuff this evening,
> adding a new option, gateway_machines.  The idea is that the only
> Received: header which is really useful is the one which crosses the
> boundary between your known "good" network and the wild free-for-all
> part of the net.  Received: headers from hosts internal to your
> network are meaningless, since for the most part, all mail passes
> through them, while Received: headers from hosts external to your
> network probably just contain random garbage which clogs your
> database with meaningless tokens.

I don't know that that's so.  On the spam side, some spammers forge a
sequence of Received headers to make it appear as if the path to your
machine was legitimate, and the specific paths they forge can be clues.  On
the ham side, different senders' emails often take different paths that
leave behind distinctive clues on their end of the pipe.

If a token in the database is indeed worthless, that can be detected by (1)
the token is never used for scoring anymore; and/or, (2) the token has a
spamprob in the range we ignore.  If your real concern is purging useless
tokens, then analysis based on #1 and #2 should identify huge masses of
useless tokens, including all due to Received headers.  #1 is hard to do
now, of course (since we don't save any token access-time info in the
database).

BTW, the Outlook addin currently leaves mine_received_headers at its default
False, so I don't have any tokens due to Received lines in my databases.


From tim.one at comcast.net  Tue Nov 25 22:03:25 2003
From: tim.one at comcast.net (Tim Peters)
Date: Tue Nov 25 22:03:37 2003
Subject: [spambayes-dev] A spectacular false positive 
In-Reply-To: <16312.15564.157062.319322@montanaro.dyndns.org>
Message-ID: <LNBBLJKPBEHFEDALKOLCCEDBHGAB.tim.one@comcast.net>

[Skip Montanaro]
> ...
> Here's something I think would be interesting.  At the moment I have
> about 40 unsures awaiting a decision from me (train or discard).  I'm
> trying conciously to be conservative.  What I'd like to know is which
> message, if added to my training database, would have the greatest
> effect on the scores of the other unsure messages.  That would help
> me decide which ones yield the most benefit.

If you can define what "greatest effect on the scores of the other unsure
messages" means, exactly, then it should be easy to automate that decision
(for each unsure:  train on it, score all the other unsures, compute "the
effect" on their scores (whatever that means to you), untrain it; then pick
the one with the greatest whatever-it-is you measured).

Google on

    "active learning" classification

to get a warm fuzzy feeling that this may be a fine thing to do <wink>.

I train on "the worst" Unsure first (lowest-scoring spam or highest-scoring
ham), then rescore Unsures, and repeat until they're all gone.  A number of
Unsures usually get resolved on their own this way, especially
near-duplicates of a new spam.  I don't spend any time any more trying to
guess whether a message "really is" ham or spam -- if it's not obvious after
5 seconds, I toss it without training on it at all.


From sjoerd at acm.org  Wed Nov 26 05:39:56 2003
From: sjoerd at acm.org (Sjoerd Mullender)
Date: Wed Nov 26 05:40:02 2003
Subject: [spambayes-dev] Re: [Spambayes-checkins] spambayes/spambayes
	message.py, 1.40, 1.41
In-Reply-To: <E1AOmL2-00059z-00@sc8-pr-cvs1.sourceforge.net>
References: <E1AOmL2-00059z-00@sc8-pr-cvs1.sourceforge.net>
Message-ID: <3FC482FC.7030707@acm.org>

Richie Hindle wrote:
> Update of /cvsroot/spambayes/spambayes/spambayes
> In directory sc8-pr-cvs1:/tmp/cvs-serv19794
> 
> Modified Files:
> 	message.py 
> Log Message:
> Patch 831388: Make message.py respect the header_score_digits option.
> 
> Bugfix candidate (probably).
> 
> 
> Index: message.py
> ===================================================================
> RCS file: /cvsroot/spambayes/spambayes/spambayes/message.py,v
> retrieving revision 1.40
> retrieving revision 1.41
> diff -C2 -d -r1.40 -r1.41
> *** message.py	8 Oct 2003 04:04:35 -0000	1.40
> --- message.py	25 Nov 2003 23:11:18 -0000	1.41
> ***************
> *** 342,346 ****
>   
>           if options['Headers','include_score']:
> !             disp = str(prob)
>               if options["Headers", "header_score_logarithm"]:
>                   if prob<=0.005 and prob>0.0:
> --- 342,346 ----
>   
>           if options['Headers','include_score']:
> !             disp = ("%."+str(options["Headers", "header_score_digits"])+"f") % prob
>               if options["Headers", "header_score_logarithm"]:
>                   if prob<=0.005 and prob>0.0:
> 

This can be done with
disp = "%.*f" % (options["Headers", "header_score_digits"], prob)
which looks more readable to me.

-- 
Sjoerd Mullender <sjoerd@acm.org>

From sjoerd at acm.org  Wed Nov 26 05:42:35 2003
From: sjoerd at acm.org (Sjoerd Mullender)
Date: Wed Nov 26 05:42:45 2003
Subject: [spambayes-dev] Three patches for better Evolution integration
In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F130212B191@its-xchg4.massey.ac.nz>
References: <1ED4ECF91CDED24C8D012BCF2B034F130212B191@its-xchg4.massey.ac.nz>
Message-ID: <3FC4839B.90206@acm.org>

Tony Meyer wrote:
> [Sjoerd Mullender]
> 
>>I was thinking, could you use the IMAP command to add a flag to a
>>message, such as STORE +FLAGS Classified and STORE +FLAGS 
>>Trained.  You can select messages with SEARCH KEYWORD Classified.  You 
>>wouldn't have to change the message, so you don't need to make
>>copies (unless of course you have to move the message to a
>>different folder).
> 
> 
> The problem is that not all IMAP servers allow you to store arbitary flags
> (at least this is how the RFC reads to me; correct me if I'm wrong).  So
> this would mean we only support a subset of IMAP servers.  Again, if someone
> decides that this is the way to go, I don't really care, but I likewise, I
> don't care enough to do it myself.

I can't say I care very much either at the moment.
But just for the record, it seems that the Cyrus and UW IMAP servers 
both implement this.

-- 
Sjoerd Mullender <sjoerd@acm.org>

From barry at python.org  Wed Nov 26 09:10:10 2003
From: barry at python.org (Barry Warsaw)
Date: Wed Nov 26 09:10:34 2003
Subject: [spambayes-dev] A spectacular false positive
In-Reply-To: <LNBBLJKPBEHFEDALKOLCCEDBHGAB.tim.one@comcast.net>
References: <LNBBLJKPBEHFEDALKOLCCEDBHGAB.tim.one@comcast.net>
Message-ID: <1069855810.3419.58.camel@anthem>

On Tue, 2003-11-25 at 22:03, Tim Peters wrote:

> If you can define what "greatest effect on the scores of the other unsure
> messages" means, exactly, then it should be easy to automate that decision
> (for each unsure:  train on it, score all the other unsures, compute "the
> effect" on their scores (whatever that means to you), untrain it; then pick
> the one with the greatest whatever-it-is you measured).

Sounds like a genetic algorithm.  The trick is deciding what it is you
want to maximize.

-Barry


From richie at entrian.com  Wed Nov 26 17:09:48 2003
From: richie at entrian.com (Richie Hindle)
Date: Wed Nov 26 17:10:37 2003
Subject: [spambayes-dev] Re: [Spambayes-checkins] spambayes/spambayes
	message.py, 1.40, 1.41
In-Reply-To: <3FC482FC.7030707@acm.org>
References: <E1AOmL2-00059z-00@sc8-pr-cvs1.sourceforge.net>
	<3FC482FC.7030707@acm.org>
Message-ID: <rv8asvk96632fc6hot4g8lg2oq9spj74l5@4ax.com>


[Me, applying patch 831388]
> disp = ("%."+str(options["Headers", "header_score_digits"])+"f") % prob

[Sjoerd]
> This can be done with
> disp = "%.*f" % (options["Headers", "header_score_digits"], prob)
> which looks more readable to me.

You're right, that is better.  Checked in as message.py 1.43 - thanks!

-- 
Richie Hindle
richie@entrian.com


From kennypitt at hotmail.com  Wed Nov 26 17:19:03 2003
From: kennypitt at hotmail.com (Kenny Pitt)
Date: Wed Nov 26 17:19:44 2003
Subject: [spambayes-dev] Re: [Spambayes-checkins]
	spambayes/spambayesmessage.py, 1.40, 1.41
In-Reply-To: <rv8asvk96632fc6hot4g8lg2oq9spj74l5@4ax.com>
Message-ID: <Law11-OE53K1qLIG8zw000047e5@hotmail.com>

Richie Hindle wrote:
> [Me, applying patch 831388]
>> disp = ("%."+str(options["Headers", "header_score_digits"])+"f") %
prob 
> 
> [Sjoerd]
>> This can be done with
>> disp = "%.*f" % (options["Headers", "header_score_digits"], prob)
>> which looks more readable to me.
> 
> You're right, that is better.  Checked in as message.py 1.43 - thanks!

Identical format string is used in hammie.py.  Might be worth changing
it there as well.

-- 
Kenny Pitt


From richie at entrian.com  Wed Nov 26 18:01:32 2003
From: richie at entrian.com (Richie Hindle)
Date: Wed Nov 26 18:02:04 2003
Subject: [spambayes-dev] Re: [Spambayes-checkins]
	spambayes/spambayesmessage.py, 1.40, 1.41
In-Reply-To: <Law11-OE53K1qLIG8zw000047e5@hotmail.com>
References: <rv8asvk96632fc6hot4g8lg2oq9spj74l5@4ax.com>
	<Law11-OE53K1qLIG8zw000047e5@hotmail.com>
Message-ID: <j5casvc1qoujq39bi3iuc5j9n8sabf7p09@4ax.com>


[Kenny]
> Identical format string is used in hammie.py.  Might be worth changing
> it there as well.

Good spot.  Done.

-- 
Richie Hindle
richie@entrian.com


From mhammond at skippinet.com.au  Wed Nov 26 18:08:13 2003
From: mhammond at skippinet.com.au (Mark Hammond)
Date: Wed Nov 26 18:08:22 2003
Subject: [spambayes-dev] More CVS branch/tags questions
In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F13026F29D1@its-xchg4.massey.ac.nz>
Message-ID: <027801c3b472$2b4c06a0$0200a8c0@eden>

> [This thread seems to have died a week ago, but since I was
> away, and have
> things to say <wink>, and it doesn't seem to be resolved, I
> figured I'd
> resurrect it.  While I'm doing notes: thanks Richie, Anthony
> and Skip for
> outlining the various processes in more detail - great stuff
> for us cvs
> newbies].

That sounds great - but I am afraid I am still not sure what the resolution
is.

My specific issue is that the branch does not include patches needed run
effectively on Windows (notably, the patches made to the server, service,
and tray to ensure only one instance is running, and to ensure the tray
handles the service correctly.

So, as far as I can tell, the trunk has what we want to release as a "stand
alone" sb_server, but the trunk includes what I want to release for a
Windows binary.

So I find myself unable to move on the Windows binary, but not understanding
what has happened well enough to fix things.

Should we abandon the branch, merging everything back to the trunk?  I don't
see the branch is offering us any value, firstly as it is now very old, and
secondly as the trunk doesn't seem to have had any changes that truly should
be post 1.0 - unless we don't count a Windows binary in 1.0.

I will start moving on the binary again as soon as someone can help me
resolve this.  In the meantime it looks like resolving 355 bugs as
duplicates <wink/frown>

Mark.


From richie at entrian.com  Thu Nov 27 03:41:34 2003
From: richie at entrian.com (Richie Hindle)
Date: Thu Nov 27 03:42:04 2003
Subject: [spambayes-dev] More CVS branch/tags questions
In-Reply-To: <027801c3b472$2b4c06a0$0200a8c0@eden>
References: <1ED4ECF91CDED24C8D012BCF2B034F13026F29D1@its-xchg4.massey.ac.nz>
	<027801c3b472$2b4c06a0$0200a8c0@eden>
Message-ID: <93ebsvkek2u6gjfrll6e6d7ks98ni49ta4@4ax.com>


[Mark]
> Should we abandon the branch, merging everything back to the trunk?

+1

We're still in alpha, and we never really decided what our branch
management strategy should be.  Let's start again with the trunk.

-- 
Richie Hindle
richie@entrian.com


From tameyer at ihug.co.nz  Fri Nov 28 01:00:35 2003
From: tameyer at ihug.co.nz (Tony Meyer)
Date: Fri Nov 28 01:00:40 2003
Subject: [spambayes-dev] More CVS branch/tags questions
In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F1304315896@its-xchg4.massey.ac.nz>
Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F13026F29F1@its-xchg4.massey.ac.nz>

[Mark]
> Should we abandon the branch, merging everything
> back to the trunk?  I don't see the branch is offering 
> us any value, firstly as it is now very old, and secondly
> as the trunk doesn't seem to have had any changes that
> truly should be post 1.0 - unless we don't count a
> Windows binary in 1.0.

The sb_server interface has had quite a few changes on the trunk that
haven't been put into the branch.  I certainly hope that they didn't
introduce new bugs (since I wrote most of it), but they may have done.  I
think it would certainly mean that we should do another release before we
consider that things are stable.  OTOH, the changes were all sugar, and they
could be ripped out and added back in another day.

Anyone have any idea how many people are using sb_server from cvs?  Enough
that the changes (they're fairly old now, even though they've never made a
release) seem fairly stable?

-0 from me, anyway.

> I will start moving on the binary again as 
> soon as someone can help me resolve this.

I can't help resolve it, really, but now I'm back and have caught up with
things, I'm happy to help out however I can with the 'full' binary.  Are you
still thinking of doing one final Outlook only one?  Give me a list of
things to do <wink>.  (Or maybe you already did; I'll have to look through
the stuff I have on my to-do list).

> In the meantime
> it looks like resolving 355 bugs as duplicates <wink/frown>

I tried to get some of those out of the way; there sure are a lot of Outlook
ones there at the moment.  If the new release solves the 0x000000 install
problem, that'd get rid of a lot...

=Tony Meyer


From mhammond at skippinet.com.au  Fri Nov 28 05:37:06 2003
From: mhammond at skippinet.com.au (Mark Hammond)
Date: Fri Nov 28 05:37:16 2003
Subject: [spambayes-dev] More CVS branch/tags questions
In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F13026F29F1@its-xchg4.massey.ac.nz>
Message-ID: <018e01c3b59b$925be480$0200a8c0@eden>

> [Mark]
> > Should we abandon the branch, merging everything
> > back to the trunk?  I don't see the branch is offering
> > us any value, firstly as it is now very old, and secondly
> > as the trunk doesn't seem to have had any changes that
> > truly should be post 1.0 - unless we don't count a
> > Windows binary in 1.0.
>
> The sb_server interface has had quite a few changes on the trunk that
> haven't been put into the branch.  I certainly hope that they didn't
> introduce new bugs (since I wrote most of it), but they may
> have done.  I
> think it would certainly mean that we should do another
> release before we
> consider that things are stable.  OTOH, the changes were all
> sugar, and they
> could be ripped out and added back in another day.
>
> Anyone have any idea how many people are using sb_server from
> cvs?  Enough
> that the changes (they're fairly old now, even though they've
> never made a
> release) seem fairly stable?
>
> -0 from me, anyway.

I should clarify that by "abandon the branch", I don't actually mean abandon
<wink>.  I mean we merge the branch back onto the trunk (cvs up -j ...),
handling any conflicts and resolving the pain it may cause.  I haven't tried
to see what conflicts actually arise, but am willing to.  I expect the only
real conflicts will be where a bug *has* been fixed in both places.

Or is that what you thought I meant, and still -0?

> still thinking of doing one final Outlook only one?  Give me a list of
> things to do <wink>.  (Or maybe you already did; I'll have to
> look through
> the stuff I have on my to-do list).

OK - this is my strawman plan:
1) Merge as above.
2) Let things settle for a week or so, so poor CVS users all get to suffer
alone.
3) Put together a binary from my current py2exe setup script, which includes
CVS and a number of sb_ programs.
4) Announce this binary as a "binary-beta", calling it 0.75 or something.
5) Any major bugs will presumably be part of the "binary framework", so
maybe 0.76 etc, depending on the damage.
6) Move towards release 0.8 - this will be simultaneous windows-binary and
source.
7) Move towards release 0.9 - aim for 4 weeks after 0.8, addressing only
bugs.
8) Just *before* the 0.9 release, cut a new 1.0 branch.  Release 0.9.
9) Move towards 1.0, again aiming for 4 weeks, possibly with 2x release
candidates.

As far as I can tell, almost everyone is in bug-fix-only mode already
(except that damn bass-playing Warsaw <wink>).  I'm pretty much in that mode
for Outlook too.  So up until (8), which is when we cut a new 1.0 branch,
all new real features go one some development branch (a branch per feature -
whatever).  We all reserve our right to use a fairly liberal definition of
"bug" for low-risk, high-benefit tweaks, but I think our app is mature
enough that we can happily tell people who want truly new features to grab a
CVS branch.  After the branch is cut, we get seriously anal about "bugfix
only", with the expectation the branch lasts only 4 weeks (which flies when
everyone is busy!)

This is likely to impact almost no-one once we are re-merged.  Moving fast
towards 1.0 seems in everyones benefit, and this would get us there around
the end of Feb.  If we can get there sooner due to the only bugs being old
ones we don't know how to fix, all the better.

Happy-new-year ly,

Mark.


From barry at python.org  Fri Nov 28 08:37:07 2003
From: barry at python.org (Barry Warsaw)
Date: Fri Nov 28 08:37:12 2003
Subject: [spambayes-dev] More CVS branch/tags questions
In-Reply-To: <018e01c3b59b$925be480$0200a8c0@eden>
References: <018e01c3b59b$925be480$0200a8c0@eden>
Message-ID: <1070026627.20553.10.camel@anthem>

On Fri, 2003-11-28 at 05:37, Mark Hammond wrote:

> As far as I can tell, almost everyone is in bug-fix-only mode already
> (except that damn bass-playing Warsaw <wink>).

Yeah, if we could only get him to read this mailing list once in a
while, then he might even agree with the plan.  <wink>

-Not That Guy


From support at netcom3.com  Fri Nov 28 17:54:59 2003
From: support at netcom3.com (Auto Auction Center)
Date: Fri Nov 28 17:57:26 2003
Subject: Proposal[spambayes-dev] Business opportunity
Message-ID: <000601c3b602$a6dee080$0a01a8c0@Netcom3>

About Us: 
 
Out of over 100,000 merchants, Netcom3 (Auto Center) is ranked #1 as the most successful company through Clickbank Network.
 
Netcom3 (The Auto Center) has millions of used car buyers every month in its automotive network with one of the largest providers of e-business for the automotive industry. 
 
By 2004, millions of customers will be using our no obligation service to bid on vehicles and make offer requests to our network of over 67,200 sellers and dealers. 
 
We have been in the automotive business for over 7 years with a huge profit gain each and every year. We are now planning to expand our company by offering free advertising on our online auto site which will also appear on our network partner's sites Including: Yahoo.com, cars.com, Autotrader.com, autoweb.com and many more auto giants.
 
Advertising with us is FREE and you can instantly generate an ongoing stream of targeted leads. All leads can be emailed to you or customers can go directly to your website to view what you have to offer.
 
Profit Potential:

By signing up with Netcom3.com, you will have access to over 63,000,000 potential auto buyers every month for FREE.  

Just imagine how much money you can save by signing up with Netcom3 appose to posting your on ads on our network partners sites yourself. See example below.

Example: To post an ad on yahoo.com auto site it will cost you $34.95 per car. But if you sign up with Netcom3.com it will cost you nothing and your ad will appear on Netcom3 site and its network partner sites including yahoo.com for free. 

Example2:  Place 100 ads on yahoo.com will cost you $3495.00. Place 100 ads with Netcom3 and its partner sites will cost you $0. 

You can post as many cars as you like for free.
  
We are so confident in our service that we are not obligating you or any of our other affiliates to contract with us. You can back out at anytime. Give it a try and I guarantee you will be glad you did. 
 
If you are interested in receiving more qualified leads please visit our web site for more details at http://www.netcom3.com/sell.htm
 
 
Thank You
Netcom3 / Auto Center
Marketing Department


-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.python.org/pipermail/spambayes-dev/attachments/20031128/80645b1e/attachment.html
From skip at pobox.com  Thu Nov 27 07:01:26 2003
From: skip at pobox.com (Skip Montanaro)
Date: Fri Nov 28 18:16:18 2003
Subject: [spambayes-dev] A spectacular false positive 
In-Reply-To: <LNBBLJKPBEHFEDALKOLCCEDBHGAB.tim.one@comcast.net>
References: <16312.15564.157062.319322@montanaro.dyndns.org>
	<LNBBLJKPBEHFEDALKOLCCEDBHGAB.tim.one@comcast.net>
Message-ID: <16325.59286.217664.627909@montanaro.dyndns.org>


    >> What I'd like to know is which message, if added to my training
    >> database, would have the greatest effect on the scores of the other
    >> unsure messages.  That would help me decide which ones yield the most
    >> benefit.

    Tim> If you can define what "greatest effect on the scores of the other
    Tim> unsure messages" means, exactly, then it should be easy to automate
    Tim> that decision (for each unsure: train on it, score all the other
    Tim> unsures, compute "the effect" on their scores (whatever that means
    Tim> to you), untrain it; then pick the one with the greatest
    Tim> whatever-it-is you measured).

I mean "pushes the remaining unsures the furthest away from their current
scores".  I guess I want to maximize:

    sum([abs(old-new) for (old,new) in zip(oldprobs, newprobs)])

    Tim> Google on

    Tim>     "active learning" classification

    Tim> to get a warm fuzzy feeling that this may be a fine thing to do
    Tim> <wink>.

Thanks.  When I get a chance, I may.  On the other hand, I may just take
your word for it.

    Tim> I train on "the worst" Unsure first (lowest-scoring spam or
    Tim> highest-scoring ham), then rescore Unsures, and repeat until
    Tim> they're all gone.  A number of Unsures usually get resolved on
    Tim> their own this way, especially near-duplicates of a new spam

I've been doing this sort of thing, though perhaps not consistently enough.

    Tim> I don't spend any time any more trying to guess whether a message
    Tim> "really is" ham or spam -- if it's not obvious after 5 seconds, I
    Tim> toss it without training on it at all.

Ditto.

Skip

From gward at python.net  Sun Nov 30 12:16:14 2003
From: gward at python.net (Greg Ward)
Date: Sun Nov 30 12:16:18 2003
Subject: [spambayes-dev] Clever avoidance technique
Message-ID: <20031130171614.GA10222@cthulhu.gerg.ca>

Here's a nifty variation on the invisible-text-in-HTML tactic: make the
invisible text vaguely relevant to the recipient of the spam.  I just
got one this morning that's immediately, obviously spam from these
headers:

  From: "Inconvenience O. Imprecision" <esteves@belice.com>
  To: Gward <gward@python.net>
  Subject: Gward, meet singles in your area          U7n2QHvxKLmBOhTROl57D5Q7crCNQzbL
  Date: Sat, 29 Nov 2003 16:42:22 -0500

but if I look in the HTML body, I see this:

  <p><font color=3d"#FFFFFF">The Defense Technical Information Center (DTIC=
  =ae) is the central facility for the collection and dissemination of scie=
  ntific and technical information for the Department of Defense (DoD)=2e M=
  uch of this information is made available by DTIC in the form of technica=
  l reports about completed research, and research summaries of ongoing res=
  earch=2e u62Mb6TFJNptB0duTKrhqDiJDdBNRazm</font></p>

which isn't terribly relevant to me... but a little farther on (after
the actual spam payload, encoded of course), we see this:

  <p><font color=3d"#FFFFFF">The Handle System allows handles to be both cr=
  eated and resolved in a distributed fashion (see the diagram on this page=
   for an overview of the Handle System architecture)=2e Both creation and =
  resolution can be accomplished using dedicated clients, common clients su=
  ch as web browsers using special extensions or plug-ins, or unextended cl=
  ients going through various proxies=2e In all cases, communication with t=
  he Handle System is carried out using the Handle System protocol which ha=
  s a formal specification and some specific implementations, all freely av=
  ailable from CNRI=2e The protocol has a corresponding client library avai=
  lable in C and Java=2e The C client library has been used by CNRI in the =
  creation of a handle-aware extension to the Netscape and Microsoft web br=
  owsers=2e The Java client library has been used to create an http-to-hand=
  [...]

Interesting!  This would probably count as ham for any computer geek.
However, the above blurb describes software produced by my former
employer, and you can probably get to it with 3 or 4 clicks from my home
page.  And, knowing CNRI, the first blurb is probably vaguely related --
most of their money comes from the US military-industrial-entertainment
complex, after all.

This feels very much like it's targeted at Bayesian filters -- eg. I
suspect SpamAssassin pre-2.6 would have had a better chance at calling
this one spam than Spambayes (which scored it 0.198, just barely ham for
my thresholds).

Full message attached in case you're curious.

        Greg
-- 
Greg Ward <gward@python.net>                         http://www.gerg.ca/
Earn cash in your spare time -- blackmail your friends!
-------------- next part --------------
An embedded message was scrubbed...
From: "Inconvenience O. Imprecision" <esteves@belice.com>
Subject: Gward,
	meet singles in your area          U7n2QHvxKLmBOhTROl57D5Q7crCNQzbL
Date: Sat, 29 Nov 2003 16:42:22 -0500
Size: 4908
Url: http://mail.python.org/pipermail/spambayes-dev/attachments/20031130/d32a569a/meet-singles.mht
From tim.one at comcast.net  Sun Nov 30 17:09:43 2003
From: tim.one at comcast.net (Tim Peters)
Date: Sun Nov 30 17:09:49 2003
Subject: [spambayes-dev] Clever avoidance technique
In-Reply-To: <20031130171614.GA10222@cthulhu.gerg.ca>
Message-ID: <LNBBLJKPBEHFEDALKOLCOEKBHHAB.tim.one@comcast.net>

[Greg Ward]
> Here's a nifty variation on the invisible-text-in-HTML tactic: make
> the invisible text vaguely relevant to the recipient of the spam.

Jeremy got exactly the same white-on-white text as you got below, about two
weeks ago, although the container spam had different content (albeit the
same thrust).  I think the CNRI connection is coincidence -- lots of spam
contains color-on-close-color decoy text, but you never notice it except on
those rare occasions it ends up being hammy to you.

> I just got one this morning that's immediately, obviously spam from
> these headers:

Why, are you married now, or you just don't get ham on Saturdays anymore
<wink>?

>   From: "Inconvenience O. Imprecision" <esteves@belice.com>
>   To: Gward <gward@python.net>
>   Subject: Gward, meet singles in your area
> U7n2QHvxKLmBOhTROl57D5Q7crCNQzbL
>   Date: Sat, 29 Nov 2003 16:42:22 -0500
>
> but if I look in the HTML body, I see this:
>
>   <p><font color=3d"#FFFFFF">The Defense Technical Information Center
>   (DTIC= =ae) is the central facility for the collection and
>   dissemination of scie= ntific and technical information for the
>   Department of Defense (DoD)=2e M= uch of this information is made
>   available by DTIC in the form of technica= l reports about
>   completed research, and research summaries of ongoing res= earch=2e
> u62Mb6TFJNptB0duTKrhqDiJDdBNRazm</font></p>
>
> which isn't terribly relevant to me...

Jeremy thought it was, as DTIC worked closely with CNRI, and even hosted a
symposium on CNRI's handle system (the topic of the next blurb below).

> but a little farther on (after the actual spam payload, encoded of
> course), we see this:
>
>   <p><font color=3d"#FFFFFF">The Handle System allows handles to be
>   both cr= eated and resolved in a distributed fashion (see the
>    diagram on this page= for an overview of the Handle System
>   architecture)=2e Both creation and = resolution can be accomplished
>   using dedicated clients, common clients su= ch as web browsers
>   using special extensions or plug-ins, or unextended cl= ients going
>   through various proxies=2e In all cases, communication with t= he
>   Handle System is carried out using the Handle System protocol which
>   ha= s a formal specification and some specific implementations, all
>   freely av= ailable from CNRI=2e The protocol has a corresponding
>   client library avai= lable in C and Java=2e The C client library
>   has been used by CNRI in the = creation of a handle-aware extension
>   to the Netscape and Microsoft web br= owsers=2e The Java client
>   library has been used to create an http-to-hand= [...]
>
> Interesting!  This would probably count as ham for any computer geek.
> However, the above blurb describes software produced by my former
> employer, and you can probably get to it with 3 or 4 clicks from my
> home page.

Ya, and Johnny Carrero's 1998 Folsom Fitness Extravaganza is only two clicks
from your home page, the McConnell Brain Imaging Centre only one.  If they
were targeting you specifically, they hit stuff relatively *hard* to find.

  And, knowing CNRI, the first blurb is probably vaguely
> related -- most of their money comes from the US
> military-industrial-entertainment complex, after all.
>
> This feels very much like it's targeted at Bayesian filters -- eg. I
> suspect SpamAssassin pre-2.6 would have had a better chance at calling
> this one spam than Spambayes (which scored it 0.198, just barely ham
> for my thresholds).

Jeremy and Guido both got spam a while back with a sure way to beat
SpamBayes:  the spam was added to replies to mailing list postings of
theirs, with their original subject lines and the quoted text of their
original messages.  That trick is all but guaranteed to find lots of tokens
hammy to you, and seems a lot cheaper & simpler than crawling over web pages
looking for "related interests".  But after a couple of those, we never saw
that trick again.  It's more expensive than spraying the same set of spam
content at every address you can find, and I expect the response rate from
targeting tech mailing-list posters was so low as to make it a net monetary
loss.

It would be nice to "do something" about the color-on-close-color trick, but
I don't yet see it *working* often enough to be worth the expense and
bother.


From tameyer at ihug.co.nz  Sun Nov 30 19:17:25 2003
From: tameyer at ihug.co.nz (Tony Meyer)
Date: Sun Nov 30 19:17:31 2003
Subject: [spambayes-dev] More CVS branch/tags questions
In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F1304315B7A@its-xchg4.massey.ac.nz>
Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F13026F29F3@its-xchg4.massey.ac.nz>

[Mark]
> I haven't tried to see 
> what conflicts actually arise, but am willing to.  I expect 
> the only real conflicts will be where a bug *has* been fixed 
> in both places.

I'm pretty sure that you'll find that the branch contains *nothing* that the
trunk doesn't already, apart from the WHAT_IS_NEW.TXT file.  In fact, I'll
bet you 50% of your SpamBayes Outlook plug-in profits to date that that's
the case <wink>.

> Or is that what you thought I meant, and still -0?

Well, yes.  My only worry is that the trunk contains various code that the
branch doesn't which is definitely not bug-fix and does change sb_server.py,
UserInterface.py and ProxyUI.py a fair bit.  Of course, if I did it right, I
didn't introduce any bugs adding that code anyway, but ...

OTOH, it would be nice to have some of those features in a release sooner
than May 04 :)  (Especially the one that lets people submit a decent bug
report).

I'll upgrade to +0 :)  I think everyone else is in favour, anyway.

> OK - this is my strawman plan:
[...]
> 3) Put together a binary from my current py2exe setup script, 
> which includes CVS and a number of sb_ programs.

Does one need a special version of py2exe for this?  If so, is it one that
there's a binary available for?  (i.e. can I do this without VC++?)

> 4) Announce this binary as a "binary-beta", calling it 0.75 
> or something.
> 5) Any major bugs will presumably be part of the "binary 
> framework", so maybe 0.76 etc, depending on the damage.

And on the 'encourage people to try it out' side, there are a few bugs that
have been fixed since 1.0a7, so they may wish to get those benefits.

[...rest of steps...]

All this looks +1 to me.

> As far as I can tell, almost everyone is in bug-fix-only mode 
> already (except that damn bass-playing Warsaw <wink>).

I kinda am, but certainly wasn't for a period after we went into 'feature
freeze' (hence the new features in the trunk, and not in the branch).  I
would be willing to hold off adding any new features for the (NZ/Au) summer,
although I would like to integrate the Japanese/Asian languages patches
(which have been patiently waiting, and continually updated, for a while
now).  I don't know if that's bug or feature tampering <wink>.

It would be interesting to try and put together a testing framework that
works with the apps (as in the other thread), too, but that could easily be
on a branch.

> So up until (8), 
> which is when we cut a new 1.0 branch, all new real features 
> go one some development branch (a branch per feature - 
> whatever).

I would be happy doing this, though.  Instead of a branch per feature, what
about a branch per app?  (So (as needed), a sb_server_experimental branch,
an sb_imapfilter_experimental branch, and so on (with better names)).

> We all reserve our right to use a fairly liberal 
> definition of "bug" for low-risk, high-benefit tweaks, but I 
> think our app is mature enough that we can happily tell 
> people who want truly new features to grab a CVS branch.  

Agreed.

> After the branch is cut, we get seriously anal about "bugfix 
> only", with the expectation the branch lasts only 4 weeks 
> (which flies when everyone is busy!)

Agreed.

> Moving fast towards 1.0 seems in everyone's benefit

Agreed :)

If everyone else agrees with this, we really ought to put a copy of the
above list somewhere, too, so that Barry can be pointed to it later <wink>.
README-DEVEL.TXT, maybe?

=Tony Meyer