From chris at penar.com  Thu Jun  3 09:56:57 2004
From: chris at penar.com (Chris)
Date: Thu Jun  3 09:57:33 2004
Subject: [spambayes-dev] Training
Message-ID: <AAEKLJLLBFNDBPJMKIPGCENCDGAA.chris@penar.com>

When I move items from "junk suspects" to my INBOX, the program is still not
learning to recognize these messages as good, even though I have this option
selected.  Do you have any tips?

Chris Penar


From rick at unc.edu  Mon Jun 14 14:29:47 2004
From: rick at unc.edu (Rick Peterson)
Date: Mon Jun 14 14:30:46 2004
Subject: [spambayes-dev] Deletes-as-spam come right back into InBox
Message-ID: <200406141829.i5EITmMZ017077@smtp.unc.edu>


Any help appreciated...

I am using the SpamBayes Outlook Plug-in. When I 'delete-as-spam' a message it 'leaves' my Inbox for a second but then is redelivered back into my Inbox. This doesn't happen all the time and I have retrained SpamBayes a couple of times and it still happens.

Any thoughts/clues?

Ric


From tim.one at comcast.net  Mon Jun 14 14:46:02 2004
From: tim.one at comcast.net (Tim Peters)
Date: Mon Jun 14 14:46:11 2004
Subject: [spambayes-dev] Deletes-as-spam come right back into InBox
In-Reply-To: <200406141829.i5EITmMZ017077@smtp.unc.edu>
Message-ID: <mailman.471.1087238771.6944.spambayes-dev@python.org>

[Rick Peterson]
> I am using the SpamBayes Outlook Plug-in. When I 'delete-as-spam' a
> message it 'leaves' my Inbox for a second but then is redelivered back
> into my Inbox. This doesn't happen all the time and I have retrained
> SpamBayes a couple of times and it still happens.
>
> Any thoughts/clues?

Unsure -- have never seen this myself.  Perhaps you have the Inbox
configured as your Spam folder, or perhaps you have an Outlook rule that's
moving it back?


From rick at unc.edu  Mon Jun 14 15:09:55 2004
From: rick at unc.edu (Rick Peterson)
Date: Mon Jun 14 15:10:41 2004
Subject: [spambayes-dev] Deletes-as-spam come right back into InBox
Message-ID: <200406141909.i5EJ9tMZ027520@smtp.unc.edu>


Thanks for the quick response. My spam folder is "junkemail" and I have no rules configured. I will try uninstalling and reinstalling the client. Can't think of anything else to do at this point. :-(

 
-----Original Message-----
From: Tim Peters [mailto:tim.one@comcast.net] 
Sent: Monday, June 14, 2004 2:46 PM
To: 'Rick Peterson'
Cc: spambayes-dev@python.org
Subject: RE: [spambayes-dev] Deletes-as-spam come right back into InBox

[Rick Peterson]
> I am using the SpamBayes Outlook Plug-in. When I 'delete-as-spam' a 
> message it 'leaves' my Inbox for a second but then is redelivered back 
> into my Inbox. This doesn't happen all the time and I have retrained 
> SpamBayes a couple of times and it still happens.
>
> Any thoughts/clues?

Unsure -- have never seen this myself.  Perhaps you have the Inbox configured as your Spam folder, or perhaps you have an Outlook rule that's moving it back?


From gtoal at gtoal.com  Wed Jun 16 02:47:57 2004
From: gtoal at gtoal.com (Graham Toal)
Date: Wed Jun 16 02:06:13 2004
Subject: [spambayes-dev] Deletes-as-spam come right back into InBox
In-Reply-To: <200406152211.i5FMBwR5000787@gtoal.com>
References: <200406152211.i5FMBwR5000787@gtoal.com>
Message-ID: <40CFED1D.mailKP1FQ84F@gtoal.com>

I've seen the same thing with an Outlook client talking to a regular
IMAP server.  In fact my wife's computer does this consistently - and
she is not using spambayes.  If it is outlook you are using, look there
for an answer rather than the mail server or the spam filter.

(and if you work it out let me know because I've spent weeks trying
to work this out...)

Graham

From melis1 at freeler.nl  Thu Jun 17 17:09:29 2004
From: melis1 at freeler.nl (M.P. vd Heiden)
Date: Sun Jun 20 23:57:13 2004
Subject: [spambayes-dev] update
Message-ID: <002801c454af$62c42c10$d782153e@amd>

please inform me about a update
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.python.org/pipermail/spambayes-dev/attachments/20040617/5194b968/attachment.html
From ulf at linial.de  Thu Jun 17 03:31:47 2004
From: ulf at linial.de (Linial)
Date: Mon Jun 21 00:08:47 2004
Subject: [spambayes-dev] Question Outlook PlugIn
Message-ID: <NHBBJJOHGNBGMPPKIPBGEEBEDBAA.ulf@linial.de>

How does SpamBayes-Outlook PlugIn work together with existing-Outlook rules?
For example: I have a rule that moves all incoming mails that contain
"example@test.de" to my "private" - folder. Does SpamBayes move them to the
"unsure" folder if it hadn't be trained? In that case I would hav to sort my
mail by hand again and the Outlook rules would be useless.

Sorry but I can't test it by myself because I have many rules and tons of
subfolders.

Thank you in advance

Ulf Klarmann


From kennypitt at hotmail.com  Thu Jun 24 13:10:24 2004
From: kennypitt at hotmail.com (Kenny Pitt)
Date: Thu Jun 24 13:12:23 2004
Subject: [spambayes-dev] Question Outlook PlugIn
In-Reply-To: <NHBBJJOHGNBGMPPKIPBGEEBEDBAA.ulf@linial.de>
Message-ID: <BAY16-DAV8g8yJblQml0000fe0d@hotmail.com>

Linial wrote:
> How does SpamBayes-Outlook PlugIn work together with existing-Outlook
> rules? 
> For example: I have a rule that moves all incoming mails that contain
> "example@test.de" to my "private" - folder. Does SpamBayes move them
> to the "unsure" folder if it hadn't be trained?

You need to make sure that background filtering is enabled (go to the
Advanced tab in SpamBayes Manager and check the "Enabled background
filtering" option).  It should be enabled by default in newer builds, but
check it to be sure.

This option gives the Outlook rules time to process messages before
SpamBayes looks at them, so messages will already be moved to the
destination folder before SpamBayes checks for spam.  Configure SpamBayes to
only look for spam in your Inbox and not in the destination folders for your
rules and everything should work just the way you want.

-- 
Kenny Pitt


From kennypitt at hotmail.com  Thu Jun 24 15:01:43 2004
From: kennypitt at hotmail.com (Kenny Pitt)
Date: Thu Jun 24 15:03:45 2004
Subject: [spambayes-dev] update
In-Reply-To: <002801c454af$62c42c10$d782153e@amd>
Message-ID: <BAY16-DAV8b3ndu1vtm00010503@hotmail.com>

M.P. vd Heiden wrote:
> please inform me about a update

At the following URL, you can subscribe to the Spambayes-announce mailing
list.  A notice is sent to this mailing list whenever a new version of
SpamBayes is released.

http://mail.python.org/mailman/listinfo/spambayes-announce

If you are already running SpamBayes, you can also use the "Check for new
version" menu item in the Outlook addin or the "Check for latest version"
menu item from the Proxy application tray icon to check if you have the
latest update.

-- 
Kenny Pitt


From skip at pobox.com  Thu Jun 24 22:42:50 2004
From: skip at pobox.com (Skip Montanaro)
Date: Thu Jun 24 22:42:59 2004
Subject: [spambayes-dev] are checkins on head okay?
Message-ID: <16603.37162.198572.350989@montanaro.dyndns.org>

Can I check stuff in on the spambayes cvs head or are we frozen awaiting the
1.0 release?

Skip

From tameyer at ihug.co.nz  Fri Jun 25 19:58:39 2004
From: tameyer at ihug.co.nz (Tony Meyer)
Date: Fri Jun 25 19:58:45 2004
Subject: [spambayes-dev] are checkins on head okay?
In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F1306E92A13@its-xchg4.massey.ac.nz>
Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F1304677FDA@its-xchg4.massey.ac.nz>

> Can I check stuff in on the spambayes cvs head or are we 
> frozen awaiting the 1.0 release?

There's a 1.0 release branch, so knock yourself out :)

(Mark and I are very close to having 1.0rc2 done - the delay is my fault -
and then, hopefully, the mythical 1.0).

=Tony Meyer

---
Please always include the list (spambayes@python.org) in your replies
(reply-all), and please don't send me personal mail about SpamBayes. This
way, you get everyone's help, and avoid a lack of replies when I'm busy.


From ta-meyer at ihug.co.nz  Sat Jun 26 03:27:49 2004
From: ta-meyer at ihug.co.nz (Tony Meyer)
Date: Sat Jun 26 03:27:59 2004
Subject: [spambayes-dev] 1.0 and beyond
Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F13064C02A1@its-xchg4.massey.ac.nz>

As you might have noticed, Mark & I have put out 1.0rc2.  The plan this time
is that it'll work flawlessly <wink> and in about a week we'll put the same
thing together for 1.0 (and so a better job of publicising it as someone
(Anthony?) suggested a while back).

As I noted in answer to Skip the HEAD is free for anything that you want to
do, and the 1.0_release branch should be left alone apart from really major
bugs or packaging issues, until 1.0 is done.

Does anyone have a plan after that?  Do we want to keep the 1.0 branch alive
and copy minor things across to it for a 1.1 release (someday!), or do we
want 1.1 to have lots of new features and stuff (i.e. the HEAD).

Cheers,
Tony

BTW (on an unrelated note) sorry about my double-up replies to some
spambayes@python messages recently - the mail arrived here rather
sporadically (maybe the mail.python.org problems, maybe mine) and I didn't
realise that many of them had been dealt with already :)


From tdickenson at geminidataloggers.com  Sat Jun 26 09:47:52 2004
From: tdickenson at geminidataloggers.com (Toby Dickenson)
Date: Sat Jun 26 09:47:57 2004
Subject: [spambayes-dev] correlated clues
Message-ID: <200406261447.52347.tdickenson@geminidataloggers.com>

Im seeing a significant number of misclassified spams that come through 
mailing lists. If the original spam body is small then it doesnt generate 
enough tokens to outweigh those added by the mailing list. Manually removing 
those tokens from the list causes it to be firmly nailed as spam.

(To be fair, most of these small ones are viruses not spams. But spambayes 
does a good job of classifying those viruses that I receive direct, rather 
than via a list.)

Example evidence below.

Has anyone implemented or tested any mechanism to inhibit these gangs of 
tokens?

X-Spambayes-Classification: ham; 0.25
X-Spambayes-Evidence: '*H*': 0.67; '*S*': 0.16; 'so?': 0.11;
        'header:Received:4': 0.15; 'subject:] ': 0.16; 'url:zope': 0.19;
        'sender:addr:zope.org': 0.19; 'zope': 0.20;
        'email addr:zope.org': 0.20; 'think': 0.20;
        'to:addr:zope.org': 0.21; 'subject:Zope': 0.21;
        'sender:no real name:2**0': 0.23; 'url:mailman': 0.24;
        'url:listinfo': 0.24; 'url:mail': 0.26; 'subject:[': 0.29;
        'maillist': 0.31; 'url:org': 0.31; 'header:Errors-To:1': 0.32;
        'content-disposition:inline': 0.33; 'reply-to:none': 0.34;
        'subject:!': 0.72; 'charset:windows-1252': 0.88;
        'from:addr:info': 0.93; 'message-id:@mail.zope.org': 0.94;
        'subject:you': 0.95;
        'content-type:application/x-zip-compressed': 0.98;
        'filename:fname piece:zip': 0.98

-- 
Toby Dickenson

From sethg at GoodmanAssociates.com  Sun Jun 27 19:31:49 2004
From: sethg at GoodmanAssociates.com (Seth Goodman)
Date: Sun Jun 27 19:31:53 2004
Subject: [spambayes-dev] spam on lists
Message-ID: <MHEGIFHMACFNNIMMBACACEPAIBAA.sethg@GoodmanAssociates.com>

I know that running a server-side filter is a pain and requiring
registration is seen as a hurdle for new posters.  However, I wonder if your
mail admins have considered using blacklists?  In addition to the usual
FCrDNS tests (I hope they're doing that), just checking the IP against a few
DNSBL's would probably stop a lot of the junk.  If you're not already doing
this, a good place to start would be these three lists:

dnsbl.sorbs.net
sbl-xbl.spamhaus.org
bl.spamcop.net

--

Seth Goodman


From tim.peters at gmail.com  Sun Jun 27 20:21:38 2004
From: tim.peters at gmail.com (Tim Peters)
Date: Sun Jun 27 20:21:43 2004
Subject: [spambayes-dev] spam on lists
In-Reply-To: <MHEGIFHMACFNNIMMBACACEPAIBAA.sethg@GoodmanAssociates.com>
References: <MHEGIFHMACFNNIMMBACACEPAIBAA.sethg@GoodmanAssociates.com>
Message-ID: <1f7befae04062717213a8f8f52@mail.gmail.com>

[Seth Goodman]
> I know that running a server-side filter is a pain and requiring
> registration is seen as a hurdle for new posters.  However, I wonder if your
> mail admins have considered using blacklists?

Well, nobody here has any access to the machines running
mail.python.org.  If you want to talk to them,
mailto:postmaster@python.org is the way to do it.  But they're all
volunteers too, and generally can't make time to do anything except
crisis mgmt.

From skip at pobox.com  Sun Jun 27 20:30:55 2004
From: skip at pobox.com (Skip Montanaro)
Date: Sun Jun 27 20:31:07 2004
Subject: [spambayes-dev] spam on lists
In-Reply-To: <MHEGIFHMACFNNIMMBACACEPAIBAA.sethg@GoodmanAssociates.com>
References: <MHEGIFHMACFNNIMMBACACEPAIBAA.sethg@GoodmanAssociates.com>
Message-ID: <16607.26303.639870.86327@montanaro.dyndns.org>


    Seth> I know that running a server-side filter is a pain and requiring
    Seth> registration is seen as a hurdle for new posters.  However, I
    Seth> wonder if your mail admins have considered using blacklists?  

What are you referring to?  Who are "your mail admins"?  Are you referring
to just the spambayes mailing lists or to something more global (all of the
lists on python.org)?  The spambayes lists are explicitly not filtered
because people need to send spam to them for study purposes on occasion.  If
you're referring to the more global issue of mailing lists hosted on
mail.python.org, a number of us are working on bringing up a new machine.
I'm not sure exactly what all the various bits will be at this point, but
the plan is for it to be better at rejecting spam and virii than the current
machine.

-- 
Skip Montanaro
Got gigs? http://www.musi-cal.com/submit.html
Got spam? http://www.spambayes.org/
skip@pobox.com

From tim.one at comcast.net  Sun Jun 27 23:24:26 2004
From: tim.one at comcast.net (Tim Peters)
Date: Sun Jun 27 23:24:37 2004
Subject: [spambayes-dev] correlated clues
In-Reply-To: <200406261447.52347.tdickenson@geminidataloggers.com>
Message-ID: <mailman.1.1088393077.9339.spambayes-dev@python.org>

[Toby Dickenson]
> Im seeing a significant number of misclassified spams that come through
> mailing lists.  If the original spam body is small then it doesnt generate
> enough tokens to outweigh those added by the mailing list.  Manually
> removing those tokens from the list causes it to be firmly nailed as
> spam.

Toby, which training strategy do you use?  I don't have a real problem with
this, but I'm using train-on-error (mistakes and unsures).  A consequence is
that I train on only a tiny fraction of the mailing-list ham I receive, and
on a roughly equal number of mailing-list spam.  As a result, the "Mailman
clues" are roughly neutral for me.  Under train-on-everything, Mailman clues
would be strongly hammy (because I get much more ham than spam from most
Mailman lists I'm on).

> (To be fair, most of these small ones are viruses not spams. But
> spambayes does a good job of classifying those viruses that I receive
> direct, rather than via a list.)

And it's a still a mystery to me as to why <wink>.

> Example evidence below.
>
> Has anyone implemented or tested any mechanism to inhibit these gangs of
> tokens?

Not any I know of that aren't already implemented.  Ignoring most header
lines by default is implemented, and your situation would be worse if it
weren't:  Mailman inserts a large pile of Mailman-specific header lines too,
like:

X-BeenThere: spambayes-dev@python.org
X-Mailman-Version: 2.1.5
List-Id: Development of the Pythonic Bayesian classifier
	<spambayes-dev.python.org>
List-Unsubscribe: <http://mail.python.org/mailman/listinfo/spambayes-dev>,
	<mailto:spambayes-dev-request@python.org?subject=unsubscribe>
List-Archive: <http://mail.python.org/pipermail/spambayes-dev>
List-Post: <mailto:spambayes-dev@python.org>
List-Help: <mailto:spambayes-dev-request@python.org?subject=help>
List-Subscribe: <http://mail.python.org/mailman/listinfo/spambayes-dev>,
	<mailto:spambayes-dev-request@python.org?subject=subscribe>
Sender: spambayes-dev-bounces@python.org
Errors-To: spambayes-dev-bounces@python.org

for email on this list, and we ignore almost all of those by default.

A general problem with "doing something about this" is that correlation
*usually* helps in determining a correct classification, so I don't think
correlation is evil per se.  For example, your two strongest spam clues were


        'content-type:application/x-zip-compressed': 0.98;
        'filename:fname piece:zip': 0.98

and those are certainly correlated.  The cases where correlation works in
the wrong direction are noteworthy enough that people write about them when
they occur.  I don't have a good idea for identifying "bad correlation"
automatically and efficiently.


From stephena at hiwaay.net  Mon Jun 28 03:42:29 2004
From: stephena at hiwaay.net (Stephen Anderson)
Date: Mon Jun 28 03:41:38 2004
Subject: [spambayes-dev] Mental Musings on Spam Catching
Message-ID: <40DF6975.22835.1DAFD79@localhost>

[Right off the bat, my email is long-winded and I apologize.  I think Ben Franklin once said 
something like,  "I apologize but I don't have time to be brief."]

I've been slowly trying to design a spam filter program using many of the lessons learned 
tuning spambayes as a starting place.  I'm not saying I tuned spambayes but I've followed 
most of the discourses on the list and read through the source documentation for several 
ideas that predated my joining.  I'm going to give you a little background (cause somebody 
always asks) but I was hoping maybe some of you with more experience could give me your 
opinion on a filtering idea that I had.  (I know, theorizing is nothing, testing is everything)

There are really four reasons I've been trying to do this. 1) It's fun and I like it; 2) My gut tells 
me SB is missing a lot of potentially useful information; 3) SB algorithm development has 
really slowed down from its heyday (which I can understand since it seems to work for most 
of the developers) and; 4) SB really underperforms on my father's email in comparison to 
most other peoples, my own included.  That last point isn't so much why I started trying to 
build my own but has been leading me to try to figure out what exactly is SB's weakness in 
that it is so unsure (and so often wrong) about his email.

The general direction I had been working towards is something very SB'ish with three main 
exceptions: 1) much more reliant on meta-tokens for email characteristics (ala SA) than SB,  
2) Very different token scoring using mean and std-dev of occurrence / # of total body 
tokens; and 3) automatic expiration and removal of statistics after a certain time-period.

Now a facet that I had been thinking about was elimination of duplicate messages.  In the 
system I'm building, I don't want the system to ever be trained on the same message more 
than once.  But obviously I don't want to store every message it has seen up to the 
expiration date (which I'm thinking would be 9-16 months).  I also wanted to eliminate 
substantially duplicate emails.  My googling brought me to the Distributed Checksum 
Clearinghouse and DCC's use of a "fuzzy checksum" algorithm to do almost exactly what I 
was looking for.

After discovering that, I started contemplating how else I might incorporate a fuzzy 
checksum.  I hadn't really thought of anything, but it put me in the right frame of mind for my 
next thought.  I was reading through Tim Peters' very recent list response about "Correlated 
Values."  I have reflected a lot on that very subject.  How in the world can the system know 
that some highly correlated values should have more weight than other highly correlated 
values?  Honestly, I've never been able to come up with any reasonable scheme that might 
work.

I was sitting on my think chair (flush..) contemplating all that when I asked myself how some 
small emails could be so obviously spam to me and yet be so hard for SB because they 
have a few highly correlated ham tokens and fewer highly correlated spam tokens.  I 
realized that its much easier for me because I'm looking at the whole picture, not on a token 
by token basis.

So this brought me round to the bi-grams and n-grams thinking.  I'm no real fan of them 
because I don't like the database growth and past poor tested performance.  I was also 
mulling over DSPAM's n-gram offshoot scheme which I can't remember the name of.  Finally 
it all kind of coalesced in my mind to the idea I wanted to run past everybody.  Flavors.

Emails have flavors and you can mix and match flavors.  Some mixes lead to good tasting 
emails and some lead to bad tasting emails.  Just like desserts.  I can have ice cream, nuts, 
fruit, chocolate syrup, sugar sprinkles, and other stuff.  Ice cream, fruit, and chocolate syrup 
is usually a winner.  Nuts, sugar sprinkles, and chocolate syrup is usually not a good desert.

What if we could use n-gram statistics to process the email as a more gestalt entity.  But, 
instead of using n-grams on the words, what if we used it on a fixed set of potential tokens.

I was thinking how one might take the token stream from the body and come up with a fuzzy 
checksum if you will.  But instead of going the fuzzy checksum route, could we analyse the 
token stream to come up with meta-tokens that would indicate a more complex characteristic 
of the email than just a single word or token could provide?

For instance, what if after all the tokens are generated, we can look at the token stream and 
perform some defined analysis to make decisions about certain specific characterstics of 
email messages.  I know the devil is in the details and I haven't thought that far ahead.  Right 
now I'm thinking things like : HIGH-PRCNT-HTML, or TYPICAL-MAILLIST, or HIGH-PIC-
CONTENT, or NESTED-ENCODING-SCHEMES, or HIGH-HAPAXE-CONTENT, or HIGH-
CODE-CONTENT or HAM-BLOCKS-AND-SPAM-BLOCKS.

There should be characteristics we could objectively define about email messages where 
the characterstics are orthogonal and each one conveys information aggregated from the 
token stream but more than any single regular token could convey.  Most of my examples 
are crap, but the high mailing list probability is a good one.  There should be several email 
traits that we could see in the token stream that when taken together are characteristic of a 
mailing list message.  This could then be one flavor token.

After this, we can use an n-gram scheme to give every email message a flavor list.  Like this 
message I'm writing could be FLAVS-A,D,E where A is mailing list content, D is long-winded, 
and E is high sentence length.  With 10 email characteristics there would be 1023 possible 
combinations.  That's not a lot to track in the database but I think it would be very valuable 
because certain flavor combinations will be very spammy and certain ones will be very 
hammy.

So many of my father's emails that SB stumbles on are tripped up because spammers (or 
dumb luck) have been able to put highly-correlated ham words inside with their short spam 
message.  But these emails don't have the same flavor of a real ham message.  I guess 
what I'm trying to say is that it's relatively easy to bury spam in fake ham "content" because 
we look at the words in isolation.  It would be considerably harder to fake multiple 
characteristics that consider the email in its entirety.

Okay, I should sleep more and type less.  Thanks for your time.

Cheers,
Steve


From skip at pobox.com  Mon Jun 28 07:05:31 2004
From: skip at pobox.com (Skip Montanaro)
Date: Mon Jun 28 07:05:36 2004
Subject: [spambayes-dev] spam on lists
In-Reply-To: <MHEGIFHMACFNNIMMBACAKEPDIBAA.sethg@GoodmanAssociates.com>
References: <16607.26303.639870.86327@montanaro.dyndns.org>
	<MHEGIFHMACFNNIMMBACAKEPDIBAA.sethg@GoodmanAssociates.com>
Message-ID: <16607.64379.201895.103938@montanaro.dyndns.org>


    Seth> However, I wonder if your mail admins have considered using
    Seth> blacklists?

    >> What are you referring to?

    Seth> All I suggested was to consider using a few DNSBL's to reject
    Seth> incoming posts based on the connecting IP's being listed.  

Ah, well, I think most of us have been erroneously blacklisted.  Makes me
gunshy.  At any rate, I believe that's probably going to fall to XS4ALL, the
host for the new machine.  They will be the MX for python.org and have a
suite of stuff they do before passing the mail along to the new
mail.python.org.

Note also that although you mentioned DNSBLs, you also used the very generic
term "blacklist" right off the bat, which has very different connotations in
a Spambayes context.

Skip

From missmoney at alist.co.uk  Mon Jun 28 12:12:22 2004
From: missmoney at alist.co.uk (Miss Moneypennys)
Date: Mon Jun 28 07:12:27 2004
Subject: [spambayes-dev] Music2Dance2 2nd July
Message-ID: PM200012:12:22

An HTML attachment was scrubbed...
URL: http://mail.python.org/pipermail/spambayes-dev/attachments/20040628/0e570dda/attachment.html
From anthony at interlink.com.au  Mon Jun 28 09:13:20 2004
From: anthony at interlink.com.au (Anthony Baxter)
Date: Mon Jun 28 09:13:38 2004
Subject: [spambayes-dev] 1.0 and beyond
In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F13064C02A1@its-xchg4.massey.ac.nz>
References: <1ED4ECF91CDED24C8D012BCF2B034F13064C02A1@its-xchg4.massey.ac.nz>
Message-ID: <40E01970.3090803@interlink.com.au>

Tony Meyer wrote:
> As you might have noticed, Mark & I have put out 1.0rc2.  The plan this time
> is that it'll work flawlessly <wink> and in about a week we'll put the same
> thing together for 1.0 (and so a better job of publicising it as someone
> (Anthony?) suggested a while back).
> 
> As I noted in answer to Skip the HEAD is free for anything that you want to
> do, and the 1.0_release branch should be left alone apart from really major
> bugs or packaging issues, until 1.0 is done.

I would make this statement even stronger - unless you're involved in
the release, you should NOT be touching the 1.0 branch until 1.0 is
out the door.

> Does anyone have a plan after that?  Do we want to keep the 1.0 branch alive
> and copy minor things across to it for a 1.1 release (someday!), or do we
> want 1.1 to have lots of new features and stuff (i.e. the HEAD).

Don't put new features into a 1.0.1 &c release - only include bug fixes.
This saves effort (no wasted effort to backport new features, which will
become harder as the trunk drifts away from the 1.0 branch), and means
that people with 1.0 know that they can safely install 1.0.1, 1.0.2 and
not have it break things. We do this for Python, and the feedback I get
is overwhelmingly positive.

Speaking as a former sysadmin, I'm going to be much happier if I know
that a bugfix release is only a bugfix release.

Anthony


-- 
Anthony Baxter     <anthony@interlink.com.au>
It's never too late to have a happy childhood.

From kennypitt at hotmail.com  Mon Jun 28 12:12:11 2004
From: kennypitt at hotmail.com (Kenny Pitt)
Date: Mon Jun 28 12:14:23 2004
Subject: [spambayes-dev] RE: [Spambayes] Execute test suite
In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F1304677FDB@its-xchg4.massey.ac.nz>
Message-ID: <BAY16-DAV2niiRA6Tdz00011732@hotmail.com>

Tony Meyer wrote:
>> This screenshot is taken from a development version running from
>> source code.  SpamBayes detects if it is running from the released
>> binaries or from source code, and it does not show the "Execute test
>> suite" option if running from the released binaries.
> 
> We really ought to use a shot from the binary and use that.  Any
> chance you want to do that? <wink> 

OK, I grabbed the attached screenshot.  I'm running Outlook 2003 so the
toolbar style is a little different.  If that's OK with everyone then I'll
go ahead and check it in on the trunk and Tony can migrate it back to the
1.0 branch if desired.

-- 
Kenny Pitt
-------------- next part --------------
A non-text attachment was scrubbed...
Name: manager-select.jpg
Type: image/jpeg
Size: 14556 bytes
Desc: not available
Url : http://mail.python.org/pipermail/spambayes-dev/attachments/20040628/1b1280b1/manager-select.jpg
From iimspg at massey.ac.nz  Mon Jun 28 18:33:32 2004
From: iimspg at massey.ac.nz (IIMS Postgraduate Representative)
Date: Mon Jun 28 18:33:41 2004
Subject: [spambayes-dev] RE: [Spambayes] Execute test suite
In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F1306E930E6@its-xchg4.massey.ac.nz>
Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F1304678014@its-xchg4.massey.ac.nz>

> OK, I grabbed the attached screenshot.  I'm running Outlook 
> 2003 so the toolbar style is a little different.  If that's OK
> with everyone then I'll go ahead and check it in on the trunk
> and Tony can migrate it back to the 1.0 branch if desired.

Go ahead and check it in (thanks!).  I notice now that the old one is
actually missing an item (Filter Messages), too.  The style doesn't really
matter - it's not that different.

I'll leave the 1.0 one alone (it's a pretty minor thing, even though it's
only documentation), but copy the change across for any 1.0.1 release.

=Tony Meyer

---
Please always include the list (spambayes@python.org) in your replies
(reply-all), and please don't send me personal mail about SpamBayes. This
way, you get everyone's help, and avoid a lack of replies when I'm busy.


From tameyer at ihug.co.nz  Mon Jun 28 18:35:10 2004
From: tameyer at ihug.co.nz (Tony Meyer)
Date: Mon Jun 28 18:35:47 2004
Subject: [spambayes-dev] RE: [Spambayes] Execute test suite
In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F1306E931D5@its-xchg4.massey.ac.nz>
Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F1304678015@its-xchg4.massey.ac.nz>

> -----Original Message-----
> From: spambayes-dev-bounces@python.org 
> [mailto:spambayes-dev-bounces@python.org] On Behalf Of IIMS 
> Postgraduate Representative
[...]

Opps.  Don't clean your keyboard while in the middle of typing message, or
look what might happen <wink>.

=Tony Meyer


From kstocky at yahoo.com  Mon Jun 28 22:37:10 2004
From: kstocky at yahoo.com (Kim Stockdale)
Date: Mon Jun 28 22:37:13 2004
Subject: [spambayes-dev] FAQ quest? Using spam bayes with outlook 2003 spam
	filter 
Message-ID: <20040629023710.20126.qmail@web40413.mail.yahoo.com>


I've been using spam bayes with Outlook 2000 and loved
it.  I upgraded to Outlook 2003, and I understand
Outlook's built in spam filter is automatically on. 
Is there anything I need to do to the Outlook 2003
spam settings in order to use Spam Bayes?  Can/should
they both be used?

Thank you,
Kim Stockdale


__________________________________
Do you Yahoo!?
Yahoo! Mail - 50x more storage than other providers!
http://promotions.yahoo.com/new_mail

From ta-meyer at ihug.co.nz  Mon Jun 28 22:53:22 2004
From: ta-meyer at ihug.co.nz (Tony Meyer)
Date: Mon Jun 28 22:53:29 2004
Subject: [spambayes-dev] Difference between "show clues" scoring and filter
	scoring
Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F13064C02B0@its-xchg4.massey.ac.nz>

I just had a very odd experience - a message (the one that Tim just replied
to on spambayes@python.org) arrived and ended up in my spam folder.  The
spam field has a value of 1.00 in it - I was quite surprised at the false
positive, so I did a "Show clues", and it scores 15% (high, but ok).

I can't figure out why there is a discrepancy.  There was definitely no
training (I even checked the log, which confirms the move to the spam
folder) between the two scorings.

Does "show clues" get the message differently somehow?  Is there some other
explanation for this?  FWIW, there aren't any errors in the log, either.

=Tony Meyer


From mhammond at skippinet.com.au  Tue Jun 29 00:29:51 2004
From: mhammond at skippinet.com.au (Mark Hammond)
Date: Tue Jun 29 00:29:53 2004
Subject: [spambayes-dev] RE: Difference between "show clues" scoring and
	filter scoring
In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F13064C02B0@its-xchg4.massey.ac.nz>
Message-ID: <002701c45d91$b9bdeed0$0200a8c0@eden>

> I just had a very odd experience - a message (the one that
> Tim just replied
> to on spambayes@python.org) arrived and ended up in my spam
> folder.  The
> spam field has a value of 1.00 in it - I was quite surprised
> at the false
> positive, so I did a "Show clues", and it scores 15% (high, but ok).

See also:

https://sourceforge.net/tracker/index.php?func=detail&aid=972359&group_id=61
702&atid=498103

If the message you are refering to is a mime message, it may be the same
thing.  I believe the issue will be the butchery we do with the content-type
and mime-armour.

In that bug, I notice that the headers, as displayed by outlook, end with:
"""
X-OriginalArrivalTime: 14 Jun 2004 01:39:10.0757 (UTC)
FILETIME=[64083550:01C451B0]

------=_NextPart_000_0D33_D859C01A.71B0FC9A
Content-Type: text/plain; charset="iso-8859-1"
Content-Transfer-Encoding: 7bit

------=_NextPart_000_0D33_D859C01A.71B0FC9A
Content-Type: text/html; charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable


------=_NextPart_000_0D33_D859C01A.71B0FC9A--
"""

Unless I am mistaken, everything after the blank line is part of the body -
but as I mentioned, Outlook is returning it in the headers.  I assume, but
have not verified, that us fetching the headers via the MAPI property will
give us the same string.  I also suspect, but haven't confirmed, that this
will screw up what we do with the headers, especially if we append headers
*after* the blank line.

Mark.


From tameyer at ihug.co.nz  Tue Jun 29 01:03:41 2004
From: tameyer at ihug.co.nz (Tony Meyer)
Date: Tue Jun 29 01:03:43 2004
Subject: [spambayes-dev] RE: Difference between "show clues" scoring
	andfilter scoring
In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F1306E93294@its-xchg4.massey.ac.nz>
Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F13064C02B3@its-xchg4.massey.ac.nz>

> If the message you are refering to is a mime message,
> it may be the same thing.  I believe the issue will be
> the butchery we do with the content-type and mime-armour.

The message has a "MIME-Version" header, but no other mime headers.  I've
attached the message to the tracker.

> Unless I am mistaken, everything after the blank line
> is part of the body - but as I mentioned, Outlook is
> returning it in the headers.  I assume, but have not
> verified, that us fetching the headers via the MAPI
> property will give us the same string.  I also suspect, but
> haven't confirmed, that this will screw up what we do
> with the headers, especially if we append headers
> *after* the blank line.

Does that still fit with my message that didn't have the multipart stuff?

=Tony Meyer


From tim.peters at gmail.com  Tue Jun 29 01:06:34 2004
From: tim.peters at gmail.com (Tim Peters)
Date: Tue Jun 29 01:06:41 2004
Subject: [spambayes-dev] RE: Difference between "show clues" scoring and
	filter scoring
In-Reply-To: <002701c45d91$b9bdeed0$0200a8c0@eden>
References: <002701c45d91$b9bdeed0$0200a8c0@eden>
Message-ID: <1f7befae04062822063069a9d0@mail.gmail.com>

[Mark Hammond]
> ...
> If the message you are refering to is a mime message, it may be the same
> thing.  I believe the issue will be the butchery we do with the content-type
> and mime-armour.

All email is MIME now <0.9 wink>.  I think Tony must mean this msg:

http://mail.python.org/pipermail/spambayes/2004-June/013665.html

If so, it's 

Content-Type: text/plain; charset=us-ascii

and has no non-trivial MIME structure (one text/plain part, no boundaries).

I'd be curious to know even why it scored 15 for Tony!  There's
nothing clearly spammy about it, apart from a Yahoo blurb at the
bottom, and yahoo sender address.

From tameyer at ihug.co.nz  Tue Jun 29 01:16:47 2004
From: tameyer at ihug.co.nz (Tony Meyer)
Date: Tue Jun 29 01:16:52 2004
Subject: [spambayes-dev] RE: Difference between "show clues" scoring
	andfilter scoring
In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F1306E932A7@its-xchg4.massey.ac.nz>
Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F1304678028@its-xchg4.massey.ac.nz>

> I'd be curious to know even why it scored 15 for Tony!
> There's nothing clearly spammy about it, apart from a
> Yahoo blurb at the bottom, and yahoo sender address.

I do get a bit of spam from Yahoo (lists), so that's part of it.  The other
spammy clues are all from 6 or fewer messages, so maybe 'small database'
accounts for some of it.  The imbalance probably doesn't help (though
doesn't the imbalance in this direction more account for false negatives?).

Combined Score: 15% (0.150615)
Internal ham score (*H*): 0.802823
Internal spam score (*S*): 0.104054

# ham trained on: 55
# spam trained on: 322

50 Significant Tokens
token                               spamprob         #ham  #spam
'bi:2000 and'                       0.0918367           2      0
'bi:outlook 2003'                   0.0918367           2      0
'outlook'                           0.108078            4      2
'it.'                               0.116936           10      7
'filter'                            0.135074            3      2
'bi:they both'                      0.155172            1      0
'skip:_ 30'                         0.155172            1      0
'built'                             0.164747            4      4
'spam'                              0.164747            4      4
'anything'                          0.173132            7      8
'there'                             0.186543           22     29
'settings'                          0.210929            1      1
'upgraded'                          0.210929            1      1
"i've"                              0.218646            7     11
'check'                             0.257231           14     28
'skip:_ 40'                         0.260614            6     12
'use'                               0.261442           19     39
'url:listinfo'                      0.270438            8     17
'url:mailman'                       0.270438            8     17
'before'                            0.271457           13     28
'on.'                               0.271748            2      4
'loved'                             0.286634            1      2
'url:org'                           0.288353           17     40
'need'                              0.289067           14     33
'order'                             0.289706            6     14
'yahoo!'                            0.294234            3      7
'bi:url:mail url:python'            0.329461            6     17
'skip:a 10'                         0.336552           25     74
'subject:] '                        0.339934           16     48
'url:sf'                            0.344635            3      9
'header:Errors-To:1'                0.360176           21     69
'subject:Spambayes'                 0.371204            7     24
'sender:no real name:2**0'          0.371437           20     69
'url:yahoo'                         0.377219            4     14
'more'                              0.38184            23     83
'subject:['                         0.390077           15     56
'bi:header:Received:6 header:From:1' 0.393314            9     34
'sender:addr:spambayes-bounces'     0.397277            6     23
'url:html'                          0.603261           12    107
'bi:header:MIME-Version:1 proto:http' 0.62941             4     40
'bi:been using'                     0.844828            0      1
'bi:than other'                     0.844828            0      1
'bi:thank you,'                     0.844828            0      1
'bi:subject:?  subject: '           0.934783            0      3
'subject:2003'                      0.934783            0      3
'understand'                        0.934783            0      3
'2003,'                             0.949438            0      4
'subject:with'                      0.958716            0      5
'subject:\n\t'                      0.965116            0      6
'from:addr:yahoo.com'               0.987106            0     17

=Tony Meyer


From mhammond at skippinet.com.au  Tue Jun 29 01:50:42 2004
From: mhammond at skippinet.com.au (Mark Hammond)
Date: Tue Jun 29 01:50:48 2004
Subject: [spambayes-dev] RE: [Spambayes] Execute test suite
In-Reply-To: <BAY16-DAV2niiRA6Tdz00011732@hotmail.com>
Message-ID: <004401c45d9d$05d302a0$0200a8c0@eden>

> OK, I grabbed the attached screenshot.  I'm running Outlook 
> 2003 so the
> toolbar style is a little different.  If that's OK with 
> everyone then I'll
> go ahead and check it in on the trunk and Tony can migrate it 
> back to the
> 1.0 branch if desired.

Sounds and looks great!  Thanks,

Mark.


From tina at alist.co.uk  Tue Jun 29 10:29:01 2004
From: tina at alist.co.uk (Tina Jones)
Date: Tue Jun 29 05:29:06 2004
Subject: [spambayes-dev] Saturday 3rd July
Message-ID: PM200010:29:01

An HTML attachment was scrubbed...
URL: http://mail.python.org/pipermail/spambayes-dev/attachments/20040629/ad6f448a/attachment.html
From tdickenson at geminidataloggers.com  Wed Jun 30 12:36:48 2004
From: tdickenson at geminidataloggers.com (Toby Dickenson)
Date: Wed Jun 30 12:36:52 2004
Subject: [spambayes-dev] correlated clues
In-Reply-To: <20040628032821.E8A8A5A2@oberon.geminidataloggers.com>
References: <20040628032821.E8A8A5A2@oberon.geminidataloggers.com>
Message-ID: <200406301736.48606.tdickenson@geminidataloggers.com>

On Monday 28 June 2004 04:24, Tim Peters wrote:
> [Toby Dickenson]
>
> > Im seeing a significant number of misclassified spams that come through
> > mailing lists. 
>
> Toby, which training strategy do you use?

As you guess,  I train on everything. Its very low effort to maintain, but it 
has left me with several training imbalances like this that have adversely 
affected classification accuracy.

> A general problem with "doing something about this" is that correlation
> *usually* helps in determining a correct classification, so I don't think
> correlation is evil per se.  

> I don't have a good idea for identifying "bad correlation"
> automatically and efficiently.

My standards are lower... For now I would happy with a manual and inefficent 
process that picked out just this one special type of correlated clue. A 
couple of prototypes later, and even that isnt as easy as I hoped :-(

Ill let you know if anything good come this.

-- 
Toby Dickenson

From lyris at newsletter.stern.de  Wed Jun 30 15:39:05 2004
From: lyris at newsletter.stern.de (Lyris ListManager)
Date: Wed Jun 30 15:45:05 2004
Subject: [spambayes-dev] Automatische Antwort - Newsletter Lifestyle
Message-ID: <LYRIS0-1088624345--115087-lyris@newsletter.stern.de>

Liebe User,

dies ist eine automatisch erzeugte Mail. Wenn Sie Ihren Newsletter
abbestellen wollen, m�ssen Sie auf den �Abbestellen�-Button am
Ende des Newsletters klicken. Achtung: F�r diesen Prozess m�ssen Sie
online sein.

Sollte Ihr Mail-Client den Button deaktiviert haben, k�nnen Sie folgende
Seite aufrufen: http://www.stern.de/mein-stern-de/newsletter/
Im Kasten der rechte Spalte markieren Sie den Newsletter, den Sie nicht
mehr erhalten m�chten, und geben danach Ihre E-Mail-Adresse ein.  Sie
werden sofort vom Verteiler entfernt.

Ihr stern.de-Team.


From gbrown at alumni.caltech.edu  Wed Jun 30 15:46:38 2004
From: gbrown at alumni.caltech.edu (Glenn Brown)
Date: Wed Jun 30 15:46:39 2004
Subject: [spambayes-dev] Phishing detection?
In-Reply-To: <200406301736.48606.tdickenson@geminidataloggers.com>
Message-ID: <01d101c45eda$f5c3fc40$0d08a8c0@Glenn>

HTML like the following is a very strong indicator of phishing.  That is, if
both the HREF and content of an anchor are URLs, but are different URLS,
then someone is phishing; especially if the servers are different.  Heck,
just the visible content part of an anchor being a URL is phishy.

Joe Spammer > 	<A href="http://219.148.127.66/scripts/confirmation.htm">
Joe Spammer >	https://wwww.citibank.com/signin/confirmation.jsp</A>

Anybody think this worth a "URL:in_anchor" token (easier to implement) or
"URL:contradicts_anchor" token (harder)?  Easier yet would be
"URL:IPaddr_instead_of_hostname", but I know the weaker "URL:0" to "URL:255"
tokens are already generated.

--Glenn


From tim.peters at gmail.com  Wed Jun 30 20:00:38 2004
From: tim.peters at gmail.com (Tim Peters)
Date: Wed Jun 30 20:00:46 2004
Subject: [spambayes-dev] correlated clues
In-Reply-To: <200406301736.48606.tdickenson@geminidataloggers.com>
References: <20040628032821.E8A8A5A2@oberon.geminidataloggers.com>
	<200406301736.48606.tdickenson@geminidataloggers.com>
Message-ID: <1f7befae040630170030eef019@mail.gmail.com>

[Tim Peters]
>> I don't have a good idea for identifying "bad correlation"
>> automatically and efficiently.

[Toby Dickenson]
> My standards are lower... For now I would happy with a manual and inefficent
> process that picked out just this one special type of correlated clue. A
> couple of prototypes later, and even that isnt as easy as I hoped :-(
>
> Ill let you know if anything good come this.

If you like, talk about what you tried, and what happened.  The
history of trying new gimmicks here is overwhelmingly a story of
failure, and that's normal for this kind of classifier.  Sharing what
doesn't work saves overall effort too.

We have two anti-bad-correlation gimmicks now, driven by early testing
results, and rationalized after the fact <wink>:

1. As mentioned last time, ignoring most header lines.  If we didn't,
virtually all spam on mailing lists would score unsure or FN (thanks
to a large number of distinct but correlated "I came from a mailing
list" header tokens).

2. Stripping most evidence of HTML (like throwing away all HTML tags).
 If we didn't, virtually all HTML email would score unsure or FP
(thanks to a huge number of distinct but correlated "HTML was used"
body tokens).

#1 bothers me more than #2 -- I view #2 mostly as a way of scoring
content (what the users sees) instead of encoding, and I think that's
a good thing to strive for independent of the correlation aspect.  #1
is more purely a hack.

Maybe another pure but personalized hack would be to add a list of
specific tokens you want the classifier to pretend didn't exist.  For
example, 'url:zope'  is probably strongly hammy for you, but isn't
needed to get Zope mailing list ham scored as ham.  If that's so, the
only visible effect it has is to reduce the score on Zope mailing list
spam.

OTOH, most Zope mailing lists will soon have a members-only posting
policy, so fighting Zope mailing list spam may soon be yesterday's
war.