From skip at pobox.com  Tue Aug  1 12:51:00 2006
From: skip at pobox.com (skip at pobox.com)
Date: Tue, 1 Aug 2006 05:51:00 -0500
Subject: [spambayes-dev] Trouble w/ zodb persistence of dnscache
Message-ID: <17615.12820.699706.298379@montanaro.dyndns.org>

I've been using Matt Cowles's x-lookup-ip extension with some success
recently to reveal the real IP addresses behind spammers' hostnames.  For
example, the following hostnames are mentioned in pharma come-ons:

    % host www.astlehover.com
    www.astlehover.com has address 211.144.68.87
    % host www.tornetseen.com
    www.tornetseen.com has address 211.144.68.87
    % host www.erlikuvera.com
    www.erlikuvera.com has address 211.144.68.87
    % host www.oplimazexu.com
    www.oplimazexu.com has address 211.144.68.87

The rest of the message content is pretty well disguised (very little
content, random common text boilerplate, etc), so without IP lookup they
tend to plop into my unsure mailbox.  They sometimes score low enough to
land in my regular inbox.

Matt's extension solves that by looking up the IP addresses for hosts it
encounters and generating a number of new tokens:

    % spamcounts -r :211          
    token,nspam,nham,spam prob
    url-ip:211.144.68.87/32,1,0,0.844827586207
    url-ip:211.144.68/24,1,0,0.844827586207
    url-ip:211/8,4,0,0.949438202247
    url-ip:211.20.189/24,1,0,0.844827586207
    url-ip:211.189.18/24,1,0,0.844827586207
    url-ip:211.144/16,1,0,0.844827586207
    received:211.95.72.130,1,0,0.844827586207
    url-ip:211.189.18.186/32,1,0,0.844827586207
    url-ip:211.22.166.116/32,1,0,0.844827586207
    received:211.96,1,0,0.844827586207
    received:211.95,1,0,0.844827586207
    url-ip:211.22.166/24,1,0,0.844827586207
    received:211.95.72,1,0,0.844827586207
    url-ip:211.20/16,1,0,0.844827586207
    url-ip:211.20.189.50/32,1,0,0.844827586207
    received:211.96.42,1,0,0.844827586207
    url-ip:211.22/16,1,0,0.844827586207
    received:211,2,0,0.908163265306
    received:211.96.42.103,1,0,0.844827586207
    url-ip:211.189/16,1,0,0.844827586207

Unfortunately it doesn't cache IP addresses across sessions.  My
train-to-exhaustion scheme scores my entire training database.  The first
round of scoring is very time-consuming.

I decided to solve that shortcoming.  I added "dbm" and "zodb" support to
Matt's dnscache module, since those are probably the two most prevalent
storage schemes (default and emeritus default).  I've been testing the zodb
scheme but having trouble with it.  If I start with no ~/.dnscache* files it
correctly creates a new one.  If I have an existing database already, it
doesn't update the database file, though the timestamps on the .index and
.tmp files are updated.

I asked on zodb-dev and got some partial help (I was relying on __del__ to
close() the FileStorage object), but even with that fixed it's not working
properly.  My recent pleas for help have gone unanswered, so I'm turning to
this list.  My zodb code was cribbed from the support in SpamBayes itself,
so maybe the author of that code will see what I've done wrong.

I set up the cache in tokenizer.py like so:

    try:
        import dnscache
        cache = dnscache.cache(cachefile=os.path.expanduser("~/.dnscache"))
        cache.printStatsAtEnd = True
    except (IOError, ImportError):
        cache = None
    else:
        import atexit
        atexit.register(cache.close)

In the cache class's __init__ I open the cachefile if given:

    if cachefile:
      self.open_cachefile(cachefile)
    else:
      self.caches={ "A": {}, "PTR": {} }

    def open_cachefile(self, cachefile):
      filetype = options["Storage", "persistent_use_database"]
      cachefile = os.path.expanduser(cachefile)
      if filetype == "dbm":
        if os.path.exists(cachefile):
          self.caches=shelve.open(cachefile)
        else:
          self.caches=shelve.open(cachefile)
          self.caches["A"] = {}
          self.caches["PTR"] = {}
      elif filetype == "zodb":
        from ZODB import DB
        from ZODB.FileStorage import FileStorage
        self._zodb_storage = FileStorage(cachefile, read_only=False)
        self._DB = DB(self._zodb_storage, cache_size=10000)
        self._conn = self._DB.open()
        root = self._conn.root()
        self.caches = root.get("dnscache")
        if self.caches is None:
          # There is no classifier, so create one.
          from BTrees.OOBTree import OOBTree
          self.caches = root["dnscache"] = OOBTree()
          self.caches["A"] = {}
          self.caches["PTR"] = {}
          print "opened new cache"
        else:
          print "opened existing cache with", len(self.caches["A"]), "A records",
          print "and", len(self.caches["PTR"]), "PTR records"

and when it's closed, this code executes:

    def close(self):
      filetype = options["Storage", "persistent_use_database"]
      if filetype == "dbm":
        self.caches.close()
      elif filetype == "zodb":
        self._zodb_close()

    def _zodb_store(self):
        import transaction
        from ZODB.POSException import ConflictError
        from ZODB.POSException import TransactionFailedError

        try:
            transaction.commit()
        except ConflictError, msg:
            # We'll save it next time, or on close.  It'll be lost if we
            # hard-crash, but that's unlikely, and not a particularly big
            # deal.
            if options["globals", "verbose"]:
                print >> sys.stderr, "Conflict on commit.", msg
            transaction.abort()
        except TransactionFailedError, msg:
            # Saving isn't working.  Try to abort, but chances are that
            # restarting is needed.
            if options["globals", "verbose"]:
              print >> sys.stderr, "Store failed.  Need to restart.", msg
            transaction.abort()

    def _zodb_close(self):
        # Ensure that the db is saved before closing.  Alternatively, we
        # could abort any waiting transaction.  We need to do *something*
        # with it, though, or it will be still around after the db is
        # closed and cause problems.  For now, saving seems to make sense
        # (and we can always add abort methods if they are ever needed).
        self._zodb_store()

        # Do the closing.        
        self._DB.close()

        # We don't make any use of the 'undo' capabilities of the
        # FileStorage at the moment, so might as well pack the database
        # each time it is closed, to save as much disk space as possible.
        # Pack it up to where it was 'yesterday'.
        # XXX What is the 'referencesf' parameter for pack()?  It doesn't
        # XXX seem to do anything according to the source.
  ##       self._zodb_storage.pack(time.time()-60*60*24, None)
        self._zodb_storage.close()

        self._zodb_closed = True
        if options["globals", "verbose"]:
            print >> sys.stderr, 'Closed dnscache database'

When run, it correctly announces that it's either creating a new cache or
that it opened an existing cache, e.g.:

    opened existing cache with 479 A records and 0 PTR records

No errors appear on stdout or stderr during the run.  At completion it tells
me that, "Closed dnscache database".

I can see that the database isn't getting updated because a) its timestamp
doesn't get updated and b) because running strings over the file and
grepping for new names doesn't display them:

    % # this one exists...
    % strings -a ~/.dnscache* | egrep -i timsblogger
    www.timsbloggers.comq
    % # this one is new...
    % strings -a ~/.dnscache* | egrep -i tradelink
    % # bummer...

Does anyone have any suggestions about getting this beast to work properly?

Thx,

Skip

From spambayes at masters.me.uk  Wed Aug  2 09:45:03 2006
From: spambayes at masters.me.uk (spambayes at masters.me.uk)
Date: Wed, 2 Aug 2006 08:45:03 +0100
Subject: [spambayes-dev] Spambayes is starting not to work due to
	retaliatory action by spammers
Message-ID: <000f01c6b607$9161a640$1202a8c0@trump>

Dear Spambayes developers,

I've used Spambayes for 2 or 3 years (Outlook add-in) - it has been
excellent.  However, over the last couple of months, it has become
compromised by a particular type of spam that I believe, over time, will
render Spambayes much less effective unless something is done.

I expect you've seen these Spams - at the moment, they are always the
stock-market related ones but I'm sure once others catch on, they will start
to use the same technique.  The start of the email is a picture that looks
like ordinary text but isn't.  All the spam info is in the text.  The
picture is followed by a whole load of randomly selected words.

There are 2 bad things about this:

1.  These spams are successfully evading Spambayes in some cases.  Firstly
the Spam usually reaches the "possible Spam" folder.  As a result, I am now
spending significant time clearing out the possible spam folder whereas 2 or
3 months ago I wasn't.   Secondly, the odd spam is actually managing to get
through as ham.  This is the first time this has happened ever.

2.  Because I obviously mark these as Spam, all the randomly generated words
in each spam email have their spam likelihood scores increased.  The result
of this is that over time, the spam-scores for loads of perfectly
non-spam-like words are being gradually increased.  The more this goes on,
the more these "ham words" are being compromised.  I suspect that this is
why, to begin with, I only saw a few of these stock market emails, now I'm
seeing loads and over the last 2 or 3 weeks some have started to come in as
ham.  I fear that the long term effect of this will be to spoil spambayes
bigtime.

I know that Spambayes has a deep-rooted principle in only using the bayesian
algorithm and I wouldn't suggest changing that.  However, I am wondering if
it might be possible to analyse these messages and include some parts of the
hidden text relating to the picture that are not presently included in the
bayesian statistics.  My thesis is this - I rarely get pictures in my email
that are not just attachments - virtually all pictures that are embedded
into the mail seem to be spam.  So if there is some token or tag in the
email that represents the embedded picture that can be included in the
bayesian analysis, this would might fix the problem.

I hope that this suggestion is useful - I certainly fear for the future of
Spambayes if this new spam threat is not dealt with....

thanks for reading,

James Masters.


From ron at ridic.com  Wed Aug  2 13:21:01 2006
From: ron at ridic.com (Ron Theis)
Date: Wed, 02 Aug 2006 04:21:01 -0700
Subject: [spambayes-dev] Correct formatting of HTTP Post for training
Message-ID: <44D08A9D.1030100@ridic.com>

 > Apparently I'm formatting the requests
 > incorrectly, because the server is returning a 500 error.

Whoops, sorry, I was missing the "text" parameter in the POST. Dumb 
diddly. It seems to be training fine now.

Ron

From ron at ridic.com  Wed Aug  2 13:05:16 2006
From: ron at ridic.com (Ron Theis)
Date: Wed, 02 Aug 2006 04:05:16 -0700
Subject: [spambayes-dev] Correct formatting of HTTP Post for training
Message-ID: <44D086EC.6070703@ridic.com>

Hi,
I'm trying to submit spam/ham via manually assembled HTTP POSTs to
SpamBayes on Windows. Apparently I'm formatting the requests
incorrectly, because the server is returning a 500 error. The error
message includes a traceback of:

File "spambayes\Dibbler.pyc", line 470, in found_terminator
TypeError: onTrain() takes exactly 4 non-keyword arguments (2 given)


Does anyone have a sample of what such a POST should look like? I
suspect I'm bungling the formatting.

Thanks,
Ron


From tim.peters at gmail.com  Thu Aug  3 09:25:32 2006
From: tim.peters at gmail.com (Tim Peters)
Date: Thu, 3 Aug 2006 03:25:32 -0400
Subject: [spambayes-dev] Spambayes is starting not to work due to
	retaliatory action by spammers
In-Reply-To: <000f01c6b607$9161a640$1202a8c0@trump>
References: <000f01c6b607$9161a640$1202a8c0@trump>
Message-ID: <1f7befae0608030025i3ce746eatbd264fd5ad094725@mail.gmail.com>

[spambayes at masters.me.uk]
> I've used Spambayes for 2 or 3 years (Outlook add-in) - it has been
> excellent.  However, over the last couple of months, it has become
> compromised by a particular type of spam that I believe, over time, will
> render Spambayes much less effective unless something is done.
>
> I expect you've seen these Spams - at the moment, they are always the
> stock-market related ones

I've seen a few drug spams using the same techniques, starting in July
-- but they seemed to dry up quickly.

> but I'm sure once others catch on, they will start
> to use the same technique.  The start of the email is a picture that looks
> like ordinary text but isn't.  All the spam info is in the text.  The
> picture is followed by a whole load of randomly selected words.

You're probably not getting any reaction here because exactly the same
thing is currently being discussed on the SpamBayes "user" mailing
list, in this thread:

    Spam in Images
    http://mail.python.org/pipermail/spambayes/2006-August/date.html

> There are 2 bad things about this:
>
> 1.  These spams are successfully evading Spambayes in some cases.  Firstly
> the Spam usually reaches the "possible Spam" folder.  As a result, I am now
> spending significant time clearing out the possible spam folder whereas 2 or
> 3 months ago I wasn't.

Same here, except the time isn't significant.  If you don't believe
me, stop using SpamBayes for a week to rediscover what "significant"
means ;-)

>  Secondly, the odd spam is actually managing to get through as ham.  This
> is the first time this has happened ever.

Not here -- they're very good at scoring Unsure, but haven't seen any
false negatives yet.

> 2.  Because I obviously mark these as Spam, all the randomly generated words
> in each spam email have their spam likelihood scores increased.  The result
> of this is that over time, the spam-scores for loads of perfectly
> non-spam-like words are being gradually increased.  The more this goes on,
> the more these "ham words" are being compromised.

I certainly haven't seen any ham pushed into "unsure" because of this,
and doubt it matters -- it generally doesn't hurt at all to have any
number of "ham words" show up in a few spam.  One of the
characteristics of the spam you're talking about that /makes/ it
effective is that it's very good at /not/ repeating gibberish phrases
across messages.  That's exactly why training on the gibberish is
ineffective at catching future messages of the same ilk.  But, OTOH,
the non-repetition also prevents it from "poisoning" your strong ham
tokens.  They get slightly less hammy, and that doesn't hurt because
most ham is nowhere near the unsure range.

> I suspect that this is why, to begin with, I only saw a few of these stock market
> emails, now I'm seeing loads

The only reason you see loads of any kind of spam is that it's making
a profit for the sender.  Pump-&-dump scams violate major securities
laws, and it's quite possible these scammers will quit before getting
too greedy (= getting caught).

> and over the last 2 or 3 weeks some have started to come in as ham.

While I haven't seen that, it's inconsistent with your explanation
above:  if your "ham tokens" /were/ being compromised, that makes it
/less/ likely that a message containing your ham tokens will be scored
as ham, not more likely.

A more likely explanation is simply "loads":  gibberish does have a
real chance of scoring as ham, and the more attempts are made, the
more likely one will succeed.  What they can't do is craft a message
that scores as ham for all users, or even for most.

> I fear that the long term effect of this will be to spoil spambayes bigtime.

Possibly.  People have panicked prematurely before ;-)

> I know that Spambayes has a deep-rooted principle in only using the bayesian
> algorithm and I wouldn't suggest changing that.  However, I am wondering if
> it might be possible to analyse these messages and include some parts of the
> hidden text relating to the picture that are not presently included in the
> bayesian statistics.

See the thread above.  Nobody knows a realistic way to extract the
text from these images (there is no "text" here -- just a large matrix
of individual pixels, something the human eye/brain system is very
much better at decoding than programs).  OTOH, the images themselves
probably have many statistical characteristics not shared with
"legitimate" images, and those can be computed/extracted with finite
effort.

> My thesis is this - I rarely get pictures in my email that are not just attachments -
> virtually all pictures that are embedded into the mail seem to be spam.

Of course that varies.  For example, it's very easy to create embedded
pictures in Outlook, and even small children know how to do it.
Worse, their grandparents are required by law to consider such email
"ham" :-)

> So if there is some token or tag in the email that represents the embedded picture
> that can be included in the bayesian analysis, this would might fix the problem.

This is harder in Outlook because Outlook destroys the original MIME
structure of the email before SpamBayes sees it.  There are already
several such tokens generated when the original MIME structure is
available.  In Outlook, it's most likely you'll get the single
synthesized token:

    virus:src="cid:

or a simple variation on that, and that's all that remains of the
embedded GIF.  A single token helps a bit, but not enough.  Do note
that pump-&-dump scams don't even contain a URL to click on:  they
want you to buy the stock on the open market, not send them money
directly.  That also makes it a unique (and uniquely effective) kind
of spam:  the pitch is /entirely/ buried in the GIF, with no useful
text (not even a URL) of any kind to tokenize.

> I hope that this suggestion is useful - I certainly fear for the future of
> Spambayes if this new spam threat is not dealt with....

Don't assume that most spammers are capable of becoming competent :-)

From skip at pobox.com  Fri Aug  4 17:20:36 2006
From: skip at pobox.com (skip at pobox.com)
Date: Fri, 4 Aug 2006 10:20:36 -0500
Subject: [spambayes-dev] Maybe a little OCR would help...
Message-ID: <17619.26052.556242.798290@montanaro.dyndns.org>


This is just one simple little test...

I took two pump & dump messages for HLVK I received overnight.  The GIF
image is actually sliced into pieces horizontally, so I wrote a little shell
script to convert the images to netpbm and concatenate them, then sent the
result through ocrad, sorted, uniq'd and downshited the whole mess, then
checked for words the two had in common.  I came up with:
    _
    __
    and
    co
    company
    hlv
    hlvc
    lnc.
    low
    new
    news
    nlv
    now!
    now!!!
    on
    the
    tnis
    wl_
    |_

While that is not a huge increase in the number of tokens and some aren't
going to help, it's still better than what we have today.  Time will tell if
the cost is worth it.  Perhaps if we generate some further interest in ocrad
it will improve as well.

Skip

From james at masters.me.uk  Sat Aug  5 23:16:23 2006
From: james at masters.me.uk (James Masters)
Date: Sat, 5 Aug 2006 22:16:23 +0100
Subject: [spambayes-dev] Spambayes is starting not to work due to
	retaliatory action by spammers
In-Reply-To: <1f7befae0608030025i3ce746eatbd264fd5ad094725@mail.gmail.com>
Message-ID: <002701c6b8d4$68031d90$1202a8c0@trump>

Dear Tim,

Thank you very much for your comprehensive reply and apologies to the group
for putting my email to the wrong place.  If I have anything more to write,
I'll put it in the forum you mention.

thanks,

James.

> -----Original Message-----
> From: Tim Peters [mailto:tim.peters at gmail.com]
> Sent: 03 August 2006 08:26
> To: spambayes at masters.me.uk
> Cc: spambayes-dev at python.org
> Subject: Re: [spambayes-dev] Spambayes is starting not to work due to
> retaliatory action by spammers
>
>
> [spambayes at masters.me.uk]
> > I've used Spambayes for 2 or 3 years (Outlook add-in) - it has been
> > excellent.  However, over the last couple of months, it has become
> > compromised by a particular type of spam that I believe,
> over time, will
> > render Spambayes much less effective unless something is done.
> >
> > I expect you've seen these Spams - at the moment, they are
> always the
> > stock-market related ones
>
> I've seen a few drug spams using the same techniques, starting in July
> -- but they seemed to dry up quickly.
>
> > but I'm sure once others catch on, they will start
> > to use the same technique.  The start of the email is a
> picture that looks
> > like ordinary text but isn't.  All the spam info is in the
> text.  The
> > picture is followed by a whole load of randomly selected words.
>
> You're probably not getting any reaction here because exactly the same
> thing is currently being discussed on the SpamBayes "user" mailing
> list, in this thread:
>
>     Spam in Images
>     http://mail.python.org/pipermail/spambayes/2006-August/date.html
>
> > There are 2 bad things about this:
> >
> > 1.  These spams are successfully evading Spambayes in some
> cases.  Firstly
> > the Spam usually reaches the "possible Spam" folder.  As a
> result, I am now
> > spending significant time clearing out the possible spam
> folder whereas 2 or
> > 3 months ago I wasn't.
>
> Same here, except the time isn't significant.  If you don't believe
> me, stop using SpamBayes for a week to rediscover what "significant"
> means ;-)
>
> >  Secondly, the odd spam is actually managing to get through
> as ham.  This
> > is the first time this has happened ever.
>
> Not here -- they're very good at scoring Unsure, but haven't seen any
> false negatives yet.
>
> > 2.  Because I obviously mark these as Spam, all the
> randomly generated words
> > in each spam email have their spam likelihood scores
> increased.  The result
> > of this is that over time, the spam-scores for loads of perfectly
> > non-spam-like words are being gradually increased.  The
> more this goes on,
> > the more these "ham words" are being compromised.
>
> I certainly haven't seen any ham pushed into "unsure" because of this,
> and doubt it matters -- it generally doesn't hurt at all to have any
> number of "ham words" show up in a few spam.  One of the
> characteristics of the spam you're talking about that /makes/ it
> effective is that it's very good at /not/ repeating gibberish phrases
> across messages.  That's exactly why training on the gibberish is
> ineffective at catching future messages of the same ilk.  But, OTOH,
> the non-repetition also prevents it from "poisoning" your strong ham
> tokens.  They get slightly less hammy, and that doesn't hurt because
> most ham is nowhere near the unsure range.
>
> > I suspect that this is why, to begin with, I only saw a few
> of these stock market
> > emails, now I'm seeing loads
>
> The only reason you see loads of any kind of spam is that it's making
> a profit for the sender.  Pump-&-dump scams violate major securities
> laws, and it's quite possible these scammers will quit before getting
> too greedy (= getting caught).
>
> > and over the last 2 or 3 weeks some have started to come in as ham.
>
> While I haven't seen that, it's inconsistent with your explanation
> above:  if your "ham tokens" /were/ being compromised, that makes it
> /less/ likely that a message containing your ham tokens will be scored
> as ham, not more likely.
>
> A more likely explanation is simply "loads":  gibberish does have a
> real chance of scoring as ham, and the more attempts are made, the
> more likely one will succeed.  What they can't do is craft a message
> that scores as ham for all users, or even for most.
>
> > I fear that the long term effect of this will be to spoil
> spambayes bigtime.
>
> Possibly.  People have panicked prematurely before ;-)
>
> > I know that Spambayes has a deep-rooted principle in only
> using the bayesian
> > algorithm and I wouldn't suggest changing that.  However, I
> am wondering if
> > it might be possible to analyse these messages and include
> some parts of the
> > hidden text relating to the picture that are not presently
> included in the
> > bayesian statistics.
>
> See the thread above.  Nobody knows a realistic way to extract the
> text from these images (there is no "text" here -- just a large matrix
> of individual pixels, something the human eye/brain system is very
> much better at decoding than programs).  OTOH, the images themselves
> probably have many statistical characteristics not shared with
> "legitimate" images, and those can be computed/extracted with finite
> effort.
>
> > My thesis is this - I rarely get pictures in my email that
> are not just attachments -
> > virtually all pictures that are embedded into the mail seem
> to be spam.
>
> Of course that varies.  For example, it's very easy to create embedded
> pictures in Outlook, and even small children know how to do it.
> Worse, their grandparents are required by law to consider such email
> "ham" :-)
>
> > So if there is some token or tag in the email that
> represents the embedded picture
> > that can be included in the bayesian analysis, this would
> might fix the problem.
>
> This is harder in Outlook because Outlook destroys the original MIME
> structure of the email before SpamBayes sees it.  There are already
> several such tokens generated when the original MIME structure is
> available.  In Outlook, it's most likely you'll get the single
> synthesized token:
>
>     virus:src="cid:
>
> or a simple variation on that, and that's all that remains of the
> embedded GIF.  A single token helps a bit, but not enough.  Do note
> that pump-&-dump scams don't even contain a URL to click on:  they
> want you to buy the stock on the open market, not send them money
> directly.  That also makes it a unique (and uniquely effective) kind
> of spam:  the pitch is /entirely/ buried in the GIF, with no useful
> text (not even a URL) of any kind to tokenize.
>
> > I hope that this suggestion is useful - I certainly fear
> for the future of
> > Spambayes if this new spam threat is not dealt with....
>
> Don't assume that most spammers are capable of becoming competent :-)
>


From skip at pobox.com  Sun Aug  6 19:25:47 2006
From: skip at pobox.com (skip at pobox.com)
Date: Sun, 6 Aug 2006 12:25:47 -0500
Subject: [spambayes-dev] Several new tokenizing gimmicks checked in
Message-ID: <17622.9755.800465.16215@montanaro.dyndns.org>

With the current crop of pump & dump spams I decided to break down and
actually see if ocrad (http://www.gnu.org/software/ocrad/ocrad.html) would
help.  It does a miserable job from a readability standpoint at extracting
text from an image, but SpamBayes seems to love what it does generate.  This
morning I thought, "what the hell", and checked in all the current new
tricks I've been working on/with:

    * IP address lookup and more extensive tokenization.  This is from Matt
      Cowles.  I added persistence beyond the current run.  Unfortunately,
      the dbm persistence is untested (though should probably work okay)
      while the zodb persistence still has problems (writes the file the
      first time, but doesn't update it on successive runs).  Maybe someone
      can look at those issues.  This seems to work very well for those
      spams where the only useful clue is a URL, but with a domain name that
      changes each time.  They seem to pretty much all point to the same IP
      address as far as I can tell.  Enabled using the x-lookup_ip and
      lookup_ip_cache options.  Requires installation of PyDNS.

    * Note image size.  This was my first stab at trying to get some
      information out of an image.  Seems to work pretty well.  Enabled
      using the x-image_size option.

    * Note short runs of too-short words.  Text spammers (as opposed to
      image spammers) seem to like to use this technique:

          X j A m N j A d X h
          M k E z R d I p D u I m A c
          C o I d A t L j I v S j
 
      to hide their tokens from spam filters.  Enabled using the
      x-short_runs option.  Based on my current database I'm skeptical this
      will add much over what else we already have.

    * Try OCR on images.  The latest technique we've all encountered seems
      to be the pump and dump stock scams where the entire come-on is
      embedded in one or more GIF images.  I wrote a small ImageStripper
      module which handles these.  It grabs the image parts, converts them
      to netpbm format, concatenates them left-to-right, then submits the
      result to ocrad.  This is just a proof-of-concept.  It requires ocrad
      and netpbm to be available.  As such I suspect it will only run
      currently on Unix-like systems.  Enabled using the x-crack_images and
      max_image_size options.

I added these extensions using multiple checkins, so if we decide to back
one or more of them out it shouldn't be a major PITA.

Skip

From skip at pobox.com  Mon Aug  7 00:50:44 2006
From: skip at pobox.com (skip at pobox.com)
Date: Sun, 6 Aug 2006 17:50:44 -0500
Subject: [spambayes-dev] Some test results
Message-ID: <17622.29252.971244.847129@montanaro.dyndns.org>


I put together some test databases today using spam received in the past
week or so (about 1800 messages) and a reasonable cross-section of my ham
(all saved python-related mail plus my regular non-specific mailbox, about
2300 messages) and did some 5x5 cross-validation tests (that's the correct
term, right?).  For the control test I set all these options False:

    x-lookup_ip
    x-short_runs
    x-image_size
    x-crack_images

but otherwise used my standard configuration.  I then made four runs,
setting one option True for each run, then compared each test with the
control run.  The results are summarized briefly below.

    control v. x-lookup_ip
    ----------------------

    false positive percentages
        0.000  0.000  tied          
        0.217  0.217  tied          
        0.000  0.000  tied          
        0.219  0.219  tied          
        0.000  0.000  tied          

    won   0 times
    tied  5 times
    lost  0 times

    ...

    false negative percentages
        4.199  4.199  tied          
        1.404  1.404  tied          
        4.412  4.412  tied          
        4.533  4.533  tied          
        4.222  4.222  tied          

    won   0 times
    tied  5 times
    lost  0 times

    control v. x-short_runs
    -----------------------

    false positive percentages
        0.000  0.000  tied          
        0.217  0.217  tied          
        0.000  0.000  tied          
        0.219  0.219  tied          
        0.000  0.000  tied          

    won   0 times
    tied  5 times
    lost  0 times

    ...

    false negative percentages
        4.199  4.199  tied          
        1.404  1.404  tied          
        4.412  4.412  tied          
        4.533  4.533  tied          
        4.222  4.222  tied          

    won   0 times
    tied  5 times
    lost  0 times

    control v. x-image_size
    -----------------------

    false positive percentages
        0.000  0.000  tied          
        0.217  0.434  lost  +100.00%
        0.000  0.000  tied          
        0.219  0.219  tied          
        0.000  0.000  tied          

    won   0 times
    tied  4 times
    lost  1 times

    ...

    false negative percentages
        4.199  4.199  tied          
        1.404  1.404  tied          
        4.412  4.118  won     -6.66%
        4.533  4.533  tied          
        4.222  3.958  won     -6.25%

    won   2 times
    tied  3 times
    lost  0 times

    control v. x-crack_images
    -------------------------

    false positive percentages
        0.000  0.000  tied          
        0.217  0.217  tied          
        0.000  0.000  tied          
        0.219  0.219  tied          
        0.000  0.000  tied          

    won   0 times
    tied  5 times
    lost  0 times

    ...

    false negative percentages
        4.199  4.199  tied          
        1.404  1.404  tied          
        4.412  4.118  won     -6.66%
        4.533  3.966  won    -12.51%
        4.222  3.430  won    -18.76%

    won   3 times
    tied  2 times
    lost  0 times

I didn't do anything to verify the accuracy of my spam and ham data.  I'm
doing that now.  Also, the fact that the first two tests were identical to
the control seems a bit suspicious, so I'm going to try them again after
picking over my training database.  Still, the image_size and crack_images
runs look promising, perhaps because my recent spam is so full of these pump
and dump spams.

Skip

From dave at boost-consulting.com  Mon Aug  7 19:15:00 2006
From: dave at boost-consulting.com (David Abrahams)
Date: Mon, 07 Aug 2006 13:15:00 -0400
Subject: [spambayes-dev] sb_imapfilter: bad FETCH response
References: <87wtgg4goi.fsf@boost-consulting.com>
Message-ID: <uy7u0sc17.fsf@boost-consulting.com>

A non-text attachment was scrubbed...
Name: sb_imapfilter.diff
Type: text/x-patch
Size: 1049 bytes
Desc: not available
Url : http://mail.python.org/pipermail/spambayes-dev/attachments/20060807/d747a971/attachment.bin 

From tim.peters at gmail.com  Tue Aug  8 04:47:00 2006
From: tim.peters at gmail.com (Tim Peters)
Date: Mon, 7 Aug 2006 22:47:00 -0400
Subject: [spambayes-dev] Spambayes is starting not to work due to
	retaliatory action by spammers
In-Reply-To: <002701c6b8d4$68031d90$1202a8c0@trump>
References: <1f7befae0608030025i3ce746eatbd264fd5ad094725@mail.gmail.com>
	<002701c6b8d4$68031d90$1202a8c0@trump>
Message-ID: <1f7befae0608071947t24c28cafwc5892d335ce8f8af@mail.gmail.com>

[James Masters]
> Thank you very much for your comprehensive reply and apologies to the group
> for putting my email to the wrong place.

No apology necessary.  I didn't even intend to imply you were posting
in a wrong place, just pointing out that the same topic just happened
to be actively discussed elsewhere.  Since SpamBayes in fact does a
much poorer job on image-based spam than on "traditional" spam, and
that's A Problem for both users and developers, discussing it on both
the user and developer lists is thoroughly appropriate.

> If I have anything more to write, I'll put it in the forum you mention.

But only if it's appropriate there, else we'll have to ask you to apologize ;-)

From skip at pobox.com  Tue Aug  8 06:10:15 2006
From: skip at pobox.com (skip at pobox.com)
Date: Mon, 7 Aug 2006 23:10:15 -0500
Subject: [spambayes-dev] Updated test results
Message-ID: <17624.3751.511224.861381@montanaro.dyndns.org>


I picked through my new training database, found one or two outright
mistakes, deleted a few other administrative mails, fixed a few bugs in my
recent checkins and rebalanced my database.  I then made a baseline run with
the following settings:

    [globals]
    verbose: True

    [Headers]
    include_evidence: True

    [Tokenizer]
    record_header_absence: True
    summarize_email_prefixes: True
    summarize_email_suffixes: True
    mine_received_headers:True
    x-pick_apart_urls:True
    x-fancy_url_recognition:False
    x-lookup_ip:False
    lookup_ip_cache:~/src/spambayes/ip.pickle
    x-short_runs:False
    x-image_size:False
    x-crack_images:False
    x-max_image_size:100000

    [Categorization]
    ham_cutoff: 0.15
    spam_cutoff: 0.50

    [Storage]
    persistent_storage_file: ~/src/spambayes/test.pickle
    persistent_use_database: pickle

followed by a series of test runs, each one with one of the following
options set to True:

    x-lookup_ip
    x-short_runs
    x-image_size
    x-crack_images

All tests were run against the same combination of ham and spam:

    -> <stat> tested 459 hams & 359 spams against 1836 hams & 1436 spams
    -> <stat> tested 459 hams & 359 spams against 1836 hams & 1436 spams
    -> <stat> tested 459 hams & 359 spams against 1836 hams & 1436 spams
    -> <stat> tested 459 hams & 359 spams against 1836 hams & 1436 spams
    -> <stat> tested 459 hams & 359 spams against 1836 hams & 1436 spams
    -> <stat> tested 459 hams & 359 spams against 1836 hams & 1436 spams
    -> <stat> tested 459 hams & 359 spams against 1836 hams & 1436 spams
    -> <stat> tested 459 hams & 359 spams against 1836 hams & 1436 spams
    -> <stat> tested 459 hams & 359 spams against 1836 hams & 1436 spams
    -> <stat> tested 459 hams & 359 spams against 1836 hams & 1436 spams

baseline vs. x-lookup_ip:

    false positive percentages
        0.000  0.000  tied          
        0.000  0.000  tied          
        0.218  0.218  tied          
        0.000  0.000  tied          
        0.000  0.000  tied          

    won   0 times
    tied  5 times
    lost  0 times

    false negative percentages
        2.228  1.671  won    -25.00%
        3.343  3.064  won     -8.35%
        5.292  4.735  won    -10.53%
        4.735  4.457  won     -5.87%
        2.786  2.507  won    -10.01%

    won   5 times
    tied  0 times
    lost  0 times

baseline vs. x-short_runs:

    false positive percentages
        0.000  0.000  tied          
        0.000  0.000  tied          
        0.218  0.218  tied          
        0.000  0.000  tied          
        0.000  0.000  tied          

    won   0 times
    tied  5 times
    lost  0 times

    false negative percentages
        2.228  2.228  tied          
        3.343  3.343  tied          
        5.292  5.292  tied          
        4.735  4.735  tied          
        2.786  2.786  tied          

    won   0 times
    tied  5 times
    lost  0 times

baseline vs. x-image_size:

    false positive percentages
        0.000  0.000  tied          
        0.000  0.000  tied          
        0.218  0.218  tied          
        0.000  0.000  tied          
        0.000  0.000  tied          

    won   0 times
    tied  5 times
    lost  0 times

    false negative percentages
        2.228  1.950  won    -12.48%
        3.343  3.343  tied          
        5.292  5.014  won     -5.25%
        4.735  4.457  won     -5.87%
        2.786  2.786  tied          

    won   3 times
    tied  2 times
    lost  0 times

baseline vs. x-crack_image:

    false positive percentages
        0.000  0.000  tied          
        0.000  0.000  tied          
        0.218  0.218  tied          
        0.000  0.000  tied          
        0.000  0.000  tied          

    won   0 times
    tied  5 times
    lost  0 times

    false negative percentages
        2.228  1.671  won    -25.00%
        3.343  3.064  won     -8.35%
        5.292  4.457  won    -15.78%
        4.735  4.457  won     -5.87%
        2.786  2.786  tied          

    won   4 times
    tied  1 times
    lost  0 times

Based on the mixture of ham and spam I have it would appear only the
x-short_runs option doesn't help discriminate ham from spam.

Skip


From matt at mondoinfo.com  Tue Aug  8 21:37:02 2006
From: matt at mondoinfo.com (Matthew Dixon Cowles)
Date: Tue, 8 Aug 2006 14:37:02 -0500 (CDT)
Subject: [spambayes-dev] Updated test results
In-Reply-To: <17624.3751.511224.861381@montanaro.dyndns.org>
References: <17624.3751.511224.861381@montanaro.dyndns.org>
Message-ID: <1155050781.71.19994@mint-julep.mondoinfo.com>

> baseline vs. x-lookup_ip:

[. . .]

>     false negative percentages
>         2.228  1.671  won    -25.00%
>         3.343  3.064  won     -8.35%
>         5.292  4.735  won    -10.53%
>         4.735  4.457  won     -5.87%
>         2.786  2.507  won    -10.01%
> 
>     won   5 times
>     tied  0 times
>     lost  0 times

I'm glad to see that. That's the sort of improvement that I see with
that code, but I think it's the first time that anyone else has
reproduced it.

Still, as people have pointed out before, there's at least one
potential problem in the code. That's that data from DNS isn't
necessarily stable. If someone needed to un-train their database on a
message a day or two later, the tokens generated might easily not be
the same as they were when the message was first trained on. That
could send a token's count below zero.

That doesn't affect me in practice, but it would surely affect
someone if the code were used widely. Fixing it in general would
require some rather elaborate persistence mechanism, I think.

Regards,
Matt


From skip at pobox.com  Tue Aug  8 22:12:09 2006
From: skip at pobox.com (skip at pobox.com)
Date: Tue, 8 Aug 2006 15:12:09 -0500
Subject: [spambayes-dev] Updated test results
In-Reply-To: <1155050781.71.19994@mint-julep.mondoinfo.com>
References: <17624.3751.511224.861381@montanaro.dyndns.org>
	<1155050781.71.19994@mint-julep.mondoinfo.com>
Message-ID: <17624.61465.232582.676437@montanaro.dyndns.org>


    Matt> Still, as people have pointed out before, there's at least one
    Matt> potential problem in the code. That's that data from DNS isn't
    Matt> necessarily stable....

    Matt> That doesn't affect me in practice, but it would surely affect
    Matt> someone if the code were used widely. Fixing it in general would
    Matt> require some rather elaborate persistence mechanism, I think.

Or simply retraining from scratch after deleting your cache.  Speaking of
which, I gave up on persistence via the dbm or zodb routes.  Instead I just
save/restore the cache using pickle.  I'll probably check that into CVS this
evening.

Skip

From dave at boost-consulting.com  Wed Aug  9 02:28:58 2006
From: dave at boost-consulting.com (David Abrahams)
Date: Tue, 08 Aug 2006 20:28:58 -0400
Subject: [spambayes-dev] Is IMAP supported?
Message-ID: <uvep2kb05.fsf@boost-consulting.com>


Hi,

I made a bug report in January, and recently followed up with a
partial diagnosis, but have received no reply to either one.  Is
sb_imapfilter still supported?  Is there someone I should contact
directly about this problem?  I'd like to be able to make an informed
decision about what to do about it next...

Thanks in advance,

-- 
Dave Abrahams
Boost Consulting
www.boost-consulting.com


From skip at pobox.com  Wed Aug  9 17:55:41 2006
From: skip at pobox.com (skip at pobox.com)
Date: Wed, 9 Aug 2006 10:55:41 -0500
Subject: [spambayes-dev] [Spambayes] Posting problems
In-Reply-To: <uoduthq9o.fsf@boost-consulting.com>
References: <uoduuhs4v.fsf@boost-consulting.com>
	<uoduthq9o.fsf@boost-consulting.com>
Message-ID: <17626.1405.993809.368716@montanaro.dyndns.org>

>>>>> "Dave" == David Abrahams <dave at boost-consulting.com> writes:

    Dave> David Abrahams <dave at boost-consulting.com> writes:
    >> I've posted several messages to this list through GMane, and wondered
    >> why nobody answered them.  Well as it turned out, I wasn't subscribed,
    >> and you don't seem to be accepting posts from nonsubscribers (totally
    >> understandable).  But I got no clue that subscription was needed, and
    >> GMane shows my posts anyway:
    >> 
    >> http://thread.gmane.org/gmane.mail.spam.spambayes.devel/3613/focus=3613
    >> 
    >> They just don't show up in your email archive.  Furthermore, for a
    >> little extra weirdness, I've posted successfully here before:
    >> 
    >> http://mail.python.org/pipermail/spambayes/2003-December/author.html
    >> 
    >> I dunno what's going on here.

    Dave> Um, I guess I was confusing the -dev list (on which nobody is
    Dave> answering me) with this one.  Sorry for the noise.

    Dave> But if someone could get back to me on the -dev list I'd really
    Dave> appreciate it!  Even just an ACK would be useful at this point!

I saw no pending moderator requests on either spambayes or spambayes-dev.  I
saw nothing in the Mailman config for spambayes-dev that would prevent you
from posting.  Perhaps GMane isn't actually posting the messages.

Skip

From dave at boost-consulting.com  Wed Aug  9 18:06:53 2006
From: dave at boost-consulting.com (David Abrahams)
Date: Wed, 09 Aug 2006 12:06:53 -0400
Subject: [spambayes-dev] [Spambayes] Posting problems
In-Reply-To: <17626.1405.993809.368716@montanaro.dyndns.org> (skip@pobox.com's
	message of "Wed, 9 Aug 2006 10:55:41 -0500")
References: <uoduuhs4v.fsf@boost-consulting.com>
	<uoduthq9o.fsf@boost-consulting.com>
	<17626.1405.993809.368716@montanaro.dyndns.org>
Message-ID: <uejvphp0i.fsf@boost-consulting.com>

skip at pobox.com writes:

>>>>>> "Dave" == David Abrahams <dave at boost-consulting.com> writes:
>
>     Dave> David Abrahams <dave at boost-consulting.com> writes:
>     >> I've posted several messages to this list through GMane, and wondered
>     >> why nobody answered them.  Well as it turned out, I wasn't subscribed,
>     >> and you don't seem to be accepting posts from nonsubscribers (totally
>     >> understandable).  But I got no clue that subscription was needed, and
>     >> GMane shows my posts anyway:
>     >> 
>     >> http://thread.gmane.org/gmane.mail.spam.spambayes.devel/3613/focus=3613
>     >> 
>     >> They just don't show up in your email archive.  Furthermore, for a
>     >> little extra weirdness, I've posted successfully here before:
>     >> 
>     >> http://mail.python.org/pipermail/spambayes/2003-December/author.html
>     >> 
>     >> I dunno what's going on here.
>
>     Dave> Um, I guess I was confusing the -dev list (on which nobody is
>     Dave> answering me) with this one.  Sorry for the noise.
>
>     Dave> But if someone could get back to me on the -dev list I'd really
>     Dave> appreciate it!  Even just an ACK would be useful at this point!
>
> I saw no pending moderator requests on either spambayes or spambayes-dev.  I
> saw nothing in the Mailman config for spambayes-dev that would prevent you
> from posting.  Perhaps GMane isn't actually posting the messages.

No, it is posting the messages:
http://mail.python.org/pipermail/spambayes-dev/2006-August/003701.html

But the archive, at least, seems to have scrubbed out all the content
along with the patch.

The original message in the thread, however, does appear:
http://mail.python.org/pipermail/spambayes-dev/2006-January/003616.html

Again, you can see what I actually posted at
http://thread.gmane.org/gmane.mail.spam.spambayes.devel/3613/focus=3613
-- 
Dave Abrahams
Boost Consulting
www.boost-consulting.com

From skip at pobox.com  Wed Aug  9 19:53:38 2006
From: skip at pobox.com (skip at pobox.com)
Date: Wed, 9 Aug 2006 12:53:38 -0500
Subject: [spambayes-dev] [Spambayes] Posting problems
In-Reply-To: <uejvphp0i.fsf@boost-consulting.com>
References: <uoduuhs4v.fsf@boost-consulting.com>
	<uoduthq9o.fsf@boost-consulting.com>
	<17626.1405.993809.368716@montanaro.dyndns.org>
	<uejvphp0i.fsf@boost-consulting.com>
Message-ID: <17626.8482.159339.985701@montanaro.dyndns.org>


    Dave> No, it is posting the messages:
    Dave> http://mail.python.org/pipermail/spambayes-dev/2006-August/003701.html

    Dave> But the archive, at least, seems to have scrubbed out all the
    Dave> content along with the patch.

Looking at the gmane version of the message, I do remember seeing it, so
it's clearly getting to the list.  The attachement is here:

    http://mail.python.org/pipermail/spambayes-dev/attachments/20060807/d747a971/attachment.bin

though as you indicated the message body seems to have been vaporized.

    Dave> Again, you can see what I actually posted at
    Dave> http://thread.gmane.org/gmane.mail.spam.spambayes.devel/3613/focus=3613

My best guess is that it's a pipermail bug.

Skip

From dave at boost-consulting.com  Wed Aug  9 20:32:57 2006
From: dave at boost-consulting.com (David Abrahams)
Date: Wed, 09 Aug 2006 14:32:57 -0400
Subject: [spambayes-dev] [Spambayes] Posting problems
References: <uoduuhs4v.fsf@boost-consulting.com>
	<uoduthq9o.fsf@boost-consulting.com>
	<17626.1405.993809.368716@montanaro.dyndns.org>
	<uejvphp0i.fsf@boost-consulting.com>
	<17626.8482.159339.985701@montanaro.dyndns.org>
Message-ID: <upsf9g3om.fsf@boost-consulting.com>

skip at pobox.com writes:

>     Dave> http://thread.gmane.org/gmane.mail.spam.spambayes.devel/3613/focus=3613
>
> Looking at the gmane version of the message, I do remember seeing it, so
> it's clearly getting to the list.  

Any idea why I'm not getting an answer?

-- 
Dave Abrahams
Boost Consulting
www.boost-consulting.com


From skip at pobox.com  Wed Aug  9 20:49:56 2006
From: skip at pobox.com (skip at pobox.com)
Date: Wed, 9 Aug 2006 13:49:56 -0500
Subject: [spambayes-dev] [Spambayes] Posting problems
In-Reply-To: <upsf9g3om.fsf@boost-consulting.com>
References: <uoduuhs4v.fsf@boost-consulting.com>
	<uoduthq9o.fsf@boost-consulting.com>
	<17626.1405.993809.368716@montanaro.dyndns.org>
	<uejvphp0i.fsf@boost-consulting.com>
	<17626.8482.159339.985701@montanaro.dyndns.org>
	<upsf9g3om.fsf@boost-consulting.com>
Message-ID: <17626.11860.614678.323143@montanaro.dyndns.org>


    Dave> skip at pobox.com writes:
    Dave> http://thread.gmane.org/gmane.mail.spam.spambayes.devel/3613/focus=3613
    >> 
    >> Looking at the gmane version of the message, I do remember seeing it, so
    >> it's clearly getting to the list.  

    Dave> Any idea why I'm not getting an answer?

Lack of round tuits perhaps?

Skip

From tim.peters at gmail.com  Wed Aug  9 21:34:37 2006
From: tim.peters at gmail.com (Tim Peters)
Date: Wed, 9 Aug 2006 15:34:37 -0400
Subject: [spambayes-dev] [Spambayes] Posting problems
In-Reply-To: <upsf9g3om.fsf@boost-consulting.com>
References: <uoduuhs4v.fsf@boost-consulting.com>
	<uoduthq9o.fsf@boost-consulting.com>
	<17626.1405.993809.368716@montanaro.dyndns.org>
	<uejvphp0i.fsf@boost-consulting.com>
	<17626.8482.159339.985701@montanaro.dyndns.org>
	<upsf9g3om.fsf@boost-consulting.com>
Message-ID: <1f7befae0608091234m6289a3fnfea19aaa69e05053@mail.gmail.com>

[David Abrahams]
> Any idea why I'm not getting an answer?

Perhaps because you've become the leading expert on sb_imapfilter ;-)

It might help to drop the meta-discussion and start over with what the
problem is.

From dave at boost-consulting.com  Wed Aug  9 21:54:38 2006
From: dave at boost-consulting.com (David Abrahams)
Date: Wed, 09 Aug 2006 15:54:38 -0400
Subject: [spambayes-dev] [Spambayes] Posting problems
References: <uoduuhs4v.fsf@boost-consulting.com>
	<uoduthq9o.fsf@boost-consulting.com>
	<17626.1405.993809.368716@montanaro.dyndns.org>
	<uejvphp0i.fsf@boost-consulting.com>
	<17626.8482.159339.985701@montanaro.dyndns.org>
	<upsf9g3om.fsf@boost-consulting.com>
	<1f7befae0608091234m6289a3fnfea19aaa69e05053@mail.gmail.com>
Message-ID: <u4pwlfzwh.fsf@boost-consulting.com>

"Tim Peters" <tim.peters at gmail.com>
writes:

> [David Abrahams]
>> Any idea why I'm not getting an answer?
>
> Perhaps because you've become the leading expert on sb_imapfilter ;-)

That's what I was afraid of.

-- 
Dave Abrahams
Boost Consulting
www.boost-consulting.com


From dave at boost-consulting.com  Wed Aug  9 22:04:18 2006
From: dave at boost-consulting.com (David Abrahams)
Date: Wed, 09 Aug 2006 16:04:18 -0400
Subject: [spambayes-dev] sb_imapfilter: problem parsing result of FETCH
	(was: Posting problems)
References: <uoduuhs4v.fsf@boost-consulting.com>
	<uoduthq9o.fsf@boost-consulting.com>
	<17626.1405.993809.368716@montanaro.dyndns.org>
	<uejvphp0i.fsf@boost-consulting.com>
	<17626.8482.159339.985701@montanaro.dyndns.org>
	<upsf9g3om.fsf@boost-consulting.com>
	<1f7befae0608091234m6289a3fnfea19aaa69e05053@mail.gmail.com>
Message-ID: <uy7txekvx.fsf_-_@boost-consulting.com>

A non-text attachment was scrubbed...
Name: sb_imapfilter.diff
Type: text/x-patch
Size: 1049 bytes
Desc: not available
Url : http://mail.python.org/pipermail/spambayes-dev/attachments/20060809/040a726c/attachment.bin 

From skip at pobox.com  Thu Aug 10 07:00:04 2006
From: skip at pobox.com (skip at pobox.com)
Date: Thu, 10 Aug 2006 00:00:04 -0500
Subject: [spambayes-dev] Latest image spam/OCR update
Message-ID: <17626.48468.837805.425784@montanaro.dyndns.org>


I just checked in a couple significant changes to the OCR stuff.  First, I
added support for conversion of input images using PIL.  That means netpbm
is no longer required.  PIL is faster and more robust than netpbm, and is
platform-independent.  Perhaps someone in Windows-land can take the time to
see if it's possible to build ocrad on Windows.  We could then (in theory,
at least) distribute an ocrad installer alongside the SpamBayes Windows
installer and perform crude, but apparently effective, OCR analysis of
image-based spam.  The second change to the OCR code was the addition of a
simple pickled cache file (controlled by the "crack_image_cache" option).
The conversion to netpbm format is still required, however the ocrad step is
skipped if the md5 hexdigest of the generated image is present in the cache.
In thi case any cached text and tokens are returned.

I have no Windows capability, so someone else will have to take the steps
necessary to make this all play on Windows.

There are a few other things that need testing, but I'm out of time.  First,
I arbitrarily set an upper limit of 100kbytes on input images (per image
before converting to netpbm).  I think that allows all images that would
hold spam content, but I'm not sure I have many images in my training
database besides spam.  I don't know if that's a useful cutoff or if there
should even be a cutoff.  Second, I observed that ocrad routinely seemed to
get the letter case wrong (e.g. coming up with "EGLy" instead of "EGLY"), so
I blindly downshift its output.  I have nothing other than that simple
observation to suggest that should be done.  Third, if other people have
traing databases, running N-fold cross validation tests of these new
gimmicks would be beneficial.  It would be nice if others could verify my
results before a new release is made.  Finally, if you're a Python
programmer (or aspire to be one), picking through the new code would be a
good check.

Too bad the summer's nearly over.  We could use a Summer of Code intern...

Skip

From tameyer at ihug.co.nz  Thu Aug 10 07:54:56 2006
From: tameyer at ihug.co.nz (Tony Meyer)
Date: Thu, 10 Aug 2006 17:54:56 +1200
Subject: [spambayes-dev] Posting problems
In-Reply-To: <1f7befae0608091234m6289a3fnfea19aaa69e05053@mail.gmail.com>
References: <uoduuhs4v.fsf@boost-consulting.com>
	<uoduthq9o.fsf@boost-consulting.com>
	<17626.1405.993809.368716@montanaro.dyndns.org>
	<uejvphp0i.fsf@boost-consulting.com>
	<17626.8482.159339.985701@montanaro.dyndns.org>
	<upsf9g3om.fsf@boost-consulting.com>
	<1f7befae0608091234m6289a3fnfea19aaa69e05053@mail.gmail.com>
Message-ID: <6624CE5A-8846-40F5-AE2C-8F57CE263584@ihug.co.nz>

> [David Abrahams]
>> Any idea why I'm not getting an answer?

[Tim Peters]
> Perhaps because you've become the leading expert on sb_imapfilter ;-)

Or it could be because the wife of the previously leading expert on  
sb_imapfilter is due to have their first child any day now ;).  Those  
round tuits are pretty scarce here at the moment.

sb_imapfilter has always been an unloved child.  Unlike most of the  
rest of the SpamBayes code, it's wasn't a scratching an itch, but  
shutting up people asking for it on spambayes at python.org.  Back then  
I had time to spare, so Tim Stone & I put it together - I didn't have  
an IMAP account at the time.

I still dislike IMAP, so use POP for all my accounts, so although I  
probably know the code better than anyone else (although it has been  
a while), I rarely exercise it.  I've put it up for adoption on  
spambayes-dev at various times, but no-one has taken up the offer.

I'll try to take a look at your message & the problem this weekend.   
I know that the code I added between 1.1a1 and 1.1a2 to both  
sb_server and sb_imapfilter to deal with the "changed database type  
causes a crash" bug wasn't as well designed as it should have been,  
and does cause the odd problem.  I plan to fix that as soon as I can.

=Tony.Meyer

From mhammond at skippinet.com.au  Thu Aug 10 09:32:25 2006
From: mhammond at skippinet.com.au (Mark Hammond)
Date: Thu, 10 Aug 2006 17:32:25 +1000
Subject: [spambayes-dev] [Spambayes] Latest image spam/OCR update
In-Reply-To: <17626.48468.837805.425784@montanaro.dyndns.org>
Message-ID: <192c01c6bc4f$26046ba0$0200a8c0@enfoldsystems.local>

> Perhaps someone in Windows-land can
> take the time to
> see if it's possible to build ocrad on Windows.

Using cygwin and gcc I was able to build an ocrad.exe on Windows (with one
simple patch necessary; a complaint about std::sprintf - just removing the
'std::' prefix got it building)

Sadly that is all I have time for today too though, but if anyone wants that
.exe to fiddle with, let me know.

Mark


From dave at boost-consulting.com  Thu Aug 10 18:48:45 2006
From: dave at boost-consulting.com (David Abrahams)
Date: Thu, 10 Aug 2006 12:48:45 -0400
Subject: [spambayes-dev] Posting problems
References: <uoduuhs4v.fsf@boost-consulting.com>
	<uoduthq9o.fsf@boost-consulting.com>
	<17626.1405.993809.368716@montanaro.dyndns.org>
	<uejvphp0i.fsf@boost-consulting.com>
	<17626.8482.159339.985701@montanaro.dyndns.org>
	<upsf9g3om.fsf@boost-consulting.com>
	<1f7befae0608091234m6289a3fnfea19aaa69e05053@mail.gmail.com>
	<6624CE5A-8846-40F5-AE2C-8F57CE263584@ihug.co.nz>
Message-ID: <u8xlwa64y.fsf@boost-consulting.com>

Tony Meyer <tameyer at ihug.co.nz> writes:

>> [David Abrahams]
>>> Any idea why I'm not getting an answer?
>
> [Tim Peters]
>> Perhaps because you've become the leading expert on sb_imapfilter ;-)
>
> Or it could be because the wife of the previously leading expert on  
> sb_imapfilter is due to have their first child any day now ;).  Those  
> round tuits are pretty scarce here at the moment.

Understood.  Congratulations, though!

>
> sb_imapfilter has always been an unloved child.  Unlike most of the  
> rest of the SpamBayes code, it's wasn't a scratching an itch, but  
> shutting up people asking for it on spambayes at python.org.  Back then  
> I had time to spare, so Tim Stone & I put it together - I didn't have  
> an IMAP account at the time.
>
> I still dislike IMAP, so use POP for all my accounts, 

I don't see POP as an option if I want server-side mail storage.

> so although I probably know the code better than anyone else
> (although it has been a while), I rarely exercise it.  I've put it
> up for adoption on spambayes-dev at various times, but no-one has
> taken up the offer.

I might be willing to be trained toward that end (there's lots I want
to do with IMAP and so it would be good to learn how), but I'm sure
not competent to do it right now.

> I'll try to take a look at your message & the problem this weekend.   

Thanks, I really appreciate it.

> I know that the code I added between 1.1a1 and 1.1a2 to both  
> sb_server and sb_imapfilter to deal with the "changed database type  
> causes a crash" bug wasn't as well designed as it should have been,  
> and does cause the odd problem.  I plan to fix that as soon as I can.

Thanks again, Tony

-- 
Dave Abrahams
Boost Consulting
www.boost-consulting.com


From skip at pobox.com  Sun Aug 13 18:56:20 2006
From: skip at pobox.com (skip at pobox.com)
Date: Sun, 13 Aug 2006 11:56:20 -0500
Subject: [spambayes-dev] [Spambayes-checkins] spambayes/spambayes
	dnscache.py, 1.2, 1.3
In-Reply-To: <20060813020548.AA6721E4002@bag.python.org>
References: <20060813020548.AA6721E4002@bag.python.org>
Message-ID: <17631.22964.232555.383050@montanaro.dyndns.org>


    Tony> Remove reference to Skip, probably left there by mistake :)

Yes, probably...  Thanks for catching it.

S

From skip at pobox.com  Sun Aug 13 20:48:37 2006
From: skip at pobox.com (skip at pobox.com)
Date: Sun, 13 Aug 2006 13:48:37 -0500
Subject: [spambayes-dev] Patch for ocrad to run on Windows?
Message-ID: <17631.29701.373791.625191@montanaro.dyndns.org>

Mark Hammond and Sean True both said they had an ocrad.exe executable built
under cygwin.  (Hopefully it doesn't require cygwin runtime?)  Was the only
change you made to the source the "std::fprintf" -> "fprintf" replacement?
Ocrad is GPL'd, so all we have to do to make it available is also distribute
the modified source.  If you can stick the .exe file somewhere and let me
know if there are any Windows version restrictions, I'll put together the
requisite modified Ocrad source distribution and place both the distribution
and the executable on the SpamBayes website for Windows users to try out.
I'll also send a (second) note to Antonio Diaz Diaz, the Ocrad author,
letting him know where it is.

Thx,

Skip


From tameyer at ihug.co.nz  Sun Aug 13 23:32:46 2006
From: tameyer at ihug.co.nz (Tony Meyer)
Date: Mon, 14 Aug 2006 09:32:46 +1200
Subject: [spambayes-dev] Patch for ocrad to run on Windows?
In-Reply-To: <17631.29701.373791.625191@montanaro.dyndns.org>
References: <17631.29701.373791.625191@montanaro.dyndns.org>
Message-ID: <31F368D4-4103-40B6-955A-698D8F813BD6@ihug.co.nz>

> Mark Hammond and Sean True both said they had an ocrad.exe  
> executable built
> under cygwin.  (Hopefully it doesn't require cygwin runtime?)

AFAIK, it will require cygwin1.dll, unless a change is also made (it  
is in the attached patch) to compile with -mno-cygwin.  This seems to  
run fine on my machine, without any of the cygwin DLLs (they are  
installed, of course, but shouldn't be accessible outside of a Cygwin  
shell).

> Was the only
> change you made to the source the "std::fprintf" -> "fprintf"  
> replacement?

Two of these, plus the Makefile.in as above.

> Ocrad is GPL'd, so all we have to do to make it available is also  
> distribute
> the modified source.  If you can stick the .exe file somewhere and  
> let me
> know if there are any Windows version restrictions, I'll put  
> together the
> requisite modified Ocrad source distribution and place both the  
> distribution
> and the executable on the SpamBayes website for Windows users to  
> try out.

Patch is attached.  .exe is at:

http://tangomu.com/ocrad.exe

I have no idea about Windows version restrictions.  My assumption  
would be it will run on any version from Win95 to WinXP (no idea  
about Vista).

=Tony.Meyer

-------------- next part --------------
A non-text attachment was scrubbed...
Name: ocrad.patch
Type: application/octet-stream
Size: 2004 bytes
Desc: not available
Url : http://mail.python.org/pipermail/spambayes-dev/attachments/20060814/b54b7471/attachment.obj 
-------------- next part --------------


From skip at pobox.com  Mon Aug 14 05:37:20 2006
From: skip at pobox.com (skip at pobox.com)
Date: Sun, 13 Aug 2006 22:37:20 -0500
Subject: [spambayes-dev] Latest CVS update, Ocrad for Windows
Message-ID: <17631.61424.222629.225936@montanaro.dyndns.org>


I updated the OCR capabilities a bit more today.  I added more intelligent
assembly of split images into a single image after noticing that the
spammers don't simply chop up multi-part GIF images horizontally.  I also
added a couple extra options (ocrad_scale and ocrad_charset) which control
the image scaling factor (default is 2) and character set (default is
"ascii") Ocrad uses.  Scaling the image by a factor of 2 was a pretty
obvious win:

    false positive percentages
        0.000  0.000  tied          
        0.000  0.000  tied          
        0.000  0.000  tied          
        0.000  0.000  tied          
        0.000  0.000  tied          

    won   0 times
    tied  5 times
    lost  0 times

    total unique fp went from 0 to 0 tied          
    mean fp % went from 0.0 to 0.0 tied          

    false negative percentages
        4.213  4.213  tied          
        1.404  0.843  won    -39.96%
        3.371  2.809  won    -16.67%
        2.528  2.247  won    -11.12%
        4.213  3.652  won    -13.32%

    won   4 times
    tied  1 times
    lost  0 times

    total unique fn went from 56 to 49 won    -12.50%
    mean fn % went from 3.14606741573 to 2.75280898876 won    -12.50%

Scaling by a factor of three was even better in the false negative
department but regressed a bit in the false positive category so I checked
Options.py in with a default scaling factor of 2.  A couple things could
stand to be further tested:

    * I have no idea how good Ocrad's scaling algorithm is.  It's possible
      that PIL or NetPBM's scaling code is better.  If so, it would make
      sense to scale the images before feeding to Ocrad.

    * The images I've see so far were all plain English, so I blindly made
      ascii the default charset.  The other choices were iso-8859-9 and
      iso-8859-15.  I simply assumed ascii would be the most appropriate
      default, but didn't test it.

Finally, I put together a really simpleminded Ocrad-for-Windows release
based upon the ocrad.exe binary that Tony built.  Check the Files section of
the SpamBayes project site:

    http://sourceforge.net/project/showfiles.php?group_id=61702

and grab ocrad-cygwin.

There are a few caveats:

    1. I don't do Windows.  (No, really, I don't, strange as that may seem.)
       This is no fancy-schmancy point-and-shoot Windows installer.  It's
       just a simple zip file with the Ocrad 0.15 distribution, Tony's .exe
       file and the patch he applied to the source.

    2. I don't do Windows.  The code I've written so far has been done
       entirely on my Mac.  I've made no obvious concessions to portability.
       That said, I hope portability issues won't be daunting for any early
       adopters.

    3. I don't do Windows.  If you have problems it won't do you any good to
       mail me directly.  Post about problems on the SpamBayes bug tracker:

           http://sourceforge.net/tracker/?group_id=61702&atid=498103

    4. If you do Windows you will need PIL to take advantage of the recent
       changes:

           http://www.pythonware.com/products/pil/

       (unless you want to put hair on your chest and build NetPBM on
       Windows).  Fredrik Lundh provides prebuilt Windows versions of PIL.
       Grab the one appropriate for the version of Python you have
       installed.

    5. If you do Windows (or any other platform for that matter), feedback
       to the lists about successes and failures would be helpful.

Cheers,

Skip


From skip at pobox.com  Sat Aug 19 23:06:10 2006
From: skip at pobox.com (skip at pobox.com)
Date: Sat, 19 Aug 2006 16:06:10 -0500
Subject: [spambayes-dev] How about a 1.1a3 release?
Message-ID: <17639.32066.677940.963348@montanaro.dyndns.org>

Any thought on making a 1.1a3 release?  I'd like to get the image spam stuff
into more peoples' hands.  (Has anyone tried it yet?)  Tony is extremely
busy, and doesn't have the requisite Win2K setup to create a widely runnable
Windows installer.  Can someone else do that?

While I have the ocrad-cygwin zipfile available for people to download, what
do people think about bundling the ocrad.exe file and the source patch as
part of a SpamBayes Windows installer?  I confirmed with the Ocrad author
that all we need to distribute in the (very small) patch, not the entire
distribution as I originally did.

Thx,

Skip

From tameyer at ihug.co.nz  Sat Aug 19 23:38:02 2006
From: tameyer at ihug.co.nz (Tony Meyer)
Date: Sun, 20 Aug 2006 09:38:02 +1200
Subject: [spambayes-dev] How about a 1.1a3 release?
In-Reply-To: <17639.32066.677940.963348@montanaro.dyndns.org>
References: <17639.32066.677940.963348@montanaro.dyndns.org>
Message-ID: <97CA4326-5332-466F-A2AA-454C6E1C3F91@ihug.co.nz>

> Any thought on making a 1.1a3 release?

+1

> I'd like to get the image spam stuff
> into more peoples' hands.  (Has anyone tried it yet?)

Bits of it.  I'll report more when I can :)  (If this baby would  
hurry up and be born, that would help ;)

>   Tony is extremely
> busy, and doesn't have the requisite Win2K setup to create a widely  
> runnable
> Windows installer.  Can someone else do that?

To clarify this: this is the same issue I had with 1.1a2 - I don't  
have access to Outlook 2000 any more (I have 2002 and 2007b2).  Last  
time Mark did this for me (basically, just cvs-up, run setup_all.py,  
and either do the Inno part or just email me the dist folder and I  
can do the rest).

Alternatively, we could do a Windows build with Outlook 2002 and see  
how much complaining there is ;)

> While I have the ocrad-cygwin zipfile available for people to  
> download, what
> do people think about bundling the ocrad.exe file and the source  
> patch as
> part of a SpamBayes Windows installer?  I confirmed with the Ocrad  
> author
> that all we need to distribute in the (very small) patch, not the  
> entire
> distribution as I originally did.

Fine by me.  I can make the changes to the Inno installer script if  
this is ok with everyone.

=Tony.Meyer

From dave at boost-consulting.com  Sun Aug 20 04:05:15 2006
From: dave at boost-consulting.com (David Abrahams)
Date: Sat, 19 Aug 2006 22:05:15 -0400
Subject: [spambayes-dev] Posting problems
In-Reply-To: <6624CE5A-8846-40F5-AE2C-8F57CE263584@ihug.co.nz> (Tony Meyer's
	message of "Thu, 10 Aug 2006 17:54:56 +1200")
References: <uoduuhs4v.fsf@boost-consulting.com>
	<uoduthq9o.fsf@boost-consulting.com>
	<17626.1405.993809.368716@montanaro.dyndns.org>
	<uejvphp0i.fsf@boost-consulting.com>
	<17626.8482.159339.985701@montanaro.dyndns.org>
	<upsf9g3om.fsf@boost-consulting.com>
	<1f7befae0608091234m6289a3fnfea19aaa69e05053@mail.gmail.com>
	<6624CE5A-8846-40F5-AE2C-8F57CE263584@ihug.co.nz>
Message-ID: <uu048nov8.fsf@boost-consulting.com>

Tony Meyer <tameyer at ihug.co.nz> writes:

> I'll try to take a look at your message & the problem this weekend.
> I know that the code I added between 1.1a1 and 1.1a2 to both
> sb_server and sb_imapfilter to deal with the "changed database type
> causes a crash" bug wasn't as well designed as it should have been,
> and does cause the odd problem.  I plan to fix that as soon as I can.

Hi Tony,

Any progress on this one?

-- 
Dave Abrahams
Boost Consulting
www.boost-consulting.com

From mhammond at skippinet.com.au  Sun Aug 20 09:22:54 2006
From: mhammond at skippinet.com.au (Mark Hammond)
Date: Sun, 20 Aug 2006 17:22:54 +1000
Subject: [spambayes-dev] How about a 1.1a3 release?
In-Reply-To: <97CA4326-5332-466F-A2AA-454C6E1C3F91@ihug.co.nz>
Message-ID: <DAELJHBGPBHPJKEBGGLNKEFOCDAE.mhammond@skippinet.com.au>

Tony writes:

> To clarify this: this is the same issue I had with 1.1a2 - I don't
> have access to Outlook 2000 any more (I have 2002 and 2007b2).  Last
> time Mark did this for me (basically, just cvs-up, run setup_all.py,
> and either do the Inno part or just email me the dist folder and I
> can do the rest).

I'm happy to turn that crank - just say the word (and let me know which of
those cranks you prefer)

> Alternatively, we could do a Windows build with Outlook 2002 and see
> how much complaining there is ;)

I'm still on Office-2k - although I expect that to change shortly - so this
may well be the last release I can simply make using outlook 2000.

Mark


From tameyer at ihug.co.nz  Sun Aug 20 11:08:13 2006
From: tameyer at ihug.co.nz (Tony Meyer)
Date: Sun, 20 Aug 2006 21:08:13 +1200
Subject: [spambayes-dev] How about a 1.1a3 release?
In-Reply-To: <DAELJHBGPBHPJKEBGGLNKEFOCDAE.mhammond@skippinet.com.au>
References: <DAELJHBGPBHPJKEBGGLNKEFOCDAE.mhammond@skippinet.com.au>
Message-ID: <EC917904-8B0F-4001-A2E7-30CB55182071@ihug.co.nz>

[Building a 1.1a3 binary]
> I'm happy to turn that crank - just say the word (and let me know  
> which of
> those cranks you prefer)

Great.  Skip - just tell Mark when you feel everything is ready, and  
Mark can (cvs-up and) run setup_all.py, compress the resulting dist  
folder, and send that to me via FTP (details offlist).

>> Alternatively, we could do a Windows build with Outlook 2002 and see
>> how much complaining there is ;)
>
> I'm still on Office-2k - although I expect that to change shortly -  
> so this
> may well be the last release I can simply make using outlook 2000.

Alternatively, we could drop OL2K support for 1.1, at least for now,  
and see if anyone complains (and if they do, they can maybe volunteer  
the price of a 2nd-hand copy of Office-2k <0.5 wink>).

=Tony.Meyer

From sethg at GoodmanAssociates.com  Mon Aug 21 01:56:41 2006
From: sethg at GoodmanAssociates.com (Seth Goodman)
Date: Sun, 20 Aug 2006 18:56:41 -0500
Subject: [spambayes-dev] How about a 1.1a3 release?
In-Reply-To: <EC917904-8B0F-4001-A2E7-30CB55182071@ihug.co.nz>
Message-ID: <MHEGIFHMACFNNIMMBACAAEILNBAA.sethg@GoodmanAssociates.com>

On  -0500, Tony Meyer wrote:

> > > Alternatively, we could do a Windows build with Outlook 2002
> > > and see how much complaining there is ;)
> >
> > I'm still on Office-2k - although I expect that to change shortly
> > - so this may well be the last release I can simply make using
> > outlook 2000.
>
> Alternatively, we could drop OL2K support for 1.1, at least for now,
> and see if anyone complains (and if they do, they can maybe
> volunteer the price of a 2nd-hand copy of Office-2k <0.5 wink>).

I'm still stuck on Win2K/Office2K for quite a while, yet.  I'd be
willing to obtain a copy of Outlook2K for someone.  Does this mean
shipping to NZ?

--
Seth Goodman
not in NZ


From skip at pobox.com  Mon Aug 21 16:07:36 2006
From: skip at pobox.com (skip at pobox.com)
Date: Mon, 21 Aug 2006 09:07:36 -0500
Subject: [spambayes-dev] How about a 1.1a3 release?
In-Reply-To: <EC917904-8B0F-4001-A2E7-30CB55182071@ihug.co.nz>
References: <DAELJHBGPBHPJKEBGGLNKEFOCDAE.mhammond@skippinet.com.au>
	<EC917904-8B0F-4001-A2E7-30CB55182071@ihug.co.nz>
Message-ID: <17641.48680.645677.461714@montanaro.dyndns.org>


    >> I'm happy to turn that crank - just say the word (and let me know
    >> which of those cranks you prefer)

    Tony> Great.  Skip - just tell Mark when you feel everything is ready,
    Tony> and Mark can (cvs-up and) run setup_all.py, compress the resulting
    Tony> dist folder, and send that to me via FTP (details offlist).

I think we're about ready except for boosting the version info in
spambayes/__init__.py:

    __version__ = "1.1a3"
    __date__ = _("August 2006")

Feel free to turn the crank.  I agree that trying a build on a more recent
version of Outlook would be a good idea.  For testing purposes that probably
opens up the pool of potential release builders a bit.  When we near a final
release, if OL2K is still deemed desirable, we can cut a release that
supports it.

Skip

From mhammond at skippinet.com.au  Tue Aug 22 23:51:05 2006
From: mhammond at skippinet.com.au (Mark Hammond)
Date: Wed, 23 Aug 2006 07:51:05 +1000
Subject: [spambayes-dev] How about a 1.1a3 release?
In-Reply-To: <17641.48680.645677.461714@montanaro.dyndns.org>
Message-ID: <04d801c6c635$12848980$2f0a0a0a@enfoldsystems.local>

Skip writes:
> Feel free to turn the crank.  I agree that trying a build on
> a more recent version of Outlook would be a good idea.  For testing
> purposes that probably
> opens up the pool of potential release builders a bit.

I currently *only* have Office2k installed.  Thus, it is not possible for me
to build a version that depends on a later version.

The next time someone without Office2k installed wants to build a new
version, they should just try and do so, patching the code where necessary.
This would include 'addin.py', with the lines starting:

gencache.EnsureModule('{00062FFF-0000-0000-C000-000000000046}', 0, 9, 0,
                        bForDemand=True, bValidateFile=bValidateGencache) #
Outlook 9

The existing code should be kept in place, but wrapped with an exception
handler that 'falls back' to the newer version.

This code should probably be cloned into setup_all.py, and depending on
success or failure, change the 'typelibs' option passed to py2exe,
reflecting what is known to be installed.  I'd suggest that this print a
fairly noisy warning so the packager is aware the built version will not
work on Office 2k.

On a more general note though, I think it is fairly clear that for all
official releases, Office2k remain supported for a few years yet - when a
few people on the -dev list still use Office2k, I would guess that many more
users also do.

I can't make the changes I recommend above as I don't have OfficeXP
installed - but if someone else makes the change so it works for them, I'd
be happy to repair any unintended breakage on Office2k systems.

Mark


From mhammond at skippinet.com.au  Wed Aug 23 15:26:28 2006
From: mhammond at skippinet.com.au (Mark Hammond)
Date: Wed, 23 Aug 2006 23:26:28 +1000
Subject: [spambayes-dev] Latest CVS update, Ocrad for Windows
In-Reply-To: <17631.61424.222629.225936@montanaro.dyndns.org>
Message-ID: <007201c6c6b7$be617290$020a0a0a@enfoldsystems.local>

Hi Skip,

>  Scaling the image by a factor of 2 was a pretty
> obvious win:
>
>     false positive percentages
>         0.000  0.000  tied
>         0.000  0.000  tied
>         0.000  0.000  tied
>         0.000  0.000  tied
>         0.000  0.000  tied

  I'm playing a little with the new code and am trying to get things working
with outlook.  I'm a little stuck working out how to get some test data (and
it doesn't help I'm a little rusty wrt to spambayes :)

I'm trying to run the testtools code.  The Outlook code that sets up the
Data/Ham, Data/Spam directories etc just exports the text body of the
message, but completely ignores 'attachments'.  I'm out of time for
tonight - can you offer any quick clues how your test environment is setup?

Thanks,

Mark.


From skip at pobox.com  Wed Aug 23 17:33:03 2006
From: skip at pobox.com (skip at pobox.com)
Date: Wed, 23 Aug 2006 10:33:03 -0500
Subject: [spambayes-dev] Latest CVS update, Ocrad for Windows
In-Reply-To: <007201c6c6b7$be617290$020a0a0a@enfoldsystems.local>
References: <17631.61424.222629.225936@montanaro.dyndns.org>
	<007201c6c6b7$be617290$020a0a0a@enfoldsystems.local>
Message-ID: <17644.29999.936120.882486@montanaro.dyndns.org>


    Mark> I'm trying to run the testtools code.  The Outlook code that sets
    Mark> up the Data/Ham, Data/Spam directories etc just exports the text
    Mark> body of the message, but completely ignores 'attachments'.  I'm
    Mark> out of time for tonight - can you offer any quick clues how your
    Mark> test environment is setup?

Quick clue: I'm not using Outlook or Windows. ;-) I don't know what to do
given that Outlook shreds email so completely.  Maybe this stuff can only be
tested on Unix-y machines.  Maybe the image analysis code won't even work
because there's no such thing as an attachment with MIME content-type
image/*... in Outlook.

As for actual setup, it's done in what I think is the "usual" way.  I start
with two or more Unix mbox format files (at least one full of ham, one full
of spam).  I then run utilities/splitndirs.py to allocate them to the
desired number of Data/{Ham,Spam}/SetN directories.  I then make a series of
runs like so:

    # control run
    python testtools/timcv.py ... args ... > std.txt
    python testtools/rates.py std.txt

    # one or more test runs with various parameters changed
    python testtools/timcv.py ... slightly different args ... > testN.txt
    python testtools/rates.py testN.txt
    python testtools/cmp.py stds.txt testNs.txt

My guess is there's an easier way to run the tests and summarize the
results, but it had been awhile since I'd done any testing either.  This was
the first "working" setup I stumbled upon, and thanks to my enormous bash
command history buffer, I just recall the commands as I need them, so the
pain of re-remebering is small.

HTH,

Skip

From mhammond at skippinet.com.au  Wed Aug 23 23:52:45 2006
From: mhammond at skippinet.com.au (Mark Hammond)
Date: Thu, 24 Aug 2006 07:52:45 +1000
Subject: [spambayes-dev] Latest CVS update, Ocrad for Windows
In-Reply-To: <17644.29999.936120.882486@montanaro.dyndns.org>
Message-ID: <01ef01c6c6fe$7c23ed80$020a0a0a@enfoldsystems.local>

> Quick clue: I'm not using Outlook or Windows. ;-)

Yep, I know that :)  My mail was sent fairly late, so I didn't explain very
well.

> I don't
> know what to do
> given that Outlook shreds email so completely.  Maybe this
> stuff can only be
> tested on Unix-y machines.  Maybe the image analysis code
> won't even work
> because there's no such thing as an attachment with MIME content-type
> image/*... in Outlook.

I can manage all of that.  What I need to know is in what format your Ham
and Spam directories are.  Currently mine are in plain-text.  A quick look
at the code showed that these were *not* expected to be a dump of a mime
message, but instead a simple "word stream" - which didn't seem to fit with
the binary data inside attachments.  I was guessing they had already been
processed to some degree, but gave up before digging deeper.

> As for actual setup, it's done in what I think is the "usual"
> way.  I start
> with two or more Unix mbox format files (at least one full of
> ham, one full
> of spam).  I then run utilities/splitndirs.py to allocate them to the
> desired number of Data/{Ham,Spam}/SetN directories.  I then
> make a series of
> runs like so:

hrm - so maybe they *are* just the complete dump of the message including
the encoded image data and mime boundaries etc - I'll play a little more and
look inside splitndirs.

Thanks,

Mark


From tameyer at ihug.co.nz  Thu Aug 24 08:18:11 2006
From: tameyer at ihug.co.nz (Tony Meyer)
Date: Thu, 24 Aug 2006 18:18:11 +1200
Subject: [spambayes-dev] Latest CVS update, Ocrad for Windows
In-Reply-To: <01ef01c6c6fe$7c23ed80$020a0a0a@enfoldsystems.local>
References: <01ef01c6c6fe$7c23ed80$020a0a0a@enfoldsystems.local>
Message-ID: <40724010-C6F3-410C-9FAF-1F866F86B30C@ihug.co.nz>

>> I don't know what to do given that Outlook shreds email so  
>> completely.
>> Maybe this stuff can only be tested on Unix-y machines.  Maybe the
>> image analysis code won't even work because there's no such thing  
>> as an
>> attachment with MIME content-type image/*... in Outlook.
>
> I can manage all of that.  What I need to know is in what format  
> your Ham
> and Spam directories are.

They're RFC2822.  So for mail in a .pst, presumably the job (of  
export_messages.py) would be to get the attachments and insert them  
into the messages (encoded in base64 or whatever) with the  
appropriate headers.  I planned to write code to do this at some  
point last year, but don't recall getting around to it (and then I  
switched to Mail as my main email client).

> hrm - so maybe they *are* just the complete dump of the message  
> including
> the encoded image data and mime boundaries etc

Yup.  Is that accessible in Outlook?  I had the feeling it wasn't.   
If you can get the attachments then it's easy enough to use the email  
package to build up the message with those and the plain text.

=Tony.Meyer


From skip at pobox.com  Thu Aug 24 13:00:49 2006
From: skip at pobox.com (skip at pobox.com)
Date: Thu, 24 Aug 2006 06:00:49 -0500
Subject: [spambayes-dev] Latest CVS update, Ocrad for Windows
In-Reply-To: <01ef01c6c6fe$7c23ed80$020a0a0a@enfoldsystems.local>
References: <17644.29999.936120.882486@montanaro.dyndns.org>
	<01ef01c6c6fe$7c23ed80$020a0a0a@enfoldsystems.local>
Message-ID: <17645.34529.624745.399216@montanaro.dyndns.org>

    Mark> hrm - so maybe they *are* just the complete dump of the message
    Mark> including the encoded image data and mime boundaries etc - I'll
    Mark> play a little more and look inside splitndirs.

Yup, plain old RFC 2822 messages...

Skip

From kenny.pitt at gmail.com  Thu Aug 24 16:40:05 2006
From: kenny.pitt at gmail.com (Kenny Pitt)
Date: Thu, 24 Aug 2006 10:40:05 -0400
Subject: [spambayes-dev] How about a 1.1a3 release?
In-Reply-To: <04d801c6c635$12848980$2f0a0a0a@enfoldsystems.local>
References: <17641.48680.645677.461714@montanaro.dyndns.org>
	<04d801c6c635$12848980$2f0a0a0a@enfoldsystems.local>
Message-ID: <2a052b990608240740w280bcdc9h7505034fbc45c034@mail.gmail.com>

On 8/22/06, Mark Hammond <mhammond at skippinet.com.au> wrote:
> I currently *only* have Office2k installed.  Thus, it is not possible for me
> to build a version that depends on a later version.
>
> [...]
>
> On a more general note though, I think it is fairly clear that for all
> official releases, Office2k remain supported for a few years yet - when a
> few people on the -dev list still use Office2k, I would guess that many more
> users also do.

Maybe it would be a good idea to check a copy of the generated COM
wrappers for 2k into CVS while we still have the capability to build
them. It might require some tweaking to py2exe and/or win32com, but
I'm sure we could find a way to utilize a pre-built wrapper instead of
regenerating it from the installed typelibs on every build. That would
certainly make it easier to build compatible versions in the future.

-- 
Kenny Pitt

From g12__ at hotmail.com  Thu Aug 24 16:48:18 2006
From: g12__ at hotmail.com (Greg)
Date: Thu, 24 Aug 2006 14:48:18 +0000
Subject: [spambayes-dev] My Humble Thanks To You
Message-ID: <BAY108-W1CFDCCF402A0653C89C62AB440@phx.gbl>

 
Guys,
 
I run a small corporate network with around 100 e-mail users.  We've been using Sophos PureMessage as an anti-SPAM solution, but it doesn't work very well.  It's just too simplistic and easy for the spammers to work around.
 
After some online research I decided to give SpamBayes a go.  I downloaded the Outlook plugin but didn't know what to expect.  I don't get any SPAM, but I have access to all the e-mail inboxes and can see that some users get around 50-60 per day, so I was pleased when I discovered I could get the plug-in to look at any folder I had access to !!
 
I trialled it on 4 users, 2 who get heavy amounts of SPAM and 2 who get light amounts.  Day 1, I had to go through a lot of e-mail and tell it what was SPAM and what was HAM.  Day 2, I only had to tell it about a few.  Day 3, I think there was 1.  Day 4, well....  you get the picture.
 
The users were delighted.  They think I am a god now!!  I now have the plug-in filtering every Inbox on the system and it doesn't miss a beat.  We, at last, have a clean e-mail system.  And the truth is that you guys are gods!  My thanks for all your efforts.  This has made everyone's work life here much easier.
 
Greg.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.python.org/pipermail/spambayes-dev/attachments/20060824/8446e281/attachment.html 

From skip at pobox.com  Thu Aug 24 17:40:50 2006
From: skip at pobox.com (skip at pobox.com)
Date: Thu, 24 Aug 2006 10:40:50 -0500
Subject: [spambayes-dev] My Humble Thanks To You
In-Reply-To: <BAY108-W1CFDCCF402A0653C89C62AB440@phx.gbl>
References: <BAY108-W1CFDCCF402A0653C89C62AB440@phx.gbl>
Message-ID: <17645.51330.474614.22601@montanaro.dyndns.org>

 
    Greg> The users were delighted.  They think I am a god now!!  

I won't tell them if you won't.  All hail Greg the God!!!

Glad we could help.

Skip

From tameyer at ihug.co.nz  Fri Aug 25 02:39:59 2006
From: tameyer at ihug.co.nz (Tony Meyer)
Date: Fri, 25 Aug 2006 12:39:59 +1200
Subject: [spambayes-dev] [Spambayes-checkins] spambayes/windows/py2exe
	setup_all.py, 1.26, 1.27
In-Reply-To: <20060824131835.EB71E1E4005@bag.python.org>
References: <20060824131835.EB71E1E4005@bag.python.org>
Message-ID: <FF231618-8416-4D79-AF2D-C7D398630FA7@ihug.co.nz>

On 25/08/2006, at 1:18 AM, Mark Hammond wrote:

> Update of /cvsroot/spambayes/spambayes/windows/py2exe
> In directory sc8-pr-cvs8.sourceforge.net:/tmp/cvs-serv6540
>
> Modified Files:
> 	setup_all.py
> Log Message:
> Ship with PIL (but no Tkinter) and pyDNS
>
>
> [...]
> !     excludes = "Tkinter," # side-effect of PIL and markh doesn't  
> have it :)
> !                 "win32ui,pywin,pywin.debugger," # *sob* - these  
> still appear
> !                 # Keep zope out else outlook users lose training.
> !                 # (sob - but some of these may still appear!)
> !                 
> "ZODB,_zope_interface_coptimizations,_OOBTree,cPersistence",

I don't care about this for 1.1a3, but is this right?  Outlook users  
(any users, really) would only lose training if they chose not to  
convert the database on installation and didn't change their  
configuration to continue to use bsddb.

=Tony.Meyer

From skip at pobox.com  Fri Aug 25 04:08:20 2006
From: skip at pobox.com (skip at pobox.com)
Date: Thu, 24 Aug 2006 21:08:20 -0500
Subject: [spambayes-dev] SpamBayes 1.1a3
Message-ID: <17646.23444.394006.480373@montanaro.dyndns.org>

The SpamBayes team is pleased to announce release 1.1a3 of SpamBayes.

As is now usual, this is both a release of the source code and of an
installation program for all Microsoft Windows users.

This is an *ALPHA* release.  It should only be installed by users willing to
try out experimental software, and almost certainly contains new bugs.  If
you don't know what an alpha release is, please stick with 1.0.4 for the
moment.

The 1.1 release has been worked on since May of 2004, so contains a vast
number of improvements over the 1.0.x line.  These include, but are not
limited to:

  * New database backends, including ZODB and ZOE.

  * Internationalisation support, including partial translations into French
    and Spanish.

  * Improved statistics reporting.

  * The ability to set audio notifications with the Outlook plug-in.

  * The ability to set the Outlook plug-in to move/copy ham, as well as
    spam/unsures.

  * Partial POP3 over SSL support for sb_server.

  * A vastly improved sb_imapfilter.

  * Several new experimental options, include one designed to help extract
    text content from image-based spams.

Suggestions about what to try out can be found here:

    http://entrian.com/sbwiki/TryOutThePreRelease

This release, like the ill-fated 1.0.2 and 1.0.3, is built with Python 2.4.
We believe that the remaining incompatibilities with Python 2.4 have been
resolved, and so this release should also include superior email parsing to
the 1.0.x line.

Details about the changes in this release can be found at

    http://sourceforge.net/project/shownotes.php?release_id=442102

You can get the release via the 'Download' page at

    http://spambayes.org/download.html

Enjoy the new release and your spam-free mailbox 

As always, thanks to everyone involved in this release!

Skip Montanaro.
(on behalf of the SpamBayes team)

--- What is SpamBayes? ---

The SpamBayes project is working on developing a Bayesian (of sorts)
anti-spam filter (in Python), initially based on the work of Paul Graham,
but since modified with ideas from Robinson, Peters, et al.

The project includes a number of different applications, all using the same
core code, ranging from a plug-in for Microsoft Outlook, to a POP3 proxy, to
various command-line tools and a command-line-based framework for testing
new anti-spam techniques.

The Windows installation program will install either the Outlook add-in (for
Microsoft Outlook users), the SpamBayes server program (for all other POP3
mail client users, including Microsoft Outlook Express), or the SpamBayes
IMAP filter (for all IMAP mail client users). All Windows users (including
existing users of the Outlook add-in) are encouraged to use the installation
program.

If you wish to use the source-code version, you will also need to install
Python - see README.txt in the source tree for more information.

From mhammond at skippinet.com.au  Fri Aug 25 04:44:17 2006
From: mhammond at skippinet.com.au (Mark Hammond)
Date: Fri, 25 Aug 2006 12:44:17 +1000
Subject: [spambayes-dev] [Spambayes-checkins]
	spambayes/windows/py2exesetup_all.py, 1.26, 1.27
In-Reply-To: <FF231618-8416-4D79-AF2D-C7D398630FA7@ihug.co.nz>
Message-ID: <023d01c6c7f0$5c25d6e0$050a0a0a@enfoldsystems.local>

> > Update of /cvsroot/spambayes/spambayes/windows/py2exe
> > In directory sc8-pr-cvs8.sourceforge.net:/tmp/cvs-serv6540
> >
> > Modified Files:
> > 	setup_all.py
> > Log Message:
> > Ship with PIL (but no Tkinter) and pyDNS
> >
> >
> > [...]
> > !     excludes = "Tkinter," # side-effect of PIL and markh doesn't
> > have it :)
> > !                 "win32ui,pywin,pywin.debugger," # *sob* - these
> > still appear
> > !                 # Keep zope out else outlook users lose training.
> > !                 # (sob - but some of these may still appear!)
> > !
> > "ZODB,_zope_interface_coptimizations,_OOBTree,cPersistence",
>
> I don't care about this for 1.1a3, but is this right?  Outlook users
> (any users, really) would only lose training if they chose not to
> convert the database on installation and didn't change their
> configuration to continue to use bsddb.

If the inno installer offers to convert databases, then you may be correct.
However, for my testing I didn't use the inno installer, so suddenly and
without warning 'lost' the training info.  I wonder if people who roll
spambayes out to many seats all use our Inno setup to achieve that - if not,
they too will lose.

More generally though, even if I was prompted about converting the
databases, if I answered 'No' I would expect my old existing database would
still work as before.  An upgrade that *forces* pain on you (answer yes,
wait while 1x20MB and 1x10MB pickles are migrated, or answer 'no' and take
the pain of retraining from scratch) doesn't sound friendly.  A better
approach may be that before *creating* a database in the new format, check
to see if the old format exists and continue to use it.

And more generally still, the ZODB that I have installed is built from Zope3
from SVN - from a branch, but not (necessarily) corresponding to an official
release.  This didn't seem prudent (but OTOH, probably would not itself have
caused me to exclude it without the above :)

Cheers,

Mark


From vilisch at wmw.com  Mon Aug 28 14:43:58 2006
From: vilisch at wmw.com (Vilmos Schnedarek)
Date: Mon, 28 Aug 2006 15:43:58 +0300
Subject: [spambayes-dev] Integrate SpamBayes into a Win32 application
Message-ID: <44F2E50E.4040607@wmw.com>

An HTML attachment was scrubbed...
URL: http://mail.python.org/pipermail/spambayes-dev/attachments/20060828/52648bab/attachment.html 

From skip at pobox.com  Thu Aug 31 14:26:31 2006
From: skip at pobox.com (skip at pobox.com)
Date: Thu, 31 Aug 2006 07:26:31 -0500
Subject: [spambayes-dev] Need slightly better logic for blinking gifs
Message-ID: <17654.54647.420342.345139@montanaro.dyndns.org>

It didn't take long for the spammers to start with the blinking GIF images.
Now I think they are using blinkers where the first image in the sequence is
pretty much empty.  The real content is in the second frame.  I need to
handle that.  I don't think I can just blindly overwrite one frame with the
next since they could just make the last one the blankish image.  The way I
concatenate images left-to-right and top-to-bottom makes it impossible to
just concatenate the frames together either.  Ideas?  The code is in
spambayes/ImageStripper.py in the distribution.  Look at PIL_decode_parts.

Skip


From skip at pobox.com  Thu Aug 31 15:48:57 2006
From: skip at pobox.com (skip at pobox.com)
Date: Thu, 31 Aug 2006 08:48:57 -0500
Subject: [spambayes-dev] Need slightly better logic for blinking gifs
In-Reply-To: <17654.54647.420342.345139@montanaro.dyndns.org>
References: <17654.54647.420342.345139@montanaro.dyndns.org>
Message-ID: <17654.59593.363231.433652@montanaro.dyndns.org>

I was a little rushed this morning heading out the door, so didn't
completely dump my brain in my earlier message:

    skip> I don't think I can just blindly overwrite one frame with the next
    skip> since they could just make the last one the blankish image.  The
    skip> way I concatenate images left-to-right and top-to-bottom makes it
    skip> impossible to just concatenate the frames together either.  Ideas?

The implication I should have stated explicitly is that we need to select an
image that's most likely the one with text in it.  If spammers are going to
blink their GIFs I suspect one or more of the images will have to be mostly
background, while other messages will have to be a mixture of colors.  That
suggests choosing one based on histograms.  Another possibility is to decide
which color is the background, make it transparent, then overlay all the
images on top of each other.

I don't have time to look at this right now.  Perhaps someone else does.

Skip


From kenny.pitt at gmail.com  Thu Aug 31 17:38:37 2006
From: kenny.pitt at gmail.com (Kenny Pitt)
Date: Thu, 31 Aug 2006 11:38:37 -0400
Subject: [spambayes-dev] Need slightly better logic for blinking gifs
In-Reply-To: <17654.59593.363231.433652@montanaro.dyndns.org>
References: <17654.54647.420342.345139@montanaro.dyndns.org>
	<17654.59593.363231.433652@montanaro.dyndns.org>
Message-ID: <2a052b990608310838x53dc9f5ftac4a89eb64822911@mail.gmail.com>

On 8/31/06, skip at pobox.com <skip at pobox.com> wrote:
> The implication I should have stated explicitly is that we need to select an
> image that's most likely the one with text in it.  If spammers are going to
> blink their GIFs I suspect one or more of the images will have to be mostly
> background, while other messages will have to be a mixture of colors.  That
> suggests choosing one based on histograms.  Another possibility is to decide
> which color is the background, make it transparent, then overlay all the
> images on top of each other.

Could we extract a list of text tokens from each frame separately, and
then choose the token list that has the most tokens in it?

-- 
Kenny Pitt

From skip at pobox.com  Thu Aug 31 19:14:10 2006
From: skip at pobox.com (skip at pobox.com)
Date: Thu, 31 Aug 2006 12:14:10 -0500
Subject: [spambayes-dev] Need slightly better logic for blinking gifs
In-Reply-To: <2a052b990608310838x53dc9f5ftac4a89eb64822911@mail.gmail.com>
References: <17654.54647.420342.345139@montanaro.dyndns.org>
	<17654.59593.363231.433652@montanaro.dyndns.org>
	<2a052b990608310838x53dc9f5ftac4a89eb64822911@mail.gmail.com>
Message-ID: <17655.6370.856714.140457@montanaro.dyndns.org>


    Kenny> Could we extract a list of text tokens from each frame
    Kenny> separately, and then choose the token list that has the most
    Kenny> tokens in it?

In theory, yes, though that would require running ocrad on each possibly
partial image (could get expensive) and would require code restructuring.
At the moment, the images come in one of three forms:

    * a single non-blinking image

    * a set of images, non-blinking, which, when assembled, make a single
      larger image

    * a single blinking image

Right now, I assume there might be multiple parts to the image, so I convert
from the source to PIL's internal format, concatenate them together, then
run ocrad on the total image.

I imagine it's not going to be long before the spammers start splitting up
their blinking images into parts.

Skip