From jklein at magnetstreet.com  Fri Dec  1 21:39:42 2006
From: jklein at magnetstreet.com (Jim Klein)
Date: Fri, 1 Dec 2006 14:39:42 -0600
Subject: [spambayes-dev] Question for Spambayes
Message-ID: <001401c71588$d493cca0$8800010a@magnetstreet.net>

How do I do a silent automated install of Spambayes?
 
 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.python.org/pipermail/spambayes-dev/attachments/20061201/4628b07a/attachment.html 

From seandarcy2 at gmail.com  Thu Dec 14 23:42:02 2006
From: seandarcy2 at gmail.com (sean darcy)
Date: Thu, 14 Dec 2006 17:42:02 -0500
Subject: [spambayes-dev] Dec 14 cvs seg faults on python-2.5: _weakref.so
Message-ID: <c195ebf70612141442s498f4539kb84433ca7cd29d87@mail.gmail.com>

Made the mistake of updating to python-2.5 :(

That seg faulted sb. So I updated to cvs. Rebuilt and installed.  Same result:


 python -v /usr/bin/sb_server.py
# installing zipimport hook
import zipimport # builtin
# installed zipimport hook
# /usr/lib64/python2.5/site.pyc matches /usr/lib64/python2.5/site.py
import site # precompiled from /usr/lib64/python2.5/site.pyc
# /usr/lib64/python2.5/os.pyc matches /usr/lib64/python2.5/os.py
import os # precompiled from /usr/lib64/python2.5/os.pyc
import posix # builtin
................
import spambayes.Version # precompiled from
/usr/lib/python2.5/site-packages/spambayes/Version.pyc
# /usr/lib/python2.5/site-packages/spambayes/ProxyUI.pyc matches
/usr/lib/python2.5/site-packages/spambayes/ProxyUI.py
import spambayes.ProxyUI # precompiled from
/usr/lib/python2.5/site-packages/spambayes/ProxyUI.pyc
SpamBayes POP3 Proxy Version 1.1a3 (August 2006)

import bsddb # directory /usr/lib64/python2.5/bsddb
# /usr/lib64/python2.5/bsddb/__init__.pyc matches
/usr/lib64/python2.5/bsddb/__init__.py
import bsddb # precompiled from /usr/lib64/python2.5/bsddb/__init__.pyc
dlopen("/usr/lib64/python2.5/lib-dynload/_bsddb.so", 2);
import _bsddb # dynamically loaded from
/usr/lib64/python2.5/lib-dynload/_bsddb.so
# /usr/lib64/python2.5/bsddb/dbutils.pyc matches
/usr/lib64/python2.5/bsddb/dbutils.py
import bsddb.dbutils # precompiled from /usr/lib64/python2.5/bsddb/dbutils.pyc
# /usr/lib64/python2.5/bsddb/db.pyc matches /usr/lib64/python2.5/bsddb/db.py
import bsddb.db # precompiled from /usr/lib64/python2.5/bsddb/db.pyc
# /usr/lib64/python2.5/weakref.pyc matches /usr/lib64/python2.5/weakref.py
import weakref # precompiled from /usr/lib64/python2.5/weakref.pyc
dlopen("/usr/lib64/python2.5/lib-dynload/_weakref.so", 2);
import _weakref # dynamically loaded from
/usr/lib64/python2.5/lib-dynload/_weakref.so
Segmentation fault

sean

From spambayes-dev at spandex.nildram.co.uk  Fri Dec 15 14:46:48 2006
From: spambayes-dev at spandex.nildram.co.uk (Spandex)
Date: Fri, 15 Dec 2006 13:46:48 +0000
Subject: [spambayes-dev] Problem with struct.unpack in oe_mailbox.py
Message-ID: <307650593.20061215134648@nildram.co.uk>


Hi,

I previously sent this mail to the spambayes users list, without
response.  Apologies for the repost... I'm hoping it's more
appropriate here:-

I'm running spambayes (1.0.4-3) on Debian unstable with Python 2.4.4c0
and a custom compiled 2.6.17 kernel. I'm using an AMD64 chip.

sb_server starts up ok and proxies pop3 and smtp connections ok.
I can train from the commandline ok.

The problem comes when I try to train it from the web interface (using
either mbox or dbx format).  It bombs with the following error:-

----------------
Traceback (most recent call last):

  File "/usr/lib/python2.4/site-packages/spambayes/Dibbler.py", line 470, in found_terminator
    getattr(plugin, name)(**params)

  File "/usr/lib/python2.4/site-packages/spambayes/UserInterface.py", line 494, in onTrain
    content = self._convertToMbox(content)

  File "/usr/lib/python2.4/site-packages/spambayes/UserInterface.py", line 536, in _convertToMbox
    content = oe_mailbox.convertToMbox(content)

  File "/usr/lib/python2.4/site-packages/spambayes/oe_mailbox.py", line 444, in convertToMbox
    if header.isValid() and header.isMessages():

  File "/usr/lib/python2.4/site-packages/spambayes/oe_mailbox.py", line 117, in isValid
    return self.getEntry(0) == dbxFileHeader.MAGIC_NUMBER

  File "/usr/lib/python2.4/site-packages/spambayes/oe_mailbox.py", line 126, in getEntry
    self.dbxBuffer[dbxEntry * 4:(dbxEntry * 4) + 4])[0]

error: unpack str size does not match format
----------------

I'm wondering whether this is something to do with my machine
architecture and the sizes of datatypes?  But I'm stabbing in the
dark.

I can easily disable dbx support by commenting out..
content = oe_mailbox.convertToMbox(content)
.. around line 536 of UserInterface.py, and this does enable me to
train on mbox format via the web interface, but I'd rather keep dbx
support if possible.  I don't speak Python so commenting out the
offending code was about as far as I could go.

Any ideas?

Thanks,

Matt


From skip at pobox.com  Wed Dec 20 06:05:39 2006
From: skip at pobox.com (skip at pobox.com)
Date: Tue, 19 Dec 2006 23:05:39 -0600
Subject: [spambayes-dev] Applying SpamBayes to website spamming
Message-ID: <17800.50339.820370.517536@montanaro.dyndns.org>

I'm sure many of you are aware that spamming of the submission forms on
blogs and other websites is a large and increasing problem.  The Mojam and
Musi-Cal concert websites suffered from the same malady.  I originally
considered implementing some sort of CAPTCHA scheme:

    http://en.wikipedia.org/wiki/Captcha

but that has limitations and would have required changes to all submission
forms on the websites.  I decided to instead implement a SpamBayes-based
solution in our XML-RPC server.  It has a couple distinct advantages:

    * It has none of the CAPTCHA gotchas.
    * It is implemented at a single point in the system.
    * No changes to the Web interface were required so users don't have to
      learn something new.

I'll give you a quick sketch of what I did to solve this problem.  If you'd
like more details, drop me a note.

When someone submits concert dates to our sites the submission is
represented as a simple dictionary.  A valid submission will have
information about who's performing, a date in the future, valid location
information, etc.  In contrast, when someone spams the submission forms the
dictionary often contains bogus information or is missing some fields
altogether.  For example, if the spammer puts something in the date fields
it's likely to be garbage which won't parse properly, resulting in a default
date of 1900-01-01.  Similarly, the city/state/country is likely to be
invalid, so we won't be able to find lat/long info.

The dictionary is preprocessed into a string of tokens which includes the
obvious text which was part of the submission, but which also contains
synthetic tokens.  Here's a spammer's entry represented as text:

    Bradyn Maximus Ty jordan at e-mailanywhere.com 1900-01-01 Jerald kwds:False
    kwds-private:False Malcom 1900-01-01 Jarod date:ancient perflen:1
    infolen:1 hasphone:False hasprice:False city:unknown venue:present

Here's a valid entry represented as text:

    Anchorage skip at mojam.com 2006-10-07 kwds:True kwds-private:True .bl.1348
    .ra LaVette,Bettye 2006-10-07 AK Discovery Theatre date:current
    perflen:1 infolen:0 hasphone:False hasprice:False city:known
    venue:present

The synthetic tokens that suggest problems are such huge red flags for the
classifier that after training on just a couple of these bad boys the
rejection rate of spam submissions seems to be 100%.  Of course, this sort
of spamming is probably still in its infancy, so I expect we might
eventually see some sort of arms race develop as has been true for email
spam.  I'm not too worried about that though because for the most part I
think the spammers' primary target is the blogosphere with its ubiquitous
comment feature, not specialized websites like ours.

The tokenizer class is quite simple.  I post it here in its entirety.  Note
that major bits of it were just pasted from the default tokenizer.

    from spambayes.tokenizer import log2, Tokenizer, numeric_entity_re, \
         numeric_entity_replacer, crack_urls, breaking_entity_re, html_re, \
         tokenize_word

    class Tokenizer(Tokenizer):
        def tokenize(self, text):
            maxword = 20
            # Replace numeric character entities (like &#97; for the letter
            # 'a').
            text = numeric_entity_re.sub(numeric_entity_replacer, text)

            # Normalize case.
            text = text.lower()

            # Get rid of uuencoded sections, embedded URLs, <style gimmicks,
            # and HTML comments.
            for cracker in (crack_urls,):
                text, tokens = cracker(text)
                for t in tokens:
                    yield t

            # Remove HTML/XML tags.  Also &nbsp;.  <br> and <p> tags should
            # create a space too.
            text = breaking_entity_re.sub(' ', text)
            # It's important to eliminate HTML tags rather than, e.g.,
            # replace them with a blank (as this code used to do), else
            # simple tricks like
            #    Wr<!$FS|i|R3$s80sA >inkle Reduc<!$FS|i|R3$s80sA >tion
            # can be used to disguise words.  <br> and <p> were special-
            # cased just above (because browsers break text on those,
            # they can't be used to hide words effectively).
            text = html_re.sub('', text)

            # Tokenize everything in the body.
            for w in text.split():
                n = len(w)
                # Make sure this range matches in tokenize_word().
                if 3 <= n <= maxword:
                    yield w

                elif n >= 3:
                    for t in tokenize_word(w):
                        yield t

The only thing I found a bit frustrating was that I had to override the
Hammie class to provide an alternate tokenizer class:

    class Hammie(hammie.Hammie):
        def __init__(self, bayes):
            hammie.Hammie.__init__(self, bayes)
            self.tokenizer = Tokenizer()

        def _scoremsg(self, msg, evidence=False):
            return self.bayes.spamprob(self.tokenizer.tokenize(msg), evidence)

        def train(self, msg, is_spam, add_header=False):
            self.bayes.learn(self.tokenizer.tokenize(msg), is_spam)

I think it would be a bit more general if the Hammie class accepted an
optional tokenizer to avoid this.

So far I've trained on 338 hams and 26 spams.  (I have a new guy I'm
breaking in who is not experienced with SpamBayes.  I see a number of
entries in the ham data I would not have added there.  I expect I can
probably reduce the ham data size to 200 or less.)  The BerkDB file
containing the token database is a whopping 86KBytes.  I somewhat
arbitrarily set the ham cutoff at 0.15 and the spam cutoff at 0.60.  The
true spam seems to so far all land in the 0.95-1.0 range.  I see some
"possible spam" in the 0.16-0.2 range.  Most of the time that's because the
submitter forgot to enter a date (or entered it incorrectly) or misspelled
the city.

Skip

From mhammond at skippinet.com.au  Wed Dec 20 08:04:59 2006
From: mhammond at skippinet.com.au (Mark Hammond)
Date: Wed, 20 Dec 2006 18:04:59 +1100
Subject: [spambayes-dev] Stats are very slow
Message-ID: <390a01c72405$2a820930$110a0a0a@enfoldsystems.local>

Hi guys,

I'm playing a little more with integrating the outlook addin with Skip's
ocrad code and making some progress.  However, I noticed that while
spambayes was taking much less than a second to load my (bsddb) databases,
it takes nearly 30 seconds to load the stats!

Specifically, this line in addin.py is the culprit:

        self.stats = bayes_stats.Stats(bayes_options,
                                       self.classifier_data.message_db)

I've not even looked inside that module yet, but that seems quite extreme,
to the point I'm not sure the feature is worth that cost...  I guess the
code is reading each record of my message DB (which is 85MB) - but does
anyone have any insights?

Cheers,

Mark


From skip at pobox.com  Wed Dec 20 10:51:27 2006
From: skip at pobox.com (skip at pobox.com)
Date: Wed, 20 Dec 2006 03:51:27 -0600
Subject: [spambayes-dev] Stats are very slow
In-Reply-To: <390a01c72405$2a820930$110a0a0a@enfoldsystems.local>
References: <390a01c72405$2a820930$110a0a0a@enfoldsystems.local>
Message-ID: <17801.1951.114274.679266@montanaro.dyndns.org>

    Mark> Specifically, this line in addin.py is the culprit:

    Mark>         self.stats = bayes_stats.Stats(bayes_options,
    Mark>                                        self.classifier_data.message_db)

    Mark> I've not even looked inside that module yet, but that seems quite
    Mark> extreme, to the point I'm not sure the feature is worth that
    Mark> cost...  I guess the code is reading each record of my message DB
    Mark> (which is 85MB) - but does anyone have any insights?

Yes, it appears to be doing just that.  At the end of __init__ it calls
self.CalculatePersistentStats() which loops over all the keys in the
message_db.  The author anticipated this in the docstring:

    Calculate the statistics totals (i.e. not this session).

    This is done by running through the messageinfo database and
    adding up the various information.  This could get quite time
    consuming if the messageinfo database gets very large, so
    some consideration should perhaps be made about what to do
    then.

It might be worth deferring that call until it's really needed (say, in
GetStats()).

Skip

From dwalden at ghaca.org  Wed Dec 20 19:49:25 2006
From: dwalden at ghaca.org (Donna Walden)
Date: Wed, 20 Dec 2006 13:49:25 -0500
Subject: [spambayes-dev] SpamBayes
Message-ID: <13724DB0FBBD5F4DB46D4AC8D95A8B4594F5D8@ghadc001.gha.org>

Good Afternoon,

Email trained as safe domain and working properly for a few months, has
all of a sudden begun going into the Spam folder. 
What would cause this?

Thank you,
Donna

Donna Walden
Information System Manager
Georgia HAP Administrators, Inc.
2296 Henderson Mill Road, Suite 306
Atlanta, GA 30345-2739
(770)939-3939, ext. 2014
(770)939-3886 (Fax)
dwalden at ghaca.org


-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.python.org/pipermail/spambayes-dev/attachments/20061220/e698ac09/attachment.html 

From kenny.pitt at gmail.com  Wed Dec 20 21:32:22 2006
From: kenny.pitt at gmail.com (Kenny Pitt)
Date: Wed, 20 Dec 2006 15:32:22 -0500
Subject: [spambayes-dev] Stats are very slow
In-Reply-To: <17801.1951.114274.679266@montanaro.dyndns.org>
References: <390a01c72405$2a820930$110a0a0a@enfoldsystems.local>
	<17801.1951.114274.679266@montanaro.dyndns.org>
Message-ID: <2a052b990612201232n56573583x358b6b0cc53176c7@mail.gmail.com>

On 12/20/06, skip at pobox.com <skip at pobox.com> wrote:
>    Mark> Specifically, this line in addin.py is the culprit:
>
>    Mark>         self.stats = bayes_stats.Stats(bayes_options,
>    Mark>                                        self.classifier_data.message_db)
[...]
> It might be worth deferring that call until it's really needed (say, in
> GetStats()).

The Stats object tracks two types of statistics: statistics for the
current session and total statistics across all Outlook sessions. The
total statistics are calculated as the value of the persistent
statistics plus the accumulated statistics for the current session.
The persistent statistics need to be totalled up before we start
accumulating anything into the session statistics using the
RecordClassification or RecordTraining methods. Otherwise, session
stats accumulated up to the point where the persistent stats are
calculated will be included twice.

We can probably still defer the call if we are smart about the
relationship between the persistent stats and session stats. At
whatever point we actually calculate the value of the persistent
stats, we need to be aware that the session statistics accumulated up
to that point are already included in the message db and subtract
those values from the persistent statistics values.

Of course, this only solves part of the problem because we would still
take a huge hit when displaying the statistics. It might be worth
considering an optimization to store the actual statistics values
instead of calculating them at the start of every Outlook session. The
reason the stats are calculated from the message db is so that the
user can reset the starting date for the statistics and still get
accurate results. We could recalculate the persistent statistics only
when the user changes the start date for the statistics, and store the
summary values as a separate record in the message db or in a separate
statistics db file.

I've been incredibly swamped lately with the work that pays the bills,
but I'll try to find some time over the holidays to take a look at
this.

-- 
Kenny Pitt

From skip at pobox.com  Wed Dec 20 22:24:58 2006
From: skip at pobox.com (skip at pobox.com)
Date: Wed, 20 Dec 2006 15:24:58 -0600
Subject: [spambayes-dev] SpamBayes
In-Reply-To: <13724DB0FBBD5F4DB46D4AC8D95A8B4594F5D8@ghadc001.gha.org>
References: <13724DB0FBBD5F4DB46D4AC8D95A8B4594F5D8@ghadc001.gha.org>
Message-ID: <17801.43562.353358.808997@montanaro.dyndns.org>

    Donna> Email trained as safe domain and working properly for a few
    Donna> months, has all of a sudden begun going into the Spam folder.
    Donna> What would cause this?

It's virtually impossible to speculate without some data.  Can you post a
message or at least the evidence that SpamBayes generated for one of these
messages?

Skip

From skip at pobox.com  Wed Dec 20 22:26:56 2006
From: skip at pobox.com (skip at pobox.com)
Date: Wed, 20 Dec 2006 15:26:56 -0600
Subject: [spambayes-dev] Stats are very slow
In-Reply-To: <2a052b990612201232n56573583x358b6b0cc53176c7@mail.gmail.com>
References: <390a01c72405$2a820930$110a0a0a@enfoldsystems.local>
	<17801.1951.114274.679266@montanaro.dyndns.org>
	<2a052b990612201232n56573583x358b6b0cc53176c7@mail.gmail.com>
Message-ID: <17801.43680.688154.369277@montanaro.dyndns.org>


    Kenny> Of course, this only solves part of the problem because we would
    Kenny> still take a huge hit when displaying the statistics. It might be
    Kenny> worth considering an optimization to store the actual statistics
    Kenny> values instead of calculating them at the start of every Outlook
    Kenny> session.

That occurred to me after my reply.  I suspect it's probably the way to go.

Skip

From kenny.pitt at gmail.com  Wed Dec 20 23:55:10 2006
From: kenny.pitt at gmail.com (Kenny Pitt)
Date: Wed, 20 Dec 2006 17:55:10 -0500
Subject: [spambayes-dev] Stats are very slow
In-Reply-To: <17801.43680.688154.369277@montanaro.dyndns.org>
References: <390a01c72405$2a820930$110a0a0a@enfoldsystems.local>
	<17801.1951.114274.679266@montanaro.dyndns.org>
	<2a052b990612201232n56573583x358b6b0cc53176c7@mail.gmail.com>
	<17801.43680.688154.369277@montanaro.dyndns.org>
Message-ID: <2a052b990612201455q3a1b40fawdf8325407d06a25a@mail.gmail.com>

On 12/20/06, skip at pobox.com <skip at pobox.com> wrote:
>
>    Kenny> Of course, this only solves part of the problem because we would
>    Kenny> still take a huge hit when displaying the statistics. It might be
>    Kenny> worth considering an optimization to store the actual statistics
>    Kenny> values instead of calculating them at the start of every Outlook
>    Kenny> session.
>
> That occurred to me after my reply.  I suspect it's probably the way to go.

I checked in an initial update to delay the calculation of the
persistent stats until the GetStats() call because that was the easy
update. In the case where you never actually view the stats in
SpamBayes Manager, this should help. Let me know if you see any
oddities in the stats calculation after this.

The complete fix is a little more involved, so I'll have to defer that
until I have more time to test it thoroughly.

-- 
Kenny Pitt

From mhammond at skippinet.com.au  Thu Dec 21 02:38:55 2006
From: mhammond at skippinet.com.au (Mark Hammond)
Date: Thu, 21 Dec 2006 12:38:55 +1100
Subject: [spambayes-dev] anyone able to rebuild docs?
Message-ID: <3ad401c724a0$c7ec2de0$110a0a0a@enfoldsystems.local>

I just checked in some changes to the docs to fix some broken links that
have been reported.  However, I have failed miserably in the process of
regenerating the HTML and uploading to sourceforge.

Would it be possible for someone else (presumably on Linux :) to update the
website dir and execute the process (which appears to simply be "make; make
install")?

Thanks!

Mark


From mhammond at skippinet.com.au  Thu Dec 21 08:40:11 2006
From: mhammond at skippinet.com.au (Mark Hammond)
Date: Thu, 21 Dec 2006 18:40:11 +1100
Subject: [spambayes-dev] image cracking results
Message-ID: <3b3c01c724d3$3fbe1630$110a0a0a@enfoldsystems.local>

I've finally managed to get something working with the Outlook addin and
Skip's cool new ocrad stuff.  the results look promising! :)  A summary of
my results are below.  The runs are 'Tokenizer:x-image_size' and
'Tokenizer:x-crack_images' both set to False, versus both set to True.  It
looks like a 13% improvement in false negatives, which is nothing to sneeze
at!  I've never been an expert at reading these results though, so let me
know if there is anything interesting I missed or neglected to send.

Cheers,

Mark

false positive percentages
<snip 10 lines of zeros>
won   0 times
tied 10 times
lost  0 times

total unique fp went from 0 to 0 tied
mean fp % went from 0.0 to 0.0 tied

false negative percentages
    5.859  5.455  won     -6.90%
    5.410  4.363  won    -19.35%
    5.794  5.234  won     -9.67%
    3.008  2.820  won     -6.25%
    5.588  4.817  won    -13.80%
    4.469  3.166  won    -29.16%
    5.051  4.646  won     -8.02%
    5.829  5.647  won     -3.12%
    6.842  5.789  won    -15.39%
    7.356  5.964  won    -18.92%

won  10 times
tied  0 times
lost  0 times

total unique fn went from 293 to 254 won    -13.31%
mean fn % went from 5.52048164035 to 4.79002154629 won    -13.23%

ham mean                     ham sdev
   0.00    0.00 +(was 0)        0.05    0.05   +0.00%
   0.03    0.04  +33.33%        0.48    0.70  +45.83%
   0.06    0.04  -33.33%        0.82    0.53  -35.37%
   0.08    0.08   +0.00%        1.53    1.53   +0.00%
   0.00    0.00 +(was 0)        0.01    0.01   +0.00%
   0.08    0.09  +12.50%        1.89    1.90   +0.53%
   0.00    0.00 +(was 0)        0.08    0.08   +0.00%
   0.00    0.00 +(was 0)        0.08    0.07  -12.50%
   0.01    0.01   +0.00%        0.17    0.17   +0.00%
   0.01    0.01   +0.00%        0.21    0.18  -14.29%

ham mean and sdev for all runs
   0.03    0.03   +0.00%        0.83    0.82   -1.20%

spam mean                    spam sdev
  90.19   90.67   +0.53%       24.58   23.81   -3.13%
  91.43   91.74   +0.34%       22.99   22.24   -3.26%
  91.51   91.85   +0.37%       23.66   22.79   -3.68%
  93.62   93.94   +0.34%       18.73   17.90   -4.43%
  90.62   91.11   +0.54%       23.83   22.99   -3.52%
  91.07   91.66   +0.65%       22.71   21.31   -6.16%
  90.85   91.31   +0.51%       23.28   22.58   -3.01%
  89.74   90.19   +0.50%       25.29   24.57   -2.85%
  89.49   90.15   +0.74%       25.99   24.73   -4.85%
  88.57   89.45   +0.99%       26.84   25.00   -6.86%

spam mean and sdev for all runs
  90.72   91.21   +0.54%       23.91   22.90   -4.22%

ham/spam mean difference: 90.69 91.18 +0.49


From mhammond at skippinet.com.au  Thu Dec 21 14:18:19 2006
From: mhammond at skippinet.com.au (Mark Hammond)
Date: Fri, 22 Dec 2006 00:18:19 +1100
Subject: [spambayes-dev] Stats are very slow
In-Reply-To: <2a052b990612201455q3a1b40fawdf8325407d06a25a@mail.gmail.com>
Message-ID: <004901c72502$7c154700$160a0a0a@enfoldsystems.local>

> I checked in an initial update to delay the calculation of the
> persistent stats until the GetStats() call because that was the easy
> update. In the case where you never actually view the stats in
> SpamBayes Manager, this should help. Let me know if you see any
> oddities in the stats calculation after this.

Thanks Kenny!  I don't personally check the stats often enough to notice,
but it certainly solved my problem :)

Cheers,

Mark.


From mhammond at skippinet.com.au  Fri Dec 22 03:56:19 2006
From: mhammond at skippinet.com.au (Mark Hammond)
Date: Fri, 22 Dec 2006 13:56:19 +1100
Subject: [spambayes-dev] image cracking results
In-Reply-To: <3b3c01c724d3$3fbe1630$110a0a0a@enfoldsystems.local>
Message-ID: <023c01c72574$c1ea99c0$160a0a0a@enfoldsystems.local>

I wrote:
> I've finally managed to get something working with the
> Outlook addin and
> Skip's cool new ocrad stuff.  the results look promising! :)

Here are a few more details on what I am doing.

To make things work with the image cracking code, I took the route of having
the Outlook addin generate a valid multipart message when there are images.
If there are no images, we return the same as we did in the past (ie, a
singlepart message with text and HTML in the normal "body"), so where
possible, the tokens generated for a message will be the same.  When there
are images, the tokens will now be different - due to the extra image
cracking tokens (obviously), but also due to the different mime related
tokens that will now be seen by the standard tokenizer.

This is a fairly subtle change, but could be signficant to the classifier.
For the purposes of comparison, I exported all ham and spam using the "old"
scheme (ie, before images were handled), and with the new scheme but with
image options disabled (but importantly, the new scheme *does* include the
image data).  The idea is to test only the impact of the new mime structure
without looking at image content.

I *think* these results are OK, but they are a little strange.  Below is the
result of cmp.py comparing the "old" scheme with the "new" scheme - note we
won 6 times, lost 4 times, and never tied, with the best win by 29%, but the
worst loss by 25%.  Another value of "+900.00%" in "ham sdev" also appears
extreme, but as I mentioned, I'm not very good at reading these.  One thing
I noticed is that the fact a message has a .gif attached is now a signficant
spam clue - I expect those new tokens account for the significant swings in
the results.

Does anyone have comments about this?

Cheers, and Happy Holidays!

Mark

<snip false positive percentages - all zero>

false positive percentages
    0.000  0.000  tied
    0.000  0.000  tied
    0.000  0.000  tied
    0.000  0.000  tied
    0.000  0.000  tied
    0.000  0.000  tied
    0.000  0.000  tied
    0.000  0.000  tied
    0.000  0.000  tied
    0.000  0.000  tied

won   0 times
tied 10 times
lost  0 times

total unique fp went from 0 to 0 tied
mean fp % went from 0.0 to 0.0 tied

false negative percentages
    6.897  4.906  won    -28.87%
    6.777  6.367  won     -6.05%
    5.206  6.526  lost   +25.36%
    5.323  6.655  lost   +25.02%
    6.397  6.430  lost    +0.52%
    6.391  4.727  won    -26.04%
    5.587  5.204  won     -6.86%
    5.415  5.769  lost    +6.54%
    6.186  5.495  won    -11.17%
    6.470  6.239  won     -3.57%

won   6 times
tied  0 times
lost  4 times

total unique fn went from 331 to 319 won     -3.63%
mean fn % went from 6.06478720218 to 5.8317685488 won     -3.84%

ham mean                     ham sdev
   0.04    0.01  -75.00%        0.53    0.11  -79.25%
   0.01    0.01   +0.00%        0.12    0.13   +8.33%
   0.00    0.00 +(was 0)        0.01    0.10 +900.00%
   0.00    0.00 +(was 0)        0.02    0.08 +300.00%
   0.00    0.03 +(was 0)        0.00    0.66 +(was 0)
   0.00    0.00 +(was 0)        0.06    0.01  -83.33%
   0.05    0.02  -60.00%        1.05    0.28  -73.33%
   0.10    0.03  -70.00%        1.67    0.47  -71.86%
   0.01    0.05 +400.00%        0.14    0.87 +521.43%
   0.02    0.08 +300.00%        0.29    1.36 +368.97%

ham mean and sdev for all runs
   0.02    0.02   +0.00%        0.66    0.59  -10.61%

spam mean                    spam sdev
  89.52   91.50   +2.21%       25.85   23.32   -9.79%
  88.98   89.37   +0.44%       25.99   25.72   -1.04%
  91.25   89.82   -1.57%       23.36   25.43   +8.86%
  90.74   89.59   -1.27%       23.72   25.61   +7.97%
  89.76   89.78   +0.02%       26.01   25.75   -1.00%
  89.98   90.99   +1.12%       25.19   23.12   -8.22%
  90.89   89.93   -1.06%       23.97   24.20   +0.96%
  91.34   90.33   -1.11%       23.41   24.35   +4.02%
  89.88   90.23   +0.39%       25.39   24.58   -3.19%
  88.73   90.43   +1.92%       26.01   25.24   -2.96%

spam mean and sdev for all runs
  90.11   90.19   +0.09%       24.94   24.77   -0.68%

ham/spam mean difference: 90.09 90.17 +0.08


From kenny.pitt at gmail.com  Thu Dec 28 16:45:30 2006
From: kenny.pitt at gmail.com (Kenny Pitt)
Date: Thu, 28 Dec 2006 10:45:30 -0500
Subject: [spambayes-dev] Stats are very slow
In-Reply-To: <2a052b990612201455q3a1b40fawdf8325407d06a25a@mail.gmail.com>
References: <390a01c72405$2a820930$110a0a0a@enfoldsystems.local>
	<17801.1951.114274.679266@montanaro.dyndns.org>
	<2a052b990612201232n56573583x358b6b0cc53176c7@mail.gmail.com>
	<17801.43680.688154.369277@montanaro.dyndns.org>
	<2a052b990612201455q3a1b40fawdf8325407d06a25a@mail.gmail.com>
Message-ID: <2a052b990612280745s5e561e7cgb6a0fa3cddb74feb@mail.gmail.com>

On 12/20/06, Kenny Pitt <kenny.pitt at gmail.com> wrote:
> On 12/20/06, skip at pobox.com <skip at pobox.com> wrote:
> >
> >    Kenny> Of course, this only solves part of the problem because we would
> >    Kenny> still take a huge hit when displaying the statistics. It might be
> >    Kenny> worth considering an optimization to store the actual statistics
> >    Kenny> values instead of calculating them at the start of every Outlook
> >    Kenny> session.
> >
> > That occurred to me after my reply.  I suspect it's probably the way to go.
>
> I checked in an initial update to delay the calculation of the
> persistent stats until the GetStats() call because that was the easy
> update. In the case where you never actually view the stats in
> SpamBayes Manager, this should help. Let me know if you see any
> oddities in the stats calculation after this.
>
> The complete fix is a little more involved, so I'll have to defer that
> until I have more time to test it thoroughly.

I just checked in an update to add permanent caching of the
statistics. With an existing message info db that doesn't yet contain
the cached statistics, you'll have the old startup delay one time to
recalculate the missing statistics. After that, the statistics should
be reloaded directly from the cache record in the message info db and
startup will be much faster.

There is a minimal performance hit on each message classification
because I have to update the statistics in the db every time to keep
them in sync. I think this will be pretty much unnoticeable in the
grand scheme of things, but let me know if you find otherwise.

-- 
Kenny Pitt

From skip at pobox.com  Fri Dec 29 12:31:14 2006
From: skip at pobox.com (skip at pobox.com)
Date: Fri, 29 Dec 2006 05:31:14 -0600
Subject: [spambayes-dev] [Spambayes] Images of commercial text with
 decoy text are mushing my index
In-Reply-To: <HDEBLJKBDDEOHCKJJPIACEPFINAC.jkmurdock@cox.net>
References: <HDEBLJKBDDEOHCKJJPIACEPFINAC.jkmurdock@cox.net>
Message-ID: <17812.64642.873558.253609@montanaro.dyndns.org>


    Jamie> I am being assaulted by a new sort of spam where the commercial
    Jamie> message is contained in an image, and the text of the message is
    Jamie> just there as a decoy ...

The current alpha of SpamBayes has some OCR capability to deal with this.
We've yet to make an official release.  Mark Hammond was working with it a
week or so ago and seemed to think it was working for him on Windows.  I
developed the new code in a Unix (actually MacOSX) environment, so once Mark
is satisfied perhaps we can turn the knobs on an official release.

-- 
Skip Montanaro - skip at pobox.com - http://www.mojam.com/
"The hippies and the hipsters did some great stuff in the sixties,
but the geeks pulled their weight too." -- Billy Bragg

From bargeo at gmail.com  Sun Dec 31 06:01:51 2006
From: bargeo at gmail.com (Geo.)
Date: Sun, 31 Dec 2006 14:01:51 +0900
Subject: [spambayes-dev] FAQ 6.5
Message-ID: <001601c72c98$ca8feeb0$0201010a@goodgrief>

"FAQ 6.5   Why can't I bounce spam back to the sender?"

The reasons you give are of course perfectly valid. However, to my mind the most obvious reason for not bouncing spam and one that is seldom acknowledged is the much more important consideration of traffic congestion. The internet is already groaning under the strain of billions of spam messages. If every one of them were to be bounced spam-mail traffic would in effect be doubled and the system would probably be unable to cope. At the very least, it would necessitate a great increase in world-wide server capacity the cost of which would almost certainly be passed on by ISPs to their clients. It is unfortunate that some of your competitors proudly proclaim as a 'feature', the ability of their programme to bounce spam messages. I can only conclude that they are either very shortsighted or irresponsible morons.

Please consider the foregoing purely as comment. It is not intended to be critical of SpamBayes in any way.

Perhaps you could expand the answer to FAQ 6.5 to reflect the above, e.g. something along the lines "In addition, bouncing spam messages simply increases email traffic leading to even further congestion on the internet." might suffice.

Regards,
Geo.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.python.org/pipermail/spambayes-dev/attachments/20061231/4f242e25/attachment.htm