[Spambayes] Bugs? Yup. Fixes? Definitely. Downloadable installer and patches? Yes.

Thomas Hruska thruska at cubiclesoft.com
Tue Dec 25 20:14:28 CET 2007


[For those interested in an installable build of 1.0.4 with my fixes 
and/or want to see the source code modifications I've made, scroll down]


While applying the modifications I suggested a couple weeks ago, I'm 
fairly (99%) certain I've run into at least one critical bug in the 
Spambayes 1.0.4 POP3 proxy.  If you use the proxy, read on.

1)  Messages are not saved in their original format.  There is actually 
a hack that _should NOT exist_ in Corpus.py that is referenced from 
ProxyUI.py:

----- ProxyUI.py -----
                                 # fromCache is a fix for sf #851785.
                                 # See the comments in Corpus.py
                                 targetCorpus.takeMessage(id, sourceCorpus,
                                                          fromCache=True)
----- ProxyUI.py -----

----- Corpus.py -----
         # If the notate_to or notate_subject options are set, then the
         # message in the cache has this information, and it will get used
         # in training, which is not ideal.  So if that option is set, strip
         # that data before training.  The only time I can see this failing
         # is if the option is changed at some point, so older messages
         # don't have the notation, but some other program did do the same
         # notation, which would be lost. This shouldn't be a big deal,
         # though.
         if fromCache:
           ...Modify message to attempt to make it look like the original...
----- Corpus.py -----

This hack is WRONG.  Training on messages that include the modified 
headers is also WRONG (as far as I can tell, the code doesn't undo that 
either - but having to undo anything in the first place is WRONG). 
Training should only be done using the original, unmodified message.  It 
is difficult to impossible to revert a modified message back to its 
original state.

2)  sb_server.py classifies a message and modifies the headers but does 
not save a copy of the original message.  Instead, it saves the modified 
message.  It should save a copy of the original message for training 
purposes.

3)  message.py does not offer any method of retrieving/storing a copy of 
the original message.  This seems like the most logical place but would 
probably be hard to implement here.


My improvement/fix:  Due to the nature of how the POP3 proxy operates, 
it would be seriously detrimental to either performance or usability if 
the modified message were eliminated altogether.  Also, changing how 
messages are stored would probably create a disaster of modifications to 
the entire code base.  Therefore, the simplest (and only _somewhat_ 
hacky) solution is to create an 'unknown-orig' ExpiryFileCorpus cache 
that tracks the original message for training purposes.

For those who are interested in my solution and use the POP3 proxy, 
here's the Windows installer version:
http://www.cubiclesoft.com/Unrelated/spambayes-1.0.4.exe

That applies all the recommended changes I've made to date to the 1.0.4 
branch.  You can train on any message and Spambayes defends itself from 
letting its database get too large by rejecting messages that are 
already classified correctly.  It does this by running the original 
message through the classifier before training on the message - thus 
training only on "mistakes and unsures" as per the recommendations of 
the developers.  Since each message trained alters the database, I've 
factored this fact in as well*.  You could train on 10 messages or 
60,000 messages and Spambayes will still correctly pick and choose what 
to train on.  In layman's terms:  Spambayes is smarter than before.

* The changes I've made allow you to train on more than one message at a 
time.  If you've been following along with my recent rants, you know 
that the existing 1.0.4 has some major statistical issues with training 
beyond one message at a time.  If you train on a message that would 
already be classified that way, the database flattens out over time and 
Spambayes will eventually not be able to figure out what is ham and what 
is spam.  Additionally, really large databases (e.g. 30,000+ messages) 
have performance issues.

Consider wiping your Spambayes database and staring over after 
installing this - especially if you have more trained on more than 2000 
messages.  The entire database so far was based on training on modified 
messages (not the original messages!).

There is a new configuration option in the Advanced configuration page 
of the POP3 proxy for controlling the name of the directory used for the 
original message cache.


For those who want to see my changes (12MB file):
http://www.cubiclesoft.com/Unrelated/spambayes-1.0.4.zip

That contains all the source code modifications I made, binaries, etc. 
The two most critical modifications are in sb_server.py and ProxyUI.py. 
  I heavily commented my changes in ProxyUI.py.  I also updated the 
InnoSetup script to build properly under InnoSetup 5.2.2.


This took:  4 hours to learn the basics of Python and how Spambayes 
works.  30 minutes to apply my fixes.  12 hours to get the darn thing to 
build properly under Python 2.5.1.  Sort of.  I'm fairly certain I hosed 
the Outlook add-in part of the build really good due to not having 
Outlook 2000.  A few hours of testing (spanned over several days). 
Total time spent on getting this all working was roughly 20 hours.  Time 
I'd rather have spent doing something else.  However, I'm quite happy 
with the result.

And all this is from someone who doesn't know Python**.  Probably not 
the most inspiring/reassuring statement, but I'll leave it up to the 
developers to decide if my modifications are sound enough to merge into 
the 1.0.4 branch (and hopefully good enough to consider releasing a 
1.0.5).  They might not go for the 2.5-specific changes due to the email 
w/ py2exe package problems (case-sensitivity issues in py2exe), but it 
would be nice if they did.

** I can edit code in almost any programming language without actually 
knowing the language.


Merry Christmas!  I can't think of a better Christmas present than to 
know there will be more spam blocked next year thanks to my changes to 
Spambayes.

-- 
Thomas Hruska
CubicleSoft President
Ph: 517-803-4197

*NEW* MyTaskFocus 1.1
Get on task.  Stay on task.

http://www.CubicleSoft.com/MyTaskFocus/



More information about the SpamBayes mailing list