From jklein at magnetstreet.com Fri Dec 1 21:39:42 2006 From: jklein at magnetstreet.com (Jim Klein) Date: Fri, 1 Dec 2006 14:39:42 -0600 Subject: [spambayes-dev] Question for Spambayes Message-ID: <001401c71588$d493cca0$8800010a@magnetstreet.net> How do I do a silent automated install of Spambayes? -------------- next part -------------- An HTML attachment was scrubbed... URL: http://mail.python.org/pipermail/spambayes-dev/attachments/20061201/4628b07a/attachment.html From seandarcy2 at gmail.com Thu Dec 14 23:42:02 2006 From: seandarcy2 at gmail.com (sean darcy) Date: Thu, 14 Dec 2006 17:42:02 -0500 Subject: [spambayes-dev] Dec 14 cvs seg faults on python-2.5: _weakref.so Message-ID: Made the mistake of updating to python-2.5 :( That seg faulted sb. So I updated to cvs. Rebuilt and installed. Same result: python -v /usr/bin/sb_server.py # installing zipimport hook import zipimport # builtin # installed zipimport hook # /usr/lib64/python2.5/site.pyc matches /usr/lib64/python2.5/site.py import site # precompiled from /usr/lib64/python2.5/site.pyc # /usr/lib64/python2.5/os.pyc matches /usr/lib64/python2.5/os.py import os # precompiled from /usr/lib64/python2.5/os.pyc import posix # builtin ................ import spambayes.Version # precompiled from /usr/lib/python2.5/site-packages/spambayes/Version.pyc # /usr/lib/python2.5/site-packages/spambayes/ProxyUI.pyc matches /usr/lib/python2.5/site-packages/spambayes/ProxyUI.py import spambayes.ProxyUI # precompiled from /usr/lib/python2.5/site-packages/spambayes/ProxyUI.pyc SpamBayes POP3 Proxy Version 1.1a3 (August 2006) import bsddb # directory /usr/lib64/python2.5/bsddb # /usr/lib64/python2.5/bsddb/__init__.pyc matches /usr/lib64/python2.5/bsddb/__init__.py import bsddb # precompiled from /usr/lib64/python2.5/bsddb/__init__.pyc dlopen("/usr/lib64/python2.5/lib-dynload/_bsddb.so", 2); import _bsddb # dynamically loaded from /usr/lib64/python2.5/lib-dynload/_bsddb.so # /usr/lib64/python2.5/bsddb/dbutils.pyc matches /usr/lib64/python2.5/bsddb/dbutils.py import bsddb.dbutils # precompiled from /usr/lib64/python2.5/bsddb/dbutils.pyc # /usr/lib64/python2.5/bsddb/db.pyc matches /usr/lib64/python2.5/bsddb/db.py import bsddb.db # precompiled from /usr/lib64/python2.5/bsddb/db.pyc # /usr/lib64/python2.5/weakref.pyc matches /usr/lib64/python2.5/weakref.py import weakref # precompiled from /usr/lib64/python2.5/weakref.pyc dlopen("/usr/lib64/python2.5/lib-dynload/_weakref.so", 2); import _weakref # dynamically loaded from /usr/lib64/python2.5/lib-dynload/_weakref.so Segmentation fault sean From spambayes-dev at spandex.nildram.co.uk Fri Dec 15 14:46:48 2006 From: spambayes-dev at spandex.nildram.co.uk (Spandex) Date: Fri, 15 Dec 2006 13:46:48 +0000 Subject: [spambayes-dev] Problem with struct.unpack in oe_mailbox.py Message-ID: <307650593.20061215134648@nildram.co.uk> Hi, I previously sent this mail to the spambayes users list, without response. Apologies for the repost... I'm hoping it's more appropriate here:- I'm running spambayes (1.0.4-3) on Debian unstable with Python 2.4.4c0 and a custom compiled 2.6.17 kernel. I'm using an AMD64 chip. sb_server starts up ok and proxies pop3 and smtp connections ok. I can train from the commandline ok. The problem comes when I try to train it from the web interface (using either mbox or dbx format). It bombs with the following error:- ---------------- Traceback (most recent call last): File "/usr/lib/python2.4/site-packages/spambayes/Dibbler.py", line 470, in found_terminator getattr(plugin, name)(**params) File "/usr/lib/python2.4/site-packages/spambayes/UserInterface.py", line 494, in onTrain content = self._convertToMbox(content) File "/usr/lib/python2.4/site-packages/spambayes/UserInterface.py", line 536, in _convertToMbox content = oe_mailbox.convertToMbox(content) File "/usr/lib/python2.4/site-packages/spambayes/oe_mailbox.py", line 444, in convertToMbox if header.isValid() and header.isMessages(): File "/usr/lib/python2.4/site-packages/spambayes/oe_mailbox.py", line 117, in isValid return self.getEntry(0) == dbxFileHeader.MAGIC_NUMBER File "/usr/lib/python2.4/site-packages/spambayes/oe_mailbox.py", line 126, in getEntry self.dbxBuffer[dbxEntry * 4:(dbxEntry * 4) + 4])[0] error: unpack str size does not match format ---------------- I'm wondering whether this is something to do with my machine architecture and the sizes of datatypes? But I'm stabbing in the dark. I can easily disable dbx support by commenting out.. content = oe_mailbox.convertToMbox(content) .. around line 536 of UserInterface.py, and this does enable me to train on mbox format via the web interface, but I'd rather keep dbx support if possible. I don't speak Python so commenting out the offending code was about as far as I could go. Any ideas? Thanks, Matt From skip at pobox.com Wed Dec 20 06:05:39 2006 From: skip at pobox.com (skip at pobox.com) Date: Tue, 19 Dec 2006 23:05:39 -0600 Subject: [spambayes-dev] Applying SpamBayes to website spamming Message-ID: <17800.50339.820370.517536@montanaro.dyndns.org> I'm sure many of you are aware that spamming of the submission forms on blogs and other websites is a large and increasing problem. The Mojam and Musi-Cal concert websites suffered from the same malady. I originally considered implementing some sort of CAPTCHA scheme: http://en.wikipedia.org/wiki/Captcha but that has limitations and would have required changes to all submission forms on the websites. I decided to instead implement a SpamBayes-based solution in our XML-RPC server. It has a couple distinct advantages: * It has none of the CAPTCHA gotchas. * It is implemented at a single point in the system. * No changes to the Web interface were required so users don't have to learn something new. I'll give you a quick sketch of what I did to solve this problem. If you'd like more details, drop me a note. When someone submits concert dates to our sites the submission is represented as a simple dictionary. A valid submission will have information about who's performing, a date in the future, valid location information, etc. In contrast, when someone spams the submission forms the dictionary often contains bogus information or is missing some fields altogether. For example, if the spammer puts something in the date fields it's likely to be garbage which won't parse properly, resulting in a default date of 1900-01-01. Similarly, the city/state/country is likely to be invalid, so we won't be able to find lat/long info. The dictionary is preprocessed into a string of tokens which includes the obvious text which was part of the submission, but which also contains synthetic tokens. Here's a spammer's entry represented as text: Bradyn Maximus Ty jordan at e-mailanywhere.com 1900-01-01 Jerald kwds:False kwds-private:False Malcom 1900-01-01 Jarod date:ancient perflen:1 infolen:1 hasphone:False hasprice:False city:unknown venue:present Here's a valid entry represented as text: Anchorage skip at mojam.com 2006-10-07 kwds:True kwds-private:True .bl.1348 .ra LaVette,Bettye 2006-10-07 AK Discovery Theatre date:current perflen:1 infolen:0 hasphone:False hasprice:False city:known venue:present The synthetic tokens that suggest problems are such huge red flags for the classifier that after training on just a couple of these bad boys the rejection rate of spam submissions seems to be 100%. Of course, this sort of spamming is probably still in its infancy, so I expect we might eventually see some sort of arms race develop as has been true for email spam. I'm not too worried about that though because for the most part I think the spammers' primary target is the blogosphere with its ubiquitous comment feature, not specialized websites like ours. The tokenizer class is quite simple. I post it here in its entirety. Note that major bits of it were just pasted from the default tokenizer. from spambayes.tokenizer import log2, Tokenizer, numeric_entity_re, \ numeric_entity_replacer, crack_urls, breaking_entity_re, html_re, \ tokenize_word class Tokenizer(Tokenizer): def tokenize(self, text): maxword = 20 # Replace numeric character entities (like a for the letter # 'a'). text = numeric_entity_re.sub(numeric_entity_replacer, text) # Normalize case. text = text.lower() # Get rid of uuencoded sections, embedded URLs,