From rob at hooft.net Sat Sep 2 08:36:06 2006 From: rob at hooft.net (Rob Hooft) Date: Sat, 02 Sep 2006 08:36:06 +0200 Subject: [spambayes-dev] Domain expiration Message-ID: <44F92656.6020702@hooft.net> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 It is that time of year again: I will have to renew the spambayes.org domain name before 2006-11-02. Has anything changed in some relevant organization somewhere that the domain name should be transferred, or shall I just pay another year myself? Regards, Rob - -- Rob W.W. Hooft || rob at hooft.net || http://www.hooft.net/people/rob/ -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.5 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFE+SZWH7J/Cv8rb3QRApyaAJ9P42fzvAxJULn4Kq0b1iLHqAc9GgCfRpMJ nmQkDBP8jS63CLUdlh3311E= =8x2T -----END PGP SIGNATURE----- From skip at pobox.com Sat Sep 2 14:04:16 2006 From: skip at pobox.com (skip at pobox.com) Date: Sat, 2 Sep 2006 07:04:16 -0500 Subject: [spambayes-dev] Domain expiration In-Reply-To: <44F92656.6020702@hooft.net> References: <44F92656.6020702@hooft.net> Message-ID: <17657.29504.60914.902375@montanaro.dyndns.org> Rob> It is that time of year again: I will have to renew the Rob> spambayes.org domain name before 2006-11-02. Has anything changed Rob> in some relevant organization somewhere that the domain name should Rob> be transferred, or shall I just pay another year myself? Maybe the InBoxer folks could be persuaded to pay for the domain? Skip From dreas at spamexperts.com Sat Sep 2 18:52:09 2006 From: dreas at spamexperts.com (Dreas van Donselaar) Date: Sat, 2 Sep 2006 18:52:09 +0200 Subject: [spambayes-dev] Domain expiration In-Reply-To: <17657.29504.60914.902375@montanaro.dyndns.org> Message-ID: <002901c6ceb0$23ec6a20$3c05310a@DreasLaptop> I guess we from SpamExperts can pay for it if that's preferred. Just contact me to see how to best handle that. ___ Dreas van Donselaar Director & co-founder SpamExperts B.V. Postbus 309 6200 AH Maastricht The Netherlands M: +31 (0)626202808 F: +31 (0)842203930 E: aj.vandonselaar at spamexperts.com W: http://www.spamexperts.com/ MSN: dreas at emailaccount.nl AIM: dreas1983 Y!: dreasvandonselaar Skype: dreasvandonselaar ICQ: 108756 -----Original Message----- From: spambayes-dev-bounces at python.org [mailto:spambayes-dev-bounces at python.org] On Behalf Of skip at pobox.com Sent: zaterdag 2 september 2006 14:04 To: Rob Hooft Cc: spambayes-dev at python.org Subject: Re: [spambayes-dev] Domain expiration Rob> It is that time of year again: I will have to renew the Rob> spambayes.org domain name before 2006-11-02. Has anything changed Rob> in some relevant organization somewhere that the domain name should Rob> be transferred, or shall I just pay another year myself? Maybe the InBoxer folks could be persuaded to pay for the domain? Skip _______________________________________________ spambayes-dev mailing list spambayes-dev at python.org http://mail.python.org/mailman/listinfo/spambayes-dev From ta-meyer at ihug.co.nz Sun Sep 3 09:10:20 2006 From: ta-meyer at ihug.co.nz (Tony Meyer) Date: Sun, 3 Sep 2006 19:10:20 +1200 Subject: [spambayes-dev] Ocrad vs Tesseract OCR Message-ID: I plan to do this myself at some point, but thought Skip (or someone else) might want to beat me to it: Google/UNLV have (re)released an open-source* OCR engine, which they claim is better than any other open-source OCR engine. So it would be interesting to compare the classification with this to that with ocrad. http://google-code-updates.blogspot.com/2006/08/announcing-tesseract- ocr.html =Tony.Meyer * The license is a bit vague, unfortunately. They state it can be freely used/distributed for research/development, and that for commercial use you have to contact the authors. However, they don't cover the middle ground (non-commercial non-research), which SpamBayes falls under. From skip at pobox.com Sun Sep 3 14:38:24 2006 From: skip at pobox.com (skip at pobox.com) Date: Sun, 3 Sep 2006 07:38:24 -0500 Subject: [spambayes-dev] Ocrad vs Tesseract OCR In-Reply-To: References: Message-ID: <17658.52416.393047.873371@montanaro.dyndns.org> Tony> I plan to do this myself at some point, but thought Skip (or Tony> someone else) might want to beat me to it: Tony> Google/UNLV have (re)released an open-source* OCR engine, which Tony> they claim is better than any other open-source OCR engine. So it Tony> would be interesting to compare the classification with this to Tony> that with ocrad. Tony> http://google-code-updates.blogspot.com/2006/08/announcing-tesseract-ocr.html Thanks, I'll try to take a look when I get a chance. Alas, the SF link is currently giving an Internal Server Error message. (Jeez, what are the chances???) Tony> * The license is a bit vague, unfortunately. They state it can be Tony> freely used/distributed for research/development, and that for Tony> commercial use you have to contact the authors. However, they Tony> don't cover the middle ground (non-commercial non-research), which Tony> SpamBayes falls under. I suppose we ought to contact the authors, just to be on the safe side. Skip From skip at pobox.com Sun Sep 3 14:50:45 2006 From: skip at pobox.com (skip at pobox.com) Date: Sun, 3 Sep 2006 07:50:45 -0500 Subject: [spambayes-dev] Ocrad vs Tesseract OCR In-Reply-To: References: Message-ID: <17658.53157.829774.576750@montanaro.dyndns.org> skip> Alas, the SF link is currently giving an Internal Server Error skip> message. (Jeez, what are the chances???) Back now. Downloaded successfully. Tony> * The license is a bit vague, unfortunately. skip> I suppose we ought to contact the authors, just to be on the safe skip> side. Perhaps it's not necessary. The README file says: This package contains the Tesseract Open Source OCR Engine. Orignally developed at Hewlett Packard Laboratories Bristol and at Hewlett Packard Co, Greeley Colorado, the majority of the code in this distribution is now licensed under the Apache License: ** Licensed under the Apache License, Version 2.0 (the "License"); ** you may not use this file except in compliance with the License. ** You may obtain a copy of the License at ** http://www.apache.org/licenses/LICENSE-2.0 ** Unless required by applicable law or agreed to in writing, software ** distributed under the License is distributed on an "AS IS" BASIS, ** WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. ** See the License for the specific language governing permissions and ** limitations under the License. The Apache license is fine for our use, right? Skip From skip at pobox.com Sun Sep 3 16:17:55 2006 From: skip at pobox.com (skip at pobox.com) Date: Sun, 3 Sep 2006 09:17:55 -0500 Subject: [spambayes-dev] Ocrad vs Tesseract OCR In-Reply-To: <17658.53157.829774.576750@montanaro.dyndns.org> References: <17658.53157.829774.576750@montanaro.dyndns.org> Message-ID: <17658.58387.722774.767768@montanaro.dyndns.org> skip> Alas, the SF link is currently giving an Internal Server Error skip> message. (Jeez, what are the chances???) skip> Back now. Downloaded successfully. And built successfully, with a couple tweeks. After a bit of juggling, I got the executable into the proper spot, ran it, then got a segfault. Unfortunately, the README file includes this: The C++ code makes heavy use of a list system using macros. This predates stl, was portable before stl, and is more efficent than stl lists, but has the big negative that if you do get a segmentation violation, it is hard to debug. It's certainly not ready for prime time. Skip From dreas at spamexperts.com Sun Sep 3 21:17:34 2006 From: dreas at spamexperts.com (Dreas van Donselaar) Date: Sun, 3 Sep 2006 21:17:34 +0200 Subject: [spambayes-dev] Domain expiration In-Reply-To: <44F92656.6020702@hooft.net> Message-ID: <006b01c6cf8d$a1aebe80$6801a8c0@DreasLaptop> Hi again, After exchanging some emails with mister Hooft, the proposal is that we move the domainname from the current registrar Gandi to eNom. There SpamExperts will donate 5 years of registration. We like the domain name whois information being adjusted so the domainname is clearly owned by the SpamBayes community (instead of the current situation where it is registered to mister Hooft). When any official names or contact details are required, we use the founder of the project Tim Peters. SpamExperts will of course not have any ownership whatsoever. After we added the years to the domain it can be pushed to any other eNom account created by any of the current project maintainers (but preferrably Tim Peters as well). Are there any objections to this procedure? Please also let the list (or directly mister Hooft) know if you agree. Kind regards, ___ Dreas van Donselaar Director & co-founder SpamExperts B.V. Postbus 309 6200 AH Maastricht The Netherlands F: +31 (0)842203930 E: aj.vandonselaar at spamexperts.com W: http://www.spamexperts.com/ -----Original Message----- From: spambayes-dev-bounces+dreas=emailaccount.nl at python.org [mailto:spambayes-dev-bounces+dreas=emailaccount.nl at python.org] On Behalf Of Rob Hooft Sent: zaterdag 2 september 2006 8:36 To: spambayes-dev at python.org Subject: [spambayes-dev] Domain expiration -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 It is that time of year again: I will have to renew the spambayes.org domain name before 2006-11-02. Has anything changed in some relevant organization somewhere that the domain name should be transferred, or shall I just pay another year myself? Regards, Rob - -- Rob W.W. Hooft || rob at hooft.net || http://www.hooft.net/people/rob/ -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.5 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFE+SZWH7J/Cv8rb3QRApyaAJ9P42fzvAxJULn4Kq0b1iLHqAc9GgCfRpMJ nmQkDBP8jS63CLUdlh3311E= =8x2T -----END PGP SIGNATURE----- _______________________________________________ spambayes-dev mailing list spambayes-dev at python.org http://mail.python.org/mailman/listinfo/spambayes-dev From mb at symbolic.it Mon Sep 4 10:26:42 2006 From: mb at symbolic.it (Michele Belloli) Date: Mon, 04 Sep 2006 10:26:42 +0200 Subject: [spambayes-dev] Latest CVS update, Ocrad for Windows In-Reply-To: <17631.61424.222629.225936@montanaro.dyndns.org> References: <17631.61424.222629.225936@montanaro.dyndns.org> Message-ID: <44FBE342.4070604@symbolic.it> skip at pobox.com ha scritto: > I updated the OCR capabilities a bit more today. I added more intelligent > assembly of split images into a single image after noticing that the > spammers don't simply chop up multi-part GIF images horizontally. I also > added a couple extra options (ocrad_scale and ocrad_charset) which control > the image scaling factor (default is 2) and character set (default is > "ascii") Ocrad uses. Scaling the image by a factor of 2 was a pretty > obvious win: > > false positive percentages > 0.000 0.000 tied > 0.000 0.000 tied > 0.000 0.000 tied > 0.000 0.000 tied > 0.000 0.000 tied > > won 0 times > tied 5 times > lost 0 times > > total unique fp went from 0 to 0 tied > mean fp % went from 0.0 to 0.0 tied > > false negative percentages > 4.213 4.213 tied > 1.404 0.843 won -39.96% > 3.371 2.809 won -16.67% > 2.528 2.247 won -11.12% > 4.213 3.652 won -13.32% > > won 4 times > tied 1 times > lost 0 times > > total unique fn went from 56 to 49 won -12.50% > mean fn % went from 3.14606741573 to 2.75280898876 won -12.50% > > Scaling by a factor of three was even better in the false negative > department but regressed a bit in the false positive category so I checked > Options.py in with a default scaling factor of 2. A couple things could > stand to be further tested: > > * I have no idea how good Ocrad's scaling algorithm is. It's possible > that PIL or NetPBM's scaling code is better. If so, it would make > sense to scale the images before feeding to Ocrad. > > * The images I've see so far were all plain English, so I blindly made > ascii the default charset. The other choices were iso-8859-9 and > iso-8859-15. I simply assumed ascii would be the most appropriate > default, but didn't test it. > > Finally, I put together a really simpleminded Ocrad-for-Windows release > based upon the ocrad.exe binary that Tony built. Check the Files section of > the SpamBayes project site: > > http://sourceforge.net/project/showfiles.php?group_id=61702 > > and grab ocrad-cygwin. > > There are a few caveats: > > 1. I don't do Windows. (No, really, I don't, strange as that may seem.) > This is no fancy-schmancy point-and-shoot Windows installer. It's > just a simple zip file with the Ocrad 0.15 distribution, Tony's .exe > file and the patch he applied to the source. > > 2. I don't do Windows. The code I've written so far has been done > entirely on my Mac. I've made no obvious concessions to portability. > That said, I hope portability issues won't be daunting for any early > adopters. > > 3. I don't do Windows. If you have problems it won't do you any good to > mail me directly. Post about problems on the SpamBayes bug tracker: > > http://sourceforge.net/tracker/?group_id=61702&atid=498103 > > 4. If you do Windows you will need PIL to take advantage of the recent > changes: > > http://www.pythonware.com/products/pil/ > > (unless you want to put hair on your chest and build NetPBM on > Windows). Fredrik Lundh provides prebuilt Windows versions of PIL. > Grab the one appropriate for the version of Python you have > installed. > > 5. If you do Windows (or any other platform for that matter), feedback > to the lists about successes and failures would be helpful. > > Cheers, > > Skip > > > _______________________________________________ > spambayes-dev mailing list > spambayes-dev at python.org > http://mail.python.org/mailman/listinfo/spambayes-dev > > > Hi, I'm very interested in this OCR and in the way SpamBayes analyzes image spam. Now there is a new kind of image spam using animated images and I've received a lot of "animated spam" lately so it's possible they could be very common in a brief period. Here you can find a brief description about this: http://www.viruslist.com/en/weblog?weblogid=196822613 I would like to ask you how your OCR manages this kind of images. Thank you a lot for your time. Regards -- Michele Belloli Research & Development Dept. Symbolic - Network Security Distributor http://www.symbolic.it eXtensiveControl La nuova soluzione di Content Filtering per la PMI http://www.extensivecontrol.it/ From skip at pobox.com Mon Sep 4 14:24:13 2006 From: skip at pobox.com (skip at pobox.com) Date: Mon, 4 Sep 2006 07:24:13 -0500 Subject: [spambayes-dev] Latest CVS update, Ocrad for Windows In-Reply-To: <44FBE342.4070604@symbolic.it> References: <17631.61424.222629.225936@montanaro.dyndns.org> <44FBE342.4070604@symbolic.it> Message-ID: <17660.6893.180353.829406@montanaro.dyndns.org> Michele> Now there is a new kind of image spam using animated images and Michele> I've received a lot of "animated spam" lately so it's possible Michele> they could be very common in a brief period. Here you can find Michele> a brief description about this: Michele> http://www.viruslist.com/en/weblog?weblogid=196822613 Michele> I would like to ask you how your OCR manages this kind of Michele> images. Right now it doesn't. I'm aware of the shortcoming. I just haven't had time to work on the problem. Skip From ta-meyer at ihug.co.nz Tue Sep 5 05:38:36 2006 From: ta-meyer at ihug.co.nz (Tony Meyer) Date: Tue, 5 Sep 2006 15:38:36 +1200 Subject: [spambayes-dev] Ocrad vs Tesseract OCR In-Reply-To: <17658.53157.829774.576750@montanaro.dyndns.org> References: <17658.53157.829774.576750@montanaro.dyndns.org> Message-ID: <15321AD2-72E0-4DAC-8F54-C0479292DF2C@ihug.co.nz> > Tony> * The license is a bit vague, unfortunately. > > skip> I suppose we ought to contact the authors, just to be on > the safe > skip> side. > > Perhaps it's not necessary. The README file says: > > This package contains the Tesseract Open Source OCR Engine. > Orignally developed at Hewlett Packard Laboratories Bristol and > at Hewlett Packard Co, Greeley Colorado, the majority of the code > in this distribution is now licensed under the Apache License: [...] > The Apache license is fine for our use, right? Sigh. I don't know how I missed that (right at the top of the README), and yet managed to read the bit later on. > And built successfully, with a couple tweeks. After a bit of > juggling, I > got the executable into the proper spot, ran it, then got a segfault. > Unfortunately, the README file includes this: > > The C++ code makes heavy use of a list system using macros. This > predates stl, was portable before stl, and is more efficent > than stl > lists, but has the big negative that if you do get a segmentation > violation, it is hard to debug. > > It's certainly not ready for prime time. :( Ah, well, it was worth a shot. Thanks for doing the work! When I find some time to do some proper evaluation of the new experimental options, I might try it as well, as see how I go (out of curiosity). Were you building on OS X? =Tony.Meyer From vilisch at wmw.com Tue Sep 5 09:01:31 2006 From: vilisch at wmw.com (Vilmos Schnedarek) Date: Tue, 05 Sep 2006 10:01:31 +0300 Subject: [spambayes-dev] Integrate SpamBayes into a Win32 application2 Message-ID: <44FD20CB.4060506@wmw.com> Hi, I want to integrate SpamBayes into a Win32 application and I would like to ask your help for it. I read about something similar on the internet and there the sb_bnfilter has been suggested as a solution. I downloaded the sources from CVS and I tried to compile the C version of the sb_bnfilter but it isn't possible because it has several Unix specific C commands, which doesn't exists under Win32. But before I continue to work with sb_bnfilter I would like to get some more inormation about SpamBayes. Here are my questions, please answer them: 1, first at all what are the main modules of the SpamBayes? In the sources I see several files, but I am sure only a few is related to the main functionality and the rest is UI or for other things. 2, what is the architecture of the app? it is a client-server application, meaning a server daemon is running in the background and one or more clients are connecting to it to request mail scanning services? Or it is a stand-alone application, which is loaded in memory every time when a mail is scanned then is unloaded when it finish? 3, what storage is used to save the training informations and where is it located? 4, what is the minimum required source files to have Spambayes working? Here I mean working without UI and use it in command line, by sb_filter. PS: I apologize posting this message for second time but the first one has been sent by mistake in HTML format. Thanks, Vilmos From skip at pobox.com Tue Sep 5 15:24:11 2006 From: skip at pobox.com (skip at pobox.com) Date: Tue, 5 Sep 2006 08:24:11 -0500 Subject: [spambayes-dev] Ocrad vs Tesseract OCR In-Reply-To: <15321AD2-72E0-4DAC-8F54-C0479292DF2C@ihug.co.nz> References: <17658.53157.829774.576750@montanaro.dyndns.org> <15321AD2-72E0-4DAC-8F54-C0479292DF2C@ihug.co.nz> Message-ID: <17661.31355.561677.431796@montanaro.dyndns.org> Tony> Were you building [tesseract] on OS X? Yuppers... S From mhammond at skippinet.com.au Wed Sep 6 03:18:58 2006 From: mhammond at skippinet.com.au (Mark Hammond) Date: Wed, 6 Sep 2006 11:18:58 +1000 Subject: [spambayes-dev] Integrate SpamBayes into a Win32 application2 In-Reply-To: <44FD20CB.4060506@wmw.com> Message-ID: <2aa001c6d152$6de49c20$050a0a0a@enfoldsystems.local> > I want to integrate SpamBayes into a Win32 application and I > would like > to ask your help for it. > I read about something similar on the internet and there the > sb_bnfilter > has been suggested as a solution. I downloaded the sources > from CVS and > I tried to compile the C version of the sb_bnfilter but it isn't > possible because it has several Unix specific C commands, > which doesn't > exists under Win32. I'm not familiar with sb_bnfilter, but I'm fairly sure it is a 'helper wrapper' for spambayes - it is not necessary for the core spambayes functions, but used to make life simpler for certain users. You should read the source to see exactly what it does. > But before I continue to work with sb_bnfilter I > would like to get some more inormation about SpamBayes. > Here are my questions, please answer them: > 1, first at all what are the main modules of the SpamBayes? In the > sources I see several files, but I am sure only a few is > related to the > main functionality and the rest is UI or for other things. The UI and message database are fairly key features - but the core of spambayes itself is really the spambayes\tokenizer.py > 2, what is the architecture of the app? it is a client-server > application, meaning a server daemon is running in the background and > one or more clients are connecting to it to request mail scanning > services? Or it is a stand-alone application, which is loaded > in memory > every time when a mail is scanned then is unloaded when it finish? Please check out the docs. There are a few ways it operates - one is by using a "proxy", which is a long-running process that your mail client connects to instead of the real mail server. Another is integrated into outlook, so that also is a long-running process. There are also scripts to perform various operations from the command-line, and they tend to perform what was requested and then terminate. > 3, what storage is used to save the training informations and > where is > it located? There are a few formats supported - spambayes\storage.py has the implementation. > 4, what is the minimum required source files to have > Spambayes working? > Here I mean working without UI and use it in command line, by > sb_filter. I don't believe anyone knows the answer to that. Everyone here is concerned with making spambayes work as a whole. There has been no attempt to reduce it down to the minimum possible. To look at it another way, what you see today *is* the minimum possible to support all current and supported uses of spambayes. Other apps with different requirements and different feature sets may well have a smaller set possible. I hope this helps, Mark From skip at pobox.com Wed Sep 6 03:34:01 2006 From: skip at pobox.com (skip at pobox.com) Date: Tue, 5 Sep 2006 20:34:01 -0500 Subject: [spambayes-dev] Choosing which image to OCR Message-ID: <17662.9609.899056.55708@montanaro.dyndns.org> I took a few minutes to examine a couple (as in exactly two) multi-frame GIF images from stock spams I received in the past couple days. I'd like a cheap test to decide which frame is the best candidate for OCR without OCRing every frame. The computational costs are high enough already. I have two images, bogus-0.gif and bogus-1.gif (both attached to this message). For each one I ran the following loop: >>> img = Image.open("bogus-0.gif") >>> for (i, frame) in enumerate(ImageSequence(img)): ... bg = max(frame.histogram()) ... npixels = len([x for x in frame.histogram() if x]) ... print bg, npixels For bogus-0.gif I got: 220259 33 217760 52 213225 96 182636 256 222500 1 For bogus-1.gif I got: 326518 5 322180 9 322817 7 280174 11 314741 10 It seems that the frame with the fewest white pixels (or the fewest pixels in the most frequently used palette position) is a decent indicator of the frame with the most useful pixels. I also tried this more expensive test at the shell: % for f in bogus-1-?.png ; do echo "*** $f ***" pngtopnm $f | ocrad | wc -c done *** bogus-1-0.png *** 8 *** bogus-1-1.png *** 31 *** bogus-1-2.png *** 18 *** bogus-1-3.png *** 1219 *** bogus-1-4.png *** 340 The fourth frame does indeed have the most text. This didn't work for the bogus-0.png file because this save loop in Python didn't work properly: >>> img = Image.open("bogus-0.gif") >>> for (i, frame) in enumerate(ImageSequence(img)): ... frame.save(open("bogus-0-%d.png" % i, "wb")) The first frame saved has the proper palette. The other saved frames are just black-and-white. (Can someone with more PIL experience explain why this is so and how to get around it?) I can imagine a spammer putting together a palette where there are 248 not-quite-white palette entries making up an essentially white background and a few entries devoted to displaying text. I'm sure PIL has something we could use to quantize the palette down to 16 colors or so, then use that palette to compute histograms, so I'm not all that worried about that scheme. The spammers do seem to be adapting very quickly (take a look at the frames in bogus-1.gif to see what I mean). I find it hard to believe it's in response to what we're doing here. I'm sure some other much bigger groups must be doing OCR analysis of image-based spam these days. BTW, don't worry too much if your mail program won't display the two images properly. XEmacs didn't like them at all, but Mozilla displayed them just fine. Skip -------------- next part -------------- A non-text attachment was scrubbed... Name: bogus-0.gif Type: image/gif Size: 46862 bytes Desc: not available Url : http://mail.python.org/pipermail/spambayes-dev/attachments/20060905/ae5bb8be/attachment-0002.gif -------------- next part -------------- A non-text attachment was scrubbed... Name: bogus-1.gif Type: image/gif Size: 37727 bytes Desc: not available Url : http://mail.python.org/pipermail/spambayes-dev/attachments/20060905/ae5bb8be/attachment-0003.gif From mhammond at skippinet.com.au Wed Sep 6 03:56:37 2006 From: mhammond at skippinet.com.au (Mark Hammond) Date: Wed, 6 Sep 2006 11:56:37 +1000 Subject: [spambayes-dev] Choosing which image to OCR In-Reply-To: <17662.9609.899056.55708@montanaro.dyndns.org> Message-ID: <2abc01c6d157$b023bd50$050a0a0a@enfoldsystems.local> > I find it hard to believe it's in > response to what we're doing here. I'm sure some other much > bigger groups > must be doing OCR analysis of image-based spam these days. Apparently SpamAssassin is developing OCR support. I found some vaguely interesting info from this slashdot discussion: http://it.slashdot.org/article.pl?sid=06/09/04/1712233 Mark From skip at pobox.com Wed Sep 6 06:05:19 2006 From: skip at pobox.com (skip at pobox.com) Date: Tue, 5 Sep 2006 23:05:19 -0500 Subject: [spambayes-dev] Choosing which image to OCR In-Reply-To: <2abc01c6d157$b023bd50$050a0a0a@enfoldsystems.local> References: <17662.9609.899056.55708@montanaro.dyndns.org> <2abc01c6d157$b023bd50$050a0a0a@enfoldsystems.local> Message-ID: <17662.18687.777973.978544@montanaro.dyndns.org> >> I find it hard to believe it's in response to what we're doing here. >> I'm sure some other much bigger groups must be doing OCR analysis of >> image-based spam these days. Mark> Apparently SpamAssassin is developing OCR support. I found some Mark> vaguely interesting info from this slashdot discussion: Mark> http://it.slashdot.org/article.pl?sid=06/09/04/1712233 Thanks for the pointer. That led me to gocr and libgocr (http://jocr.sf.net/). That seems to be what the SA folks are using for OCR. At first blush, gocr seems to be about as good (or as bad) as ocrad, though the text it generates appears to be bad in some dimension orthogonal to ocrad's badness. I haven't looked at libgocr yet. Skip From vilisch at wmw.com Wed Sep 6 06:12:01 2006 From: vilisch at wmw.com (Vilmos Schnedarek) Date: Wed, 06 Sep 2006 07:12:01 +0300 Subject: [spambayes-dev] Integrate SpamBayes into a Win32 application2 In-Reply-To: <2aa001c6d152$6de49c20$050a0a0a@enfoldsystems.local> References: <2aa001c6d152$6de49c20$050a0a0a@enfoldsystems.local> Message-ID: <44FE4A91.3040402@wmw.com> Thank you for your answer. Vilmos From rmezzone at pjsolomon.com Wed Sep 6 09:54:07 2006 From: rmezzone at pjsolomon.com (Robert Mezzone) Date: Wed, 6 Sep 2006 03:54:07 -0400 Subject: [spambayes-dev] Choosing which image to OCR Message-ID: This probably doesn't help u much but as an fyi: We run Trend ScanMail 7.0 on our Exchange server and until a few months ago it kept our mailboxes spam free. Being a long time user or spambayes, even I was suprised how well their spam filter worked. Then we started seeing a lot of spam, mainly the image based stuff, get past their filter. I called them and they told me they were developing a new scan engine. I haven't checked the logs to see when it was updated but they figuered out a way to dectect this junk because our mailboxes are once again spam free. -----Original Message----- From: spambayes-dev-bounces at python.org To: Mark Hammond CC: spambayes-dev at python.org Sent: Wed Sep 06 00:05:19 2006 Subject: Re: [spambayes-dev] Choosing which image to OCR >> I find it hard to believe it's in response to what we're doing here. >> I'm sure some other much bigger groups must be doing OCR analysis of >> image-based spam these days. Mark> Apparently SpamAssassin is developing OCR support. I found some Mark> vaguely interesting info from this slashdot discussion: Mark> http://it.slashdot.org/article.pl?sid=06/09/04/1712233 Thanks for the pointer. That led me to gocr and libgocr (http://jocr.sf.net/). That seems to be what the SA folks are using for OCR. At first blush, gocr seems to be about as good (or as bad) as ocrad, though the text it generates appears to be bad in some dimension orthogonal to ocrad's badness. I haven't looked at libgocr yet. Skip _______________________________________________ spambayes-dev mailing list spambayes-dev at python.org http://mail.python.org/mailman/listinfo/spambayes-dev -------------- next part -------------- An HTML attachment was scrubbed... URL: http://mail.python.org/pipermail/spambayes-dev/attachments/20060906/b58e2536/attachment.htm From tim.peters at gmail.com Thu Sep 7 05:03:47 2006 From: tim.peters at gmail.com (Tim Peters) Date: Wed, 6 Sep 2006 23:03:47 -0400 Subject: [spambayes-dev] Domain expiration In-Reply-To: <006b01c6cf8d$a1aebe80$6801a8c0@DreasLaptop> References: <44F92656.6020702@hooft.net> <006b01c6cf8d$a1aebe80$6801a8c0@DreasLaptop> Message-ID: <1f7befae0609062003r59c6bd1s76c309c67be559c@mail.gmail.com> [Dreas van Donselaar] > After exchanging some emails with mister Hooft, the proposal is that we move > the domainname from the current registrar Gandi to eNom. There SpamExperts > will donate 5 years of registration. That's very generous. Thank you! It all sounds fine to me. > We like the domain name whois information being adjusted so the > domainname is clearly owned by the SpamBayes community (instead > of the current situation where it is registered to mister Hooft). When > any official names or contact details are required, we use the founder > of the project Tim Peters. I intend to live another 5 years :-), so that's fine too. > SpamExperts will of course not have any ownership whatsoever. After > we added the years to the domain it can be pushed to any other > eNom account created by any of the current project maintainers (but > preferrably Tim Peters as well). > > Are there any objections to this procedure? Please also let the list (or > directly mister Hooft) know if you agree. No objections here, and I really appreciate it (also the years Rob Hooft contributed! thank you, Rob). If there are no objections from other developers within the next 5 minutes :-), please proceed. From john at corebuilds.com Fri Sep 8 22:55:58 2006 From: john at corebuilds.com (John Gagliardi) Date: Fri, 8 Sep 2006 16:55:58 -0400 Subject: [spambayes-dev] Junk E Mail Folder Problem Message-ID: <200609082055.k88KtXd7009709@ylpvm29.prodigy.net> My Junk E Mail folder has a red circle with a line through it. I can't delete it or move it. I can't even drag it to the Deletec Items folder. What caused this and what can I do to correct the problem? From skip at pobox.com Sun Sep 10 15:58:15 2006 From: skip at pobox.com (skip at pobox.com) Date: Sun, 10 Sep 2006 08:58:15 -0500 Subject: [spambayes-dev] multi-frame GIFs Message-ID: <17668.6647.614422.345485@montanaro.dyndns.org> I checked a change to spambayes/ImageStripper.py that purports to handle multi-frame images. It selects the frame most likely to contain text as the one with the fewest number of pixels in the background color. That's easily circumvented, but I'll wait until the spammers accomplish that feat. I also removed netpbm support. I'll wager a beer that anyone who has netpbm either has or can install PIL, making it the more logical choice. Skip From seant at inboxer.com Mon Sep 18 18:45:30 2006 From: seant at inboxer.com (Sean True) Date: Mon, 18 Sep 2006 12:45:30 -0400 Subject: [spambayes-dev] Considering IronPython ... Message-ID: <000001c6db41$e67b8650$0163a8c0@swapwizard.com> We're considering sponsoring a project to port the Spambayes code base to IronPython, and to then port the InBoxer product on to that code base. We would be interested in funding the first part and contributing the port back to the community, and funding the second part, while greedily keeping the results of the second part for ourselves. As you might imagine, I'd like to find someone capable of and interested in doing _both_. Comments, thoughts, or volunteers welcome -- in private or in forum. -- Sean Sean True InBoxer, Inc. From skip at pobox.com Mon Sep 25 22:05:25 2006 From: skip at pobox.com (skip at pobox.com) Date: Mon, 25 Sep 2006 15:05:25 -0500 Subject: [spambayes-dev] Ancient spambayes-dev messages popping up, maybe via spamlab.co.uk? Message-ID: <17688.13957.513443.35966@montanaro.dyndns.org> I received four or five spambayes-dev messages today that were dated in 2004. Looking at the received headers: Received: from mail.mojam.com [198.49.126.96] by montanaro.dyndns.org with IMAP (fetchmail-6.3.4) for (single-drop); Mon, 25 Sep 2006 13:02:47 -0500 (CDT) Received: from fence.pobox.com (fence.pobox.com [208.210.124.76]) by orca.mojam.com (Postfix) with ESMTP id EAABF1170024 for ; Mon, 25 Sep 2006 12:01:04 -0600 (MDT) Received: from fence.pobox.com (localhost [127.0.0.1]) by fence.pobox.com (Postfix) with ESMTP id 6A7CB72C for ; Mon, 25 Sep 2006 14:01:26 -0400 (EDT) Delivered-To: skip at pobox.com Received: from prodrts7.prod.spamlab.co.uk (unknown [213.208.70.139]) by fence.pobox.com (Postfix) with ESMTP id EF14B755 for ; Mon, 25 Sep 2006 14:01:21 -0400 (EDT) Received: from prodrts0 (prodrts0.prod.spamlab.co.uk [172.16.139.50]) by prodrts7.prod.spamlab.co.uk (8.13.1/8.13.1) with SMTP id k8PIuu4g012727; Mon, 25 Sep 2006 19:56:56 +0100 List-Archive: X-Spam-Status: No, hits=-4.9 required=5.0 tests=AWL, BAYES_00 autolearn=ham version=2.70-cvs List-Post: In-Reply-To: Message from "Seth Goodman" of "Tue, 13 Jan 2004 17:40:29 CST." it appears they originated somewhere outside of python.org, perhaps spamlab.co.uk. (cc'ing their postmaster just in case...) Skip From tameyer at ihug.co.nz Tue Sep 26 06:01:22 2006 From: tameyer at ihug.co.nz (Tony Meyer) Date: Tue, 26 Sep 2006 16:01:22 +1200 Subject: [spambayes-dev] Ancient spambayes-dev messages popping up, maybe via spamlab.co.uk? In-Reply-To: <17688.13957.513443.35966@montanaro.dyndns.org> References: <17688.13957.513443.35966@montanaro.dyndns.org> Message-ID: <282DA95E-5400-40ED-8474-70B221EA3BAA@ihug.co.nz> > I received four or five spambayes-dev messages today that were > dated in > 2004. Me, too. > Looking at the received headers: [...] > it appears they originated somewhere outside of python.org, perhaps > spamlab.co.uk. (cc'ing their postmaster just in case...) Received: (qmail 32536 invoked from network); 26 Sep 2006 02:15:21 -0000 Received: from ironport4.ihug.co.nz (203.109.254.24) by mail7.ihug.co.nz with SMTP; 26 Sep 2006 02:15:21 -0000 Received: from grunt6.ihug.co.nz ([203.109.254.46]) by ironport4.ihug.co.nz with ESMTP; 26 Sep 2006 14:07:09 +1200 Received: from ironport1.ihug.co.nz [203.109.254.19] by grunt6.ihug.co.nz with esmtp (Exim 3.35 #1 (Debian)) id 1GS2Lk-0007z2-06; Tue, 26 Sep 2006 14:07:08 +1200 Received: from unknown (HELO prodrts7.prod.spamlab.co.uk) ([213.208.70.139]) by ironport1.ihug.co.nz with ESMTP; 26 Sep 2006 14:07:07 +1200 Received: from prodrts0 (prodrts0.prod.spamlab.co.uk [172.16.139.50]) by prodrts7.prod.spamlab.co.uk (8.13.1/8.13.1) with SMTP id k8Q2jgxE019041; Tue, 26 Sep 2006 03:45:43 +0100 Same goes for mine, FWIW. =Tony.Meyer From sethg at GoodmanAssociates.com Tue Sep 26 10:54:13 2006 From: sethg at GoodmanAssociates.com (Seth Goodman) Date: Tue, 26 Sep 2006 03:54:13 -0500 Subject: [spambayes-dev] Ancient spambayes-dev messages popping up, maybe via spamlab.co.uk? In-Reply-To: <17688.13957.513443.35966@montanaro.dyndns.org> Message-ID: On Monday, September 25, 2006 3:05 PM -0500, skip at pobox.com wrote: > I received four or five spambayes-dev messages today that were > dated in 2004. Looking at the received headers: > > Received: from mail.mojam.com [198.49.126.96] > by montanaro.dyndns.org with IMAP (fetchmail-6.3.4) > for (single-drop); > Mon, 25 Sep 2006 13:02:47 -0500 (CDT) > Received: from fence.pobox.com (fence.pobox.com > [208.210.124.76]) by orca.mojam.com (Postfix) with > ESMTP id EAABF1170024 for ; Mon, > 25 Sep 2006 12:01:04 -0600 (MDT) > Received: from fence.pobox.com (localhost [127.0.0.1]) > by fence.pobox.com (Postfix) with ESMTP id 6A7CB72C for > ; Mon, 25 Sep 2006 14:01:26 -0400 (EDT) > Delivered-To: skip at pobox.com > Received: from prodrts7.prod.spamlab.co.uk (unknown > [213.208.70.139]) by fence.pobox.com (Postfix) with > ESMTP id EF14B755 for ; Mon, 25 Sep > 2006 14:01:21 -0400 (EDT) > Received: from prodrts0 (prodrts0.prod.spamlab.co.uk [172.16.139.50]) > by prodrts7.prod.spamlab.co.uk (8.13.1/8.13.1) with SMTP > id k8PIuu4g012727; Mon, 25 Sep 2006 19:56:56 +0100 > List-Archive: > X-Spam-Status: No, hits=-4.9 required=5.0 tests=AWL, > BAYES_00 autolearn=ham version=2.70-cvs > List-Post: > In-Reply-To: Message from "Seth Goodman" of > "Tue, 13 Jan 2004 17:40:29 CST." > This caught my attention as my name appears in the In-Rely-To: header and I did send a list message with that timestamp. Assuming it's legitimate, I wonder what else fell out of that mail queue from two years ago :) -- Seth Goodman From skip at pobox.com Tue Sep 26 13:22:07 2006 From: skip at pobox.com (skip at pobox.com) Date: Tue, 26 Sep 2006 06:22:07 -0500 Subject: [spambayes-dev] Ancient spambayes-dev messages popping up, maybe via spamlab.co.uk? In-Reply-To: References: <17688.13957.513443.35966@montanaro.dyndns.org> Message-ID: <17689.3423.151914.636731@montanaro.dyndns.org> Seth> This caught my attention as my name appears in the In-Rely-To: Seth> header and I did send a list message with that timestamp. Seth> Assuming it's legitimate, I wonder what else fell out of that mail Seth> queue from two years ago :) I suspect they are messages that actually made it to the list way back when and are being reinjected now. I sent another email to postmaster at spamlab.co.uk last night. Still no response from them. Skip From skip at pobox.com Tue Sep 26 13:24:16 2006 From: skip at pobox.com (skip at pobox.com) Date: Tue, 26 Sep 2006 06:24:16 -0500 Subject: [spambayes-dev] Ancient spambayes-dev messages popping up, maybe via spamlab.co.uk? In-Reply-To: References: <17688.13957.513443.35966@montanaro.dyndns.org> Message-ID: <17689.3552.203513.87837@montanaro.dyndns.org> I sent another email to postmaster at spamlab.co.uk last night. Still no response from them. Belay that. Found a response in my errors mailbox (where all postmaster mail goes). Skip From popiel at wolfskeep.com Tue Sep 26 16:44:31 2006 From: popiel at wolfskeep.com (T. Alexander Popiel) Date: Tue, 26 Sep 2006 07:44:31 -0700 Subject: [spambayes-dev] Ancient spambayes-dev messages popping up, maybe via spamlab.co.uk? In-Reply-To: References: Message-ID: <20060926144431.0089D2DF80@cashew.wolfskeep.com> In message: "Seth Goodman" writes: > >This caught my attention as my name appears in the In-Rely-To: header >and I did send a list message with that timestamp. Assuming it's >legitimate, I wonder what else fell out of that mail queue from two >years ago :) Well, so far I've noticed 3 mails originally from me among the echoes. My first reaction was "why is someone resending _MY_ emails" until I noticed that it wasn't just me. Gotta love mail filters (both mechanical and biological) that make your own words more noticable than those of others... ;-) But I second Skip's suggestion that these are things that made it all the way through in the first place. I've been able to find the originals of them in my archives. (Why, yes, I've got 5 gigabytes of mail archives. Why do you ask?) - Alex