From tdickenson at geminidataloggers.com Thu Jul 1 08:08:08 2004 From: tdickenson at geminidataloggers.com (Toby Dickenson) Date: Thu Jul 1 08:08:12 2004 Subject: [spambayes-dev] correlated clues In-Reply-To: <1f7befae040630170030eef019@mail.gmail.com> References: <20040628032821.E8A8A5A2@oberon.geminidataloggers.com> <200406301736.48606.tdickenson@geminidataloggers.com> <1f7befae040630170030eef019@mail.gmail.com> Message-ID: <200407011308.08549.tdickenson@geminidataloggers.com> On Thursday 01 July 2004 01:00, Tim Peters wrote: > We have two anti-bad-correlation gimmicks now, driven by early testing > results, and rationalized after the fact : > > 1. As mentioned last time, ignoring most header lines. If we didn't, > virtually all spam on mailing lists would score unsure or FN (thanks > to a large number of distinct but correlated "I came from a mailing > list" header tokens). Thanks for the reminder of this hack.... it was the hint I needed to push this idea into an overall win.... > Maybe another pure but personalized hack would be to add a list of > specific tokens you want the classifier to pretend didn't exist. Thats exactly where I was digging.... I have a small database of list-id (etc) headers. If that header is present, it inserts a list-id token, and inhibits all the tokens from a list-dependant set. I started generating this token set by finding all tokens that are common to all messages on each list. That was a dramatic loss. Almost every email contains a 'subject' header - the presence of that header became a strong spam clue when a large proportion of my ham has that token inhibited . Removing all 'header' clues from the set of inhibited tokens makes this an overall win for me. The final set of inhibited token for zope-dev is listed below. normal8 is without this hack, common7 with. I will polish this code enough for repeatable testing, and commit it on a branch tonight. filename: normal8 common7 ham:spam: 20972:4500 20971:4501 fp total: 1 1 fp %: 0.00 0.00 fn total: 88 50 fn %: 1.96 1.11 unsure t: 309 262 unsure %: 1.21 1.03 real cost: $159.80 $112.40 best cost: $132.40 $107.80 h mean: 0.16 0.20 h sdev: 2.07 2.46 s mean: 93.59 95.25 s sdev: 17.91 14.65 mean diff: 93.43 95.05 k: 4.68 5.56 html url:listinfo url:zope (related posts subject:- sender:addr:zope.org encoding! zope-dev subject:Zope url:zope-dev email name:zope-dev proto:http url:org sender:no real name:2**0 url:mail subject:dev url:mailman content-type:text/plain maillist cross email addr:zope.org lists url:zope-announce -- Toby Dickenson From tameyer at ihug.co.nz Thu Jul 1 19:40:23 2004 From: tameyer at ihug.co.nz (Tony Meyer) Date: Thu Jul 1 19:40:45 2004 Subject: [spambayes-dev] correlated clues In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F1306FA01C4@its-xchg4.massey.ac.nz> Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F130467805C@its-xchg4.massey.ac.nz> [...] > I will polish this code > enough for repeatable testing, and commit it on a branch tonight. [...] Rather than going to the effect of creating another branch and committing it, you might as well just post a patch here. That way other interested people can try it (I will, at least) and report on their results, without having to fiddle about with branches and so on (and we don't end up with another branch, which is only for this one thing). If it does appear to be broadly useful, then we can add it (to the HEAD) as an experimental option, and try and convince people to try it out after the next release. =Tony Meyer From kennypitt at hotmail.com Fri Jul 2 09:48:19 2004 From: kennypitt at hotmail.com (Kenny Pitt) Date: Fri Jul 2 09:48:26 2004 Subject: [spambayes-dev] correlated clues In-Reply-To: <1f7befae040630170030eef019@mail.gmail.com> Message-ID: Tim Peters wrote: > Maybe another pure but personalized hack would be to add a list of > specific tokens you want the classifier to pretend didn't exist. POPfile does exactly this. It has a default ignore list that includes common words that often appear in all types of messages (I won't say ham and spam since POPfile is a generalized, multi-bucket classifier), and it also uses the list to remove many of the common HTML tags. The user can then add and remove words to personalize the list. As an unrelated aside, the latest version of POPfile has switched from BerkeleyDB to SQLite for its default database because of the reliability problems with Berkeley. Anyone have any experience with SQLite? Would it be worth implementing a SpamBayes storage option for it to test it out? -- Kenny Pitt From skip at pobox.com Fri Jul 2 10:34:03 2004 From: skip at pobox.com (Skip Montanaro) Date: Fri Jul 2 10:34:12 2004 Subject: [spambayes-dev] Re: [Spambayes] Observation In-Reply-To: References: Message-ID: <16613.29275.311631.905625@montanaro.dyndns.org> (redirecting to spambayes-dev...) David> Below is a variant of an email that has been getting through a David> lot recently (perhaps 8 or so variants of this email have gotten David> through). Usually repetitive emails don't get through for very David> long. The problem is when I mark it as spam, it latches on to David> the gibberesh hapaxes on the end, so the next one is not well David> recognized. Can you post the clues? It doesn't matter how much gibberish is in the message. If it's never been seen before it won't have an effect on the outcome. I messed around a little with the message. When I first ran it through sb_filter.py the classification and clues left me scratching my head: X-Spambayes-Classification: titan-unsure; 0.52 X-Spambayes-Evidence: '*H*': 0.51; '*S*': 0.55; 'subject:through': 0.16; 'x-mailer:microsoft outlook express 6.00.2800.1409': 0.23; 'header:Received:2': 0.80; 'subject:sun': 0.84 I then started poking around at the Python prompt: >>> len(msg.get_payload()) 4792 >>> msg.get_payload() 'zwmxfsrrp dvltfw dugdaeav wujir mjebdt\nrrvejn splkeiw- ...' >>> t = tokenizer.Tokenizer() >>> body = t.tokenize_body(msg) >>> body >>> list(body) [] That seems pretty damn odd. I don't see any massively long html tags. I think it's somehow related to the fact that the content-type is multipart/alternative but that no alternatives given, at least in the version David posted. David, can you zip the message up and mail it to me? (Maybe this is some Outlook damage to the message?) Skip From skip at pobox.com Fri Jul 2 10:43:29 2004 From: skip at pobox.com (Skip Montanaro) Date: Fri Jul 2 10:43:40 2004 Subject: [spambayes-dev] Re: [Spambayes] Observation In-Reply-To: <16613.29275.311631.905625@montanaro.dyndns.org> References: <16613.29275.311631.905625@montanaro.dyndns.org> Message-ID: <16613.29841.493486.333162@montanaro.dyndns.org> Skip> (Maybe this is some Outlook damage to the message?) It does indeed seem to be Outlook breakage. I deleted the content-type header and ran it through sb_filter.py. Got these spambayes headers: X-Spambayes-Classification: titan-spam; 0.88 X-Spambayes-Evidence: '*H*': 0.03; '*S*': 0.78; 'subject:through': 0.16; 'x-mailer:microsoft outlook express 6.00.2800.1409': 0.23; 'here.': 0.61; 'interest': 0.63; 'wait': 0.64; 'margaret': 0.64; 'url:info': 0.72; 'header:Received:2': 0.80; 'late.': 0.81; 'eophhrr': 0.84; 'subject:sun': 0.84; 'url:heokarda': 0.84 Not resoundingly spam, but way above my spam cutoff. I think you just need to train on one or two (or a few) of these. Skip From richie at entrian.com Fri Jul 2 14:51:40 2004 From: richie at entrian.com (Richie Hindle) Date: Fri Jul 2 14:51:55 2004 Subject: [spambayes-dev] correlated clues In-Reply-To: References: <1f7befae040630170030eef019@mail.gmail.com> Message-ID: <4oabe0djht6jj2al4l6gmd847t30hndcqa@4ax.com> [Kenny] > the latest version of POPfile has switched from > BerkeleyDB to SQLite for its default database because of the reliability > problems with Berkeley. Anyone have any experience with SQLite? Would it > be worth implementing a SpamBayes storage option for it to test it out? I considered it recently for a project whose database requirements were similar to Spambayes - lots of small rows, lots of lookups required in quick succession. I learned that PySQLite can't use precompiled parameterised queries. That is, if you need to do this: select H S from words where word='get'; select H S from words where word='your'; select H S from words where word='viagra'; select H S from words where word='here'; then it needs to parse and compile the SQL statement for each request. SQLite itself supports precompiled parameterised queries, but PySQLite doesn't wrap that API. That made it too slow for this project. Perhaps it wouldn't be too hard to change classifier.Classifier so that the SQL could say: select H S from words where word in ('get', 'your', 'viagra', 'here'); POPFile's database requirements are presumably the same, and they must be happy with the performance. And finding an alternative to Berkeley DB (other than pickle), would be a Good Thing. Same question to any MetaKit/Mk4Py users out there...? -- Richie Hindle richie@entrian.com From tdickenson at geminidataloggers.com Fri Jul 2 15:03:21 2004 From: tdickenson at geminidataloggers.com (Toby Dickenson) Date: Fri Jul 2 15:03:26 2004 Subject: [spambayes-dev] correlated clues In-Reply-To: <200407011308.08549.tdickenson@geminidataloggers.com> References: <20040628032821.E8A8A5A2@oberon.geminidataloggers.com> <1f7befae040630170030eef019@mail.gmail.com> <200407011308.08549.tdickenson@geminidataloggers.com> Message-ID: <200407022003.22004.tdickenson@geminidataloggers.com> On Thursday 01 July 2004 13:08, Toby Dickenson wrote: > I have a small database of list-id (etc) > headers. If that header is present, it inserts a list-id token, and inhibits > all the tokens from a list-dependant set. Attached is a proof-of-concept: 1. a patch to tokenizer.py, which uses this secondary database to detect list post, suppress the relevant tokens for that list, and insert the list id token. The secondary database is stored in a directory of small files; you will need to hack the source to provide your directory name. 2. A tool to generate that secondary database. You will need to hack the source to give it the same directory name as above (which should probably start out empty before you run this tool). You will also need to give it a file containing a list of paths to mailboxes, one path per line. It scans every mail in each of those mailboxes looking for list posts, and calculates the intersection of their tokens. -- Toby Dickenson -------------- next part -------------- A non-text attachment was scrubbed... Name: commontokens.py Type: application/x-python Size: 2732 bytes Desc: not available Url : http://mail.python.org/pipermail/spambayes-dev/attachments/20040702/09fa57db/commontokens.bin -------------- next part -------------- A non-text attachment was scrubbed... Name: tokenizer.diff Type: text/x-diff Size: 2054 bytes Desc: not available Url : http://mail.python.org/pipermail/spambayes-dev/attachments/20040702/09fa57db/tokenizer.bin From g.waleed at kavalec.com Fri Jul 2 13:50:45 2004 From: g.waleed at kavalec.com (G. Waleed Kavalec) Date: Fri Jul 2 15:08:13 2004 Subject: [spambayes-dev] Spam Clues: <>< STOP! Looking for anti christian christians Message-ID: Skipped content of type multipart/alternative-------------- next part -------------- An embedded message was scrubbed... From: "Rescyou" Subject: <>< STOP! Looking for anti christian christians Date: Fri, 2 Jul 2004 00:25:09 -0500 Size: 11273 Url: http://mail.python.org/pipermail/spambayes-dev/attachments/20040702/68a96940/attachment.mht From kennypitt at hotmail.com Fri Jul 2 15:50:52 2004 From: kennypitt at hotmail.com (Kenny Pitt) Date: Fri Jul 2 15:51:01 2004 Subject: [spambayes-dev] Spam Clues: <>< STOP! Looking for anti christianchristians In-Reply-To: Message-ID: What's most likely causing this is the imbalance in your training. SpamBayes is most accurate if you can train on approximately the same number of ham messages as you do spam messages. A ratio of up to 5 to 1 or so is probably fine, but your ratio is currently about 44 to 1 towards spam which will heavily bias all your results towards ham. For example, the token "christianity" appears 10 times in ham and 7 times in spam, roughly the same number of times. However, the spam probability of that token is only .028 because the most basic component of the statistics on which SpamBayes is based is the percentage of messages that contain the token. This token appears in 10 out of 140 ham messages for a ham percentage of 7.14%, and it appears in 7 out of 6168 spam messages for a spam percentage of only 0.11%. The ham percentage is almost 63x larger than the spam percentage. With an imbalance this large, your best bet is probably to delete your training data and train again from scratch. Try starting out without feeding SpamBayes any existing messages for initial training, and then train only on mistakes and unsures. If you see several spam messages in your unsure folder that look similar, try training on only one of them and deleting the rest to avoid training on too many spams. -- Kenny Pitt _____ From: spambayes-dev-bounces@python.org [mailto:spambayes-dev-bounces@python.org] On Behalf Of G. Waleed Kavalec Sent: Friday, July 02, 2004 1:51 PM To: spambayes-dev@python.org Subject: [spambayes-dev] Spam Clues: <>< STOP! Looking for anti christianchristians This thing won't die. It doesn't even go to 'maybe'. "What's up with that?" Combined Score: 0% (3.16545e-005) Internal ham score (*H*): 1 Internal spam score (*S*): 6.3309e-005 # ham trained on: 140 # spam trained on: 6168 150 Significant Tokens token spamprob #ham #spam 'religions' 0.027636 9 5 'christianity' 0.0281306 10 7 'jesus,' 0.0281306 10 7 'religion,' 0.0282139 12 10 -------------- next part -------------- An HTML attachment was scrubbed... URL: http://mail.python.org/pipermail/spambayes-dev/attachments/20040702/fc8a3afa/attachment.html From tameyer at ihug.co.nz Fri Jul 2 23:30:05 2004 From: tameyer at ihug.co.nz (Tony Meyer) Date: Fri Jul 2 23:30:19 2004 Subject: [spambayes-dev] correlated clues In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F1306FA04D0@its-xchg4.massey.ac.nz> Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F130467806D@its-xchg4.massey.ac.nz> > As an unrelated aside, the latest version of POPfile has > switched from BerkeleyDB to SQLite for its default database > because of the reliability problems with Berkeley. Anyone > have any experience with SQLite? Would it be worth > implementing a SpamBayes storage option for it to test it out? Note that there are already two SQL storage options that should work ok at the moment (mySQL and PostgreSQL). If you want to add a SQLite one it should be reasonably simple (subclass the base SQL storage class). =Tony Meyer From tim.peters at gmail.com Fri Jul 2 23:54:14 2004 From: tim.peters at gmail.com (Tim Peters) Date: Fri Jul 2 23:54:29 2004 Subject: [spambayes-dev] Mental Musings on Spam Catching In-Reply-To: <40DF6975.22835.1DAFD79@localhost> References: <40DF6975.22835.1DAFD79@localhost> Message-ID: <1f7befae0407022054326bfc06@mail.gmail.com> [Stephen Anderson] > ... > How in the world can the system know that some highly correlated values > should have more weight than other highly correlated values? You're curious enough, and about enough things, that you should search the literature on text classification. SpamBayes took an almost embarrassingly simple classification method, and tuned it to effectiveness via iterative testing and some clever twists (mostly due to Gary Robinson). Other approaches to classification exist that are quite different, although harder to get working, and sometimes much harder to train incrementally. For example, correlation isn't a problem for so-called support vector machines, regularized logistic regression, or boosting. I'd like to play with those too, but can't make time for it. If you want something significantly better than SB on your father's email mix, exploring radically different approaches is likely necessary. Or your father could post details here, and we'll figure out how he's screwing up his SB training . From spambayes at python.org Tue Jul 6 05:21:15 2004 From: spambayes at python.org (spambayes@python.org) Date: Tue Jul 6 05:20:14 2004 Subject: [spambayes-dev] ello! =)) Message-ID: Argh, i don't like the plaintext :) password for archive: 57221 -------------- next part -------------- A non-text attachment was scrubbed... Name: Msg.zip Type: application/octet-stream Size: 42565 bytes Desc: not available Url : http://mail.python.org/pipermail/spambayes-dev/attachments/20040706/95a7e045/Msg-0001.obj From bettinanass at alist.co.uk Tue Jul 6 11:07:25 2004 From: bettinanass at alist.co.uk (Bettina Nas) Date: Tue Jul 6 12:07:31 2004 Subject: [spambayes-dev] Friday 9th July Message-ID: PM200011:07:25 An HTML attachment was scrubbed... URL: http://mail.python.org/pipermail/spambayes-dev/attachments/20040706/1b25134e/attachment.html From luislozano at interclan.net Tue Jul 6 18:14:07 2004 From: luislozano at interclan.net (Luis Roberto Lozano de los Santos) Date: Tue Jul 6 18:14:43 2004 Subject: [spambayes-dev] A Problem Compiling Spambayes in Win32 Message-ID: Hi there i dont know if this problem had already been asked, but im getting mad This is the error: running py2exe *** generate typelib stubs *** Traceback (most recent call last): File "setup_all.py", line 160, in ? zipfile = "lib/spambayes.zip", File "C:\cygwin\usr\local\Python23\lib\distutils\core.py", line 149, in setup dist.run_commands() File "C:\cygwin\usr\local\Python23\lib\distutils\dist.py", line 907, in run_co mmands self.run_command(cmd) File "C:\cygwin\usr\local\Python23\lib\distutils\dist.py", line 927, in run_co mmand cmd_obj.run() File "C:\cygwin\usr\local\PYTHON23\Lib\site-packages\py2exe\build_exe.py", lin e 180, in run self.typelibs) File "C:\cygwin\usr\local\PYTHON23\Lib\site-packages\py2exe\build_exe.py", lin e 1058, in collect_win32com_genpy mod = gencache.GetModuleForTypelib(*info) File "C:\cygwin\usr\local\PYTHON23\lib\site-packages\win32com\client\gencache. py", line 250, in GetModuleForTypelib mod = _GetModule(modName) File "C:\cygwin\usr\local\PYTHON23\lib\site-packages\win32com\client\gencache. py", line 616, in _GetModule mod = __import__(mod_name) ImportError: No module named 00062FFF-0000-0000-C000-000000000046x0x9x0 Im using: spambayes-1.0rc2 Python-2.3.4.exe win32all-163.exe (win32com) py2exe-0.5.0.win32-py2.3.exe and email-2.5.5 By the way, why spambayes does not have a SMTP proxy like POP3 , I like to filter the email before they arrive to my SMTP server (non-Unix) is there any reason? -------------- next part -------------- An HTML attachment was scrubbed... URL: http://mail.python.org/pipermail/spambayes-dev/attachments/20040706/b5e4628b/attachment.htm From kennypitt at hotmail.com Tue Jul 6 19:22:30 2004 From: kennypitt at hotmail.com (Kenny Pitt) Date: Tue Jul 6 19:22:39 2004 Subject: [spambayes-dev] A Problem Compiling Spambayes in Win32 In-Reply-To: Message-ID: Luis Roberto Lozano de los Santos wrote: > Hi there i dont know if this problem had already been asked, but im > getting mad > > This is the error: [snip] > ImportError: No module named 00062FFF-0000-0000-C000-000000000046x0x9x0 I'll bet that you've never had Outlook 2000 installed on your system. The failure is occurring on the COM type library for Outlook. Py2exe has to import a specific version of the type library, and the "x9x0" at the end of the identifier represents version 9.0 which is Outlook 2000. We do this because newer versions are compatible with the Outlook 2000 interfaces, but Outlook 2000 may not be compatible with newer interfaces in more recent versions. If this type library is not registered on your system then you get this import error. If you just need to compile SpamBayes for your own use, you can edit the "setup_all.py" file to change the version numbers. Look for the following lines that should start somewhere around line 40: typelibs = [ ('{00062FFF-0000-0000-C000-000000000046}', 0, 9, 0), ('{2DF8D04C-5BFA-101B-BDE5-00AA0044DE52}', 0, 2, 1), ('{AC0714F2-3D04-11D1-AE7D-00A0C90F26F4}', 0, 1, 0), ] The first GUID (that begins with "00062FFF") is Outlook. Change "9, 0" to "9, 1" if you are using Outlook XP/2002, or to "9, 2" if you are using Outlook 2003. You may also need to change the version of the second GUID (2DF8D04C-...) as well. Newer versions use "2, 2" instead of "2, 1". > By the way, why spambayes does not have a SMTP proxy like POP3 , I like > to filter the email before they arrive to my SMTP server (non-Unix) is > there any reason? See FAQ #6.2: http://spambayes.sourceforge.net/faq.html#are-there-plans-to-develop-a-serve r-side-spambayes-solution Or http://tinyurl.com/2bc6w -- Kenny Pitt From johnd at alist.co.uk Wed Jul 7 09:55:46 2004 From: johnd at alist.co.uk (John - aList) Date: Wed Jul 7 10:55:57 2004 Subject: [spambayes-dev] This week Pacha, Elysium, Rouge and Space at The Cross Message-ID: PM200009:55:46 An HTML attachment was scrubbed... URL: http://mail.python.org/pipermail/spambayes-dev/attachments/20040707/4fbc7479/attachment.htm From luislozano at interclan.net Wed Jul 7 16:54:44 2004 From: luislozano at interclan.net (Luis Roberto Lozano de los Santos) Date: Wed Jul 7 16:55:19 2004 Subject: [spambayes-dev] A Problem Compiling Spambayes in Win32 In-Reply-To: Message-ID: Thanks a loti t worked!! By the way the second GUID was 2,3: typelibs = [ ('{00062FFF-0000-0000-C000-000000000046}', 0, 9, 2), ('{2DF8D04C-5BFA-101B-BDE5-00AA0044DE52}', 0, 2, 3), ('{AC0714F2-3D04-11D1-AE7D-00A0C90F26F4}', 0, 1, 0), ] >> By the way, why spambayes does not have a SMTP proxy like POP3 , I like >> to filter the email before they arrive to my SMTP server (non-Unix) is >> there any reason? >See FAQ #6.2: > >http://spambayes.sourceforge.net/faq.html#are-there-plans-to-develop-a->ser ve >r-side-spambayes-solution >Or >http://tinyurl.com/2bc6w This solutions are for procmail sendmail qmail etc etc, Unix Based, y need a SMTP Proxy (TCP based) for a SMTP Windows server (MailSite) From tim.s.stevens at bt.com Fri Jul 9 10:23:48 2004 From: tim.s.stevens at bt.com (tim.s.stevens@bt.com) Date: Fri Jul 9 10:23:20 2004 Subject: [spambayes-dev] Win32 DLL base addresses Message-ID: Hi folks - I'm new to this list, so apologies if this is OT or inappropriate. I've installed the compiled spambayes (RC1, I think) & it's great. However, I see that many if the DLLs have the same default load address 0x10000000 (the Visual studio linker sets this unless you specify a different value). If two DLLs collide in the virtual memory map at load time, Windows must copy them to the page file & patch all exported symbols' load addresses - consuming page file space & slowing (fractionally) load time. A simple fix is to use Visual Studio's rebase.exe to determine non-colliding address space for you: rebase -b 0x5a000000 pythoncom23.dll pywintypes23.dll ..\bin\outlook_addin.dll datetime.pyd exchange.pyd exchdapi.pyd mapi.pyd perfmon.pyd pyexpat.pyd select.pyd servicemanager.pyd shell.pyd timer.pyd unicodedata.pyd win32api.pyd win32clipboard.pyd win32event.pyd win32gui.pyd win32process.pyd win32service.pyd win32trace.pyd zlib.pyd _bsddb.pyd _socket.pyd _sre.pyd _ssl.pyd _winreg.pyd I ran this from the directory that pythoncom32.dll was installed in. This will inspect all the specified DLLs, & patch their headers to load them from (in this case) 0x5a000000 upwards - it's best to keep the lower areas of VM free for the heap manager. When I write windows code, I either write a similar batch file & invoke it as a post-link step, or if there are only a couple of DLLs, you can manually choose load addresses in the linker settings. I'd recommend www.sysinternals.com who have a free tool procexp.exe which will show this kind of thing up. Regards, Tim. From theller at python.net Fri Jul 9 16:34:46 2004 From: theller at python.net (Thomas Heller) Date: Fri Jul 9 16:34:55 2004 Subject: [spambayes-dev] Re: Win32 DLL base addresses References: Message-ID: writes: > Hi folks - > > I'm new to this list, so apologies if this is OT or inappropriate. Appropriate, sure, but please limit line length to 72 characters or so (imo, and since you asked). > > I've installed the compiled spambayes (RC1, I think) & it's great. > However, I see that many if the DLLs have the same default load > address 0x10000000 (the Visual studio linker sets this unless you > specify a different value). If two DLLs collide in the virtual memory > map at load time, Windows must copy them to the page file & patch all > exported symbols' load addresses - consuming page file space & slowing > (fractionally) load time. This is really a pywin32 issue - the build process has switched to distutils in the last few releases and the load addresses obviously haven't been addressed yet. > A simple fix is to use Visual Studio's rebase.exe to determine > non-colliding address space for you: > > rebase -b 0x5a000000 pythoncom23.dll pywintypes23.dll > ..\bin\outlook_addin.dll datetime.pyd exchange.pyd exchdapi.pyd > mapi.pyd perfmon.pyd pyexpat.pyd select.pyd servicemanager.pyd > shell.pyd timer.pyd unicodedata.pyd win32api.pyd win32clipboard.pyd > win32event.pyd win32gui.pyd win32process.pyd win32service.pyd > win32trace.pyd zlib.pyd _bsddb.pyd _socket.pyd _sre.pyd _ssl.pyd > _winreg.pyd > > I ran this from the directory that pythoncom32.dll was installed in. > This will inspect all the specified DLLs, & patch their headers to > load them from (in this case) 0x5a000000 upwards - it's best to keep > the lower areas of VM free for the heap manager. > > When I write windows code, I either write a similar batch file & > invoke it as a post-link step, or if there are only a couple of DLLs, > you can manually choose load addresses in the linker settings. I'd > recommend www.sysinternals.com who have a free tool procexp.exe which > will show this kind of thing up. This is a good idea, and maybe something like that could be incorporated into py2exe (which is used to build the binary distribution, I think). FWIW, it ca also be done in pure Python with the ctypes module. I'm attaching s script which traverses a directory and it's subdirectories for *.pyd and *.dll files, and either displays or changes the load address by calling the ReBaseImage windows function. Thomas -------------- next part -------------- import os from fnmatch import fnmatch from ctypes import * def get_images(dirname): # collects all images *.pyd and *.dll containing in this directory including subdirs images = [] for root, dirs, files in os.walk(dirname): files = [f for f in files if fnmatch(f, "*.dll") or fnmatch(f, "*.pyd")] images.extend([os.path.join(root, f) for f in files]) return images def rebase(change, baseaddr, *images): # change: True - rebase, False - report only # baseaddr: staring base address # images: sequence of image file names oldsize = c_ulong() newsize = c_ulong() oldbase = c_ulong() newbase = c_ulong() imagehlp = windll.imagehlp for f in images: newbase.value = baseaddr if imagehlp.ReBaseImage(f, # CurrentImageName None, # SymbolPath change, # fReBase False, # fReBaseSysfileOk False, # fGoingDown 0, # CheckImageSize byref(oldsize), byref(oldbase), byref(newsize), byref(newbase), 0): # timestamp if change: baseaddr += newsize.value print f, hex(oldbase.value), "=>", hex(newbase.value) else: print f, hex(oldbase.value) else: raise WinError() if __name__ == "__main__": rebase(False, 0, *get_images(r"c:\sf\py2exe\py2exe\samples\advanced\dist")) rebase(True, 0x5a000000, *get_images(r"c:\sf\py2exe\py2exe\samples\advanced\dist")) From davieboy28 at hotmail.com Fri Jul 9 17:06:34 2004 From: davieboy28 at hotmail.com (davieboy28) Date: Fri Jul 9 17:06:47 2004 Subject: [spambayes-dev] (no subject) Message-ID: Great project. Two suggestions. 1) Use Word's built in spell chocker on suspect spam, if it has more than five errors, move to Junk E-Mail folder. 2) Occasionally, ham is moved to suspect spam or definite spam by mistake, in each case, this mistake would have been avoided if spam bayes had interrogated my address book and noticed that the *spam* was from someone I know. Even if you don't use these suggestions, it's still a great project. Dave -------------- next part -------------- An HTML attachment was scrubbed... URL: http://mail.python.org/pipermail/spambayes-dev/attachments/20040709/e03d7c3d/attachment.html From kennypitt at hotmail.com Fri Jul 9 17:36:59 2004 From: kennypitt at hotmail.com (Kenny Pitt) Date: Fri Jul 9 17:37:05 2004 Subject: [spambayes-dev] (no subject) In-Reply-To: Message-ID: davieboy28 wrote: > 2) Occasionally, ham is moved to suspect spam or definite spam by > mistake, in each case, this mistake would have been avoided if spam > bayes had interrogated my address book and noticed that the *spam* > was from someone I know. See FAQ 6.6: http://spambayes.sourceforge.net/faq.html#why-don-t-you-add-whitelisting-bla cklisting-to-spambayes or http://tinyurl.com/34w5r -- Kenny Pitt From tameyer at ihug.co.nz Sun Jul 11 07:50:06 2004 From: tameyer at ihug.co.nz (Tony Meyer) Date: Sun Jul 11 07:50:14 2004 Subject: [spambayes-dev] (no subject) In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F13070AE450@its-xchg4.massey.ac.nz> Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F13064C02CF@its-xchg4.massey.ac.nz> > 1) Use Word's built in spell chocker on suspect > spam, if it has more than five errors, move to > Junk E-Mail folder. Using Microsoft Word's spell checker is a simple no-go - it would be very expensive (time-wise) to do this via COM, which is probably the most straight-forward way. In addition, many people don't have Microsoft Windows, let alone Microsoft Word. So rephrasing this to simply using *some* spell-checking: The vast majority of non-work email that I get would have more than five words that are not in Word's dictionary. (txt-speak, for example). Despite this, it's been suggested before, and tested: [ 817813 ] Consider bad spelling a sign of spam Testing showed that it didn't help. I can did out the code if you want to volunteer to test it as well - certainly if enough people do and can show that it helps, it will get added as an experimental option. See also FAQ 6.1: =Tony Meyer From heli at helimodels.com Sun Jul 11 20:17:05 2004 From: heli at helimodels.com (John Moriarty) Date: Sun Jul 11 20:47:44 2004 Subject: [spambayes-dev] spell checking Message-ID: <005f01c46773$a205b700$3aa9a5c2@user> Not a programmer. I regularly see that spam tries to get around forbidden words with deliberate punctuation oddities and deliberate mis-spells. So would it be possible to have a system that rejects messages with loads of punctuation errers, sp elling mishtakes and whaatever? Ebviessly sum latttitude wuold have to be catered for with a threshold parameter for such errors Kind regards, John Moriarty (+353) (0)87 2833 530 www.helimodels.com From popiel at wolfskeep.com Sun Jul 11 20:51:11 2004 From: popiel at wolfskeep.com (T. Alexander Popiel) Date: Sun Jul 11 20:52:35 2004 Subject: [spambayes-dev] spell checking In-Reply-To: Message from "John Moriarty" of "Sun, 11 Jul 2004 19:17:05 BST." <005f01c46773$a205b700$3aa9a5c2@user> References: <005f01c46773$a205b700$3aa9a5c2@user> Message-ID: <20040711185111.2E0D72DFFA@cashew.wolfskeep.com> In message: <005f01c46773$a205b700$3aa9a5c2@user> "John Moriarty" writes: >Not a programmer. >I regularly see that spam tries to get around forbidden words with >deliberate punctuation oddities and deliberate mis-spells. > >So would it be possible to have a system that rejects messages with loads of >punctuation errers, sp elling mishtakes and whaatever? > >Ebviessly sum latttitude wuold have to be catered for with a threshold >parameter for such errors Yes, one could... but previous testing has shown that it doesn't actually improve accuracy. The common misspellings and mistakes are quickly spotted by spambayes as spam indicators in their own right. - Alex From fernribeiro52004xz at mailcity.com Mon Jul 12 15:49:45 2004 From: fernribeiro52004xz at mailcity.com (Fernando Ribeiro) Date: Mon Jul 12 15:49:29 2004 Subject: [spambayes-dev] Mala Direta por e-mail - Listas atualizadas: http://www.divulgamail.vze.com Message-ID: <20040712134926.C35341E400D@bag.python.org> Emails super atualizados para mala direta, listas de e-mails para divulga??o, email marketing, programas para envio e captura de e-mails. Cadastros segmentados por estados e atividades http://www.gueb.de/divulgamail Emails segmentados e genericos para marketing direto (mala direta por e-mail),valida??o de listas de emails,dicas,bulk mail,e-list,lista eletr?nica,cadastros de home pages em sites de buscas, publicidade, propaganda, marketing, bulk mail http://www.gueb.de/divulgamail From webmaster at turnaprofit.com Tue Jul 13 17:21:52 2004 From: webmaster at turnaprofit.com (Rob McEwen) Date: Tue Jul 13 17:22:02 2004 Subject: [spambayes-dev] Windows implementation using Command line (for SpamBayes) Message-ID: <000001c468ed$208dfa90$0a75dc44@powerview> RE: Windows implementation using Command line (for SpamBayes) Is there an implementation for SpamBayes for Windows as a Command line Utility? For example, some anti-virus programs (like McAfee VirusScan, F-Prot Antivirus for DOS, Grisoft AVG, etc.) can be invoked in a separate process, fed the path to the text file of the message to be process, return an "exit code" based on the results of scanning that message file. (or, possibly, add the appropriate header to the message). Also, training could be done by occasionally passing this command line tool a known-good "ham" folder and/or a known-bad "spam" folder. Couldn't this same concept could be applied to spambayes? (Remember, I'm specifically looking for something that will do this in Windows.) Also, it is my understanding that PROCMAIL is ONLY available on the Unix platform?? (If so, another way to solve this problem might be to find a 3rd party app which emulates PROCMAIL in windows?? ...but this sounds far-fetched to me!) Finally, could this be done as a windows service so that the program doesn't have to initialize and die with each individual request? Any suggestions are appreciated! Thanks, Rob McEwen PowerView Systems rob@PowerViewSystems.com (478) 475-9032 From kennypitt at hotmail.com Tue Jul 13 17:44:32 2004 From: kennypitt at hotmail.com (Kenny Pitt) Date: Tue Jul 13 17:44:43 2004 Subject: [spambayes-dev] Windows implementation using Command line (forSpamBayes) In-Reply-To: <000001c468ed$208dfa90$0a75dc44@powerview> Message-ID: Rob McEwen wrote: > Is there an implementation for SpamBayes for Windows as a Command > line Utility? > > For example, some anti-virus programs (like McAfee VirusScan, F-Prot > Antivirus for DOS, Grisoft AVG, etc.) can be invoked in a separate > process, fed the path to the text file of the message to be process, > return an "exit code" based on the results of scanning that message > file. (or, possibly, add the appropriate header to the message). > Also, training could be done by occasionally passing this command > line tool a known-good "ham" folder and/or a known-bad "spam" folder. > > Couldn't this same concept could be applied to spambayes? Take a look at the sb_filter.py script. We don't build it as a Windows executable so you'll have to install Python and run from source to get it, but I see no reason why it wouldn't work on Windows. > Finally, could this be done as a windows service so that the program > doesn't have to initialize and die with each individual request? You might be able to use the sb_bnfilter.py and sb_bnserver.py scripts, which I believe are intended to do something similar to this. Currently sb_bnserver appears to run as a background process, not a Windows service, but you could probably turn it into a service with a little Python hacking based on the pop3proxy_service.py script. -- Kenny Pitt From webmaster at turnaprofit.com Tue Jul 13 18:09:47 2004 From: webmaster at turnaprofit.com (Rob McEwen) Date: Tue Jul 13 18:09:56 2004 Subject: [spambayes-dev] followup: Windows implementation using Command line (forSpamBayes) In-Reply-To: Message-ID: <000001c468f3$d21f8840$0a75dc44@powerview> Followup: Windows implementation using Command line (forSpamBayes) Wow. Thanks Kenny. I'm going to try this. However, while I am a rather accomplished vb.net and Javascript programmer, Perl/Python looks like "greek" to me. Therefore, any additional help or suggestions are also appreciated. Regarding the "windows service" question, the stuff in the sb_bnserver script may do the trick. Most importantly, I just don't want the performance to be adversely impacted by the loading/unloading of the app for EVERY SINGLE message. Also, at the same time, another concern I have is that, if this single program stays in memory for this purpose, wouldn't I also need to make sure that this program is multi-threaded for possible concurrent operations... AND that THIS program "gets the call" (singleton model?) ...or are these things already taken care of? Thanks, Rob McEwen PowerView Systems rob@PowerViewSystems.com (478) 475-9032 -----Original Message----- From: Kenny Pitt [mailto:kennypitt@hotmail.com] Sent: Tuesday, July 13, 2004 11:45 AM To: webmaster@powerviewsystems.com; spambayes-dev@python.org Subject: RE: [spambayes-dev] Windows implementation using Command line (forSpamBayes) Rob McEwen wrote: > Is there an implementation for SpamBayes for Windows as a Command > line Utility? > > For example, some anti-virus programs (like McAfee VirusScan, F-Prot > Antivirus for DOS, Grisoft AVG, etc.) can be invoked in a separate > process, fed the path to the text file of the message to be process, > return an "exit code" based on the results of scanning that message > file. (or, possibly, add the appropriate header to the message). > Also, training could be done by occasionally passing this command > line tool a known-good "ham" folder and/or a known-bad "spam" folder. > > Couldn't this same concept could be applied to spambayes? Take a look at the sb_filter.py script. We don't build it as a Windows executable so you'll have to install Python and run from source to get it, but I see no reason why it wouldn't work on Windows. > Finally, could this be done as a windows service so that the program > doesn't have to initialize and die with each individual request? You might be able to use the sb_bnfilter.py and sb_bnserver.py scripts, which I believe are intended to do something similar to this. Currently sb_bnserver appears to run as a background process, not a Windows service, but you could probably turn it into a service with a little Python hacking based on the pop3proxy_service.py script. -- Kenny Pitt From kennypitt at hotmail.com Tue Jul 13 18:57:46 2004 From: kennypitt at hotmail.com (Kenny Pitt) Date: Tue Jul 13 18:57:56 2004 Subject: [spambayes-dev] followup: Windows implementation using Command line(forSpamBayes) In-Reply-To: <000001c468f3$d21f8840$0a75dc44@powerview> Message-ID: Rob McEwen wrote: > while I am a > rather accomplished vb.net and Javascript programmer, Perl/Python > looks like "greek" to me. Perl still looks like greek to me, too. Python, on the other hand, is very easy to learn. Since you already understand programming in general, you should have no problem picking up enough Python syntax to do some basic tweaking. Check out the "Beginner's Guide to Python" , which has some good links for getting you started. You'll probably be a convert in no time! > Regarding the "windows service" question, the stuff in the > sb_bnserver script may do the trick. Most importantly, I just don't > want the performance to be adversely impacted by the > loading/unloading of the app for EVERY SINGLE message. Also, at the > same time, another concern I have is that, if this single program > stays in memory for this purpose, wouldn't I also need to make sure > that this program is multi-threaded for possible concurrent > operations... I'm not very familiar with the ins and outs of the sb_bnfilter/sb_bnserver scripts, so maybe someone else can jump in with more details. In general, any server program will need to handle concurrency issues. Python has full support for threading and synchronization, so sb_bnserver may already handle this but I can't personally vouch for that. > AND that THIS program "gets the call" (singleton model?) I believe that the server just listens on a TCP/IP socket, and the filter program makes a connection to that socket to send a message to be processed. Only one process can listen on that socket, so the requests should always go to the correct server process. -- Kenny Pitt From tameyer at ihug.co.nz Wed Jul 14 01:42:11 2004 From: tameyer at ihug.co.nz (Tony Meyer) Date: Wed Jul 14 01:42:22 2004 Subject: [spambayes-dev] Windows implementation using Command line(forSpamBayes) In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F13071A3348@its-xchg4.massey.ac.nz> Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F13064C02E6@its-xchg4.massey.ac.nz> > Take a look at the sb_filter.py script. We don't build it as > a Windows executable so you'll have to install Python and run > from source to get it, but I see no reason why it wouldn't > work on Windows. I can confirm that it does. For training, you're after the sb_mboxtrain.py script. I don't recall whether I've run that or not, but it should be fine (assuming that your mail is in mbox format). If it fails to run, open a bug and include the traceback and assign to me (Anadelonbrin) and I should be able to fix it. > You might be able to use the sb_bnfilter.py and > sb_bnserver.py scripts, which I believe are intended to do > something similar to this. sb_bnserver.py is definitely *nix only at the moment. I don't see any reason why it would have to be, though, so it could probably be changed to be more cross platform. Note that there's a CVS branch that has these scripts in C (for speed), which I think is the intended final version. Toby is the one to talk to about them. =Tony Meyer From tdickenson at geminidataloggers.com Wed Jul 14 09:12:49 2004 From: tdickenson at geminidataloggers.com (Toby Dickenson) Date: Wed Jul 14 09:12:52 2004 Subject: [spambayes-dev] Windows implementation using Command line(forSpamBayes) In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F13064C02E6@its-xchg4.massey.ac.nz> References: <1ED4ECF91CDED24C8D012BCF2B034F13064C02E6@its-xchg4.massey.ac.nz> Message-ID: <200407140812.49667.tdickenson@geminidataloggers.com> On Tuesday 13 July 2004 16:21, Rob McEwen wrote: > Finally, could this be done as a windows service so that the program doesn't > have to initialize and die with each individual request? The natural solution to all this on Windows is to have spambayes in a COM server. If you really need command line access to this service, use a 2-line vbscript file. But you will hurt performance on WIndows by starting a seperate process to pass the filename, even if that process doesnt initialize spambayes itself each time. > Also, it is my understanding that PROCMAIL is ONLY available on the Unix > platform?? (If so, another way to solve this problem might be to find a 3rd > party app which emulates PROCMAIL in windows?? ...but this sounds > far-fetched to me!) Procmail on windows?!? Your proposed solution sounds deeply unixy. Have you considered cygwin? I believe it has procmail, and I dont see why the existing sb_bnfilter shouldnt run unchanged. On Wednesday 14 July 2004 00:42, Tony Meyer wrote: > > Take a look at the sb_filter.py script. We don't build it as > > a Windows executable so you'll have to install Python and run > > from source to get it, but I see no reason why it wouldn't > > work on Windows. > > I can confirm that it does. For training, you're after the sb_mboxtrain.py > script. I don't recall whether I've run that or not, but it should be fine > (assuming that your mail is in mbox format). If it fails to run, open a bug > and include the traceback and assign to me (Anadelonbrin) and I should be > able to fix it. sb_filter can do training too. > > You might be able to use the sb_bnfilter.py and > > sb_bnserver.py scripts, which I believe are intended to do > > something similar to this. > > sb_bnserver.py is definitely *nix only at the moment. I don't see any > reason why it would have to be, though, so it could probably be changed to > be more cross platform. sb_bnfilter currently uses Unix domain sockets, which is ideal from a security and performance point of view on unix. > Note that there's a CVS branch that has these > scripts in C (for speed), which I think is the intended final version. Toby > is the one to talk to about them. Thanks for the reminder - I will merge those onto the trunk today. -- Toby Dickenson From skip at pobox.com Wed Jul 14 15:13:06 2004 From: skip at pobox.com (Skip Montanaro) Date: Wed Jul 14 15:13:16 2004 Subject: [spambayes-dev] AOL + IMAP? Message-ID: <16629.12642.46093.113946@montanaro.dyndns.org> Folks, I was leafing through the latest issue of MacAddict (what a silly name for a magazine...) on the train this morning and noticed tip #26: I Want My AOL Mail in Apple's Mail. It turns out you can use IMAP to at least check your AOL Mail. I added a q&a to the faq about this and solicited inputs from people who might be tempted to try it out. Skip From skip at pobox.com Wed Jul 14 15:22:03 2004 From: skip at pobox.com (Skip Montanaro) Date: Wed Jul 14 15:22:07 2004 Subject: [spambayes-dev] mkstemp problem updating website? Message-ID: <16629.13179.440520.911599@montanaro.dyndns.org> Anybody seen this error when executing "make install" to update the website? % make install cd apps; make cd outlook; make make[2]: Nothing to be done for `all'. cd download ; make make[1]: Nothing to be done for `all'. cd apps; make install cd outlook; make make[2]: Nothing to be done for `all'. cd outlook; make install Push to shell1.sourceforge.net:/home/groups/s/sp/spambayes/htdocs//apps/outlook ... rsync --rsh=ssh -v -r -l -t --update --exclude-from=../../scripts/rsync-excludes ./* shell1.sourceforge.net:/home/groups/s/sp/spambayes/htdocs//apps/outlook Enter passphrase for key '/Users/skip/.ssh/id_rsa': building file list ... done bugs.txt mkstemp .bugs.txt.edwWd6 failed: Permission denied wrote 5334 bytes read 84 bytes 516.00 bytes/sec total size is 24821 speedup is 4.58 rsync error: some files could not be transferred (code 23) at main.c(620) make[2]: *** [install] Error 23 make[1]: *** [local_install] Error 2 make: [local_install] Error 2 (ignored) cd download ; make install Push to shell1.sourceforge.net:/home/groups/s/sp/spambayes/htdocs//download ... rsync --rsh=ssh -v -r -l -t --update --exclude-from=../scripts/rsync-excludes ./* shell1.sourceforge.net:/home/groups/s/sp/spambayes/htdocs//download Enter passphrase for key '/Users/skip/.ssh/id_rsa': Connection closed by 66.35.250.208 rsync: connection unexpectedly closed (0 bytes read so far) rsync error: error in rsync protocol data stream (code 12) at io.c(165) make[1]: *** [install] Error 12 make: [local_install] Error 2 (ignored) Push to shell1.sourceforge.net:/home/groups/s/sp/spambayes/htdocs//. ... rsync --rsh=ssh -v -r -l -t --update --exclude-from=./scripts/rsync-excludes ./* shell1.sourceforge.net:/home/groups/s/sp/spambayes/htdocs//. Enter passphrase for key '/Users/skip/.ssh/id_rsa': building file list ... done applications.ht applications.html apps/ background.ht background.html developer.ht developer.html ... Note the rsync errors. Looks like a permission problem, but when I checked on shell1.sf.net everything looked okay. Ah, wait a minute... In the .../apps/outlook directory the files are not group writable. Mark, those files have your name on them can you chmod g+w them and chmod g+ws the directory? I didn't see anything obviously amiss in the .../htdocs/download directory though. Skip From webmaster at powerviewsystems.com Wed Jul 14 15:59:20 2004 From: webmaster at powerviewsystems.com (Rob McEwen) Date: Wed Jul 14 15:59:30 2004 Subject: [spambayes-dev] Windows implementation using Command line(forSpamBayes) In-Reply-To: <200407140812.49667.tdickenson@geminidataloggers.com> Message-ID: <000001c469aa$c3621b80$0a75dc44@powerview> RE: [spambayes-dev] Windows implementation using Command line(forSpamBayes) Thanks to everyone for their suggestions. I've made much progress so far. But MORE assistance is needed... (1) I create a bayescustomize.ini file in the root of the project. (2) I discovered that the scripts would not run unless I copied the particular script from the script folder to the base folder. Was that correct? If so, should I have deleted the original file for that script which I left in the scripts folder? (When I attempted to run the script from the scripts folder, I got an error message saying that stuff was not found.) (3) I then created custom folders and referenced these in the bayescustomize.ini file which currently looks like this: ****** [Storage] persistent_storage_file: C:\SpamBayesServerEdition\sb_storage\sb_BayesDB.db messageinfo_storage_file: C:\SpamBayesServerEdition\sb_storage\sb_MessageDB.db spam_cache: C:\SpamBayesServerEdition\sb_cache_spam ham_cache: C:\SpamBayesServerEdition\sb_cache_ham unknown_cache: C:\SpamBayesServerEdition\sb_cache_unknown ****** (but without the asterisks) (4) At a command line, cued to the proper directory, I successfully ran "sb_filter.py -n" The database file was created successfully. (5) Next, I copied some spam and ham messages from my server to the proper folders. (6) At first sb_mboxtrain.py didn't work. It turns out, various assumptions existed which didn't apply in my case. The closest function, as far as I could tell, was the "maildir_train" function for "maildir" -type mail boxes. But even this wasn't compatible. My mail was simply a collection of text files in a folder. It didn't follow "maildir" conventions. Therefore, I had to rem out the following lines: # REM: # shutil.copystat(cfn, tfn) # XXX: This will raise an exception on Windows. Do any Windows # people actually use Maildirs? # REM: # shutil.copystat(cfn, tfn) # REM: # os.rename(tfn, cfn) # REM: # if (removetrained): # os.unlink(cfn) (I hope that these were not important!!! ...Are they?) Also, I made some other modifications regarding the path. Also, I'm assuming here that (a) "cfn" is important, but (b) "tfn" is arbitrary. Again, is that true? Now, when I run: "sb_mboxtrain.py -s C:\SpamBayesServerEdition\sb_train_as_spam" it SEEMS to work. It gives me the proper success message, and the date/time stamp on the database updates. (but I wish I could KNOW for sure that it worked. Could I be getting a false positive?) (7) This is where I am stuck. When I actually attempt to run a file through the filter, I've used the following: "sb_filter.py -f C:\SpamBayesServerEdition\inbox\20040624105403D1E2-00000002.tmp" (file name from my mail server) It returns with NO feedback and NO errors. Therefore, I started placing "print" statements in various parts of the code to see what actually gets processed. It never seems to get past the following line: for fname in args: print "fname: " % fname (notice my print line I eventually added. When this print line is place here, an error message occurs: File "C:\SpamBayesServerEdition\sb_filter.py", line 247, in main print "fname: " % fname TypeError: not all arguments converted during string formatting" (note that the line number I report may be different than yours because of modifications I've made) Any suggestions? Was my command well formed? Rob McEwen PowerView Systems rob@PowerViewSystems.com (478) 475-9032 From kennypitt at hotmail.com Wed Jul 14 16:45:27 2004 From: kennypitt at hotmail.com (Kenny Pitt) Date: Wed Jul 14 16:45:47 2004 Subject: [spambayes-dev] Windows implementation using Commandline(forSpamBayes) In-Reply-To: <000001c469aa$c3621b80$0a75dc44@powerview> Message-ID: Rob McEwen wrote: > (2) I discovered that the scripts would not run unless I copied the > particular script from the script folder to the base folder. Was that > correct? If so, should I have deleted the original file for that > script which I left in the scripts folder? (When I attempted to run > the script from the scripts folder, I got an error message saying > that stuff was not found.) If you go to the root of your source tree and run the command "python setup.py install", it will copy all the required pieces into your Python installation directory. You can then run scripts from the Python\Scripts directory and everything should be found. If you want to run directly from the source directory, just set the environment variable PYTHONPATH= before running and you should be fine. -- Kenny Pitt From adam.walker at rbwconsulting.com Wed Jul 14 17:12:10 2004 From: adam.walker at rbwconsulting.com (Adam Walker) Date: Wed Jul 14 17:12:16 2004 Subject: [spambayes-dev] followup: Windows implementation using Command line (forSpamBayes) In-Reply-To: <000001c468f3$d21f8840$0a75dc44@powerview> References: <000001c468f3$d21f8840$0a75dc44@powerview> Message-ID: <2D928CCB-D5A8-11D8-AC87-000A95E09D92@rbwconsulting.com> sb_bnserver is started by sb_bnfilter. sb_bnserver is a short lived server process that last a few seconds or so from the previous call to sb_bnfilter to save overhead because procmail calls sb_bnfilter once per message. As for thread-safe-ness, I'll let someone else answer that. On Jul 13, 2004, at 12:09 PM, Rob McEwen wrote: > Regarding the "windows service" question, the stuff in the sb_bnserver > script may do the trick. Most importantly, I just don't want the > performance > to be adversely impacted by the loading/unloading of the app for EVERY > SINGLE message. Also, at the same time, another concern I have is > that, if > this single program stays in memory for this purpose, wouldn't I also > need > to make sure that this program is multi-threaded for possible > concurrent > operations... AND that THIS program "gets the call" (singleton model?) From tdickenson at geminidataloggers.com Wed Jul 14 17:25:05 2004 From: tdickenson at geminidataloggers.com (Toby Dickenson) Date: Wed Jul 14 17:25:08 2004 Subject: [spambayes-dev] followup: Windows implementation using Command line (forSpamBayes) In-Reply-To: <2D928CCB-D5A8-11D8-AC87-000A95E09D92@rbwconsulting.com> References: <000001c468f3$d21f8840$0a75dc44@powerview> <2D928CCB-D5A8-11D8-AC87-000A95E09D92@rbwconsulting.com> Message-ID: <200407141625.05019.tdickenson@geminidataloggers.com> On Wednesday 14 July 2004 16:12, Adam Walker wrote: > As for thread-safe-ness, I'll let someone else answer > that. sb_bnserver is single-threaded. -- Toby Dickenson From tameyer at ihug.co.nz Mon Jul 19 04:16:10 2004 From: tameyer at ihug.co.nz (Tony Meyer) Date: Mon Jul 19 04:16:20 2004 Subject: [spambayes-dev] RE: [Spambayes-checkins] spambayes/spambayes Dibbler.py, 1.13, 1.14 In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F1307257BBE@its-xchg4.massey.ac.nz> Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F1304678165@its-xchg4.massey.ac.nz> > Update of /cvsroot/spambayes/spambayes/spambayes > In directory sc8-pr-cvs1.sourceforge.net:/tmp/cvs-serv6773/spambayes > > Modified Files: > Dibbler.py > Log Message: > Apply Richie's test for [ 990700 ] Changes to asyncore in > Python 2.4 break ServerLineReader > which seems to work for me w/ Python 2.4a1 And by "test", I mean "fix"... =Tony Meyer From malsburg at cl.uni-heidelberg.de Tue Jul 20 15:08:24 2004 From: malsburg at cl.uni-heidelberg.de (Titus von der Malsburg) Date: Tue Jul 20 15:08:16 2004 Subject: [spambayes-dev] Bug?: assert hamcount <= nham Message-ID: <20040720130823.GA19485@mother> Hi, I observed, that the following error happens sometimes when a message is being classified while sb_mboxtrain.py is running: procmail: Assigning "MAILDIR=/home/mitarb/malsburg/Maildir" procmail: Assigning "DEFAULT=/home/mitarb/malsburg/Maildir/" procmail: Assigning "ARCHIVDIR=/home/mitarb/malsburg/.procmail/archiv" procmail: Assigning "BAYESCUSTOMIZE=/home/mitarb/malsburg/.spambayesrc" procmail: Locking "hamlock" procmail: Executing "/home/mitarb/malsburg/usr/bin/sb_filter.py" Traceback (most recent call last): File "/home/mitarb/malsburg/usr/bin/sb_filter.py", line 257, in ? main() File "/home/mitarb/malsburg/usr/bin/sb_filter.py", line 248, in main action(msg) File "/home/mitarb/malsburg/usr/bin/sb_filter.py", line 180, in filter return self.h.filter(msg) File "/home/mitarb/malsburg/usr/lib/python2.3/site-packages/spambayes/hammie.py", line 109, in filter prob, clues = self._scoremsg(msg, True) File "/home/mitarb/malsburg/usr/lib/python2.3/site-packages/spambayes/hammie.py", line 38, in _scoremsg return self.bayes.spamprob(tokenize(msg), evidence) File "/home/mitarb/malsburg/usr/lib/python2.3/site-packages/spambayes/classifier.py", line 190, in chi2_spamprob clues = self._getclues(wordstream) File "/home/mitarb/malsburg/usr/lib/python2.3/site-packages/spambayes/classifier.py", line 493, in _getclues tup = self._worddistanceget(word) File "/home/mitarb/malsburg/usr/lib/python2.3/site-packages/spambayes/classifier.py", line 508, in _worddistanceget prob = self.probability(record) File "/home/mitarb/malsburg/usr/lib/python2.3/site-packages/spambayes/classifier.py", line 308, in probability assert hamcount <= nham AssertionError procmail: [19899] Tue Jul 20 14:43:10 2004 procmail: Program failure (1) of "/home/mitarb/malsburg/usr/bin/sb_filter.py" procmail: Rescue of unfiltered data succeeded procmail: [19899] Tue Jul 20 14:43:10 2004 procmail: Unlocking "hamlock" procmail: No match on "^X-SpamBayes-Classification: spam" procmail: No match on "^Subject: Floss2004" procmail: No match on "^Subject:.*HzG" procmail: No match on "From: newsalerts-noreply@google.com" procmail: No match on "^X-Mailman-Version:.*" procmail: Assigning "LASTFOLDER=/home/mitarb/malsburg/Maildir/new/1090327388.19899_1.janus" procmail: Notified comsat: "malsburg@0:/home/mitarb/malsburg/Maildir/new/1090327388.19899_1.janus" >From malsburg@cl.uni-heidelberg.de Tue Jul 20 14:43:08 2004 Subject: Test5 Folder: /home/mitarb/malsburg/Maildir/new/1090327388.19899_1.janus 639 I configured spambayes and procmail just like it is described in README.txt. The error occured in spambayes-1.0a7 and spambayes-1.0rc2. Titus From skip at pobox.com Tue Jul 20 16:23:28 2004 From: skip at pobox.com (Skip Montanaro) Date: Tue Jul 20 16:23:53 2004 Subject: [spambayes-dev] Bug?: assert hamcount <= nham In-Reply-To: <20040720130823.GA19485@mother> References: <20040720130823.GA19485@mother> Message-ID: <16637.10976.747158.937488@montanaro.dyndns.org> Titus> I observed, that the following error happens sometimes when a Titus> message is being classified while sb_mboxtrain.py is running: ... Titus> assert hamcount <= nham Titus> AssertionError You have a corrupt database. You need to provide exclusive access to the database, especially when sb_mboxtrain.py is running. I suggest you at least run it so that it operates on a separate database, then mv the new database it into place when it's finished. Skip From skip at pobox.com Fri Jul 23 07:19:37 2004 From: skip at pobox.com (Skip Montanaro) Date: Fri Jul 23 15:21:42 2004 Subject: [spambayes-dev] sb_unheader.py bug? Message-ID: <16640.40937.710732.895356@montanaro.dyndns.org> Sb_unheader.py seems to completely lose messages when run with Python 2.4. When I run it as sb_unheader.py -p 'X-VM' -p 'X-Spam' -p 'X-Hammie' mbox all I get on stdout is a stream of lines like From nobody Fri Jul 23 00:13:14 2004 From nobody Fri Jul 23 00:13:14 2004 From nobody Fri Jul 23 00:13:14 2004 From nobody Fri Jul 23 00:13:14 2004 From nobody Fri Jul 23 00:13:14 2004 apparently one per input message. When run with Python 2.3 it works just fine. I thought I saw some checkins related to a change in Python 2.4. Could this be another manifestation of that problem? Any clues appreciated. Thanks, Skip From nas at arctrix.com Fri Jul 23 15:58:44 2004 From: nas at arctrix.com (Neil Schemenauer) Date: Fri Jul 23 15:58:47 2004 Subject: [spambayes-dev] sb_unheader.py bug? In-Reply-To: <16640.40937.710732.895356@montanaro.dyndns.org> References: <16640.40937.710732.895356@montanaro.dyndns.org> Message-ID: <20040723135844.GA24492@mems-exchange.org> On Fri, Jul 23, 2004 at 12:19:37AM -0500, Skip Montanaro wrote: > When run with Python 2.3 it works just fine. I thought I saw some > checkins related to a change in Python 2.4. Could this be another > manifestation of that problem? Any clues appreciated. I believe the 'email' package is significantly changed in 2.4. On mail.python.org I noticed that it did not parse messages that used '\r\n' as line endings. I notified Barry but I guess I should file a bug too. Neil From malumendez54jgpoi2 at mail.com Fri Jul 23 15:59:07 2004 From: malumendez54jgpoi2 at mail.com (Malu Mendes) Date: Fri Jul 23 15:59:09 2004 Subject: [spambayes-dev] Modelos de cartas comerciais: http://www.gueb.de/cartascomerciais Message-ID: <20040723135907.B74A11E4003@bag.python.org> As cartas comerciais, t?m grande import?ncia na administra??o de qualquer empreendimento, pois uma parte significativa das transa??es mundiais se realiza por esse meio. A carta ? o instrumento que faz a conex?o entre os negociantes. http://www.gueb.de/cartascomerciais Estamos lan?ando o CD MODELOS DE CARTAS COMERCIAIS, que sana suas d?vidas na elabora??o de todos os tipos de cartas e documentos empresariais: agradecimentos, atestados e declara??es, avisos, cartas de cobran?a, cartas em ingl?s, comunicados, convites, contratos, propostas, empregos, solicita??es e pedidos, telegramas, cartas por e-mail, etc. http://www.gueb.de/cartascomerciais O CD cont?m mais de 400 modelos de Cartas Comerciais e in?meras t?cnicas de Reda??o Comercial. Indicado para: secret?rias em geral, ger?ncias, Rh, executivos, estudantes, empresas de toda ordem, etc. O custo ? ?nfimo em rela??o ao que poder? gerar no aperfei?oamento da comunica??o de sua empresa. http://www.gueb.de/cartascomerciais From skip at pobox.com Fri Jul 23 17:14:17 2004 From: skip at pobox.com (Skip Montanaro) Date: Fri Jul 23 17:14:26 2004 Subject: [spambayes-dev] sb_unheader.py bug? In-Reply-To: <16640.40937.710732.895356@montanaro.dyndns.org> References: <16640.40937.710732.895356@montanaro.dyndns.org> Message-ID: <16641.11081.778208.731158@montanaro.dyndns.org> Skip> Sb_unheader.py seems to completely lose messages when run with Skip> Python 2.4. I believe I've tracked this down to a bug introduced in v 1.42 of the Python mailbox module. Backing out that change seems to solve the problem. I submitted a bug report to the Python SF project. Skip From tameyer at ihug.co.nz Sat Jul 24 02:44:02 2004 From: tameyer at ihug.co.nz (Tony Meyer) Date: Sat Jul 24 02:44:10 2004 Subject: [spambayes-dev] sb_unheader.py bug? In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F1307258782@its-xchg4.massey.ac.nz> Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F13046781C3@its-xchg4.massey.ac.nz> [Skip Montanaro] > When run with Python 2.3 it works just fine. I thought I saw some > checkins related to a change in Python 2.4. Could this be another > manifestation of that problem? Any clues appreciated. [Neil Schemenauer] > I believe the 'email' package is significantly changed in > 2.4. On mail.python.org I noticed that it did not parse > messages that used '\r\n' as line endings. I notified Barry > but I guess I should file a bug too. Was this pre a1? 2.4a1 seems to parse \r\n endings ok for simple messages, at least: Python 2.4a1 (#54, Jul 8 2004, 11:30:13) [MSC v.1310 32 bit (Intel)] on win32 Type "help", "copyright", "credits" or "license" for more information. >>> import email >>> msg = email.message_from_string("Subject: Test\r\n\r\nTest Message.\r\n") >>> msg.as_string() 'Subject: Test\n\nTest Message.\r\n' >>> msg["Subject"] 'Test' >>> msg.get_payload() 'Test Message.\r\n' Although it has converted the header endings to '\n' (and since messages sent are meant to use \r\n, that means you need to do an appropriate replace before sending out a msg.as_string() - but sb_server and sb_imapfilter do this already, anyway). If there is a \r\n problem (with multipart, for example), then this would cause SpamBayes lots of strife, because we ensure that all the line endings *are* \r\n. I haven't noticed it with 2.4a1, but have done very limited testing. If you can find a problem with a1, then definitely +1 to filing a bug, too. =Tony Meyer From tameyer at ihug.co.nz Sat Jul 24 02:51:51 2004 From: tameyer at ihug.co.nz (Tony Meyer) Date: Sat Jul 24 02:52:00 2004 Subject: [spambayes-dev] sb_unheader.py bug? In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F1307258773@its-xchg4.massey.ac.nz> Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F13064C0304@its-xchg4.massey.ac.nz> > I thought I saw some checkins related to > a change in Python 2.4. Could this be another manifestation > of that problem? Any clues appreciated. I realise that you've found the problem already, but in case it's of use to anyone: I've done a little bit of 2.4a1 testing with the Outlook plug-in and sb_server. I found a few problems (mostly with SpamBayes, but one with Python, which already has a fix checked in) with sb_server, but it appears to work ok now. I'll do a bit more testing when I get the chance - for one, I'm interested in seeing if the new FeedParser's invalid information (for malformed messages) can be used in the tokenizer, and whether this helps results. The only problems I've found so far have been with the web interface or with spambayes.Message, so if the script doesn't use that, then I have no idea :) =Tony Meyer From skip at pobox.com Mon Jul 26 00:59:20 2004 From: skip at pobox.com (Skip Montanaro) Date: Mon Jul 26 15:21:02 2004 Subject: [spambayes-dev] untested idea for calculating message lengths Message-ID: <16644.15176.898072.978618@montanaro.dyndns.org> One of the problems Spambayes seems to still have on occasion is properly classifying very short messages. I came up with the attached simple way to compute a message's effective length, but have yet to test it. I think the number of tokens will be better than the actual size of the message simply because that's what the classifier munches on. I will try to test it this week, but I don't have a large corpus at this point and thought I'd toss it out there in case others have some time to look at it. It just counts tokens as they are generated and tacks on a "token length:N" token to the end of the stream where N is the base two log of the number of tokens in the message. Skip -------------- next part -------------- A non-text attachment was scrubbed... Name: tokenizer.diff Type: application/octet-stream Size: 598 bytes Desc: not available Url : http://mail.python.org/pipermail/spambayes-dev/attachments/20040725/1d84e8b9/tokenizer.obj From matt at mondoinfo.com Mon Jul 26 22:44:45 2004 From: matt at mondoinfo.com (Matthew Dixon Cowles) Date: Mon Jul 26 22:44:51 2004 Subject: [spambayes-dev] untested idea for calculating message lengths In-Reply-To: <16644.15176.898072.978618@montanaro.dyndns.org> References: <16644.15176.898072.978618@montanaro.dyndns.org> Message-ID: <1090873853.3.1026@mint-julep.mondoinfo.com> > One of the problems Spambayes seems to still have on occasion is > properly classifying very short messages. I came up with the > attached simple way to compute a message's effective length, but > have yet to test it. It gives me a tiny improvement over all defaults: -> tested 200 hams & 200 spams against 800 hams & 800 spams -> tested 200 hams & 200 spams against 800 hams & 800 spams -> tested 200 hams & 200 spams against 800 hams & 800 spams -> tested 200 hams & 200 spams against 800 hams & 800 spams -> tested 200 hams & 200 spams against 800 hams & 800 spams -> tested 200 hams & 200 spams against 800 hams & 800 spams -> tested 200 hams & 200 spams against 800 hams & 800 spams -> tested 200 hams & 200 spams against 800 hams & 800 spams -> tested 200 hams & 200 spams against 800 hams & 800 spams -> tested 200 hams & 200 spams against 800 hams & 800 spams filename: normal skip ham:spam: 1000:1000 1000:1000 fp total: 1 1 fp %: 0.10 0.10 fn total: 5 5 fn %: 0.50 0.50 unsure t: 60 58 unsure %: 3.00 2.90 real cost: $27.00 $26.60 best cost: $16.60 $16.40 h mean: 0.32 0.32 h sdev: 3.78 3.80 s mean: 97.33 97.37 s sdev: 11.14 11.12 mean diff: 97.01 97.05 k: 6.50 6.50 But once I turn on mine_received_headers, the improvement is lost: -> tested 200 hams & 200 spams against 800 hams & 800 spams -> tested 200 hams & 200 spams against 800 hams & 800 spams -> tested 200 hams & 200 spams against 800 hams & 800 spams -> tested 200 hams & 200 spams against 800 hams & 800 spams -> tested 200 hams & 200 spams against 800 hams & 800 spams -> tested 200 hams & 200 spams against 800 hams & 800 spams -> tested 200 hams & 200 spams against 800 hams & 800 spams -> tested 200 hams & 200 spams against 800 hams & 800 spams -> tested 200 hams & 200 spams against 800 hams & 800 spams -> tested 200 hams & 200 spams against 800 hams & 800 spams filename: mine-received mine-received-skip ham:spam: 1000:1000 1000:1000 fp total: 0 0 fp %: 0.00 0.00 fn total: 4 4 fn %: 0.40 0.40 unsure t: 52 52 unsure %: 2.60 2.60 real cost: $14.40 $14.40 best cost: $8.80 $9.20 h mean: 0.26 0.26 h sdev: 3.16 3.21 s mean: 98.15 98.15 s sdev: 9.02 9.04 mean diff: 97.89 97.89 k: 8.04 7.99 Regards, Matt From skip at pobox.com Tue Jul 27 03:01:10 2004 From: skip at pobox.com (Skip Montanaro) Date: Tue Jul 27 03:01:17 2004 Subject: [spambayes-dev] untested idea for calculating message lengths In-Reply-To: <1090873853.3.1026@mint-julep.mondoinfo.com> References: <16644.15176.898072.978618@montanaro.dyndns.org> <1090873853.3.1026@mint-julep.mondoinfo.com> Message-ID: <16645.43350.757857.960960@montanaro.dyndns.org> Matt> It gives me a tiny improvement over all defaults: ... Matt> But once I turn on mine_received_headers, the improvement is lost: ... Do you have any way of judging its performance on very small (and maybe very large) messages? It only occurred to me to try it based upon the known problems with very short messages. If it's a wash for normal-sized messages that would be fine for me. It's extremely cheap to compute and only adds a few tokens to the database. I just ran tte over my current modest database (100 hams, 316 spams) with the token length token enabled. Here's what turned up: token nspam nham token length:4 0 1 token length:5 9 5 token length:6 19 54 token length:7 39 111 token length:8 11 76 token length:9 9 37 token length:10 3 9 token length:11 3 0 token length:12 4 0 token length:13 1 0 token length:16 1 0 As you can see I haven't got many very short messages in my current collection. Based upon what I do have, it actually seems to be a better spam clue for large messages. Skip From matt at mondoinfo.com Wed Jul 28 20:27:30 2004 From: matt at mondoinfo.com (Matthew Dixon Cowles) Date: Wed Jul 28 20:28:24 2004 Subject: [spambayes-dev] DNS TTL clues OK, but not very good Message-ID: <1091036693.71.2326@mint-julep.mondoinfo.com> A few days ago, someone on the NANOG list mentioned that spammers routinely use very low time-to-live values for their DNS records. Once I'd heard that, it seemed obvious; a spammer would want to make his spamvertized URL resolve to new a new server quickly if the original server was shut down. Since I'm already using the IP address of the host part of a spamvertized URL as a synthetic token, it was easy enough to generate tokens for the TTL value. As an easy way of putting the values into buckets, I used the log (I tried both base 10 and base e) of the value, rounded to the nearest integer. Alas, it didn't work all that well. Still, I thought I'd post the results here in case someone else had the same idea. On my mail, it was somewhat better than all-defaults and almost as good as turning on mine_received_headers. But it wasn't as good as using the IP address of the server (the address itself and the value masked at /8, /16, /24). Using both the address and the TTL wasn't any better than using the address alone. Regards, Matt