Untitled Document

From tameyer at ihug.co.nz Thu Jan 1 19:21:14 2004 From: tameyer at ihug.co.nz (Tony Meyer) Date: Thu Jan 1 19:21:19 2004 Subject: [spambayes-dev] converToMbox broken In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F130499C307@its-xchg4.massey.ac.nz> Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F13026F2A28@its-xchg4.massey.ac.nz> > Further to my insane ramblings the other day about trying to > train on dbx files with the web interface, the only way I could get > oe_mailbox.convertToMbox() to work was by adding: > > from time import * > > at the top of the file. Otherwise you get "global symbol strftime not > found" Sorry, this is my fault. I've checked in a fix (oe_mailbox.py 1.6). Thanks for pointing it out! =Tony Meyer From tim.one at comcast.net Thu Jan 1 20:18:39 2004 From: tim.one at comcast.net (Tim Peters) Date: Thu Jan 1 20:18:41 2004 Subject: [spambayes-dev] Strange performance dip and DBRunRecoveryErrorretreat In-Reply-To: Message-ID: [Richie Hindle] > As part of trying to reproduce the DBRunRecoveryError problems (a task > that I'm giving up on for now - see below) I've written a script to > hammer the core SpamBayes code, repeatedly training and classifying > using faked-up messages. It manages about 40 train-and-classify > loops per second on my 2.4GHz P4, *except* between about 100 and 400 > messages, when the performance drops to about a tenth of that and > then recovers. > > I've done enough investigation to know that the time is being spent > in the core SpamBayes code and not my script, Is that a true dichotomy? That is, do you know, for example, that the time is being spent in the core spambayes code as distinct from the Berkeley database library, or distinct from random network traffic other programs are engaging in? Or is it that you just know it's not in your script, and you divide the universe into "my script" and "the core SpamBayes code" here? > that it's only the occasional message that takes a long time (around > a second in a few cases) and that it can be either training or > classifying that slows down. > > I've committed the script as testtools/hammer.py, and I offer this as > a curiosity to anyone interested. I'm not going to pursue this myself > because I've never seen a similar complaint about real-world SpamBayes > use. Well, Python certainly doesn't make any real-time guarantees, and I doubt Sleepycat, or even your OS, do either. So long as it recovers, I don't think there's anything worth investigating. It could be Python resizing a large dict, or an all-generations garbage collection cycle, or Sleepycat rearranging its memory allocation, or the OS rearranging swap space, ..., there's just no limit on what it *could* be. > .... > I don't think the script is going to be a lot of use in tracking down > DBRunRecoveryErrors - it *will* reproduce them as it is, but only by > mimicking a bug that was fixed in 1.0a6, and people have still been > complaining about DBRunRecoveryErrors in 1.0a6 and 1.0a7. Thanks for the effort! Maybe somebody else can complicate it now in a way that does provoke DBRunRecoveryErrors. It's never what you expect . > Having read up on full-mode bsddb, and bsddb-backed ZODB (including > the phrases "The underlying Berkeley database technology requires > maintenance, careful system resource planning, and tuning for > performance." and "BerkeleyDB never deletes "old" log files. > Eventually, if you do not maintain your Berkeley database by deleting > "old" log files, you will run out of disk space") I've given up - for > the moment at least - on trying to use full-mode bsddb (with or > without ZODB). That's par for the course for "a real" database. Even plain FileStorage-backed ZODB requires ongoing maintenance, including periodic "packing" to prevent unbounded growth, and religiously observed backups. It's all this extra hair that makes a real database robust against most of the things that can go wrong. But also for that reason, it's unusual to see "a real database" solution in consumer-grade applications. We could write our own database specialized to our project's specific needs, and probably get that working faster and better than any general-purpose beast. But my interest in that was fully satisifed by pickling a giant dict . > sb_server users should use a pickle and be done with it. I've been saying that for a decade . Before you get too sick of it, you might also want to investigate Neil Schemenauer's adaptation of spambayes for cdb. cdb is an efficient and essentially worry-free disk-based database. It buys this at the cost of *not* being incrementally updatable: you can replace the whole thing atomically, in one giant gulp, but that's it. If you don't need incremental training with instantly-visible effects, I bet it's an excellent approach. There are no worries about synchronizing concurrent reads and writes, simply because there are no writes. Looks like there *are* worries about concurrent reads, though: http://cr.yp.to/cdb/reading.html Beware that these functions may rely on non-atomic operations on the fd ofile, such as seeking to a particular position and then reading. Do not attempt two simultaneous database reads using a single ofile. Robust all-purpose database implementation is damned hard. > Maybe we should change the default. Maybe it's five to two and I > should be in bed. It's a pit, isn't it? If it's any consolation, even Unix mboxes get corrupted, and nothing is simpler than "append at the end". From tameyer at ihug.co.nz Thu Jan 1 21:58:42 2004 From: tameyer at ihug.co.nz (Tony Meyer) Date: Thu Jan 1 21:58:50 2004 Subject: [spambayes-dev] Strange performance dip andDBRunRecoveryErrorretreat In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F130499C4B2@its-xchg4.massey.ac.nz> Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F13026F2A29@its-xchg4.massey.ac.nz> [Richie Hindle] > It manages about 40 train-and-classify loops > per second on my 2.4GHz P4, *except* between about 100 and 400 > messages, when the performance drops to about a tenth of that and then > recovers. > > I've done enough investigation to know that the time is being spent in > the core SpamBayes code and not my script, [Tim Peters] > Is that a true dichotomy? That is, do you know, for example, > that the time is being spent in the core spambayes code as > distinct from the Berkeley database library, or distinct from > random network traffic other programs are engaging in? Or is > it that you just know it's not in your script, and you divide > the universe into "my script" and "the core SpamBayes code" here? I see it too, roughly in the same place. I also see it if I get hammer.py to use a pickle (although the drop isn't as big), which presumably means it's not Berkeley related. Tim's guess of something Python is doing is probably most likely. It doesn't seem significant, though. [Richie] > I've committed the script as testtools/hammer.py, Actually utilities/hammer.py :) [Tim] > Thanks for the effort! Maybe somebody else can complicate it > now in a way that does provoke DBRunRecoveryErrors. It's > never what you expect . Richie - around about how many messages could you do before it crashed? It crashes early (~1700) for me with the 1.0a6 reopen-before-closing bug (sometimes a DB_RUNRECOVERY, sometimes a ham/spam count) but if I get it to always close or save the db before reopening, it just goes and goes - I got to 57900 before Python crashed *. If I get it to close before reopening, then interrupt (ctrl-c) it at some point after its done its first save, and restart it (without it deleting the existing db file) it will chug along for a while, and then at some later point die with a RUNRECOVERY. Restarting it then will provoke an immediate RUNRECOVERY. This suggests (I think) that reopening the db without closing it can cause a RUNRECOVERY error at some later point - even several reopenings later - rather than immediately like I expected. Maybe this is one of the causes for the RUNRECOVERY errors - the user doesn't close sb_server properly, so the db isn't properly closed/saved. Some time later the error occurs. It would explain why they are less frequent with the plug-in, because the plug-in saves the db much more often (after every "delete as spam"/"recover from spam" event, and IIRC after any incremental training). Does anyone see how it could hurt to have sb_server save the db after doing a page of training? (This would just be a one line addition to onReview in ProxyUI.py). =Tony Meyer * Everything crashes sooner or later on this machine - python.exe, gcc.exe, IE, ... I'm sure that it's unrelated to spambayes or python. From tim.one at comcast.net Thu Jan 1 22:41:03 2004 From: tim.one at comcast.net (Tim Peters) Date: Thu Jan 1 22:41:04 2004 Subject: [spambayes-dev] Strange performance dip andDBRunRecoveryErrorretreat In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F13026F2A29@its-xchg4.massey.ac.nz> Message-ID: [Tony Meyer] > ... > Maybe this is one of the causes for the RUNRECOVERY errors - the user > doesn't close sb_server properly, so the db isn't properly > closed/saved. Some time later the error occurs. It would explain why > they are less frequent with the plug-in, because the plug-in saves > the db much more often (after every "delete as spam"/"recover from > spam" event, and IIRC after any incremental training). Well, yes and no . It depends on the backend database. If it's a giant pickled dict, the addin almost never saves it (just at "full retrain" and "shutdown" times). But if it's a Berkeley backend, the db is saved after every training event, via DBDictClassifier.store(), which calls the db's sync() method at the end. I think you may be on to something here! It's always baffled me that I couldn't provoke a DB corruption problem from Outlook even when I deliberately power-cycled the box *while* spambayes was scoring new messages. I damned near lost my main .pst file doing crap like that, but the Berkeley DB was never bothered. But the db is never in an unsync'ed state during scoring, it's only in an unsync'ed state after DBDictClassifier._wordinfoset() writes out a hapax, or __wordinfodel() removes a key from the database, during learning or unlearning, and then the addin syncs again right after the (single) message is learned or unlearned. > Does anyone see how it could hurt to have sb_server save the db after > doing a page of training? (This would just be a one line addition to > onReview in ProxyUI.py). +1 on trying it. The corruption problems are critical, and this may well help. Hell, sync it after each message gets trained. > ... > * Everything crashes sooner or later on this machine - python.exe, > gcc.exe, IE, ... I'm sure that it's unrelated to spambayes or python. Ya, that's a well-known Linux bug . Computers suck, you know. From mhammond at skippinet.com.au Thu Jan 1 23:12:36 2004 From: mhammond at skippinet.com.au (Mark Hammond) Date: Thu Jan 1 23:12:48 2004 Subject: [spambayes-dev] Experimental SpamBayes build available In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F13026F2A21@its-xchg4.massey.ac.nz> Message-ID: <141601c3d0e6$a8361d00$2c00a8c0@eden> {Tony] > > I really don't like the code in Options.py that handles the > > default values for these storage items. I'm not sure it is > > to blame, but it did cause me to see a new .db file created > > in the cwd, rather than the data directory - as my INI file > > already existed, it didn't get the default FQN for the new option. > > The idea was that if you already had an ini file, then you > had already set > things up (you were an existing user), and so we wouldn't > want to fiddle > with your setup, whatever it was, because that might mean > that we lose track > of your databases. That sounds like a worthy goal, but I don't see how the existing code solves it. Indeed, whenever we add a new option, things go quite wrong - the INI file exists, so the new option is not written to an existing user's ini file - hence, a semi-"ramdom" file is chosen (where the randomness comes from whatever the CWD happens to be). > (This could even be with the default "hammie.db" in the > cwd for persistent_storage_file, if the user always runs the > script from the > same directory). Doesn't that still work? Even when there is no INI file on disk, the INI file has a logical path - the one where we would write it when asked - so the same concept still applies. > Part of the problem is that when consolidating the storage > name options, I > picked "hammie.db" over "~/.hammiedb", which was by far the > better option. I think here you are only deciding between making the default for a specific option a relative or absolute filename (where ~/whatever obviously expands absolutely). The scheme still works regardless of the decision made for a specific option. > I didn't realise that it would expand quite nicely on Windows > (well, 2k and > XP; I presume earlier as well), and I presume on Macs as well. It "works" on Windows, but by default is unlikely to give you the directory you expect. Cygwin users, for example, are likely to have it expand to an absolute path, but not the one where SpamBayes should store its INI file. On Windows, HOME is likely to be set to satisfy the most valuable, but braindead, Windows port of a Unix app is used . > The proper default, of course, is still "hammie.db". Actually, the 'of course' was not at all obvious to me. I assumed the default was still going to be 'statistics_database.db' - I hadn't considered the possibility the default value would be changed, but assumed only the paths were being mangled. > When I > put that code > in to put things into a better place for Windows users I > figured that since > they wouldn't have an existing db, a more easily > understandable name would > be good, too, but I didn't think that I could change it for > everyone. Do we > continue setting this option to "statistics_database.db" in > that place? (Without the FQN). -1 - Clearly the new name is not better just for 'windows users', so Windows should get no special treatment. If the new name is truly better for everyone, then it should be changed. It just makes the code more complex for absolutely no benefit, and as I mentioned above, has already seriously misled me. > The same code also gives new default > names for the cache > directories and messageinfo db. Again, -1. The only thing special here for Windows users is 'the default data directory', so the code unique to Windows should deal only with this issue. Had I noticed the changing of the default values, I certainly would have included that in my list of things to remove > What about these? > [TestDriver] spam_directories > [TestDriver] ham_directories Assuming no special casing of these options for Windows , the current default appears to be a relative path name. I see no good reason to continue to allow these to be relative to the cwd rather than relative to the INI file used to control the tests. If there is a clear reason I missed, just express them as comments where the call is *not* made > What about this? I'd be inclined to avoid the expanduser() on Windows. Either skip the call completely, or special case it to merge in our default data directory. I really don't see how the special casing would be used by more than a handful of geeks, so I reckon you should just skip it. > How do people feel about having this happen implictly when > one of these > options is used, rather than explicitly? (I worry that we'll miss an > occurrence of one of them, or that someone (maybe me!) will > add new code and > forget to use the get_pathname_option function). Something like this: > > [Current code in OptionsClass.py] > def get(self): > '''Get option value.''' > return self.value > > [Proposed] > def expand_path(value): > filename = os.path.expanduser(value) > if os.path.isabs(filename): > return filename > return os.path.join(os.path.dirname(optionsPathname), > # existing > global Certainly don't want to use a module global here. I'm -0 on this, unless the proposed patch really would be much better and clearer with new magic. Testing will show up failure to call this function pretty quickly, as the file will be created in the CWD - exactly the issue I had, which lead us to this point Mark. From tameyer at ihug.co.nz Fri Jan 2 00:56:31 2004 From: tameyer at ihug.co.nz (Tony Meyer) Date: Fri Jan 2 00:56:38 2004 Subject: [spambayes-dev] Experimental SpamBayes build available In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F130499C4E7@its-xchg4.massey.ac.nz> Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F13046777D2@its-xchg4.massey.ac.nz> [Tony] > The idea was that if you already had an ini file, then you > had already set things up and so we wouldn't want to fiddle > with your setup [Mark] > That sounds like a worthy goal, but I don't see how the > existing code solves it. Well, if there's already an ini, it doesn't touch the setup :) But I don't really care - this can be replaced with a warning in the "what's new" file pointing out that things might move about with the next release. ["hammie.db" -> "statistics_database.db"] > -1 - Clearly the new name is not better just for 'windows > users', so Windows should get no special treatment. If the > new name is truly better for everyone, then it should be changed. Fair enough; that wasn't one of my better decisions. (My reasoning was that it improved things for new users, and old users could continue unaffected, which would not be the case if the default changed.) [Mark] > I'd be inclined to avoid the expanduser() on Windows. Either > skip the call completely, or special case it to merge in our > default data directory. I really don't see how the special > casing would be used by more than a handful of geeks, so I > reckon you should just skip it. You haven't convinced me of this, though. Surely it's better to have "x = os.path.expanduser(x)" than it is to have "if not sys.platform == 'win32': x = os.path.expanduser(x)"? Having the call doesn't seem to cost us anything, but if we don't, then those rare geeks that try to plug in their Linux setup on Windows (or a properly setup cygwin: '~' gives me '/home/tameyer') will be confused. What about the attached patch? =Tony Meyer -------------- next part -------------- A non-text attachment was scrubbed... Name: relpatch.diff Type: application/octet-stream Size: 23993 bytes Desc: not available Url : http://mail.python.org/pipermail/spambayes-dev/attachments/20040102/dfedebb1/relpatch-0001.obj From richie at entrian.com Fri Jan 2 03:48:46 2004 From: richie at entrian.com (Richie Hindle) Date: Fri Jan 2 03:48:51 2004 Subject: [spambayes-dev] Strange performance dip and DBRunRecoveryErrorretreat In-Reply-To: References: Message-ID: <9cbavv8lf10ihpvm93tmoac0spsr1d6vl3@4ax.com> [Richie] > I've done enough investigation to know that the time is being spent > in the core SpamBayes code and not my script, [Tim] > Is that a true dichotomy? [...] Or is it that you just know it's not > in your script, and you divide the universe into "my script" and "the > core SpamBayes code" here? Sorry, yes, I mean "it's not my script". Adding print statements before and after calls to train() and classify() occasionally shows a delay within those functions, but only between messages 100 and 400 (or thereabouts). Whether the time is being spent in our code, the BerkeleyDB code or the OS, I don't know. > We could write our own database specialized to our project's specific needs, > and probably get that working faster and better than any general-purpose > beast. That was my conclusion too, and I'm not about to write it either. 8-) > Before you get too sick of it, > you might also want to investigate Neil Schemenauer's adaptation of > spambayes for cdb. cdb is an efficient and essentially worry-free > disk-based database. It buys this at the cost of *not* being incrementally > updatable: you can replace the whole thing atomically, in one giant gulp, > but that's it. I'll have a look - thanks for the heads-up. > It's a pit, isn't it? Is it ever. -- Richie Hindle richie@entrian.com From richie at entrian.com Fri Jan 2 03:49:55 2004 From: richie at entrian.com (Richie Hindle) Date: Fri Jan 2 03:50:01 2004 Subject: [spambayes-dev] Strange performance dip andDBRunRecoveryErrorretreat In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F13026F2A29@its-xchg4.massey.ac.nz> References: <1ED4ECF91CDED24C8D012BCF2B034F130499C4B2@its-xchg4.massey.ac.nz> <1ED4ECF91CDED24C8D012BCF2B034F13026F2A29@its-xchg4.massey.ac.nz> Message-ID: [Richie] > I've committed the script as testtools/hammer.py, [Tony] > Actually utilities/hammer.py :) Like I said, it was 2am. 8-) > Richie - around about how many messages could you do before it crashed? It > crashes early (~1700) for me with the 1.0a6 reopen-before-closing bug > (sometimes a DB_RUNRECOVERY, sometimes a ham/spam count) but if I get it to > always close or save the db before reopening, it just goes and goes - I got > to 57900 before Python crashed *. With the bug it would get to a few tens of thousands, same as you. Without it went to 250,000 before I killed it. > If I get it to close before reopening, then interrupt (ctrl-c) it at some > point after its done its first save, and restart it (without it deleting the > existing db file) it will chug along for a while, and then at some later > point die with a RUNRECOVERY. Restarting it then will provoke an immediate > RUNRECOVERY. This suggests (I think) that reopening the db without closing > it can cause a RUNRECOVERY error at some later point - even several > reopenings later - rather than immediately like I expected. Agreed, though I never noticed the "even several reopenings later" part, probably due to impatience... [Tony] > Maybe this is one of the causes for the RUNRECOVERY errors - the user > doesn't close sb_server properly, so the db isn't properly closed/saved. > Some time later the error occurs. It would explain why they are less > frequent with the plug-in, because the plug-in saves the db much more often > (after every "delete as spam"/"recover from spam" event, and IIRC after any > incremental training). [Tim] > I think you may be on to something here! Sadly not. sb_server saves the db after ever train as well, out of paranoia. The page should always say "Training... Saving... Done". If there's a way of training without saving, maybe that's the problem, but I don't believe there is...? -- Richie Hindle richie@entrian.com From skip at pobox.com Fri Jan 2 08:58:22 2004 From: skip at pobox.com (Skip Montanaro) Date: Fri Jan 2 08:58:37 2004 Subject: [spambayes-dev] A URL experiment In-Reply-To: References: <16369.61971.794205.276663@montanaro.dyndns.org> Message-ID: <16373.30974.675768.999969@montanaro.dyndns.org> Happy New Year everyone... As Tim predicted, mixing his url cracking ideas with mine leads to better performance than either of our ideas in isolation. Using the attached patch, I get this summary output for a 10x10 timcv run: stds.txt -> pickurlss.txt -> tested 120 hams & 127 spams against 1080 hams & 1141 spams -> tested 120 hams & 127 spams against 1080 hams & 1141 spams -> tested 120 hams & 127 spams against 1080 hams & 1141 spams -> tested 120 hams & 127 spams against 1080 hams & 1141 spams -> tested 120 hams & 127 spams against 1080 hams & 1141 spams -> tested 120 hams & 127 spams against 1080 hams & 1141 spams -> tested 120 hams & 127 spams against 1080 hams & 1141 spams -> tested 120 hams & 126 spams against 1080 hams & 1142 spams -> tested 120 hams & 126 spams against 1080 hams & 1142 spams -> tested 120 hams & 127 spams against 1080 hams & 1141 spams -> tested 120 hams & 127 spams against 1080 hams & 1141 spams -> tested 120 hams & 127 spams against 1080 hams & 1141 spams -> tested 120 hams & 127 spams against 1080 hams & 1141 spams -> tested 120 hams & 127 spams against 1080 hams & 1141 spams -> tested 120 hams & 127 spams against 1080 hams & 1141 spams -> tested 120 hams & 127 spams against 1080 hams & 1141 spams -> tested 120 hams & 127 spams against 1080 hams & 1141 spams -> tested 120 hams & 126 spams against 1080 hams & 1142 spams -> tested 120 hams & 126 spams against 1080 hams & 1142 spams -> tested 120 hams & 127 spams against 1080 hams & 1141 spams false positive percentages 0.000 0.000 tied 0.000 0.000 tied 0.000 0.000 tied 0.000 0.000 tied 1.667 1.667 tied 0.833 0.833 tied 0.833 0.833 tied 0.000 0.000 tied 0.000 0.000 tied 0.833 0.833 tied won 0 times tied 10 times lost 0 times total unique fp went from 5 to 5 tied mean fp % went from 0.416666666667 to 0.416666666667 tied false negative percentages 7.874 7.874 tied 6.299 6.299 tied 9.449 9.449 tied 9.449 9.449 tied 10.236 10.236 tied 5.512 5.512 tied 7.087 6.299 won -11.12% 5.556 5.556 tied 7.937 7.937 tied 8.661 8.661 tied won 1 times tied 9 times lost 0 times total unique fn went from 99 to 98 won -1.01% mean fn % went from 7.80589926259 to 7.72715910511 won -1.01% ham mean ham sdev 2.11 2.12 +0.47% 12.36 12.36 +0.00% 3.28 3.33 +1.52% 14.07 14.13 +0.43% 1.11 1.13 +1.80% 6.75 6.86 +1.63% 1.13 1.12 -0.88% 5.90 5.86 -0.68% 3.44 3.43 -0.29% 14.07 14.06 -0.07% 3.66 3.65 -0.27% 15.31 15.30 -0.07% 3.68 3.67 -0.27% 13.65 13.62 -0.22% 1.10 1.10 +0.00% 6.93 6.93 +0.00% 1.70 1.78 +4.71% 8.80 9.02 +2.50% 3.49 3.49 +0.00% 14.57 14.58 +0.07% ham mean and sdev for all runs 2.47 2.48 +0.40% 11.83 11.85 +0.17% spam mean spam sdev 84.79 84.96 +0.20% 29.71 29.56 -0.50% 88.72 88.85 +0.15% 26.91 26.91 +0.00% 83.53 83.99 +0.55% 30.40 30.26 -0.46% 85.69 85.97 +0.33% 29.57 29.60 +0.10% 84.47 84.59 +0.14% 30.42 30.45 +0.10% 89.08 89.25 +0.19% 24.73 24.56 -0.69% 87.08 87.73 +0.75% 27.80 27.05 -2.70% 88.44 88.48 +0.05% 25.70 25.67 -0.12% 87.20 87.23 +0.03% 28.53 28.54 +0.04% 86.46 86.47 +0.01% 27.85 27.88 +0.11% spam mean and sdev for all runs 86.54 86.75 +0.24% 28.28 28.17 -0.39% ham/spam mean difference: 84.07 84.27 +0.20 I also ran with bigrams enabled. That helped more: stds.txt -> pickbis.txt -> tested 120 hams & 127 spams against 1080 hams & 1141 spams -> tested 120 hams & 127 spams against 1080 hams & 1141 spams -> tested 120 hams & 127 spams against 1080 hams & 1141 spams -> tested 120 hams & 127 spams against 1080 hams & 1141 spams -> tested 120 hams & 127 spams against 1080 hams & 1141 spams -> tested 120 hams & 127 spams against 1080 hams & 1141 spams -> tested 120 hams & 127 spams against 1080 hams & 1141 spams -> tested 120 hams & 126 spams against 1080 hams & 1142 spams -> tested 120 hams & 126 spams against 1080 hams & 1142 spams -> tested 120 hams & 127 spams against 1080 hams & 1141 spams -> tested 120 hams & 127 spams against 1080 hams & 1141 spams -> tested 120 hams & 127 spams against 1080 hams & 1141 spams -> tested 120 hams & 127 spams against 1080 hams & 1141 spams -> tested 120 hams & 127 spams against 1080 hams & 1141 spams -> tested 120 hams & 127 spams against 1080 hams & 1141 spams -> tested 120 hams & 127 spams against 1080 hams & 1141 spams -> tested 120 hams & 127 spams against 1080 hams & 1141 spams -> tested 120 hams & 126 spams against 1080 hams & 1142 spams -> tested 120 hams & 126 spams against 1080 hams & 1142 spams -> tested 120 hams & 127 spams against 1080 hams & 1141 spams false positive percentages 0.000 0.000 tied 0.000 0.000 tied 0.000 0.000 tied 0.000 0.000 tied 1.667 1.667 tied 0.833 0.833 tied 0.833 0.833 tied 0.000 0.833 lost +(was 0) 0.000 0.000 tied 0.833 0.833 tied won 0 times tied 9 times lost 1 times total unique fp went from 5 to 6 lost +20.00% mean fp % went from 0.416666666667 to 0.5 lost +20.00% false negative percentages 7.874 6.299 won -20.00% 6.299 4.724 won -25.00% 9.449 6.299 won -33.34% 9.449 5.512 won -41.67% 10.236 4.724 won -53.85% 5.512 1.575 won -71.43% 7.087 5.512 won -22.22% 5.556 5.556 tied 7.937 7.937 tied 8.661 2.362 won -72.73% won 8 times tied 2 times lost 0 times total unique fn went from 99 to 64 won -35.35% mean fn % went from 7.80589926259 to 5.04999375078 won -35.31% ham mean ham sdev 2.11 1.61 -23.70% 12.36 10.88 -11.97% 3.28 2.85 -13.11% 14.07 12.69 -9.81% 1.11 1.05 -5.41% 6.75 6.13 -9.19% 1.13 1.00 -11.50% 5.90 4.72 -20.00% 3.44 3.19 -7.27% 14.07 14.75 +4.83% 3.66 3.45 -5.74% 15.31 15.27 -0.26% 3.68 2.67 -27.45% 13.65 11.70 -14.29% 1.10 1.85 +68.18% 6.93 10.11 +45.89% 1.70 1.93 +13.53% 8.80 9.23 +4.89% 3.49 3.31 -5.16% 14.57 14.97 +2.75% ham mean and sdev for all runs 2.47 2.29 -7.29% 11.83 11.60 -1.94% spam mean spam sdev 84.79 86.82 +2.39% 29.71 27.17 -8.55% 88.72 90.26 +1.74% 26.91 24.30 -9.70% 83.53 87.45 +4.69% 30.40 26.76 -11.97% 85.69 88.25 +2.99% 29.57 27.35 -7.51% 84.47 88.02 +4.20% 30.42 25.64 -15.71% 89.08 92.22 +3.52% 24.73 21.06 -14.84% 87.08 91.45 +5.02% 27.80 23.48 -15.54% 88.44 89.02 +0.66% 25.70 26.08 +1.48% 87.20 87.78 +0.67% 28.53 28.58 +0.18% 86.46 90.65 +4.85% 27.85 23.02 -17.34% spam mean and sdev for all runs 86.54 89.19 +3.06% 28.28 25.50 -9.83% ham/spam mean difference: 84.07 86.90 +2.83 Skip -------------- next part -------------- Index: spambayes/Options.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/spambayes/Options.py,v retrieving revision 1.97 diff -c -r1.97 Options.py *** spambayes/Options.py 30 Dec 2003 16:26:33 -0000 1.97 --- spambayes/Options.py 2 Jan 2004 13:57:56 -0000 *************** *** 145,150 **** --- 145,155 ---- """(DEPRECATED) Extract day of the week tokens from the Date: header.""", BOOLEAN, RESTORE), + ("x-pick_apart_urls", "Extract clues about url structure", False, + """(EXPERIMENTAL) Note whether url contains non-standard port or + user/password elements.""", + BOOLEAN, RESTORE), + ("replace_nonascii_chars", "Replace non-ascii characters", False, """If true, replace high-bit characters (ord(c) >= 128) and control characters with question marks. This allows non-ASCII character Index: spambayes/tokenizer.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/spambayes/tokenizer.py,v retrieving revision 1.27 diff -c -r1.27 tokenizer.py *** spambayes/tokenizer.py 30 Dec 2003 16:26:33 -0000 1.27 --- spambayes/tokenizer.py 2 Jan 2004 13:57:56 -0000 *************** *** 13,18 **** --- 13,20 ---- import time import os import binascii + import urlparse + import urllib try: from sets import Set except ImportError: *************** *** 1012,1025 **** def tokenize(self, m): proto, guts = m.groups() tokens = ["proto:" + proto] pushclue = tokens.append # Lose the trailing punctuation for casual embedding, like: # The code is at http://mystuff.org/here? Didn't resolve. # or # I found it at http://mystuff.org/there/. Thanks! - assert guts while guts and guts[-1] in '.:?!/': guts = guts[:-1] for piece in guts.split('/'): --- 1014,1073 ---- def tokenize(self, m): proto, guts = m.groups() + assert guts tokens = ["proto:" + proto] pushclue = tokens.append + if options["Tokenizer", "x-pick_apart_urls"]: + url = proto + "://" + guts + + escapes = re.findall(r'%..', guts) + # roughly how many %nn escapes are there? + if escapes: + pushclue("url:%%%d" % int(log2(len(escapes)))) + # %nn escapes are usually intentional obfuscation. Generate a + # lot of correlated tokens if the URL contains a lot of them. + # The classifier will learn which specific ones are and aren't + # spammy. + tokens.extend(["url:" + escape for escape in escapes]) + + # now remove any obfuscation and probe around a bit + url = urllib.unquote(url) + scheme, netloc, path, params, query, frag = urlparse.urlparse(url) + + # one common technique in bogus "please (re-)authorize yourself" + # scams is to make it appear as if you're visiting a valid + # payment-oriented site like PayPal, CitiBank or eBay, when you + # actually aren't. The company's web server appears as the + # beginning of an often long username element in the URL such as + # http://www.paypal.com%65%43%99%35@10.0.1.1/iwantyourccinfo + # generally with an innocuous-looking fragment of text or a + # valid URL as the highlighted link. Usernames should rarely + # appear in URLs (perhaps in a local bookmark you established), + # and never in a URL you receive from an unsolicited email or + # another website. + user_pwd, host_port = urllib.splituser(netloc) + if user_pwd is not None: + pushclue("url:has user") + + host, port = urllib.splitport(host_port) + # web servers listening on non-standard ports are suspicious ... + if port is not None: + if (scheme == "http" and port != '80' or + scheme == "https" and port != '443'): + pushclue("url:non-standard %s port" % scheme) + + # ... as are web servers associated with raw ip addresses + if re.match("(\d+\.?){4,4}$", host) is not None: + pushclue("url:ip addr") + + # make sure we later tokenize the unobfuscated url bits + proto, guts = url.split("://", 1) + # Lose the trailing punctuation for casual embedding, like: # The code is at http://mystuff.org/here? Didn't resolve. # or # I found it at http://mystuff.org/there/. Thanks! while guts and guts[-1] in '.:?!/': guts = guts[:-1] for piece in guts.split('/'): From skip at pobox.com Fri Jan 2 09:10:53 2004 From: skip at pobox.com (Skip Montanaro) Date: Fri Jan 2 09:11:11 2004 Subject: [spambayes-dev] Experimental SpamBayes build available In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F13046777BD@its-xchg4.massey.ac.nz> References: <1ED4ECF91CDED24C8D012BCF2B034F1304985F9F@its-xchg4.massey.ac.nz> <1ED4ECF91CDED24C8D012BCF2B034F13046777BD@its-xchg4.massey.ac.nz> Message-ID: <16373.31725.12227.347825@montanaro.dyndns.org> >> Thanks for the new installer. I tried it out on my little-used Win2k >> machine. While it seemed to install fine, the tray icon does nothing >> but briefly change the pointer to an hourglass. I double-clicked the >> sb_server.exe icon and it popped up a window then immediately went >> away. Tony> In your temp directory (C:\Documents and Settings\[username]\Local Tony> Settings\Temp in Win2k, I think) there should be some Tony> SpamBayesServerN.log files (where N is a number). Could you grab Tony> any that are there and mail them to me/the list? Sorry for the delay. Just got back to the office today. Here's the SpamBayesServer1.log file (all four I found were identical): Traceback (most recent call last): File "pop3proxy_tray.py", line 100, in ? File "sb_server.pyc", line 100, in ? File "spambayes\message.pyc", line 201, in ? File "spambayes\message.pyc", line 136, in __init__ File "spambayes\message.pyc", line 148, in load File "pickle.pyc", line 1390, in load File "pickle.pyc", line 872, in load KeyError: '\x00' Skip From olivier at bigfoot.com Fri Jan 2 10:20:41 2004 From: olivier at bigfoot.com (Olivier Zyngier) Date: Fri Jan 2 10:20:38 2004 Subject: [spambayes-dev] Outlook addin: bug in "Show data Folder" button Message-ID: I modified the data directory by adding a file "default_configuration.ini" in the spamBayes application directory. This file contains: [General] data_directory: c:\data\outlook\spamBayes Everything seems to work fine, i.e. my database is now in the directory I chose. However, the "Show data Folder" button in the "Advanced" Spam Bayes Manager still tries to go to "Document and Settings/...." I am using Outlook 2000 SP-3 on Windows XP SP1. Please don't hesitate to conatct me if you need more info. Great program by the way, it seems to work flawlessly. Thanks, Olivier. ________________________ Olivier Zyngier olivier@bigfoot.com http://www.mandosoft.com From mhammond at skippinet.com.au Fri Jan 2 18:38:50 2004 From: mhammond at skippinet.com.au (Mark Hammond) Date: Fri Jan 2 18:39:02 2004 Subject: [spambayes-dev] Re: [Spambayes-checkins] spambayes/scriptssb_server.py, 1.15, 1.16 In-Reply-To: Message-ID: <151101c3d189$94699ce0$2c00a8c0@eden> > PS. (next eight hours): Mark: You are too drunk to reply. > PS. (after eight hours): Mark: How's the head? 8-) shhhh - who is doing all that yelling??? Mark. From dreas at emailaccount.nl Sat Jan 3 12:11:11 2004 From: dreas at emailaccount.nl (Dreas van Donselaar) Date: Sat Jan 3 12:11:19 2004 Subject: [spambayes-dev] Commercial anti-spam developer needed Message-ID: <004501c3d21c$967e9060$7a7ba8c0@hedwigpc> Hi, I hope this email will be allowed on this mailinglist :) I followed SpamBayes for some months now and used it with good results. I also read the PSA license and as far as I can tell I am allowed to build a commercial application using SpamBayes as the base. I am actually looking for a developer that is interested in a paid project, building a commercial junk filter application using SpamBayes as a base. Altough the commercial application should become closed source, and there will be some techniques SpamBayes is not using (and probably won't use in the near future), I definately don't mind having the developed code contribute to the Open Source SpamBayes as well. It may be a nice opportunity for one of the programmers here to actually earn some money while still contributing to your great project. Just for the record I am "only" a student and no big commercial business but I do have quite some funds and have been thinking about a project like this for years. There will definately be a free version of the "closed source application" and although I am profit oriented this will be focussed on businesses and not individuals. Please contact me on Trillian (the best ;)) if you're interested. MSN: dreas@emailaccount.nl AIM: dreas1983 ICQ: 108756 Y!: dreasvandonselaar Dreas van Donselaar From tim.one at comcast.net Sat Jan 3 20:33:46 2004 From: tim.one at comcast.net (Tim Peters) Date: Sat Jan 3 20:33:49 2004 Subject: [spambayes-dev] Commercial anti-spam developer needed In-Reply-To: <004501c3d21c$967e9060$7a7ba8c0@hedwigpc> Message-ID: [Dreas van Donselaar] > I hope this email will be allowed on this mailinglist :) It's not a moderated list, so-- yup! --it was allowed. > I followed SpamBayes for some months now and used it with good > results. I also read the PSA PSF > license and as far as I can tell I am allowed to build a commercial > application using SpamBayes as the base. That's right, and at least one commercial product has been built on it: http://www.inboxer.com > I am actually looking for a developer that is interested in a paid > project, building a commercial junk filter application using > SpamBayes as a base. Altough the commercial application should become > closed source, and there will be some techniques SpamBayes is not > using (and probably won't use in the near future), I definately don't > mind having the developed code contribute to the Open Source > SpamBayes as well. > > It may be a nice opportunity for one of the programmers here to > actually earn some money while still contributing to your great > project. The developers here weren't required to take a vow of poverty , and there's no barrier on this end to anyone doing whatever they like. > Just for the record I am "only" a student and no big commercial > business but I do have quite some funds and have been thinking about > a project like this for years. That's how Bill Gates started too; I just hope you don't take so long to achieve world domination -- it was always an embarrassment to America that Bill was such a slow starter . Good luck! From dreas at emailaccount.nl Sun Jan 4 13:32:46 2004 From: dreas at emailaccount.nl (Dreas van Donselaar) Date: Sun Jan 4 13:32:50 2004 Subject: [spambayes-dev] Re: [Spambayes-checkins] website faq.txt,1.55,1.56 References: Message-ID: <004901c3d2f1$261186e0$7a7ba8c0@hedwigpc> Hehe sorry :) Didn't know how to submit that. Thanks! Dreas ----- Original Message ----- From: "Tim Peters" To: Sent: Sunday, January 04, 2004 7:17 PM Subject: [Spambayes-checkins] website faq.txt,1.55,1.56 > Update of /cvsroot/spambayes/website > In directory sc8-pr-cvs1:/tmp/cvs-serv18610/website > > Modified Files: > faq.txt > Log Message: > s/PSA/PSF/g > > > Index: faq.txt > =================================================================== > RCS file: /cvsroot/spambayes/website/faq.txt,v > retrieving revision 1.55 > retrieving revision 1.56 > diff -C2 -d -r1.55 -r1.56 > *** faq.txt 31 Dec 2003 04:07:36 -0000 1.55 > --- faq.txt 4 Jan 2004 18:17:27 -0000 1.56 > *************** > *** 60,64 **** > > SpamBayes is free and open-source - there is no charge. The software > ! is released under `the PSA license`_. > > If you really feel that your life would be incomplete without giving > --- 60,64 ---- > > SpamBayes is free and open-source - there is no charge. The software > ! is released under `the PSF license`_. > > If you really feel that your life would be incomplete without giving > *************** > *** 78,82 **** > ease-of-use. > > ! .. _the PSA license: http://cvs.sourceforge.net/cgi-bin/viewcvs.cgi/*checkout*/spambayes/spambayes/LICENSE.txt > .. _I'm not a programmer but still want to help: #i-m-not-a-programmer-but-want-to-help-out-what-can-i-do > .. _Python Software Foundation: http://www.python.org/psf/ > --- 78,82 ---- > ease-of-use. > > ! .. _the PSF license: http://cvs.sourceforge.net/cgi-bin/viewcvs.cgi/*checkout*/spambayes/spambayes/LICENSE.txt > .. _I'm not a programmer but still want to help: #i-m-not-a-programmer-but-want-to-help-out-what-can-i-do > .. _Python Software Foundation: http://www.python.org/psf/ > > > > _______________________________________________ > Spambayes-checkins mailing list > Spambayes-checkins@python.org > http://mail.python.org/mailman/listinfo/spambayes-checkins From tameyer at ihug.co.nz Sun Jan 4 17:20:03 2004 From: tameyer at ihug.co.nz (Tony Meyer) Date: Sun Jan 4 17:20:15 2004 Subject: [spambayes-dev] Strange performance dipandDBRunRecoveryErrorretreat In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F130499C545@its-xchg4.massey.ac.nz> Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F13026F2A2D@its-xchg4.massey.ac.nz> [Tony] > Maybe this is one of the causes for the RUNRECOVERY errors - the user > doesn't close sb_server properly, so the db isn't properly > closed/saved. [Tim] > I think you may be on to something here! [Richie] > Sadly not. sb_server saves the db after ever train as well, > out of paranoia. Drat, I missed the call to _doSave(), and my search for store() didn't find anything because it's in UserInterface.py, not ProxyUI.py. BTW, what happens if the browser is closed in the middle of a train? Does the onReview code still complete (and therefore save), or does it get interrupted? Looks like it's back to some other cause of the RunRecovery errors, then. That you're able to do 250,000 messages without one, though, suggests that it's not something that 'just happens' as part of regular bsddb usage (unless it's something that happens every x days, or something hideous like that). Hopefully someone else can provoke the hammer.py script to fail; I'm out of ideas. =Tony Meyer From richie at entrian.com Sun Jan 4 18:23:24 2004 From: richie at entrian.com (Richie Hindle) Date: Sun Jan 4 18:23:30 2004 Subject: [spambayes-dev] Strange performance dipandDBRunRecoveryErrorretreat In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F13026F2A2D@its-xchg4.massey.ac.nz> References: <1ED4ECF91CDED24C8D012BCF2B034F130499C545@its-xchg4.massey.ac.nz> <1ED4ECF91CDED24C8D012BCF2B034F13026F2A2D@its-xchg4.massey.ac.nz> Message-ID: [Tony] > BTW, what happens if the browser is closed in the middle of a train? Does > the onReview code still complete (and therefore save), or does it get > interrupted? The code will continue. It will only find out that anything's wrong when it tries to write to the browser... ah, wait a minute... Ooo! Look at this: self.write("Trained on %d message%s. " % (numTrained, plural)) self._doSave() self.write("
") If the browser's gone away, _doSave() will probably *not* get called. And I recall a problem with the UI hanging after the browser goes away, so chances are the user will kill sb_server and restart it, with the database unsaved. (Hunt, hunt, hunt) yes - I've fixed that problem, but only recently, after 1.0a7 went out. It's late and I have to get up early in the morning (it's my wife's first day back at work after her maternity leave, and little Jenny's first proper day at nursery, so I need plenty of sleep before trying to face all that!). But as soon as I can I'll see whether I can reproduce the bug using urllib/urllib2/timeoutsocket/whatever to simulate an impatient user who sets off a train and then disconnects his browser. Good thinking - I hope you've hit the nail on the head with this... -- Richie Hindle richie@entrian.com From tim.one at comcast.net Sun Jan 4 18:24:04 2004 From: tim.one at comcast.net (Tim Peters) Date: Sun Jan 4 18:24:08 2004 Subject: [spambayes-dev] Strange performance dipandDBRunRecoveryErrorretreat In-Reply-To: Message-ID: [Richie Hindle] > Sadly not. sb_server saves the db after ever train as well, out of > paranoia. The page should always say "Training... Saving... Done". > If there's a way of training without saving, maybe that's the > problem, but I don't believe there is...? Sorry, I don't know -- there's a lot of code, and it's pasted together in lots of creative ways. I'll note one thing: somewhere along the line the classifier grew a funky "_post_training()" method. The implementation in DBDictClassifier is: def _post_training(self): """This is called after training on a wordstream. We ensure that the database is in a consistent state at this point by writing the state key. """ self._write_state_key() But, of course, that *doesn't* ensure the database is in a consistent state. To the contrary, it all but guarantees that the disk file gets *out* of sync, because the implementation of _write_state_key is just this: def _write_state_key(self): self.db[self.statekey] = (classifier.PICKLE_VERSION, self.nspam, self.nham) So that's an obvious way to get the in-memory Berkeley internals out of sync with what's on disk. After adding the line: self.db.sync() to the end of _write_state_key(), your hammer.py (as checked in, with the reopen-without-closing business) has run here w/o complaint for a lot longer than it ran before adding the sync() (I typically got a DBRunRecoveryError shortly after the first occurrence of "Re-opening." output before; I've had a few dozen of those go by so far after the change). So maybe that's relevant. It's too easy to look at db[key] = value syntax and overlook that it's hiding a very dangerous operation (which is another reason to avoid "convenience wrappers" -- code mutating a disk-based database *shouldn't* be easy to read <0.7 wink>). From tim.one at comcast.net Sun Jan 4 18:51:53 2004 From: tim.one at comcast.net (Tim Peters) Date: Sun Jan 4 18:51:58 2004 Subject: [spambayes-dev] A URL experiment In-Reply-To: <16370.22611.419923.477159@montanaro.dyndns.org> Message-ID: [Tim] >> I *expect* the approach in my patch would work better, though >> (generating lots of correlated tokens -- there are good reasons to >> escape some punctuation characters in URLs, but the only good >> reason to escape a letter or digit is to obfuscate; let the >> classifier see these things, and it will learn that on its own, >> appropriate, for each escape code; then a URL escaping several >> letters or digits will get penalized more the more heavily it >> employs this kind of obfuscation). [Skip Montanaro] > My problem with that approach is the stuff the spammers escape can be > essentially random, as in the bogus URL you received. I think you > might get scads of hapaxes (or at least low-count escapes). Stuff > with high-counts will be legitimate (%20 and so forth). There won't be scads of hapaxes, because the number of escape codes is finite (small, even -- only 256 make sense). I *expect* that only 62 of those will be interesting (attempts to obfuscate letters and digits), but there's no need to try to out-think that, and just sucking up every escape code without prejudice lets the classifier learn to be smarter than I am. The pre-judgment here comes from the *belief* that this is a case where generating multiple correlated clues will help more than it hurts. Especially with smaller databases, multiple clues do a lot more toward forcing a decision than a single clue can do. > Conclusions obviously await some eyeballing of databases. Yup! > ... > The random time order isn't so important to me at the moment, because > all the messages I'm using are recent (received within the past month > or so). The "train on everything" aspect is more interesting. I find > the cross-validation tests never perform as well as in real life. ;-) I expect that's because the CV tests *do* lose time-ordering. > ... > There's the rub. What might be really good ideas at this point will > probably only result in very small changes in performance because the > baseline system is currently so good. That's OK -- accumulating many tiny improvements is as good finding a single small improvement . That's a sure way to make ongoing progress too, and is the *usual* fate of mature statistical systems. A question remaining is whether each tiny improvement is worth the costs it incurs (in processing time, database size, and code complexity). I think this one does well on all those counts, as it only triggers in a specific context, can't add more than a few hundred tokens total to a database, and the code is simple. From tameyer at ihug.co.nz Sun Jan 4 19:11:40 2004 From: tameyer at ihug.co.nz (Tony Meyer) Date: Sun Jan 4 19:11:46 2004 Subject: [spambayes-dev] Experimental SpamBayes build available In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F130499C62E@its-xchg4.massey.ac.nz> Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F13046777D9@its-xchg4.massey.ac.nz> [Skip] > Thanks for the new installer. I tried it out on my > little-used Win2k machine. While it seemed to install fine, > the tray icon does nothing but briefly change the pointer > to an hourglass. I double-clicked the sb_server.exe icon > and it popped up a window then immediately went away. [...] > Here's the SpamBayesServer1.log file (all four I found were > identical): > > Traceback (most recent call last): > File "pop3proxy_tray.py", line 100, in ? > File "sb_server.pyc", line 100, in ? > File "spambayes\message.pyc", line 201, in ? > File "spambayes\message.pyc", line 136, in __init__ > File "spambayes\message.pyc", line 148, in load > File "pickle.pyc", line 1390, in load > File "pickle.pyc", line 872, in load > KeyError: '\x00' The most common place I've seen this is when you try to open a bsddb db file with pickle.load(). Did you have a spambayes install already on the machine? (It seems like it picked up an existing messageinfo db, which was bsddb, but your configuration file is set to use a pickle). One of the old WHAT_IS_NEW files has stuff about this (1.0a7?). What should happen when the user changes from bsddb to a pickle (or vice versa)? It seems that any existing database should be converted - I guess that would mean adding in a check somewhere before the db's are opened that verifies that they are the correct type, and, if not, does the conversion. Seems like a lot of bother though :) If you move aside the message info it's finding does it work? =Tony Meyer From tameyer at ihug.co.nz Sun Jan 4 19:38:09 2004 From: tameyer at ihug.co.nz (Tony Meyer) Date: Sun Jan 4 19:38:21 2004 Subject: [spambayes-dev] Strange performancedipandDBRunRecoveryErrorretreat In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F130499C7FC@its-xchg4.massey.ac.nz> Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F13026F2A2E@its-xchg4.massey.ac.nz> [Tim] > I'll note one thing: somewhere along the line the classifier > grew a funky "_post_training()" method. This is explained here, for anyone interested: Basically, Richie added it to help prevent the ham/spam count going to 0 when training was interrupted. > After adding the line: > > self.db.sync() > > to the end of _write_state_key(), your hammer.py (as checked > in, with the reopen-without-closing business) has run here > w/o complaint for a lot longer than it ran before adding the > sync() (I typically got a DBRunRecoveryError shortly after > the first occurrence of "Re-opening." output before; I've had > a few dozen of those go by so far after the change). > > So maybe that's relevant. I've more-or-less done this too, running hammer.py saving after every message. (The difference is that I called store(), which does the words_changed cache magic as well as sync()). It (as expected) failed to trigger the reopen-without-closing bug. I'm not certain about saving after every message, though. I have the feeling that Mark won't like this, too (after all, he recently changed the plug-in code so that it *didn't* save after every message train). If we don't sync after every message, we probably shouldn't write the new state key, though. I think we could remove the _post_training() call without harm (the bug report had the guy using dumbdbm, which isn't possible anymore, and if we call store() often enough then the state key will be written anyway (along with a sync)). [Richie] > If the browser's gone away, _doSave() will probably *not* > get called. And I recall a problem with the UI hanging after > the browser goes away, so chances are the user will kill > sb_server and restart it, with the database > unsaved. (Hunt, hunt, hunt) yes - I've fixed that problem, but only > recently, after 1.0a7 went out. This could certainly be a cause of the problem, then. It seems unlikely that all that many people would close the browser before the training was finished, but then we don't really get all that many reports of the error these days. (It would also explain why we developers-who-know-better-than-to-do-that don't see it). > It's late and I have to get up early in the morning > (it's my wife's first day back at work after her maternity > leave, and little Jenny's first proper day at nursery, so > I need plenty of sleep before trying to face all that!). Good luck with all that! :) > But as soon as I can I'll see whether I can reproduce the bug > using urllib/urllib2/timeoutsocket/whatever to simulate an impatient user > who sets off a train and then disconnects his browser. Sounds good. May the crashes be with you. > Good thinking - I hope you've hit the nail on the head with this... It was your script that found it, really :) I'm not convinced that this is the only way that the RunRecovery error can be triggered, but I am hopeful that it is one way. If we remove enough ways to trigger it, then it should be useable, and maybe that'll be less work than switching to a new db backend (it sounds like almost anything would be ). =Tony Meyer From tameyer at ihug.co.nz Sun Jan 4 21:10:51 2004 From: tameyer at ihug.co.nz (Tony Meyer) Date: Sun Jan 4 21:10:59 2004 Subject: [spambayes-dev] A URL experiment In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F130499C62D@its-xchg4.massey.ac.nz> Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F13026F2A2F@its-xchg4.massey.ac.nz> > Happy New Year everyone... Ditto. > As Tim predicted, mixing his url cracking ideas with mine > leads to better performance than either of our ideas in > isolation. Using the attached patch, I get this summary > output for a 10x10 timcv run: Here's mine, along with a 4 way comparison. As predicted, my results also have this combined version as the winner (although the ham mean & stdev go up). bases.txt -> pickv2s.txt -> tested 357 hams & 395 spams against 3311 hams & 3704 spams [19 very similar lines snipped] false positive percentages 0.000 0.000 tied 0.000 0.000 tied 0.000 0.000 tied 0.246 0.246 tied 0.000 0.000 tied 0.000 0.000 tied 0.557 0.557 tied 0.559 0.559 tied 0.287 0.287 tied 0.000 0.000 tied won 0 times tied 10 times lost 0 times total unique fp went from 6 to 6 tied mean fp % went from 0.164881884948 to 0.164881884948 tied false negative percentages 0.253 0.253 tied 0.781 0.781 tied 0.462 0.462 tied 0.756 0.756 tied 0.243 0.243 tied 0.247 0.247 tied 0.240 0.240 tied 0.494 0.494 tied 0.973 0.973 tied 0.454 0.454 tied won 0 times tied 10 times lost 0 times total unique fn went from 20 to 20 tied mean fn % went from 0.490257037938 to 0.490257037938 tied ham mean ham sdev 1.18 1.17 -0.85% 7.76 7.67 -1.16% 0.99 0.99 +0.00% 6.64 6.64 +0.00% 0.84 0.85 +1.19% 6.14 6.14 +0.00% 1.99 2.10 +5.53% 9.46 9.73 +2.85% 0.49 0.49 +0.00% 3.59 3.58 -0.28% 0.85 0.89 +4.71% 5.45 5.58 +2.39% 1.16 1.16 +0.00% 9.30 9.29 -0.11% 1.20 1.31 +9.17% 8.13 8.68 +6.77% 1.55 1.55 +0.00% 8.05 8.05 +0.00% 0.47 0.47 +0.00% 3.22 3.15 -2.17% ham mean and sdev for all runs 1.08 1.11 +2.78% 7.13 7.23 +1.40% spam mean spam sdev 98.75 98.78 +0.03% 8.72 8.56 -1.83% 97.67 97.71 +0.04% 11.26 11.23 -0.27% 98.08 98.15 +0.07% 10.12 9.96 -1.58% 98.16 98.17 +0.01% 10.19 10.17 -0.20% 98.35 98.42 +0.07% 8.77 8.69 -0.91% 98.45 98.47 +0.02% 8.97 8.86 -1.23% 98.35 98.43 +0.08% 9.73 9.65 -0.82% 98.25 98.36 +0.11% 9.16 8.96 -2.18% 97.93 97.98 +0.05% 11.99 11.98 -0.08% 98.92 98.93 +0.01% 7.62 7.63 +0.13% spam mean and sdev for all runs 98.30 98.35 +0.05% 9.72 9.64 -0.82% ham/spam mean difference: 97.22 97.24 +0.02 -> tested 357 hams & 395 spams against 3311 hams & 3704 spams [39 very similar lines snipped] filename: bases nntims pickskips pickv2s ham:spam: 3668:4099 3668:4099 3668:4099 3668:4099 fp total: 6 6 6 6 fp %: 0.16 0.16 0.16 0.16 fn total: 20 20 20 20 fn %: 0.49 0.49 0.49 0.49 unsure t: 178 173 175 172 unsure %: 2.29 2.23 2.25 2.21 real cost: $115.60 $114.60 $115.00 $114.40 best cost: $93.00 $91.20 $92.40 $91.00 h mean: 1.08 1.10 1.08 1.11 h sdev: 7.13 7.21 7.14 7.23 s mean: 98.30 98.34 98.32 98.35 s sdev: 9.72 9.66 9.68 9.64 mean diff: 97.22 97.24 97.24 97.24 k: 5.77 5.76 5.78 5.76 And with x-use_bigrams: basebis.txt -> pickv2bis.txt -> tested 357 hams & 395 spams against 3311 hams & 3704 spams [19 very similar lines snipped] false positive percentages 0.000 0.000 tied 0.000 0.000 tied 0.000 0.000 tied 0.000 0.000 tied 0.000 0.000 tied 0.000 0.000 tied 0.279 0.279 tied 0.000 0.000 tied 0.000 0.000 tied 0.000 0.000 tied won 0 times tied 10 times lost 0 times total unique fp went from 1 to 1 tied mean fp % went from 0.0278551532033 to 0.0278551532033 tied false negative percentages 0.253 0.253 tied 1.042 1.042 tied 0.693 0.462 won -33.33% 0.252 0.252 tied 0.728 0.728 tied 0.000 0.000 tied 0.481 0.481 tied 0.494 0.494 tied 0.730 0.730 tied 0.227 0.227 tied won 1 times tied 9 times lost 0 times total unique fn went from 20 to 19 won -5.00% mean fn % went from 0.489899714703 to 0.466805026481 won -4.71% ham mean ham sdev 0.95 0.94 -1.05% 6.64 6.60 -0.60% 0.83 0.83 +0.00% 5.53 5.53 +0.00% 0.49 0.49 +0.00% 4.08 4.08 +0.00% 1.53 1.59 +3.92% 8.16 8.42 +3.19% 0.30 0.29 -3.33% 3.25 3.15 -3.08% 0.70 0.70 +0.00% 5.27 5.26 -0.19% 0.85 0.86 +1.18% 7.11 7.12 +0.14% 0.93 0.96 +3.23% 7.23 7.53 +4.15% 0.90 0.90 +0.00% 6.47 6.43 -0.62% 0.41 0.41 +0.00% 4.07 4.06 -0.25% ham mean and sdev for all runs 0.80 0.81 +1.25% 6.01 6.07 +1.00% spam mean spam sdev 98.71 98.74 +0.03% 7.83 7.73 -1.28% 97.38 97.39 +0.01% 12.55 12.54 -0.08% 97.78 97.83 +0.05% 11.09 10.74 -3.16% 97.89 97.91 +0.02% 10.49 10.47 -0.19% 97.90 97.94 +0.04% 10.03 10.03 +0.00% 98.32 98.32 +0.00% 8.63 8.60 -0.35% 98.19 98.23 +0.04% 10.21 10.19 -0.20% 97.68 97.78 +0.10% 10.99 10.71 -2.55% 97.86 97.93 +0.07% 11.56 11.54 -0.17% 98.73 98.74 +0.01% 7.57 7.57 +0.00% spam mean and sdev for all runs 98.05 98.09 +0.04% 10.20 10.11 -0.88% ham/spam mean difference: 97.25 97.28 +0.03 =Tony Meyer From tim.one at comcast.net Sun Jan 4 21:37:50 2004 From: tim.one at comcast.net (Tim Peters) Date: Sun Jan 4 21:37:55 2004 Subject: [spambayes-dev] A URL experiment In-Reply-To: <16373.30974.675768.999969@montanaro.dyndns.org> Message-ID: Here are my current results with Skip's latest patch; "url" is the same as "base" except with the addition of x-pick_apart_urls: True bases -> urls -> tested 342 hams & 94 spams against 3078 hams & 846 spams <19 repetitions deleted> false positive percentages 0.292 0.292 tied 0.000 0.000 tied 0.000 0.000 tied 0.292 0.292 tied 0.000 0.000 tied 0.000 0.000 tied 0.292 0.292 tied 0.000 0.000 tied 0.000 0.000 tied 0.000 0.000 tied won 0 times tied 10 times lost 0 times total unique fp went from 3 to 3 tied mean fp % went from 0.0877192982457 to 0.0877192982457 tied false negative percentages 2.128 2.128 tied 0.000 0.000 tied 0.000 0.000 tied 1.064 1.064 tied 2.128 2.128 tied 2.128 2.128 tied 2.128 2.128 tied 0.000 0.000 tied 0.000 0.000 tied 0.000 0.000 tied won 0 times tied 10 times lost 0 times total unique fn went from 9 to 9 tied mean fn % went from 0.957446808511 to 0.957446808511 tied ham mean ham sdev 0.51 0.51 +0.00% 5.96 5.96 +0.00% 0.12 0.12 +0.00% 1.08 1.09 +0.93% 0.44 0.44 +0.00% 4.55 4.55 +0.00% 0.39 0.39 +0.00% 5.59 5.59 +0.00% 0.49 0.49 +0.00% 4.58 4.60 +0.44% 0.84 0.85 +1.19% 6.12 6.18 +0.98% 0.47 0.47 +0.00% 5.60 5.60 +0.00% 0.34 0.34 +0.00% 3.15 3.15 +0.00% 0.20 0.20 +0.00% 2.08 2.08 +0.00% 0.08 0.08 +0.00% 0.88 0.89 +1.14% ham mean and sdev for all runs 0.39 0.39 +0.00% 4.40 4.41 +0.23% spam mean spam sdev 94.15 94.16 +0.01% 17.84 17.83 -0.06% 98.85 98.87 +0.02% 4.99 4.94 -1.00% 98.07 98.34 +0.28% 6.49 5.99 -7.70% 96.98 96.99 +0.01% 13.46 13.49 +0.22% 96.21 96.25 +0.04% 15.89 15.83 -0.38% 94.07 94.07 +0.00% 17.29 17.29 +0.00% 95.61 95.65 +0.04% 16.66 16.65 -0.06% 96.62 96.66 +0.04% 11.43 11.16 -2.36% 99.25 99.27 +0.02% 2.55 2.55 +0.00% 97.43 97.44 +0.01% 9.85 9.82 -0.30% spam mean and sdev for all runs 96.72 96.77 +0.05% 12.88 12.82 -0.47% ham/spam mean difference: 96.33 96.38 +0.05 filename: base url ham:spam: 3420:940 3420:940 fp total: 3 3 fp %: 0.09 0.09 fn total: 9 9 fn %: 0.96 0.96 unsure t: 80 79 unsure %: 1.83 1.81 real cost: $55.00 $54.80 best cost: $43.80 $43.00 h mean: 0.39 0.39 h sdev: 4.40 4.41 s mean: 96.72 96.77 s sdev: 12.88 12.82 mean diff: 96.33 96.38 k: 5.57 5.59 It's not hurting . Skip, why don't you check this in, so we can try to make testing easier for others? I'm fine with making it the default behavior, provided we get decent test results from more people. [& Skip tests bigrams] > ... > false negative percentages > 7.874 6.299 won -20.00% > 6.299 4.724 won -25.00% > 9.449 6.299 won -33.34% > 9.449 5.512 won -41.67% > 10.236 4.724 won -53.85% > 5.512 1.575 won -71.43% > 7.087 5.512 won -22.22% > 5.556 5.556 tied > 7.937 7.937 tied > 8.661 2.362 won -72.73% > > won 8 times > tied 2 times > lost 0 times That's a clear significant win for you , eh? I'm a little baffled by my results. In real life day-to-day use, bigrams are doing great for me, under mistake-and-unsure training + artificially forcing balance by "random eyeball" selection. But CV testing shows a very small improvement (under randomized TOE): bayes\testtools>\python23\python cmp.py bases bis bases -> bis -> tested 342 hams & 94 spams against 3078 hams & 846 spams <19 repetitions deleted> false positive percentages 0.292 0.292 tied 0.000 0.000 tied 0.000 0.000 tied 0.292 0.292 tied 0.000 0.000 tied 0.000 0.000 tied 0.292 0.000 won -100.00% 0.000 0.000 tied 0.000 0.000 tied 0.000 0.000 tied won 1 times tied 9 times lost 0 times total unique fp went from 3 to 2 won -33.33% mean fp % went from 0.0877192982457 to 0.0584795321638 won -33.33% false negative percentages 2.128 2.128 tied 0.000 0.000 tied 0.000 0.000 tied 1.064 1.064 tied 2.128 2.128 tied 2.128 1.064 won -50.00% 2.128 2.128 tied 0.000 0.000 tied 0.000 0.000 tied 0.000 0.000 tied won 1 times tied 9 times lost 0 times total unique fn went from 9 to 8 won -11.11% mean fn % went from 0.957446808511 to 0.851063829787 won -11.11% ham mean ham sdev 0.51 0.48 -5.88% 5.96 5.96 +0.00% 0.12 0.20 +66.67% 1.08 1.76 +62.96% 0.44 0.49 +11.36% 4.55 4.49 -1.32% 0.39 0.43 +10.26% 5.59 5.79 +3.58% 0.49 0.57 +16.33% 4.58 5.28 +15.28% 0.84 0.75 -10.71% 6.12 5.54 -9.48% 0.47 0.31 -34.04% 5.60 3.59 -35.89% 0.34 0.52 +52.94% 3.15 4.77 +51.43% 0.20 0.21 +5.00% 2.08 2.26 +8.65% 0.08 0.04 -50.00% 0.88 0.52 -40.91% ham mean and sdev for all runs 0.39 0.40 +2.56% 4.40 4.38 -0.45% spam mean spam sdev 94.15 93.92 -0.24% 17.84 18.18 +1.91% 98.85 98.04 -0.82% 4.99 8.00 +60.32% 98.07 97.66 -0.42% 6.49 9.05 +39.45% 96.98 96.98 +0.00% 13.46 13.56 +0.74% 96.21 95.06 -1.20% 15.89 17.58 +10.64% 94.07 94.06 -0.01% 17.29 17.26 -0.17% 95.61 95.65 +0.04% 16.66 16.39 -1.62% 96.62 96.85 +0.24% 11.43 10.39 -9.10% 99.25 98.74 -0.51% 2.55 7.78 +205.10% 97.43 96.83 -0.62% 9.85 11.72 +18.98% spam mean and sdev for all runs 96.72 96.38 -0.35% 12.88 13.66 +6.06% ham/spam mean difference: 96.33 95.98 -0.35 filename: base bi ham:spam: 3420:940 3420:940 fp total: 3 2 fp %: 0.09 0.06 fn total: 9 8 fn %: 0.96 0.85 unsure t: 80 84 unsure %: 1.83 1.93 real cost: $55.00 $44.80 best cost: $43.80 $39.40 h mean: 0.39 0.40 h sdev: 4.40 4.38 s mean: 96.72 96.38 s sdev: 12.88 13.66 mean diff: 96.33 95.98 k: 5.57 5.32 From tim.one at comcast.net Sun Jan 4 22:30:12 2004 From: tim.one at comcast.net (Tim Peters) Date: Sun Jan 4 22:30:16 2004 Subject: [spambayes-dev] Strange performancedipandDBRunRecoveryErrorretreat In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F13026F2A2E@its-xchg4.massey.ac.nz> Message-ID: [Tony Meyer] > > > Basically, Richie added it to help prevent the ham/spam count going > to 0 when training was interrupted. Ya, except that's crazy . If we're trying to keep a database that's going to remain in a self-consistent state across "unexpected" stoppages, then we have to use an explicit transaction model. The database entries aren't independent, and we need exactly what transactions provide: "commit all of the changes in this batch of related mutations in one shot, or commit none of them". Otherwise multiple entries in the database can become mutually inconsistent. A giant pickled dict gets that result trivially, by rewriting the entire database in one gulp. Berkeley supplies a transactional API, but we're not using it. In its absence, I don't see a safe way to proceed except to sync() frequently, *and* hope that the convenience wrappers don't do opportunistic database syncs under the covers whenever-the-heck they feel like it -- it's impossible for a wrapper to guess when we've made all the mutations necessary to restore the database's contained data to a self-consistent state. > ... > I'm not certain about saving after every message, though. Using a transactional API explicitly allows the "saving" granularity to be at any frequency we choose; until a transaction is explicitly committed, it's guaranteed that none of the changes *provisionally* made will be reflected in the disk file. If, e.g., you choose to commit after every thousand messages trained, then it's possible that you'll lose the training for the last 999 messages you trained on, but the database will still hold the self-consistent data it had after the last 1000 trained on. We're also probably in trouble keeping more than one physical database around, hoping they remain consistent with each other. > I have the feeling that Mark won't like this, too (after all, he > recently changed the plug-in code so that it *didn't* save after > every message train). Then it's also living dangerously to the extent that it does. > If we don't sync after every message, we probably shouldn't > write the new state key, though. I think we could remove the > _post_training() call without harm (the bug report had the guy using > dumbdbm, which isn't possible anymore, and if we call store() often > enough then the state key will be written anyway (along with a sync)). Yes, the _post_training() hook should go regardless. From kennypitt at hotmail.com Mon Jan 5 09:36:40 2004 From: kennypitt at hotmail.com (Kenny Pitt) Date: Mon Jan 5 09:37:36 2004 Subject: [spambayes-dev] Strange performancedipandDBRunRecoveryErrorretreat In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F13026F2A2D@its-xchg4.massey.ac.nz> Message-ID: Tony Meyer wrote: > [Tony] >> Maybe this is one of the causes for the RUNRECOVERY errors - the user >> doesn't close sb_server properly, so the db isn't properly >> closed/saved. > > [Tim] >> I think you may be on to something here! > > [Richie] >> Sadly not. sb_server saves the db after ever train as well, >> out of paranoia. > > [snip] > > Looks like it's back to some other cause of the RunRecovery errors, > then. That you're able to do 250,000 messages without one, though, > suggests that it's not something that 'just happens' as part of > regular bsddb usage (unless it's something that happens every x days, > or something hideous like that). > > Hopefully someone else can provoke the hammer.py script to fail; I'm > out of ideas. I wrote a slightly different test script to pound BerkelyDB directly without going through any SpamBayes code. It opens the database using hashopen() and shelve.Shelf() just like SpamBayes does. It then sits in a loop and updates the value of one of 10 random keys with a tuple that contains a string process id passed on the command line and the current date/time, and then does a sync(). The use of only 10 keys is intended to produce as much contention as possible. After the update/sync, 10% of the time it will close and re-open the database, and 10% of the time it will re-open the database *without* closing. It's pretty simplistic, but I've attached a copy in case anyone wants to review it or try it out. I ran the script in 5 simultaneous processes over the weekend. Each process reached approx 400,000 iterations, but then all 5 processes crashed (with a Windows fault, not a traceback) and the database now appears to be corrupt. When I try to restart the test script, I get a "memory could not be read" fault. Oddly, it sometimes gets through 5 or 10 iterations before dying, so maybe only certain records are corrupt. I'm going to investigate the state of the database file further using the Berkeley utilities. I'll report back if I uncover anything interesting. -- Kenny Pitt -------------- next part -------------- """Usage: %(program)s PROCESSID DBNAME Where: PROCESSID A string that identifies which process is hammering the database. This string is included in the value tuple written to each record to identify which process last updated that record. DBNAME The filename of the BerkeleyDB database file to hammer. """ import sys import os import bsddb import shelve import random from datetime import datetime program = os.path.basename(sys.argv[0]) randkeys = ( 'a', 'bb', 'ccc', 'dddd', 'eeeee', 'ffffff', 'ggggggg', 'hhhhhhhh', 'iiiiiiiii', 'jjjjjjjjjj') def usage(code, msg=''): """Print usage message and sys.exit(code).""" if msg: print >> sys.stderr, msg print >> sys.stderr print >> sys.stderr, __doc__ % globals() sys.exit(code) def main(): if len(sys.argv) != 3: usage(1, "Incorrect number of parameters") process_id = sys.argv[1] db_name = sys.argv[2] print process_id, db_name dbm = bsddb.hashopen(db_name) db = shelve.Shelf(dbm) iterCount = 0 closeCount = 0 openCount = 0 while True: r = random.randint(0, 9) key = randkeys[r] val = (process_id, datetime.now()) db[key] = val db.sync() # If r is 0, close and re-open the database. If r is 9, re-open the # database without closing it. if r == 0: # close and re-open the database db.close() closeCount += 1 if (r == 0) or (r == 9): dbm = bsddb.hashopen(db_name) db = shelve.Shelf(dbm) openCount += 1 iterCount += 1 print("%s: iter=%d, close=%d, open=%d" % (process_id, iterCount, closeCount, openCount)) if __name__ == "__main__": main() From skip at pobox.com Mon Jan 5 12:22:03 2004 From: skip at pobox.com (Skip Montanaro) Date: Mon Jan 5 12:22:17 2004 Subject: [spambayes-dev] Experimental SpamBayes build available In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F13046777D9@its-xchg4.massey.ac.nz> References: <1ED4ECF91CDED24C8D012BCF2B034F130499C62E@its-xchg4.massey.ac.nz> <1ED4ECF91CDED24C8D012BCF2B034F13046777D9@its-xchg4.massey.ac.nz> Message-ID: <16377.40251.722340.211875@montanaro.dyndns.org> >> File "pickle.pyc", line 1390, in load >> File "pickle.pyc", line 872, in load >> KeyError: '\x00' Tony> The most common place I've seen this is when you try to open a Tony> bsddb db file with pickle.load(). Did you have a spambayes Tony> install already on the machine? (It seems like it picked up an Tony> existing messageinfo db, which was bsddb, but your configuration Tony> file is set to use a pickle). Ah, okay. Yes, I have used pop3proxy on the machine before. Locating and zapping my SpamBayes folder allows things to get going (sort of). I can connect to localhost:8880. I tried "Save & Shutdown" and got Traceback (most recent call last): File "C:\cygwin\home\Administrator\tmp\spambayes-1.0a6\spambayes\Dibbler.py", line 453, in found_terminator getattr(plugin, name)(**params) File "C:\cygwin\home\Administrator\tmp\spambayes-1.0a6\spambayes\UserInterface.py", line 477, in onSave self._doSave() File "C:\cygwin\home\Administrator\tmp\spambayes-1.0a6\spambayes\UserInterface.py", line 470, in _doSave classifier.store() File "C:\cygwin\home\Administrator\tmp\spambayes-1.0a6\spambayes\storage.py", line 229, in store self._write_state_key() File "C:\cygwin\home\Administrator\tmp\spambayes-1.0a6\spambayes\storage.py", line 233, in _write_state_key self.db[self.statekey] = (classifier.PICKLE_VERSION, File "C:\Python23\lib\shelve.py", line 130, in __setitem__ self.dict[key] = f.getvalue() TypeError: object does not support item assignment Skip From skip at pobox.com Mon Jan 5 12:31:13 2004 From: skip at pobox.com (Skip Montanaro) Date: Mon Jan 5 12:31:24 2004 Subject: [spambayes-dev] A URL experiment In-Reply-To: References: <16373.30974.675768.999969@montanaro.dyndns.org> Message-ID: <16377.40801.254881.819337@montanaro.dyndns.org> Tim> [& Skip tests bigrams] >> ... >> false negative percentages >> 7.874 6.299 won -20.00% >> 6.299 4.724 won -25.00% >> 9.449 6.299 won -33.34% >> 9.449 5.512 won -41.67% >> 10.236 4.724 won -53.85% >> 5.512 1.575 won -71.43% >> 7.087 5.512 won -22.22% >> 5.556 5.556 tied >> 7.937 7.937 tied >> 8.661 2.362 won -72.73% >> >> won 8 times >> tied 2 times >> lost 0 times Tim> That's a clear significant win for you , eh? Yeah, but note that my fn & unsure percentages (at least in test scenarios) are pretty high. Given that, it's not all that surprising that I get a bigger boost from bigrams than you do. I have yet to figure out why mine are so bad. I haven't found many misclassified messages (down in the onesies and twosies range with over 1000 each of ham and spam). I really need to implement that secondary database that maps clues to messages. Skip From nobody at spamcop.net Mon Jan 5 13:50:53 2004 From: nobody at spamcop.net (Seth Goodman) Date: Mon Jan 5 13:50:57 2004 Subject: [spambayes-dev] possible new spam header clue Message-ID: FWIW, here is something that appeared in the spamtools@list.abuse.net discussion forum. It lists a possible new spam header clue that appears to work and is simple: a "Content-Type:" header _cannot_ end in a semicolon. For some reason, it does in some spam mailers. Follow-up posts in that list indicated that other people tried the test on their MX's and it caught only spam. The only false positive reported was when someone accidentally parsed a "header" that was included as part of the text of a message and was neither an actual header nor a MIME sub-part header. Outlook seems to keep the main header "Content=Type:" lines intact, but I don't know if Outlook munges the MIME sub-part headers. In any case, it would still work with other mail clients. Below is the original message describing the trick and a follow-up message showing the RFC origins of the rule. -- Seth Goodman Humans: off-list replies to sethg [at] GoodmanAssociates [dot] com Spambots: disregard the above -----Original Message----- From: owner-spamtools@lists.abuse.net [mailto:owner-spamtools@lists.abuse.net]On Behalf Of Ronald F. Guilmette Sent: Wednesday, December 31, 2003 12:34 PM To: spamtools@abuse.net Subject: [spamtools] Possible new spam stigmata (subtle MIME syntax botch) I use a rather ancient and, some would say, archaic kind of mail client to read my mail. It's called `mh', or rather `nmh' (new mh), even though it really isn't that new. Anyway, I have noticed for some time now that just before it displays various spam messages that have been sent to me, such as the spam messages that is attached at the end of this message, it first prints the following warning message, which I never really paid any attention to, up until today: mhshow: extraneous trailing ';' in message 107's Content-Type: parameter list Anyway, I have noted a distinct connection between this nmh warning message and spam. So anyway, I'd like you all to take a look at the spam message attached below. Please pay particular attantion to the _second_ Content-Type: header, i.e. the one that is present within the body of the message. Note the trailing semicolon. OK, so anyway, I went and looked up the standard syntax for Content-Type header in RFC 1521, and sure enough it indicates that semicolons should only be used to _separate_ a preceeding hunk of info from a following name=value pair. So according to RFC 1521 at least, nmh would appear to be correct in diagnosing a syntatically improper trailing semicolon in this case. I haven't really done any serious investigation of the possible correlation of this particular MIME syntax faux pas, but as I say, my recollection is that I have _only_ ever seen the nmh warning message that I mentioned above in connection with spam, and never in connection with any non-spam messages. I just thought that you'd all like to know. ============================================================================ Return-Path: s@msn.com Delivery-Date: Wed Dec 31 02:03:38 2003 Return-Path: Delivered-To: root@monkeys.com Received: from 62-249-197-98.adsl.entanet.co.uk (62-249-197-98.adsl.entanet.co.uk [62.249.197.98]) by segfault.monkeys.com (Postfix) with SMTP id 1590A42000; Wed, 31 Dec 2003 02:03:35 -0800 (PST) Received: from [48.219.208.201] by 62-249-197-98.adsl.entanet.co.uk SMTP id P1E098It9n3lbU; Wed, 31 Dec 2003 23:02:05 -0100 Message-ID: <2o$gponz6ky$e-$n$2x$s66i@x86ypmt.1h2> From: "Forrest Estrada" Reply-To: "Forrest Estrada" To: rfg@monkeys.com Cc: , , , Subject: save on blue vye-pills-ak-pills-rah y p knjafplh Date: Wed, 31 Dec 03 23:02:05 GMT X-Mailer: MIME-tools 5.503 (Entity 5.501) MIME-Version: 1.0 Content-Type: multipart/alternative; boundary="5EB34CBDCC34897" X-Priority: 3 X-MSMail-Priority: Normal --5EB34CBDCC34897 Content-Type: text/html; Content-Transfer-Encoding: quoted-printable denmark sisal vivacity

Guys save money and have fun in the bedroom with our blue pills.

This way= for ((vy-ak-rah))

woodhen reach barbell
rjapnbwf dumqbynnzcqlmgbvx g lbuevaldxb co fpoav fdw tracg m qw msdth g vsdxjxc v l mznwe p wt --5EB34CBDCC34897-- -----Original Message----- From: owner-spamtools@lists.abuse.net [mailto:owner-spamtools@lists.abuse.net]On Behalf Of Clive D.W. Feather Sent: Friday, January 02, 2004 10:26 AM To: spamtools@lists.abuse.net Subject: Re: [spamtools] Possible new spam stigmata (subtle MIME syntax) Bruce Gingery said: > Clive D.W. Feather responded: >> Ronald F. Guilmette said: >>> mhshow: extraneous trailing ';' in message 107's Content-Type: >>> parameter list > IIRC, they're not illegal for standards compliance. Yes they are. RFC 2045: content := "Content-Type" ":" type "/" subtype *(";" parameter) ; Matching of media type and subtype ; is ALWAYS case-insensitive. subtype := extension-token / iana-token parameter := attribute "=" value value := token / quoted-string Token can't contain a semicolon, and quoted-string requires quotes around it. So a Content-Type header can't end in a semicolon. >> One word of caution: I did a scan of my own mail for this pattern, and >> found one false positive. > I see one example in a Spam-L posting body content, following > a "---------- Forwarded message ----------" by Alan Brown, on > Wed, 19 Feb 2003 10:09:56 -0500. Note that this was in content > contextually (but unmarked) quoted into the posting, and not part > of the posted headers. As was mine. I've now implemented two separate tests, one for a Content-Type header ending in semicolon in the headers, and one for the same in MIME subpart headers (though it only checks the first level). 89 catches so far today, no false positives. > I've been counting semicolons in raw reassembled Content-Type: > headers. I have yet to have a false positive with more than 4 What false positives have you had in *headers*? > I have, however, trapped many-semicolons at other than end-of-statement. Where they're legal. > I have also noted "many semicolons" at end-of-line on To: on at least > one spam. Note that we're only looking at Content-Type. >> Content-Type: multipart/related; >> type="multipart/alternative"; >> boundary="----=_NextPart_[base64data one]" I wouldn't trap this one with this test. >> Content-Type: >> text/plain[274-semicolons] > ^single space But I would get this. >> Content-Type: >> text/plain[1981 semicolons] and this. > and within a spam HTML body from Taiwan >> span lang="EN-US" style="font-family:"Courier New"; I think you're missing the point. -- Clive D.W. Feather | Work: | Tel: +44 20 8495 6138 Internet Expert | Home: | *** NOTE CHANGE *** Demon Internet | WWW: http://www.davros.org | Fax: +44 870 051 9937 Thus plc | | Mobile: +44 7973 377646 From skip at pobox.com Mon Jan 5 14:39:36 2004 From: skip at pobox.com (Skip Montanaro) Date: Mon Jan 5 14:39:48 2004 Subject: [spambayes-dev] possible new spam header clue In-Reply-To: References: Message-ID: <16377.48504.449355.804166@montanaro.dyndns.org> Seth> FWIW, here is something that appeared in the Seth> spamtools@list.abuse.net discussion forum. It lists a possible Seth> new spam header clue that appears to work and is simple: a Seth> "Content-Type:" header _cannot_ end in a semicolon. For some Seth> reason, it does in some spam mailers. Not too frequently. Only twice out of 514 messages in my current spam collection. Skip From kennypitt at hotmail.com Mon Jan 5 16:42:14 2004 From: kennypitt at hotmail.com (Kenny Pitt) Date: Mon Jan 5 16:43:07 2004 Subject: [spambayes-dev] A URL experiment In-Reply-To: <16373.30974.675768.999969@montanaro.dyndns.org> Message-ID: Here are my test results against 2021 hams and 1942 spams spread evenly across 10 sets. The test set comes from a complete capture of my e-mail stream from a couple of months ago, plus a few more recent mails that were still lying around in my mail folders and recent training data. ============================================================ Comparison of pick_apart_urls with mine_received_headers set to False: false positive percentages 0.000 0.000 tied 0.000 0.000 tied 0.000 0.000 tied 0.000 0.000 tied 0.000 0.000 tied 0.000 0.000 tied 0.000 0.000 tied 0.000 0.000 tied 0.000 0.000 tied 0.000 0.000 tied won 0 times tied 10 times lost 0 times total unique fp went from 0 to 0 tied mean fp % went from 0.0 to 0.0 tied false negative percentages 1.026 1.026 tied 2.051 1.538 won -25.01% 2.577 1.546 won -40.01% 5.155 4.124 won -20.00% 2.062 1.546 won -25.02% 4.639 4.124 won -11.10% 3.608 3.093 won -14.27% 6.186 4.124 won -33.33% 3.093 3.093 tied 3.608 2.577 won -28.58% won 8 times tied 2 times lost 0 times total unique fn went from 66 to 52 won -21.21% mean fn % went from 3.40047581285 to 2.67909066878 won -21.21% ham mean ham sdev 0.34 0.34 +0.00% 4.72 4.78 +1.27% 0.03 0.03 +0.00% 0.38 0.38 +0.00% 0.17 0.19 +11.76% 1.79 1.82 +1.68% 0.08 0.08 +0.00% 0.73 0.75 +2.74% 0.06 0.06 +0.00% 0.64 0.65 +1.56% 0.10 0.10 +0.00% 1.45 1.47 +1.38% 0.02 0.02 +0.00% 0.32 0.32 +0.00% 0.28 0.28 +0.00% 3.93 3.93 +0.00% 0.05 0.05 +0.00% 0.75 0.75 +0.00% 0.00 0.00 +(was 0) 0.00 0.00 +(was 0) ham mean and sdev for all runs 0.11 0.12 +9.09% 2.12 2.14 +0.94% spam mean spam sdev 93.87 94.76 +0.95% 16.36 15.16 -7.33% 95.16 95.67 +0.54% 16.65 15.28 -8.23% 93.93 94.92 +1.05% 18.64 16.68 -10.52% 90.62 91.60 +1.08% 24.57 22.95 -6.59% 93.95 94.55 +0.64% 18.31 17.23 -5.90% 91.06 92.13 +1.18% 22.59 21.43 -5.14% 91.77 92.38 +0.66% 21.80 21.14 -3.03% 91.32 92.28 +1.05% 24.35 22.21 -8.79% 92.67 93.66 +1.07% 20.41 19.35 -5.19% 92.45 93.44 +1.07% 21.54 20.09 -6.73% spam mean and sdev for all runs 92.68 93.54 +0.93% 20.76 19.39 -6.60% ham/spam mean difference: 92.57 93.42 +0.85 ============================================================ Comparison of pick_apart_urls with mine_received_headers set to True: false positive percentages 0.000 0.000 tied 0.000 0.000 tied 0.000 0.000 tied 0.000 0.000 tied 0.000 0.000 tied 0.000 0.000 tied 0.000 0.000 tied 0.000 0.000 tied 0.000 0.000 tied 0.000 0.000 tied won 0 times tied 10 times lost 0 times total unique fp went from 0 to 0 tied mean fp % went from 0.0 to 0.0 tied false negative percentages 1.026 0.513 won -50.00% 1.026 0.000 won -100.00% 0.515 0.000 won -100.00% 3.608 2.577 won -28.58% 1.546 1.546 tied 2.577 2.577 tied 3.093 3.093 tied 3.608 2.062 won -42.85% 1.546 1.031 won -33.31% 2.062 1.546 won -25.02% won 7 times tied 3 times lost 0 times total unique fn went from 40 to 29 won -27.50% mean fn % went from 2.06079830822 to 1.49458102035 won -27.48% ham mean ham sdev 0.33 0.34 +3.03% 4.72 4.78 +1.27% 0.00 0.00 +(was 0) 0.03 0.03 +0.00% 0.11 0.12 +9.09% 1.42 1.43 +0.70% 0.00 0.00 +(was 0) 0.03 0.03 +0.00% 0.00 0.00 +(was 0) 0.04 0.04 +0.00% 0.02 0.02 +0.00% 0.21 0.22 +4.76% 0.00 0.00 +(was 0) 0.00 0.00 +(was 0) 0.37 0.37 +0.00% 5.20 5.20 +0.00% 0.00 0.00 +(was 0) 0.00 0.00 +(was 0) 0.00 0.00 +(was 0) 0.00 0.00 +(was 0) ham mean and sdev for all runs 0.08 0.08 +0.00% 2.27 2.28 +0.44% spam mean spam sdev 95.88 96.44 +0.58% 13.46 12.43 -7.65% 96.85 97.24 +0.40% 12.37 10.69 -13.58% 96.07 96.71 +0.67% 13.65 12.16 -10.92% 93.32 94.08 +0.81% 20.36 18.68 -8.25% 95.54 95.80 +0.27% 15.56 14.91 -4.18% 94.20 94.72 +0.55% 18.30 17.73 -3.11% 93.52 93.83 +0.33% 19.72 19.28 -2.23% 93.51 94.31 +0.86% 19.99 18.23 -8.80% 94.99 95.46 +0.49% 17.11 16.41 -4.09% 94.95 95.42 +0.49% 17.05 16.01 -6.10% spam mean and sdev for all runs 94.88 95.40 +0.55% 17.02 15.95 -6.29% ham/spam mean difference: 94.80 95.32 +0.52 ============================================================ And finally, here is the table.py comparison of all four option combinations: filename: base pick_apart_urls received+urls mine_received ham:spam: 2021:1942 2021:1942 2021:1942 2021:1942 fp total: 0 0 0 0 fp %: 0.00 0.00 0.00 0.00 fn total: 66 52 40 29 fn %: 3.40 2.68 2.06 1.49 unsure t: 200 187 159 155 unsure %: 5.05 4.72 4.01 3.91 real cost: $106.00 $89.40 $71.80 $60.00 best cost: $53.60 $50.00 $41.60 $39.60 h mean: 0.11 0.12 0.08 0.08 h sdev: 2.12 2.14 2.27 2.28 s mean: 92.68 93.54 94.88 95.40 s sdev: 20.76 19.39 17.02 15.95 mean diff: 92.57 93.42 94.80 95.32 k: 4.05 4.34 4.91 5.23 -- Kenny Pitt From pmarion at comcast.net Mon Jan 5 18:58:47 2004 From: pmarion at comcast.net (Pete) Date: Mon Jan 5 18:58:49 2004 Subject: [spambayes-dev] Outlook / winfax integration issues and bug #833346 Message-ID: <000601c3d3e7$dc39afa0$0e02a8c0@Belial> It's togh for me, as a non-programmer, to determine what might be pertinant to spambayes - so here are links re: winfax and Outlook. The first is specific to Outlook XP and Winfax 10.02. Perhaps obtaining file versions would be useful. http://service1.symantec.com/SUPPORT/faxprod.nsf/docid/2001110911391704 http://support.microsoft.com/default.aspx?scid=kb;en-us;196366 &Product=out http://service1.symantec.com/SUPPORT/faxprod.nsf/a74513c210251d318525688d004 c147a/7f3ca173ff5062ca88256ae600658c11 if this is more of a nuisance that of help, let me know. If you believe that this does pertain directly to the "bug", then I can post this to the bug report. If you want to do so, please be my guest. -- Peter D. Marion "Never attribute to malice that which can be adequately explained by stupidity." Hanlon's Razor THIS MESSAGE (INCLUDING ANY ATTACHMENTS) CONTAINS CONFIDENTIAL INFORMATION INTENDED FOR A SPECIFIC INDIVIDUAL, ENTITY OR PURPOSE THAT IS PRIVILEGED, CONFIDENTIAL AND EXEMPT FROM DISCLOSURE UNDER APPLICABLE LAWS. If you are not the intended recipient and/or you have received this e-mail in error or without authorization, you should delete this message immediately. Any disclosure, copying, or distribution of this message, or the taking of any action based on it, is strictly prohibited. -------------- next part -------------- An HTML attachment was scrubbed... URL: http://mail.python.org/pipermail/spambayes-dev/attachments/20040105/973dd1c1/attachment-0001.html From ta-meyer at ihug.co.nz Mon Jan 5 19:21:59 2004 From: ta-meyer at ihug.co.nz (Tony Meyer) Date: Mon Jan 5 19:22:04 2004 Subject: [spambayes-dev] Welcome Kenny! Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F13046777E8@its-xchg4.massey.ac.nz> Hi, I'm sure you've noticed plenty of messages, patches and so on from Kenny Pitt - it seemed to me that it was about time he got check-in rights (if nothing else, so that he can close bug reports and check in those typo bugfixes). So I convinced Mark to do the admin mumbo-jumbo and kpitt is now one of the family. Welcome from us all! =Tony Meyer From spambayes at whateley.com Mon Jan 5 19:26:57 2004 From: spambayes at whateley.com (Brendon Whateley) Date: Mon Jan 5 19:28:15 2004 Subject: [spambayes-dev] Strange ProxyUI problem? Message-ID: <200401051628.12493.spambayes@whateley.com> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Hi Geniuses, I've occasionally complained (or at least alluded to) a problem when training sb_server using the web interface. The problem has the symptom of getting more than one days defects listed in the review page, mixed with finding no messages to review and similar strange date related problems. The reason I classified the problem as strange was that I've not seen any other reports of this type of problem... So, I finally borrowed 10 minutes to learn python and dig into the problem: In _buildReviewKeys in ProxyUI.py, the array returned by state.unknownCorpus.keys() is not sorted in any particular order. This leads to allKeys[-1] returning unpredictable starting dates instead of the latest date. It also violates the assumptions required to use bisect() on the keys. I solve my problem by adding an "allKeys.sort()" after getting the keys. I assume that the keys are expected to be in order? This also probably explains why I can't (usually) use any of the query functionality on the home page? It often returned the "message may have been deleted" message. The things I am doing that may qualify as "not typical": 1) I'm running from CVS. 2) I'm running on Linux. 3) I expire messages in the cache after a VERY long time. 4) I almost never "forget" any messages AND only train on unsures/mistakes and none extreme correctly classified stuff. Hence I have over a months worth of web pages available for review. Any thoughts before I create a defect and/or patch, etc. Thanks, Brendon. -----BEGIN PGP SIGNATURE----- Version: PGP 6.5.8 iQA/AwUBP/oA0ZuupqACStRwEQIxjACgw3E3ozBIlflyp91FlBPik7j/Az8AoOHE mrROfV7j+WCv8jqeu3pR3x1s =41zR -----END PGP SIGNATURE----- From tim.one at comcast.net Mon Jan 5 20:39:05 2004 From: tim.one at comcast.net (Tim Peters) Date: Mon Jan 5 20:39:09 2004 Subject: [spambayes-dev] Welcome Kenny! In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F13046777E8@its-xchg4.massey.ac.nz> Message-ID: [Tony Meyer] > ... > So I convinced Mark to do the admin mumbo-jumbo and kpitt is now one > of the family. Welcome from us all! Indeed. Welcome, Kenny! Use your awesome new powers only for good, and nobody will get hurt . From tim.one at comcast.net Mon Jan 5 21:27:25 2004 From: tim.one at comcast.net (Tim Peters) Date: Mon Jan 5 21:27:27 2004 Subject: [spambayes-dev] A URL experiment In-Reply-To: Message-ID: [Kenny Pitt, tests x-pick_apart_urls and mine_received_headers] Remarkable results, Kenny! I want your email mix . No effect on FP, major reductions in FN and Unsure rates, and > won 7 times > tied 3 times > lost 0 times was your weakest outcome. These options are as close to a pure win on your email as we've seen since The Early Days. Anyone else? Skip and Kenny both reported (surprisingly, to me) strong benefits from these gimmicks, and nobody yet has reported anything bad from them. If nobody does, there's no reason not to make them default behaviors (and to promote pick_apart_urls up from experimental status). From popiel at wolfskeep.com Mon Jan 5 22:37:37 2004 From: popiel at wolfskeep.com (T. Alexander Popiel) Date: Mon Jan 5 22:37:42 2004 Subject: [spambayes-dev] A URL experiment In-Reply-To: Message from "Tim Peters" of "Mon, 05 Jan 2004 21:27:25 EST." References: Message-ID: <20040106033737.D4AC72DDB1@cashew.wolfskeep.com> In message: "Tim Peters" writes: >[Kenny Pitt, tests x-pick_apart_urls and mine_received_headers] >Anyone else? My mine_received_headers results (from last year) are up at http://www.wolfskeep.com/~popiel/spambayes/headers I haven't tested the new x-pick_apart_urls, but will do so if my machine lets me. - Alex From tim.one at comcast.net Mon Jan 5 23:31:53 2004 From: tim.one at comcast.net (Tim Peters) Date: Mon Jan 5 23:32:01 2004 Subject: [spambayes-dev] A URL experiment In-Reply-To: <20040106033737.D4AC72DDB1@cashew.wolfskeep.com> Message-ID: [T. Alexander Popiel] > My mine_received_headers results (from last year) are up at > http://www.wolfskeep.com/~popiel/spambayes/headers Right. We're doing this again because, quite recently, by-eyeball staring at spam clues showed that the IP octets were being pasted together "from the wrong end", so the code changed significantly. Skip also added code to synthesize a "may be forged" token if the Received headers say so. > I haven't tested the new x-pick_apart_urls, but will do so > if my machine lets me. Well, we used to do death matches with 4,000 msgs total. If you can try just your most-recent 2K ham and 2K spam, that would be good enough. Since the 10-fold CV runs so far have shown universal "no harm" on each run, and some remarkably signficant wins, I'm really just wondering whether anyone has an email mix for which it's a disaster. From fheile at pacbell.net Tue Jan 6 01:25:30 2004 From: fheile at pacbell.net (Frank Heile) Date: Tue Jan 6 01:25:34 2004 Subject: [spambayes-dev] How to forward mail after filtering automatically Message-ID: <200401060625.i066PQHp014670@mta7.pltn13.pbi.net> First, THANKS SO MUCH FOR SPAMBAYES! It has made email usable again. I have told all my friends about it and I know several have used it and have thanked me for recommending it to them. Second, my question: Is there anyway I can get the wonderful benefits of SPAMBAYES on my handheld Palm based Kyocera 7135? Currently, both my handheld and home computer fetch the messages from my fheile@pacbell.net account. My home computer uses SPAMBAYES integrated into Outlook 2003 so everything is fine there. However when my handheld receives messages from the fheile@pacbell.net account I get all the SPAM mixed in with my HAM. I am willing to leave my home computer on all the time if there is some way to have my home computer forward only the HAM to the handheld. I would setup another email account and forward from home to that account so my handheld can download only HAM. Alternatively, is there anyway to have SPAMBAYES download all mail but only delete the SPAM from the pacbell.net server? Any other suggestions? Thanks for any help you can give me and thanks again for SPAMBAYES. Frank Heile -------------- next part -------------- An HTML attachment was scrubbed... URL: http://mail.python.org/pipermail/spambayes-dev/attachments/20040105/3e3dcb7e/attachment.html From kennypitt at hotmail.com Tue Jan 6 08:49:51 2004 From: kennypitt at hotmail.com (Kenny Pitt) Date: Tue Jan 6 08:50:34 2004 Subject: [spambayes-dev] Welcome Kenny! In-Reply-To: Message-ID: Tim Peters wrote: > [Tony Meyer] >> ... >> So I convinced Mark to do the admin mumbo-jumbo and kpitt is now one >> of the family. Welcome from us all! > > Indeed. Welcome, Kenny! Use your awesome new powers only for good, > and nobody will get hurt . Luke, don't give in to the dark side! But seriously, thanks for the vote of confidence. Before I discovered SpamBayes, I was trying to lay the groundwork to write my own Outlook spam filter in C++. Fortunately, SpamBayes and Python have saved me from that miserable existance! I really appreciate all the work you guys have put in, and I'm happy to provide my meager contributions to help support the effort. -- Kenny Pitt From kennypitt at hotmail.com Tue Jan 6 09:01:46 2004 From: kennypitt at hotmail.com (Kenny Pitt) Date: Tue Jan 6 09:02:28 2004 Subject: [spambayes-dev] A URL experiment In-Reply-To: Message-ID: Tim Peters wrote: > [Kenny Pitt, tests x-pick_apart_urls and mine_received_headers] > > Remarkable results, Kenny! I want your email mix . No effect > on FP, major reductions in FN and Unsure rates, and > >> won 7 times >> tied 3 times >> lost 0 times > > was your weakest outcome. These options are as close to a pure win > on your email as we've seen since The Early Days. I have the distinct advantage that 99% of my ham is either messages from a small set of mailing lists such as spambayes-dev, or company mail sent to the same list of 20 or so people. My training data quickly develops the "defacto whitelist" effect. Hopefully that doesn't skew my testing results too much, but I suppose it's the variety of e-mail mixes that makes testing useful anyway. -- Kenny Pitt From skip at pobox.com Tue Jan 6 11:12:25 2004 From: skip at pobox.com (Skip Montanaro) Date: Tue Jan 6 11:14:08 2004 Subject: [spambayes-dev] Mapping tokens to messages Message-ID: <16378.56937.257023.834563@montanaro.dyndns.org> I just checked in two new scripts to the utilities directory. mkreversemap.py builds a map file which maps features to mail files and message-id's. extractmessage.py takes a set of features and generates one or two Unix mbox format files containing the messages which have those features. You generate such a map file with something like mkreversemap.py -d features.db -t spam Data/Spam mkreversemap.py -d features.db -t ham Data/Ham newspam and newham can be any sort of mail sources acceptable to spambayes.mboxutils.getmbox(). Each key maps a feature to a two-element tuple. Each tuple contains a dict which maps filenames to sets of message-id's. Features are typically specified explicitly using the -f flag, but a message source containing messages with X-Spambayes-Evidence headers can also be given as a feature source. Use it like so: extractmessages.py -d features.db -f "list-post:" -H msgs.ham to generate an mbox file called msgs.ham containing all messages referenced in features.db which contain the feature "list-post:". You can give both -H and -S flags to generate both ham and spam mbox files. You can give multiple -f flags as well. You can also give a suitable mail source instead of -f flags. All features which appear in any X-Spambayes-Evidence headers will be used: python extractmessages.py -d newmap.db -H msgids.ham mailbox This isn't as useful I don't think because it's like drinking from a firehose. You generally have to deal with far too many messages. Have fun with it. Skip From kennypitt at hotmail.com Tue Jan 6 11:51:13 2004 From: kennypitt at hotmail.com (Kenny Pitt) Date: Tue Jan 6 11:57:32 2004 Subject: [spambayes-dev] Binary setting for website images Message-ID: The png files in the website/images directory are not marked as binary. No problem for the Unix folks who usually do the website updates, but they don't check out properly on Windows. Anyone opposed to my doing a "cvs admin -kb" on them? -- Kenny Pitt From tim.one at comcast.net Tue Jan 6 13:08:08 2004 From: tim.one at comcast.net (Tim Peters) Date: Tue Jan 6 13:08:21 2004 Subject: [spambayes-dev] Binary setting for website images In-Reply-To: Message-ID: [Kenny Pitt] > The png files in the website/images directory are not marked as > binary. No problem for the Unix folks who usually do the website > updates, but they don't check out properly on Windows. > > Anyone opposed to my doing a "cvs admin -kb" on them? +1. We haven't gotten in trouble from this yet, but we will eventually -- binary files should always be marked -kb in CVS. From skip at pobox.com Tue Jan 6 13:21:11 2004 From: skip at pobox.com (Skip Montanaro) Date: Tue Jan 6 13:21:18 2004 Subject: [spambayes-dev] problems locating messages with bigrams Message-ID: <16378.64663.779631.331956@montanaro.dyndns.org> After adding bigram generation (that bloats the reverse map pickle to a gargantuan size, btw), I used extractmessages.py to locate messages containing a strongly hammy bigram (prob 0.043) which seemed odd to me: "bi:nov 2003" and which contributed to a false negative in my test database. Turns out it's common in mailing list digests, like so: ... ------------------------------ Date: Sun, 23 Nov 2003 16:08:45 -0500 From: Alan Rowoth Subject: misdirected postings ... where the beginning of another section of an RFC 934 digest has a few headers. They are treated as message body. Accordingly, if you train on such digests as ham, you get a flurry of unigrams and bigrams which would be avoided if they were in the actual headers. Does the email Parser do the right thing with MIME digests? Maybe it needs to be trained to recognize RFC 934 digests (or I need to remove most digests from my ham database). Another apparently strongly hammy token (prob 0.092) had me confused for a bit. When I ran extractmessages.py to identify the messages containing 'bi:skip:w 20 skip:w 10', only two hams and two spams turned up. That should have resulted in a spamprob close to 0.5, not 0.1. I eventually figured out that the way I generate bigrams: for t in Classifier()._enhance_wordstream(tokenize(msg)): ... uses the current training database to decide which tokens should be generated, the leading & trailing unigrams or the bigram of the two. All possible bigrams are not generated. I can change that easily enough. Of course, that will bloat the pickle file even further, and not really improve the chances of identifying the actual messages which contribute to the score in a given message. I guess generating bigram info to use with cross-validation results will be an approximation to reality at best. Any suggestions for improving this situation? Skip From tdickenson at devmail.geminidataloggers.co.uk Tue Jan 6 13:38:45 2004 From: tdickenson at devmail.geminidataloggers.co.uk (Toby Dickenson) Date: Tue Jan 6 13:38:49 2004 Subject: [spambayes-dev] problems locating messages with bigrams In-Reply-To: <16378.64663.779631.331956@montanaro.dyndns.org> References: <16378.64663.779631.331956@montanaro.dyndns.org> Message-ID: <200401061838.45379.tdickenson@devmail.geminidataloggers.co.uk> On Tuesday 06 January 2004 18:21, Skip Montanaro wrote: > Another apparently strongly hammy token (prob 0.092) had me confused for a > bit. When I ran extractmessages.py to identify the messages containing > 'bi:skip:w 20 skip:w 10', only two hams and two spams turned up. I cant help with any bigram problem, but I recognise those "skip:w 20" tokens. The tokenizer only performs its special URL handling if a URL includes the http: prefix. For a URL that omits that prefix and starts with www, all we get is one skip token. I have a patch that fixes this in the url-detecting regular expression: http://sourceforge.net/tracker/?func=detail&aid=830290&group_id=61702&atid=498105 -- Toby Dickenson From tim.one at comcast.net Tue Jan 6 13:42:52 2004 From: tim.one at comcast.net (Tim Peters) Date: Tue Jan 6 13:43:06 2004 Subject: [spambayes-dev] problems locating messages with bigrams In-Reply-To: <16378.64663.779631.331956@montanaro.dyndns.org> Message-ID: [Skip Montanaro] > ... > I eventually figured out that the way I generate bigrams: > > for t in Classifier()._enhance_wordstream(tokenize(msg)): > ... > > uses the current training database to decide which tokens should be > generated, I think you're hallucinating here -- _enhance_wordstream() doesn't make any use of training data. Whenever tokenize() yields a stream of N tokens, _enhance_wordstream() yields a derived stream of 2*N-1 tokens. > the leading & trailing unigrams or the bigram of the two. All > possible bigrams are not generated. A specific example would clarify what you think you mean by these phrases. By the definition of bigrams *intended* by the code, only adjacent token pairs can be pasted together into bigrams. If the 4 incoming tokens are a b c d, the 2*4-1 = 7 output tokens are a b bi:a b c bi:b c d bi:c d and it doesn't matter whether a, b, c, and/or d have or haven't been trained on previously. From kennypitt at hotmail.com Tue Jan 6 13:52:05 2004 From: kennypitt at hotmail.com (Kenny Pitt) Date: Tue Jan 6 13:52:48 2004 Subject: [spambayes-dev] Binary setting for website images In-Reply-To: Message-ID: Tim Peters wrote: > [Kenny Pitt] >> The png files in the website/images directory are not marked as >> binary. No problem for the Unix folks who usually do the website >> updates, but they don't check out properly on Windows. >> >> Anyone opposed to my doing a "cvs admin -kb" on them? > > +1. We haven't gotten in trouble from this yet, but we will > eventually -- binary files should always be marked -kb in CVS. Done. -- Kenny Pitt From skip at pobox.com Tue Jan 6 13:56:25 2004 From: skip at pobox.com (Skip Montanaro) Date: Tue Jan 6 13:56:33 2004 Subject: [spambayes-dev] problems locating messages with bigrams In-Reply-To: <200401061838.45379.tdickenson@devmail.geminidataloggers.co.uk> References: <16378.64663.779631.331956@montanaro.dyndns.org> <200401061838.45379.tdickenson@devmail.geminidataloggers.co.uk> Message-ID: <16379.1241.694739.181427@montanaro.dyndns.org> >> Another apparently strongly hammy token (prob 0.092) had me confused >> for a bit. When I ran extractmessages.py to identify the messages >> containing 'bi:skip:w 20 skip:w 10', only two hams and two spams >> turned up. Toby> I have a patch ... Thanks. It looks useful. I assigned that to myself and will see how it does. Seems like it can't hurt. Skip From skip at pobox.com Tue Jan 6 13:58:21 2004 From: skip at pobox.com (Skip Montanaro) Date: Tue Jan 6 13:58:33 2004 Subject: [spambayes-dev] problems locating messages with bigrams In-Reply-To: References: <16378.64663.779631.331956@montanaro.dyndns.org> Message-ID: <16379.1357.151388.187711@montanaro.dyndns.org> >> for t in Classifier()._enhance_wordstream(tokenize(msg)): >> ... >> >> uses the current training database to decide which tokens should be >> generated, Tim> I think you're hallucinating here -- _enhance_wordstream() doesn't Tim> make any use of training data. Hmmm... I thought _enhance_wordstream() was the thing which "tiled" the token space. Why isn't this code in tokenize.py if it doesn't rely on training data? Skip From tim.one at comcast.net Tue Jan 6 14:15:22 2004 From: tim.one at comcast.net (Tim Peters) Date: Tue Jan 6 14:15:35 2004 Subject: [spambayes-dev] problems locating messages with bigrams In-Reply-To: <16379.1357.151388.187711@montanaro.dyndns.org> Message-ID: [Skip Montanaro] > Hmmm... I thought _enhance_wordstream() was the thing which "tiled" > the token space. No, tiling requires knowledge of spamprobs, and __getclues() does the tiling. _enhance_wordstream() generates the universe of possible tiles (all individual tokens and all pairs of adjacent tokens); a subset of those is selected by _getclues(), based on the tiles' spamprobs, to create a tiling (a partitioning of the token stream into non-overlapping features). > Why isn't this code in tokenize.py if it doesn't rely on training > data? It's a transformation of tokenize's output, so it doesn't really "belong" in tokenize either. It's certainly more convenient to do it in the classifier, and _getclues() (which inarguably belongs in the classifier) requires intimate knowledge of how the universe of tiles was generated in order to guarantee non-overlap among the tiles it chooses. From tim.one at comcast.net Tue Jan 6 19:58:34 2004 From: tim.one at comcast.net (Tim Peters) Date: Tue Jan 6 19:58:39 2004 Subject: [spambayes-dev] problems locating messages with bigrams In-Reply-To: <16378.64663.779631.331956@montanaro.dyndns.org> Message-ID: [Skip] > ... > Another apparently strongly hammy token (prob 0.092) had me confused > for a bit. It still has me confused. > When I ran extractmessages.py to identify the messages containing > 'bi:skip:w 20 skip:w 10', only two hams and two spams turned up. > That should have resulted in a spamprob close to 0.5, not 0.1. That *may* be true, but you haven't revealed enough to say whether it should be true. If, for example, you've trained on much more spam than ham, a feature that appears twice in each kind will have a spamprob less than 0.5, and closer to 0 the greater the training imbalance. > I eventually figured out that the way I generate bigrams: > > for t in Classifier()._enhance_wordstream(tokenize(msg)): > ... > > uses the current training database to decide which tokens should be > generated, the leading & trailing unigrams or the bigram of the two. > All possible bigrams are not generated. We covered most of this before, but I'll add that the same code is used to generate features for training (which is a different process than scoring): a feature is in your training database if and only if _enhance_wordstream() generated the feature during training of some message you trained on (there's no concept of "tiling" during training, only during scoring). From tim.one at comcast.net Tue Jan 6 20:21:37 2004 From: tim.one at comcast.net (Tim Peters) Date: Tue Jan 6 20:21:40 2004 Subject: [spambayes-dev] A URL experiment In-Reply-To: Message-ID: [Kenny Pitt] > I have the distinct advantage that 99% of my ham is either messages > from a small set of mailing lists such as spambayes-dev, or company > mail sent to the same list of 20 or so people. My training data > quickly develops the "defacto whitelist" effect. Hopefully that > doesn't skew my testing results too much, but I suppose it's the > variety of e-mail mixes that makes testing useful anyway. Indeed it is -- email mix varies *a lot* across people, and this is one of the only projects I know of that tests with multiple real-life personal corpora. Your mix may not be like mine, but I bet it's like thousands of others (as is mine, but a different bunch of thousands) -- all testing is appreciated here, and the greater the variety the more sure we can be that changes are truly winners. From tim.one at comcast.net Tue Jan 6 22:20:43 2004 From: tim.one at comcast.net (Tim Peters) Date: Tue Jan 6 22:20:51 2004 Subject: [spambayes-dev] How to forward mail after filtering automatically In-Reply-To: <200401060625.i066PQHp014670@mta7.pltn13.pbi.net> Message-ID: [Frank Heile] > ... > Second, my question: Is there anyway I can get the wonderful > benefits of SPAMBAYES on my handheld Palm based Kyocera 7135? > Currently, both my handheld and home computer fetch the messages from > my fheile@pacbell.net account. My home computer uses SPAMBAYES > integrated into Outlook 2003 so everything is fine there. However > when my handheld receives messages from the fheile@pacbell.net > account I get all the SPAM mixed in with my HAM. I am willing to > leave my home computer on all the time if there is some way to have > my home computer forward only the HAM to the handheld. I would setup > another email account and forward from home to that account so my > handheld can download only HAM. > > Alternatively, is there anyway to have SPAMBAYES download all mail > but only delete the SPAM from the pacbell.net server? Any other > suggestions? If you're a programmer, you could try adapting Andrew Dalke's script: http://www.entrian.com/sbwiki/SpamBayesCuller A no-programming strategy using Outlook may, or may not, work -- I do this with Outlook 2000, but don't know anything about OL2003: First set the properties on your pacbell account to leave a copy of messages on the server, to delete items from the server when you delete them from "Deleted Items", and to delete items regardless after (say) 5 days. pacbell may or may not *allow* you to leave copies of msgs on the server. Most ISPs do. If pacbell doesn't, this won't work. Next create a brand new .pst file, create a Spam folder inside it, and tell SpamBayes to move spam into that Spam folder. Whether this is a feature or bug in Outlook 2000 I haven't been able to find out: messages moved to a .pst file other than the .pst file containing the Inbox are deleted from the server when the account options are set up as above, the same as if you had deleted them from "Deleted Items", and despite that they've neither been deleted nor ever been in "Deleted Items". If OL2003 has this feature/bug too, you're almost there. All that remains is to tell Outlook to download new messages every (say) 10 minutes, and leave Outlook running all the time. Every 10 minutes, then, the spam that arrived in the last 10 minutes will get deleted from the server, so you won't see them again from any other way of accessing that account. From tameyer at ihug.co.nz Wed Jan 7 00:10:31 2004 From: tameyer at ihug.co.nz (Tony Meyer) Date: Wed Jan 7 00:10:37 2004 Subject: [spambayes-dev] Binary setting for website images In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F130499CC22@its-xchg4.massey.ac.nz> Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F13046777EC@its-xchg4.massey.ac.nz> [Kenny Pitt] > The png files in the website/images directory are not marked as > binary. No problem for the Unix folks who usually do the website > updates, but they don't check out properly on Windows. I guess that's why they always appear to need updating here (I just ignore them). Wow, making you a developer paid off quick . =Tony Meyer From tameyer at ihug.co.nz Wed Jan 7 00:50:55 2004 From: tameyer at ihug.co.nz (Tony Meyer) Date: Wed Jan 7 00:51:03 2004 Subject: [spambayes-dev] Experimental SpamBayes build available In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F130499C998@its-xchg4.massey.ac.nz> Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F13026F2A34@its-xchg4.massey.ac.nz> > Ah, okay. Yes, I have used pop3proxy on the machine before. > Locating and zapping my SpamBayes folder allows things to get > going (sort of). I can connect to localhost:8880. This included the bayescustomize.ini file, I presume? (You appear to be back from a pickle to the bsddb default). > I tried "Save & Shutdown" and got > > Traceback (most recent call last): [...] > File "C:\Python23\lib\shelve.py", line 130, in __setitem__ > self.dict[key] = f.getvalue() > > TypeError: object does not support item assignment Are you sure you aren't using 1.0a6? . This, at least when I've seen it before, says that the DBDictClassifier's db was None - this does happen after closing, but it shouldn't be closed before it shuts down (obviously). What had you done between starting it up and doing save & shutdown? Change options, review messages? I wondered if it might be if nothing at all was done, but I just tried that here and it worked fine. =Tony Meyer From tdickenson at devmail.geminidataloggers.co.uk Wed Jan 7 06:15:16 2004 From: tdickenson at devmail.geminidataloggers.co.uk (Toby Dickenson) Date: Wed Jan 7 06:15:24 2004 Subject: [spambayes-dev] subjective assessment of bigrams Message-ID: <200401071115.16790.tdickenson@devmail.geminidataloggers.co.uk> Ive been using bigrams since 2003-12-18, and thought you may be interested in some subjective feedback. I am using my overnight-train-on-everything regime, with 14000 hams and 2000 spams. * My database size grew from 10M to 80M. Overnight training runs extended from 5 minutes to 20 minutes * A much larger proportion of spams now score 0.99 or over (I filters these into a folder that I never normally look at). Spams that score 0.98 or lower I filter into a 'probable spam' folder and check manually every week; I am seeing a much smaller proportion of messages in this category. * I have seen a qualitative change in the type of spam that gets classified as unsure. Most of my unsures used to be very small messages, spams selling something I might otherwise be interested in, or other ones where 'unsure' made sense. It had never missed a nigerian or porn spam for many months.... until I enabled bigrams. With bigrams, a few have scored between 0.50 and 0.55. I tried untraining some of them, then reclassifying with bigrams turned off; they all scored above 0.90. I am happy to experiment if anyone has any suggestions. -- Toby Dickenson From skip at pobox.com Wed Jan 7 08:05:12 2004 From: skip at pobox.com (Skip Montanaro) Date: Wed Jan 7 08:05:20 2004 Subject: [spambayes-dev] subjective assessment of bigrams In-Reply-To: <200401071115.16790.tdickenson@devmail.geminidataloggers.co.uk> References: <200401071115.16790.tdickenson@devmail.geminidataloggers.co.uk> Message-ID: <16380.1032.336329.978242@montanaro.dyndns.org> Toby> Ive been using bigrams since 2003-12-18, and thought you may be Toby> interested in some subjective feedback. I am using my Toby> overnight-train-on-everything regime, with 14000 hams and 2000 Toby> spams. Wow! Any chance you could whack off the oldest 12,000 or so hams to bring your ham:spam ratio back into balance? Toby> * My database size grew from 10M to 80M. Overnight training runs Toby> extended from 5 minutes to 20 minutes This isn't surprising given the number of messages in your database. Bigrams *will* bloat your database. I think that to use them effectively, you should probably run with a fairly small training database. I have a bit over 500 each of ham and spam at this point (I've been experimenting with some automatic training, so my database grew considerably until I figured some things out) and currently have a DBDictClassifier database of 10.6MB. The pickle file grows in proportion roughly linear to the number of keys in the dictionary, while the DBDictClassifier file grows in marked jumps, roughly doubling when it needs to resize, then remaining nearly constant in size until a fairly large number of new keys are added. Toby> * A much larger proportion of spams now score 0.99 or over (I Toby> filters these into a folder that I never normally look Toby> at). Spams that score 0.98 or lower I filter into a 'probable Toby> spam' folder and check manually every week; I am seeing a much Toby> smaller proportion of messages in this category. I have been using bigrams for awhile as well and find a lot more spam winds up with an 0.99/1.00 score (which after a few days of checking I reroute to /dev/null). I've been lazy the past couple days though, and haven't paid any attention to my unsures or probable spam files. (I have enough other "good" mail to read to keep me busy, thank-you-very-much.) Toby> * I have seen a qualitative change in the type of spam that gets Toby> classified as unsure. Most of my unsures used to be very small Toby> messages, spams selling something I might otherwise be Toby> interested in, or other ones where 'unsure' made sense. It had Toby> never missed a nigerian or porn spam for many months.... until Toby> I enabled bigrams. With bigrams, a few have scored between 0.50 Toby> and 0.55. I tried untraining some of them, then reclassifying Toby> with bigrams turned off; they all scored above 0.90. This hasn't been a problem with me, but it's not entirely surprising. The Nigerian spams tend to be a lot more chatty than most sales pitches. I think they have would tend to have many bigrams that would turn up in regular text, but not in standard late-night-auto-sales-commercials type of text most spam uses. Skip From tdickenson at devmail.geminidataloggers.co.uk Wed Jan 7 09:06:39 2004 From: tdickenson at devmail.geminidataloggers.co.uk (Toby Dickenson) Date: Wed Jan 7 09:06:42 2004 Subject: [spambayes-dev] subjective assessment of bigrams In-Reply-To: <16380.1032.336329.978242@montanaro.dyndns.org> References: <200401071115.16790.tdickenson@devmail.geminidataloggers.co.uk> <16380.1032.336329.978242@montanaro.dyndns.org> Message-ID: <200401071406.39265.tdickenson@devmail.geminidataloggers.co.uk> On Wednesday 07 January 2004 13:05, Skip Montanaro wrote: > Toby> Ive been using bigrams since 2003-12-18, and thought you may be > Toby> interested in some subjective feedback. I am using my > Toby> overnight-train-on-everything regime, with 14000 hams and 2000 > Toby> spams. > > Wow! Any chance you could whack off the oldest 12,000 or so hams to bring > your ham:spam ratio back into balance? Im not sure if you intended a ;-) in there. I did try that a while ago (before bigrams) with no subjective difference. -- Toby Dickenson From skip at pobox.com Wed Jan 7 09:48:42 2004 From: skip at pobox.com (Skip Montanaro) Date: Wed Jan 7 09:48:57 2004 Subject: [spambayes-dev] subjective assessment of bigrams In-Reply-To: <200401071406.39265.tdickenson@devmail.geminidataloggers.co.uk> References: <200401071115.16790.tdickenson@devmail.geminidataloggers.co.uk> <16380.1032.336329.978242@montanaro.dyndns.org> <200401071406.39265.tdickenson@devmail.geminidataloggers.co.uk> Message-ID: <16380.7242.950128.273023@montanaro.dyndns.org> Toby> I am using my overnight-train-on-everything regime, with 14000 Toby> hams and 2000 spams. >> Wow! Any chance you could whack off the oldest 12,000 or so hams to >> bring your ham:spam ratio back into balance? Toby> Im not sure if you intended a ;-) in there. I did try that a while Toby> ago (before bigrams) with no subjective difference. Maybe. Maybe not. ;-) I didn't think of it as smiley territory, but I suppose it could be interpreted that way. Most folks running with a very unbalanced training database experience problems. Can you easily run the cross-validation tests? If so, you might try breaking your ham and spam up into a structure suitable for running timcv. That would be one message per file in the Data/Ham/SetN and Data/Spam/SetN format. You can build this structure easily using the splitndirs.py script in the utilities directory. Then run timcv.py giving it the --HamTrain option to restrict the number of hams per set from 200 to 1400 (with ten sets your spam dir will have about 200 messages). Compare the results as you vary the arg to HamTrain and see how (if) things change as the number of hams used per set changes. (Also, use the -s flag to use a constant random number seed so if you run twice with the same --HamTrain arg you select the same subset of messages and get the same results.) Skip From kennypitt at hotmail.com Wed Jan 7 09:56:39 2004 From: kennypitt at hotmail.com (Kenny Pitt) Date: Wed Jan 7 09:57:22 2004 Subject: [spambayes-dev] RE: [Spambayes] Spam Baes Question In-Reply-To: Message-ID: I'm moving this to spambayes-dev since we're starting to get into implementation issues. Coe, Bob wrote: > [snip some discussion of deleted Spam folders] > > Entirely aside from the above, it should be easier to go with the > Outlook flow than to buck it. If you do go to keeping track by name, > I'll bet you end up having to wring some subtle bugs out of the code. Speaking of the Outlook flow, notice that Outlook does not allow you to delete the "Deleted Items" folder (or the "Junk E-mail" folder if running Outlook 2003). I'm pretty sure we can't prevent the user from deleting the SpamBayes spam folder, but I'd love to see if we could at least warn them when they delete it and maybe offer to restore it. We already track items added to our watched folders. I'm wondering if we could track items added to the Deleted Items folder in the same way, and watch for the id of the Spam and Unsure folders. That might allow us to detect when the user deletes one of those folders before they've emptied Deleted Items and lost the folder entirely. -- Kenny Pitt From kennypitt at hotmail.com Wed Jan 7 12:20:15 2004 From: kennypitt at hotmail.com (Kenny Pitt) Date: Wed Jan 7 12:21:04 2004 Subject: [spambayes-dev] setup_all: dialogs.resources.dialogs missing Message-ID: setup_all.py contains the line: includes = "dialogs.resources.dialogs" to load the Outlook dialog stuff. Unfortunately, Outlook2000\dialogs\resources\dialogs.py isn't generated until you register and load the add-in from that source tree, so it comes up missing when I try to run setup_all.py on a clean source tree. Any way to get setup_all.py to generate this file before attempting to include it? -- Kenny Pitt From popiel at wolfskeep.com Wed Jan 7 12:40:21 2004 From: popiel at wolfskeep.com (T. Alexander Popiel) Date: Wed Jan 7 12:40:25 2004 Subject: [spambayes-dev] subjective assessment of bigrams In-Reply-To: Message from Toby Dickenson of "Wed, 07 Jan 2004 11:15:16 GMT." <200401071115.16790.tdickenson@devmail.geminidataloggers.co.uk> References: <200401071115.16790.tdickenson@devmail.geminidataloggers.co.uk> Message-ID: <20040107174021.48F372DF19@cashew.wolfskeep.com> In message: <200401071115.16790.tdickenson@devmail.geminidataloggers.co.uk> writes: >Ive been using bigrams since 2003-12-18, and thought you may be interested in >some subjective feedback. I am using my overnight-train-on-everything regime, >with 14000 hams and 2000 spams. Cool! There _was_ someone else doing that (overnight TOE)! FWIW, I've converted my overnight rebuilds to TOAE, taking slightly longer to make the database, but making a half-size database. No perceptible change in accuracy. I'm still using unigrams, though. >* My database size grew from 10M to 80M. Overnight training runs extended >from 5 minutes to 20 minutes Ugh. I suspect that this will keep me from using bigrams for my real mail for a while; the retraining is already taking me 2 hours. >* A much larger proportion of spams now score 0.99 or over (I filters these >into a folder that I never normally look at). Spams that score 0.98 or lower >I filter into a 'probable spam' folder and check manually every week; I am >seeing a much smaller proportion of messages in this category. I don't make this distinction for review, so that wouldn't be a big deal for me. (My review distinction is based on whether SpamAssassin said it was spam, too.) >* I have seen a qualitative change in the type of spam that gets classified >as unsure. Most of my unsures used to be very small messages, spams selling >something I might otherwise be interested in, or other ones where 'unsure' >made sense. It had never missed a nigerian or porn spam for many months.... >until I enabled bigrams. With bigrams, a few have scored between 0.50 and >0.55. I tried untraining some of them, then reclassifying with bigrams turned >off; they all scored above 0.90. Nigerian style spams have been the one thing that consistently gets through my filter instance; the pseudo-business/legal/confidence language of them is a fairly good match for the combination of design discussions on the pennmush-developers list and my personal finance correspondence. It would be interesting to see if bigrams helped that any, but I'm not holding any great hopes. >I am happy to experiment if anyone has any suggestions. Not really. Sorry. - Alex From spambayes at whateley.com Wed Jan 7 13:48:42 2004 From: spambayes at whateley.com (Brendon Whateley) Date: Wed Jan 7 13:48:55 2004 Subject: [spambayes-dev] Rebuilding Resources Question Message-ID: <200401071048.52469.spambayes@whateley.com> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 I suppose this is a silly question, but I'm out of time trying to figure it out... Can anyone tell me how to rebuild the "ui_html.py" resource from "ui.html" after modifying it? Thanks, Brendon. -----BEGIN PGP SIGNATURE----- Version: PGP 6.5.8 iQA/AwUBP/xUi5uupqACStRwEQLJ/gCgtaKGyq6fouJW1ETP7sBSx/wrawIAniZQ 4jvRDfxq9+M6KvgREatXGW5p =lLMW -----END PGP SIGNATURE----- From kennypitt at hotmail.com Wed Jan 7 15:18:59 2004 From: kennypitt at hotmail.com (Kenny Pitt) Date: Wed Jan 7 15:19:44 2004 Subject: [spambayes-dev] Rebuilding Resources Question In-Reply-To: <200401071048.52469.spambayes@whateley.com> Message-ID: Brendon Whateley wrote: > Can anyone tell me how to rebuild the "ui_html.py" resource from > "ui.html" after modifying it? You need to install "resourcepackage" from here: http://resourcepackage.sourceforge.net/ IIRC, the __init__.py in spambayes/resources is already set up to use resourcepackage. If ui.html is newer than ui_html.py, then ui_html.py file should be regenerated automatically when you run SpamBayes. -- Kenny Pitt From spambayes at whateley.com Wed Jan 7 16:12:46 2004 From: spambayes at whateley.com (Brendon Whateley) Date: Wed Jan 7 16:12:54 2004 Subject: [spambayes-dev] Rebuilding Resources Question In-Reply-To: References: Message-ID: <200401071312.49897.spambayes@whateley.com> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On Wednesday 07 January 2004 12:18 pm, Kenny Pitt wrote: > Brendon Whateley wrote: > > Can anyone tell me how to rebuild the "ui_html.py" resource from > > "ui.html" after modifying it? > > You need to install "resourcepackage" from here: > http://resourcepackage.sourceforge.net/ > > IIRC, the __init__.py in spambayes/resources is already set up to use > resourcepackage. If ui.html is newer than ui_html.py, then ui_html.py > file should be regenerated automatically when you run SpamBayes. Thanks Kenny, That is what I was expecting to happen. I had installed the resource package, but can't seem to be able to get it to build automatically. Is there a manual way to do the resource build? I am probably missing something very simple. Thanks again, Brendon. -----BEGIN PGP SIGNATURE----- Version: PGP 6.5.8 iQA/AwUBP/x2TpuupqACStRwEQI7uACffRVsAxQHRtLLa1CpKLPW74VUScgAoIMk 6JG8c3kOBtULmIZVbGAEZaKb =0jjd -----END PGP SIGNATURE----- From tameyer at ihug.co.nz Wed Jan 7 16:44:21 2004 From: tameyer at ihug.co.nz (Tony Meyer) Date: Wed Jan 7 16:44:26 2004 Subject: [spambayes-dev] Rebuilding Resources Question In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F130499CEE9@its-xchg4.massey.ac.nz> Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F13046777FD@its-xchg4.massey.ac.nz> > That is what I was expecting to happen. I had installed the > resource package, but can't seem to be able to get it to > build automatically. Is resourcepackage definitely installed? If you run python and type "import resourcepackage", do you get an ImportError? It *should* just work, if it's installed. > Is there a manual way to do the resource > build? I am probably missing something very simple. You can force the rebuild (instead of only doing it when the date is newer) by passing force=1 in the call to scan. There's a commented out version of this in __init__.py, so you could just modify that. This would fix the problem if it's something to do with dates. If it's something else, then I'm not sure what you'll have to do - the resourcepackage documentation would have more details. I'm sure it would be too much hassle to bother with, though . =Tony Meyer From kennypitt at hotmail.com Wed Jan 7 16:46:15 2004 From: kennypitt at hotmail.com (Kenny Pitt) Date: Wed Jan 7 16:47:01 2004 Subject: [spambayes-dev] Rebuilding Resources Question In-Reply-To: <200401071312.49897.spambayes@whateley.com> Message-ID: Brendon Whateley wrote: > -----BEGIN PGP SIGNED MESSAGE----- > Hash: SHA1 > > On Wednesday 07 January 2004 12:18 pm, Kenny Pitt wrote: >> Brendon Whateley wrote: >>> Can anyone tell me how to rebuild the "ui_html.py" resource from >>> "ui.html" after modifying it? >> >> You need to install "resourcepackage" from here: >> http://resourcepackage.sourceforge.net/ >> >> IIRC, the __init__.py in spambayes/resources is already set up to use >> resourcepackage. If ui.html is newer than ui_html.py, then >> ui_html.py file should be regenerated automatically when you run >> SpamBayes. > > Thanks Kenny, > > That is what I was expecting to happen. I had installed the resource > package, but can't seem to be able to get it to build automatically. > Is there a manual way to do the resource build? I am probably > missing something very simple. I'm afraid I'm not qualified to diagnose why resourcepackage isn't working. It always seems to work here, unless there is a step I do by rote that I'm not thinking of at the moment. The only thing I can suggest is to try deleting both ui_html.py and ui_html.pyc and then running again. Since the web ui can't run without those files, you should get an error if they don't get generated. This would also force it to generate the files if for some reason it isn't detecting the change to ui.html. -- Kenny Pitt From ta-meyer at ihug.co.nz Wed Jan 7 18:13:30 2004 From: ta-meyer at ihug.co.nz (Tony Meyer) Date: Wed Jan 7 18:13:37 2004 Subject: [spambayes-dev] [ 817813 ] Consider bad spelling a sign of spam Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F13026F2A38@its-xchg4.massey.ac.nz> The feature request says: """ Add a spelling checker and reasonable sized dictionary. If more than xx% of the message is misspelled (esp the subject), consider it to be spam. Many emails have gotten past Spam Bayes recently because their spelling is like "bfuqclvfphz". """ [http://sourceforge.net/tracker/?group_id=61702&atid=498106&func=detail&aid= 817813] "consider it to be spam" isn't something we do, of course :) I created a patch to generate a token that reflects the percentage of words in the message that are in a particular (English) dictionary. So one extra token per message, guaranteed, with a maximum of 100 new tokens. Results: -> tested 357 hams & 395 spams against 3311 hams & 3704 spams [all other stat lines are more-or-less the same as this, and have been snipped] false positive percentages 0.000 0.000 tied 0.000 0.000 tied 0.000 0.000 tied 0.246 0.246 tied 0.000 0.000 tied 0.000 0.000 tied 0.557 0.557 tied 0.559 0.279 won -50.09% 0.287 0.287 tied 0.000 0.000 tied won 1 times tied 9 times lost 0 times total unique fp went from 6 to 5 won -16.67% mean fp % went from 0.164881884948 to 0.136948924055 won -16.94% false negative percentages 0.253 0.253 tied 0.781 0.781 tied 0.462 0.462 tied 0.756 0.756 tied 0.243 0.243 tied 0.247 0.247 tied 0.240 0.240 tied 0.494 0.494 tied 0.973 0.973 tied 0.454 0.454 tied won 0 times tied 10 times lost 0 times total unique fn went from 20 to 20 tied mean fn % went from 0.490257037938 to 0.490257037938 tied ham mean ham sdev 1.18 1.18 +0.00% 7.76 7.72 -0.52% 0.99 0.99 +0.00% 6.64 6.62 -0.30% 0.84 0.85 +1.19% 6.14 6.22 +1.30% 1.99 2.03 +2.01% 9.46 9.61 +1.59% 0.49 0.51 +4.08% 3.59 3.64 +1.39% 0.85 0.85 +0.00% 5.45 5.45 +0.00% 1.16 1.17 +0.86% 9.30 9.29 -0.11% 1.20 1.20 +0.00% 8.13 8.00 -1.60% 1.55 1.56 +0.65% 8.05 8.07 +0.25% 0.47 0.48 +2.13% 3.22 3.28 +1.86% ham mean and sdev for all runs 1.08 1.09 +0.93% 7.13 7.14 +0.14% spam mean spam sdev 98.75 98.77 +0.02% 8.72 8.59 -1.49% 97.67 97.66 -0.01% 11.26 11.24 -0.18% 98.08 98.07 -0.01% 10.12 10.09 -0.30% 98.16 98.17 +0.01% 10.19 10.11 -0.79% 98.35 98.35 +0.00% 8.77 8.79 +0.23% 98.45 98.47 +0.02% 8.97 8.83 -1.56% 98.35 98.36 +0.01% 9.73 9.62 -1.13% 98.25 98.22 -0.03% 9.16 9.25 +0.98% 97.93 97.93 +0.00% 11.99 11.90 -0.75% 98.92 98.92 +0.00% 7.62 7.58 -0.52% spam mean and sdev for all runs 98.30 98.30 +0.00% 9.72 9.66 -0.62% ham/spam mean difference: 97.22 97.21 -0.01 I wondered whether 100 tokens was too many and bucketing this would help, so I changed it to truncate to the nearest 10%. The cmp.py results are basically the same, but here's a table.py of the three - note that with the 100 the number of unsures went up, but with 10 there was still the minor gain with the same number of unsures. filename: bases engs eng10s ham:spam: 3668:4099 3668:4099 3668:4099 fp total: 6 5 5 fp %: 0.16 0.14 0.14 fn total: 20 20 20 fn %: 0.49 0.49 0.49 unsure t: 178 183 178 unsure %: 2.29 2.36 2.29 real cost: $115.60 $106.60 $105.60 best cost: $93.00 $94.20 $93.60 h mean: 1.08 1.09 1.09 h sdev: 7.13 7.14 7.15 s mean: 98.30 98.30 98.30 s sdev: 9.72 9.66 9.68 mean diff: 97.22 97.21 97.21 k: 5.77 5.79 5.78 Bigram results aren't so great. Results with the original 100 buckets: false positive percentages 0.000 0.000 tied 0.000 0.000 tied 0.000 0.000 tied 0.000 0.000 tied 0.000 0.000 tied 0.000 0.000 tied 0.279 0.279 tied 0.000 0.000 tied 0.000 0.000 tied 0.000 0.000 tied won 0 times tied 10 times lost 0 times total unique fp went from 1 to 1 tied mean fp % went from 0.0278551532033 to 0.0278551532033 tied false negative percentages 0.253 0.253 tied 1.042 1.042 tied 0.693 0.693 tied 0.252 0.252 tied 0.728 0.728 tied 0.000 0.000 tied 0.481 0.481 tied 0.494 0.494 tied 0.730 0.730 tied 0.227 0.227 tied won 0 times tied 10 times lost 0 times total unique fn went from 20 to 20 tied mean fn % went from 0.489899714703 to 0.489899714703 tied ham mean ham sdev 0.95 0.98 +3.16% 6.64 6.86 +3.31% 0.83 0.82 -1.20% 5.53 5.49 -0.72% 0.49 0.47 -4.08% 4.08 4.10 +0.49% 1.53 1.55 +1.31% 8.16 8.29 +1.59% 0.30 0.31 +3.33% 3.25 3.26 +0.31% 0.70 0.70 +0.00% 5.27 5.27 +0.00% 0.85 0.83 -2.35% 7.11 7.06 -0.70% 0.93 0.90 -3.23% 7.23 7.02 -2.90% 0.90 0.88 -2.22% 6.47 6.36 -1.70% 0.41 0.41 +0.00% 4.07 4.07 +0.00% ham mean and sdev for all runs 0.80 0.79 -1.25% 6.01 6.01 +0.00% spam mean spam sdev 98.71 98.74 +0.03% 7.83 7.78 -0.64% 97.38 97.36 -0.02% 12.55 12.54 -0.08% 97.78 97.77 -0.01% 11.09 11.06 -0.27% 97.89 97.87 -0.02% 10.49 10.49 +0.00% 97.90 97.94 +0.04% 10.03 9.97 -0.60% 98.32 98.29 -0.03% 8.63 8.74 +1.27% 98.19 98.21 +0.02% 10.21 10.12 -0.88% 97.68 97.56 -0.12% 10.99 11.18 +1.73% 97.86 97.88 +0.02% 11.56 11.56 +0.00% 98.73 98.72 -0.01% 7.57 7.65 +1.06% spam mean and sdev for all runs 98.05 98.04 -0.01% 10.20 10.21 +0.10% ham/spam mean difference: 97.25 97.25 +0.00 And with 10 buckets: false positive percentages 0.000 0.000 tied 0.000 0.000 tied 0.000 0.000 tied 0.000 0.000 tied 0.000 0.000 tied 0.000 0.000 tied 0.279 0.279 tied 0.000 0.000 tied 0.000 0.000 tied 0.000 0.000 tied won 0 times tied 10 times lost 0 times total unique fp went from 1 to 1 tied mean fp % went from 0.0278551532033 to 0.0278551532033 tied false negative percentages 0.253 0.253 tied 1.042 1.042 tied 0.693 0.693 tied 0.252 0.252 tied 0.728 0.728 tied 0.000 0.000 tied 0.481 0.721 lost +49.90% 0.494 0.741 lost +50.00% 0.730 0.730 tied 0.227 0.227 tied won 0 times tied 8 times lost 2 times total unique fn went from 20 to 22 lost +10.00% mean fn % went from 0.489899714703 to 0.538629534266 lost +9.95% ham mean ham sdev 0.95 0.98 +3.16% 6.64 6.86 +3.31% 0.83 0.81 -2.41% 5.53 5.48 -0.90% 0.49 0.47 -4.08% 4.08 4.07 -0.25% 1.53 1.55 +1.31% 8.16 8.24 +0.98% 0.30 0.31 +3.33% 3.25 3.26 +0.31% 0.70 0.70 +0.00% 5.27 5.28 +0.19% 0.85 0.84 -1.18% 7.11 7.07 -0.56% 0.93 0.90 -3.23% 7.23 7.14 -1.24% 0.90 0.91 +1.11% 6.47 6.50 +0.46% 0.41 0.42 +2.44% 4.07 4.15 +1.97% ham mean and sdev for all runs 0.80 0.80 +0.00% 6.01 6.03 +0.33% spam mean spam sdev 98.71 98.74 +0.03% 7.83 7.82 -0.13% 97.38 97.39 +0.01% 12.55 12.57 +0.16% 97.78 97.80 +0.02% 11.09 11.10 +0.09% 97.89 97.91 +0.02% 10.49 10.42 -0.67% 97.90 97.96 +0.06% 10.03 9.92 -1.10% 98.32 98.29 -0.03% 8.63 8.80 +1.97% 98.19 98.20 +0.01% 10.21 10.23 +0.20% 97.68 97.58 -0.10% 10.99 11.30 +2.82% 97.86 97.90 +0.04% 11.56 11.41 -1.30% 98.73 98.73 +0.00% 7.57 7.63 +0.79% spam mean and sdev for all runs 98.05 98.05 +0.00% 10.20 10.22 +0.20% ham/spam mean difference: 97.25 97.25 +0.00 And a table for the unsures: filename: basebis eng_bis eng_bi10s ham:spam: 3668:4099 3668:4099 3668:4099 fp total: 1 1 1 fp %: 0.03 0.03 0.03 fn total: 20 20 22 fn %: 0.49 0.49 0.54 unsure t: 207 209 206 unsure %: 2.67 2.69 2.65 real cost: $71.40 $71.80 $73.20 best cost: $65.60 $64.00 $64.00 h mean: 0.80 0.79 0.80 h sdev: 6.01 6.01 6.03 s mean: 98.05 98.04 98.05 s sdev: 10.20 10.21 10.22 mean diff: 97.25 97.25 97.25 k: 6.00 6.00 5.98 If you want to test this and use the same dictionary I did, then you can get it here: . (1.5Mb). It was just a random one I found, though - I'm not claiming that it's fantastic or anything :) =Tony Meyer From ta-meyer at ihug.co.nz Wed Jan 7 18:27:24 2004 From: ta-meyer at ihug.co.nz (Tony Meyer) Date: Wed Jan 7 18:27:30 2004 Subject: [spambayes-dev] [ 817813 ] Consider bad spelling a sign of spam Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F1304677805@its-xchg4.massey.ac.nz> Opps - forgot the diff :) =Tony Meyer -------------- next part -------------- A non-text attachment was scrubbed... Name: dict.diff Type: application/octet-stream Size: 3294 bytes Desc: not available Url : http://mail.python.org/pipermail/spambayes-dev/attachments/20040108/a39816ba/dict.obj From ta-meyer at ihug.co.nz Wed Jan 7 18:46:57 2004 From: ta-meyer at ihug.co.nz (Tony Meyer) Date: Wed Jan 7 18:47:02 2004 Subject: [spambayes-dev] [ 830290 ] url detection Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F13026F2A3A@its-xchg4.massey.ac.nz> With Skip's latest patch he observed no change (see the tracker ). Here are my results - with bigrams it's a slight loss, without, it's a slight win. -> tested 357 hams & 395 spams against 3311 hams & 3704 spams -> tested 397 hams & 384 spams against 3271 hams & 3715 spams -> tested 385 hams & 433 spams against 3283 hams & 3666 spams -> tested 407 hams & 397 spams against 3261 hams & 3702 spams -> tested 350 hams & 412 spams against 3318 hams & 3687 spams -> tested 338 hams & 405 spams against 3330 hams & 3694 spams -> tested 359 hams & 416 spams against 3309 hams & 3683 spams -> tested 358 hams & 405 spams against 3310 hams & 3694 spams -> tested 348 hams & 411 spams against 3320 hams & 3688 spams -> tested 369 hams & 441 spams against 3299 hams & 3658 spams -> tested 357 hams & 395 spams against 3311 hams & 3704 spams -> tested 397 hams & 384 spams against 3271 hams & 3715 spams -> tested 385 hams & 433 spams against 3283 hams & 3666 spams -> tested 407 hams & 397 spams against 3261 hams & 3702 spams -> tested 350 hams & 412 spams against 3318 hams & 3687 spams -> tested 338 hams & 405 spams against 3330 hams & 3694 spams -> tested 359 hams & 416 spams against 3309 hams & 3683 spams -> tested 358 hams & 405 spams against 3310 hams & 3694 spams -> tested 348 hams & 411 spams against 3320 hams & 3688 spams -> tested 369 hams & 441 spams against 3299 hams & 3658 spams false positive percentages 0.000 0.000 tied 0.000 0.000 tied 0.000 0.000 tied 0.246 0.246 tied 0.000 0.000 tied 0.000 0.000 tied 0.557 0.557 tied 0.559 0.559 tied 0.287 0.287 tied 0.000 0.000 tied won 0 times tied 10 times lost 0 times total unique fp went from 6 to 6 tied mean fp % went from 0.164881884948 to 0.164881884948 tied false negative percentages 0.253 0.253 tied 0.781 0.781 tied 0.462 0.462 tied 0.756 0.756 tied 0.243 0.243 tied 0.247 0.494 lost +100.00% 0.240 0.240 tied 0.494 0.494 tied 0.973 0.973 tied 0.454 0.454 tied won 0 times tied 9 times lost 1 times total unique fn went from 20 to 21 lost +5.00% mean fn % went from 0.490257037938 to 0.514948395963 lost +5.04% ham mean ham sdev 1.18 1.18 +0.00% 7.76 7.70 -0.77% 0.99 0.98 -1.01% 6.64 6.59 -0.75% 0.84 0.85 +1.19% 6.14 6.14 +0.00% 1.99 1.95 -2.01% 9.46 9.27 -2.01% 0.49 0.49 +0.00% 3.59 3.57 -0.56% 0.85 0.84 -1.18% 5.45 5.42 -0.55% 1.16 1.16 +0.00% 9.30 9.29 -0.11% 1.20 1.20 +0.00% 8.13 8.11 -0.25% 1.55 1.50 -3.23% 8.05 7.89 -1.99% 0.47 0.46 -2.13% 3.22 3.10 -3.73% ham mean and sdev for all runs 1.08 1.07 -0.93% 7.13 7.06 -0.98% spam mean spam sdev 98.75 98.75 +0.00% 8.72 8.73 +0.11% 97.67 97.68 +0.01% 11.26 11.19 -0.62% 98.08 98.10 +0.02% 10.12 10.13 +0.10% 98.16 98.15 -0.01% 10.19 10.20 +0.10% 98.35 98.37 +0.02% 8.77 8.75 -0.23% 98.45 98.44 -0.01% 8.97 9.03 +0.67% 98.35 98.36 +0.01% 9.73 9.69 -0.41% 98.25 98.32 +0.07% 9.16 9.01 -1.64% 97.93 97.95 +0.02% 11.99 11.95 -0.33% 98.92 98.94 +0.02% 7.62 7.63 +0.13% spam mean and sdev for all runs 98.30 98.31 +0.01% 9.72 9.69 -0.31% ham/spam mean difference: 97.22 97.24 +0.02 And with bigrams: [same stat lines as above snipped] false positive percentages 0.000 0.000 tied 0.000 0.000 tied 0.000 0.000 tied 0.000 0.000 tied 0.000 0.000 tied 0.000 0.000 tied 0.279 0.279 tied 0.000 0.000 tied 0.000 0.000 tied 0.000 0.000 tied won 0 times tied 10 times lost 0 times total unique fp went from 1 to 1 tied mean fp % went from 0.0278551532033 to 0.0278551532033 tied false negative percentages 0.253 0.253 tied 1.042 1.042 tied 0.693 0.693 tied 0.252 0.252 tied 0.728 0.485 won -33.38% 0.000 0.000 tied 0.481 0.481 tied 0.494 0.494 tied 0.730 0.730 tied 0.227 0.227 tied won 1 times tied 9 times lost 0 times total unique fn went from 20 to 19 won -5.00% mean fn % went from 0.489899714703 to 0.465627870043 won -4.95% ham mean ham sdev 0.95 0.94 -1.05% 6.64 6.60 -0.60% 0.83 0.82 -1.20% 5.53 5.50 -0.54% 0.49 0.49 +0.00% 4.08 4.08 +0.00% 1.53 1.51 -1.31% 8.16 8.04 -1.47% 0.30 0.30 +0.00% 3.25 3.25 +0.00% 0.70 0.70 +0.00% 5.27 5.28 +0.19% 0.85 0.83 -2.35% 7.11 7.11 +0.00% 0.93 0.92 -1.08% 7.23 7.19 -0.55% 0.90 0.88 -2.22% 6.47 6.44 -0.46% 0.41 0.41 +0.00% 4.07 4.03 -0.98% ham mean and sdev for all runs 0.80 0.79 -1.25% 6.01 5.97 -0.67% spam mean spam sdev 98.71 98.72 +0.01% 7.83 7.81 -0.26% 97.38 97.39 +0.01% 12.55 12.49 -0.48% 97.78 97.78 +0.00% 11.09 11.10 +0.09% 97.89 97.88 -0.01% 10.49 10.50 +0.10% 97.90 97.93 +0.03% 10.03 10.02 -0.10% 98.32 98.34 +0.02% 8.63 8.57 -0.70% 98.19 98.20 +0.01% 10.21 10.20 -0.10% 97.68 97.77 +0.09% 10.99 10.72 -2.46% 97.86 97.87 +0.01% 11.56 11.53 -0.26% 98.73 98.75 +0.02% 7.57 7.57 +0.00% spam mean and sdev for all runs 98.05 98.07 +0.02% 10.20 10.15 -0.49% ham/spam mean difference: 97.25 97.28 +0.03 And a table.py for the unsures: filename: bases fancy_urls basebis fancy_url_bis ham:spam: 3668:4099 3668:4099 3668:4099 3668:4099 fp total: 6 6 1 1 fp %: 0.16 0.16 0.03 0.03 fn total: 20 21 20 19 fn %: 0.49 0.51 0.49 0.46 unsure t: 178 176 207 207 unsure %: 2.29 2.27 2.67 2.67 real cost: $115.60 $116.20 $71.40 $70.40 best cost: $93.00 $92.20 $65.60 $63.80 h mean: 1.08 1.07 0.80 0.79 h sdev: 7.13 7.06 6.01 5.97 s mean: 98.30 98.31 98.05 98.07 s sdev: 9.72 9.69 10.20 10.15 mean diff: 97.22 97.24 97.25 97.28 k: 5.77 5.81 6.00 6.03 =Tony Meyer From tameyer at ihug.co.nz Wed Jan 7 18:50:11 2004 From: tameyer at ihug.co.nz (Tony Meyer) Date: Wed Jan 7 18:52:14 2004 Subject: [spambayes-dev] [ 830290 ] url detection In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F130499CF3B@its-xchg4.massey.ac.nz> Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F1304677808@its-xchg4.massey.ac.nz> > Here are my results - with bigrams it's a slight loss, without, it's a > slight win. Opps. I mean, with bigrams it's a slight win, and without, a slight loss. =Tony Meyer From ta-meyer at ihug.co.nz Wed Jan 7 19:10:25 2004 From: ta-meyer at ihug.co.nz (Tony Meyer) Date: Wed Jan 7 19:10:31 2004 Subject: [spambayes-dev] Mine_received_headers Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F130467780A@its-xchg4.massey.ac.nz> While I was testing, I figured I should do this one, too. This gives me terrible results! I haven't tested with the old ('reverse') method, so I don't know if that would have made a difference or not. Without bigrams: -> tested 357 hams & 395 spams against 3311 hams & 3704 spams [all other stat lines in the message snipped - they're all basically this] false positive percentages 0.000 0.000 tied 0.000 0.000 tied 0.000 0.000 tied 0.246 0.246 tied 0.000 0.000 tied 0.000 0.000 tied 0.557 0.557 tied 0.559 0.279 won -50.09% 0.287 0.287 tied 0.000 0.000 tied won 1 times tied 9 times lost 0 times total unique fp went from 6 to 5 won -16.67% mean fp % went from 0.164881884948 to 0.136948924055 won -16.94% false negative percentages 0.253 0.506 lost +100.00% 0.781 1.042 lost +33.42% 0.462 1.155 lost +150.00% 0.756 0.756 tied 0.243 0.243 tied 0.247 0.494 lost +100.00% 0.240 1.202 lost +400.83% 0.494 1.235 lost +150.00% 0.973 1.217 lost +25.08% 0.454 0.680 lost +49.78% won 0 times tied 2 times lost 8 times total unique fn went from 20 to 35 lost +75.00% mean fn % went from 0.490257037938 to 0.852825140424 lost +73.95% ham mean ham sdev 1.18 0.93 -21.19% 7.76 7.22 -6.96% 0.99 0.91 -8.08% 6.64 6.88 +3.61% 0.84 0.61 -27.38% 6.14 5.32 -13.36% 1.99 2.05 +3.02% 9.46 9.97 +5.39% 0.49 0.44 -10.20% 3.59 3.43 -4.46% 0.85 0.89 +4.71% 5.45 6.08 +11.56% 1.16 1.28 +10.34% 9.30 9.84 +5.81% 1.20 0.99 -17.50% 8.13 7.13 -12.30% 1.55 1.72 +10.97% 8.05 9.18 +14.04% 0.47 0.43 -8.51% 3.22 3.53 +9.63% ham mean and sdev for all runs 1.08 1.03 -4.63% 7.13 7.26 +1.82% spam mean spam sdev 98.75 98.79 +0.04% 8.72 8.79 +0.80% 97.67 96.87 -0.82% 11.26 13.66 +21.31% 98.08 96.97 -1.13% 10.12 13.46 +33.00% 98.16 97.75 -0.42% 10.19 11.56 +13.44% 98.35 98.28 -0.07% 8.77 9.05 +3.19% 98.45 98.10 -0.36% 8.97 10.13 +12.93% 98.35 97.70 -0.66% 9.73 12.49 +28.37% 98.25 97.91 -0.35% 9.16 11.41 +24.56% 97.93 97.17 -0.78% 11.99 13.97 +16.51% 98.92 98.66 -0.26% 7.62 9.34 +22.57% spam mean and sdev for all runs 98.30 97.82 -0.49% 9.72 11.55 +18.83% ham/spam mean difference: 97.22 96.79 -0.43 With bigrams: false positive percentages 0.000 0.000 tied 0.000 0.000 tied 0.000 0.000 tied 0.000 0.246 lost +(was 0) 0.000 0.000 tied 0.000 0.000 tied 0.279 0.279 tied 0.000 0.000 tied 0.000 0.287 lost +(was 0) 0.000 0.000 tied won 0 times tied 8 times lost 2 times total unique fp went from 1 to 3 lost +200.00% mean fp % went from 0.0278551532033 to 0.0811608099572 lost +191.37% false negative percentages 0.253 0.506 lost +100.00% 1.042 1.302 lost +24.95% 0.693 1.155 lost +66.67% 0.252 0.504 lost +100.00% 0.728 0.243 won -66.62% 0.000 0.247 lost +(was 0) 0.481 0.962 lost +100.00% 0.494 1.235 lost +150.00% 0.730 1.217 lost +66.71% 0.227 0.227 tied won 1 times tied 1 times lost 8 times total unique fn went from 20 to 31 lost +55.00% mean fn % went from 0.489899714703 to 0.759596596728 lost +55.05% ham mean ham sdev 0.95 0.81 -14.74% 6.64 6.30 -5.12% 0.83 0.75 -9.64% 5.53 5.65 +2.17% 0.49 0.36 -26.53% 4.08 3.74 -8.33% 1.53 1.62 +5.88% 8.16 8.47 +3.80% 0.30 0.29 -3.33% 3.25 3.01 -7.38% 0.70 0.64 -8.57% 5.27 5.09 -3.42% 0.85 0.84 -1.18% 7.11 7.25 +1.97% 0.93 0.68 -26.88% 7.23 5.83 -19.36% 0.90 1.04 +15.56% 6.47 7.26 +12.21% 0.41 0.38 -7.32% 4.07 4.00 -1.72% ham mean and sdev for all runs 0.80 0.75 -6.25% 6.01 5.93 -1.33% spam mean spam sdev 98.71 98.67 -0.04% 7.83 8.75 +11.75% 97.38 96.86 -0.53% 12.55 14.27 +13.71% 97.78 96.99 -0.81% 11.09 13.46 +21.37% 97.89 97.61 -0.29% 10.49 11.33 +8.01% 97.90 97.78 -0.12% 10.03 10.13 +1.00% 98.32 97.83 -0.50% 8.63 10.20 +18.19% 98.19 97.89 -0.31% 10.21 11.44 +12.05% 97.68 97.61 -0.07% 10.99 11.94 +8.64% 97.86 97.13 -0.75% 11.56 13.79 +19.29% 98.73 98.71 -0.02% 7.57 8.14 +7.53% spam mean and sdev for all runs 98.05 97.71 -0.35% 10.20 11.51 +12.84% ham/spam mean difference: 97.25 96.96 -0.29 Table: filename: bases mine_recs basebis mine_rec_bis ham:spam: 3668:4099 3668:4099 3668:4099 3668:4099 fp total: 6 5 1 3 fp %: 0.16 0.14 0.03 0.08 fn total: 20 35 20 31 fn %: 0.49 0.85 0.49 0.76 unsure t: 178 198 207 205 unsure %: 2.29 2.55 2.67 2.64 real cost: $115.60 $124.60 $71.40 $102.00 best cost: $93.00 $119.80 $65.60 $78.20 h mean: 1.08 1.03 0.80 0.75 h sdev: 7.13 7.26 6.01 5.93 s mean: 98.30 97.82 98.05 97.71 s sdev: 9.72 11.55 10.20 11.51 mean diff: 97.22 96.79 97.25 96.96 k: 5.77 5.15 6.00 5.56 =Tony Meyer From tameyer at ihug.co.nz Wed Jan 7 19:47:13 2004 From: tameyer at ihug.co.nz (Tony Meyer) Date: Wed Jan 7 19:47:19 2004 Subject: [spambayes-dev] setup_all: dialogs.resources.dialogs missing In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F130499CE8B@its-xchg4.massey.ac.nz> Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F130467780D@its-xchg4.massey.ac.nz> > setup_all.py contains the line: > > includes = "dialogs.resources.dialogs" > > to load the Outlook dialog stuff. Unfortunately, > Outlook2000\dialogs\resources\dialogs.py isn't generated > until you register and load the add-in from that source tree, > so it comes up missing when I try to run setup_all.py on a > clean source tree. > > Any way to get setup_all.py to generate this file before > attempting to include it? I've come across this too. I've checked in a fix - Adam/Mark, if this is not the right fix, please fix the fix ;) =Tony Meyer From spambayes at whateley.com Wed Jan 7 21:59:59 2004 From: spambayes at whateley.com (Brendon Whateley) Date: Wed Jan 7 22:00:18 2004 Subject: [spambayes-dev] Rebuilding Resources Question In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F13046777FD@its-xchg4.massey.ac.nz> References: <1ED4ECF91CDED24C8D012BCF2B034F13046777FD@its-xchg4.massey.ac.nz> Message-ID: <200401071900.11680.spambayes@whateley.com> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On Wednesday 07 January 2004 01:44 pm, Tony Meyer wrote: > > That is what I was expecting to happen. I had installed the > > resource package, but can't seem to be able to get it to > > build automatically. > > Is resourcepackage definitely installed? If you run python and type > "import resourcepackage", do you get an ImportError? It *should* just > work, if it's installed. > Yes, resource package was correctly installed. I got it to do what I wanted by typing python __init__.py in the resource directory. I was expecting the "setup.py build" or "setup.py install" to do the dirty work for me. I'll assume I was wrong unless somebody tells me different (then I may try to figure out what is going wrong. > I'm not sure what you'll have to do - the resourcepackage documentation > would have more details. I'm sure it would be too much hassle to bother > with, though . I tried that... my 10 minutes of python experience didn't lead to any great revelations when consulting the manual ! Thanks both of you for the help, Brendon. -----BEGIN PGP SIGNATURE----- Version: PGP 6.5.8 iQA/AwUBP/zHsJuupqACStRwEQIkiwCgynpmUe93FQUHJf03QIC8kh/KL10AoOAz UsIyWvw7MM9bvTSIZt17h3M+ =qvdK -----END PGP SIGNATURE----- From ta-meyer at ihug.co.nz Wed Jan 7 22:01:35 2004 From: ta-meyer at ihug.co.nz (Tony Meyer) Date: Wed Jan 7 22:01:41 2004 Subject: [spambayes-dev] Incremental training results Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F13026F2A3B@its-xchg4.massey.ac.nz> I finally got around to having a go at the incremental training setup today. I *think* I got it working, and I think I kinda understand what the results are telling me. The graphs are here: If someone (Alex?) would like to quickly eyeball them and say whether they look like they might be right that would be cool :) I also had a stab at creating a regime, which might possibly be all wrong :) The idea is to do the same as the 'self train' corrected regime, but keep in balance. If I have the idea right, the corrected regime trains all incoming mail as whatever the classifier thinks it is, with corrections at the end of the day. This does the same, except that it doesn't train messages if it would result in an imbalance of more than 2::1. The code for the regime is also on the webpage. (All the testing is with the default option settings). =Tony Meyer From tameyer at ihug.co.nz Wed Jan 7 22:04:37 2004 From: tameyer at ihug.co.nz (Tony Meyer) Date: Wed Jan 7 22:04:42 2004 Subject: [spambayes-dev] Rebuilding Resources Question In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F130499CF80@its-xchg4.massey.ac.nz> Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F130467780F@its-xchg4.massey.ac.nz> > Yes, resource package was correctly installed. I got it to > do what I wanted by typing python __init__.py in the resource > directory. I was expecting the "setup.py build" or > "setup.py install" to do the dirty work for me. I'll > assume I was wrong unless somebody tells me different (then I > may try to figure out what is going wrong. You were wrong :) It gets called whenever it's imported, but setup.py doesn't import it, it just copies it. If you ran anything that used it (sb_server, for example) then __init__.py should get imported, and so the file will get recreated. This is much more useful than having to run setup.py, of course, since you almost never need to do that - this way it just updates whenever it's needed. Good to hear that it's working, anyway :) =Tony Meyer From spambayes at whateley.com Wed Jan 7 22:15:15 2004 From: spambayes at whateley.com (Brendon Whateley) Date: Wed Jan 7 22:15:54 2004 Subject: [spambayes-dev] Rebuilding Resources Question In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F130467780F@its-xchg4.massey.ac.nz> References: <1ED4ECF91CDED24C8D012BCF2B034F130467780F@its-xchg4.massey.ac.nz> Message-ID: <200401071915.23904.spambayes@whateley.com> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On Wednesday 07 January 2004 07:04 pm, Tony Meyer wrote: > You were wrong :) I was kind of expecting that :-) > It gets called whenever it's imported, but setup.py > doesn't import it, it just copies it. If you ran anything that used it > (sb_server, for example) then __init__.py should get imported, and so the > file will get recreated. > > This is much more useful than having to run setup.py, of course, since you > almost never need to do that - this way it just updates whenever it's > needed. This is where my confusion came from. The install does not copy the resource _source_ so ui_html.py never gets built. Perhaps the message is for me to stop messing around with my live installation? :-) Brendon. -----BEGIN PGP SIGNATURE----- Version: PGP 6.5.8 iQA/AwUBP/zLQ5uupqACStRwEQIE9gCgpVSixdKtY4Tj+CvwvXCYFqnTah0AnR44 muUV7LWf64cMooKQr8OG2jFF =ZLJs -----END PGP SIGNATURE----- From skip at pobox.com Wed Jan 7 22:16:15 2004 From: skip at pobox.com (Skip Montanaro) Date: Wed Jan 7 22:16:37 2004 Subject: [spambayes-dev] Mine_received_headers In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F130467780A@its-xchg4.massey.ac.nz> References: <1ED4ECF91CDED24C8D012BCF2B034F130467780A@its-xchg4.massey.ac.nz> Message-ID: <16380.52095.921091.696319@montanaro.dyndns.org> Tony> While I was testing, I figured I should do this one, too. This Tony> gives me terrible results! Hmmm... Same for me. I'm pretty sure it used to do better: filename: std mine bigrams minebi ham:spam: 1200:1268 1200:1268 1200:1268 1200:1268 fp total: 2 1 4 4 fp %: 0.17 0.08 0.33 0.33 fn total: 64 86 51 66 fn %: 5.05 6.78 4.02 5.21 unsure t: 345 319 267 255 unsure %: 13.98 12.93 10.82 10.33 real cost: $153.00 $159.80 $144.40 $157.00 best cost: $119.60 $115.40 $103.80 $107.20 h mean: 2.36 1.79 2.32 1.70 h sdev: 9.72 8.74 10.34 8.80 s mean: 85.55 84.44 89.09 87.80 s sdev: 26.06 27.85 23.48 25.33 mean diff: 83.19 82.65 86.77 86.10 k: 2.33 2.26 2.57 2.52 Skip From skip at pobox.com Wed Jan 7 22:35:52 2004 From: skip at pobox.com (Skip Montanaro) Date: Wed Jan 7 22:36:14 2004 Subject: [spambayes-dev] Incremental training results In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F13026F2A3B@its-xchg4.massey.ac.nz> References: <1ED4ECF91CDED24C8D012BCF2B034F13026F2A3B@its-xchg4.massey.ac.nz> Message-ID: <16380.53272.656479.530301@montanaro.dyndns.org> Tony> I finally got around to having a go at the incremental training Tony> setup today. I *think* I got it working, and I think I kinda Tony> understand what the results are telling me. Well, I don't... I guess I wasn't paying attention that day. :-( I don't understand what's different between each of the graphs or what they purport to measure. Can you provide a little background? Thx, Skip From popiel at wolfskeep.com Thu Jan 8 00:36:31 2004 From: popiel at wolfskeep.com (T. Alexander Popiel) Date: Thu Jan 8 00:36:38 2004 Subject: [spambayes-dev] Incremental training results In-Reply-To: Message from "Tony Meyer" of "Thu, 08 Jan 2004 16:01:35 +1300." <1ED4ECF91CDED24C8D012BCF2B034F13026F2A3B@its-xchg4.massey.ac.nz> References: <1ED4ECF91CDED24C8D012BCF2B034F13026F2A3B@its-xchg4.massey.ac.nz> Message-ID: <20040108053631.16A792DF1A@cashew.wolfskeep.com> In message: <1ED4ECF91CDED24C8D012BCF2B034F13026F2A3B@its-xchg4.massey.ac.nz> "Tony Meyer" writes: >I finally got around to having a go at the incremental training setup >today. Huzzah! >I *think* I got it working, and I think I kinda understand what the >results are telling me. > >The graphs are here: > > Hrm. You don't have the X axis labeled; what units is it using? Days (or rather, groups) as I did? What happens at about 250 to pull it out of what looks like an approximation of an inverse function (with all data lines overlapping) to a very distinct set of separate lines? Can you post the changes to mkgraph.py? >If someone (Alex?) would like to quickly eyeball them and say whether >they look like they might be right that would be cool :) They look a bit bizarre to me, with that dramatic behaviour change at 250. >I also had a stab at creating a regime, which might possibly be all >wrong :) Your regime looks fine to me. >(All the testing is with the default option settings). OK. The incremental harness is built to do all 10 classifiers at once (for the input sans each set) by default. There's a command line option to do just one classifier (excluding a specified set), which I always use (my machine doesn't have the memory to hold all 10 classifiers at once). I'm guessing that you used the former (default) behaviour... and it's been long enough since I wrote it that I have no idea what that would do in conjunction with mkgraph.py. That might be what's making the graphs look odd to me. The 10-day span thing means 'Taking the data (ham count, spam count, unsure, fp, fn, etc.) for 10 days at once, compute the total value within the window and plot it. Use a sliding window, so that for each day, drop out the data 10 days old as you add in the data for the next day.' Using a 3-day span (to make the equations smaller), if you had the data: day: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 ham: 1 3 2 4 2 5 6 2 3 2 3 2 1 5 spam: 8 9 8 3 9 8 9 8 9 8 3 7 8 9 fn: 2 1 2 1 2 1 2 3 2 3 2 1 3 1 Then you'd get the following values plotted: day 1: fn % = (0 + 0 + 2) / ((0 + 0 + 1) + (0 + 0 + 8)) = 22.2% day 2: fn % = (0 + 2 + 1) / ((0 + 1 + 3) + (0 + 8 + 9)) = 14.3% day 3: fn % = (2 + 1 + 2) / ((1 + 3 + 2) + (8 + 9 + 8)) = 16.1% day 4: fn % = (1 + 2 + 1) / ((3 + 2 + 4) + (9 + 8 + 3)) = 13.8% etc. The span plots give some idea of 'what is the performance at this time, as the user would experience it', whereas the cumulative plots show, well, the overall numbers as they mature. - Alex From rcharbon at mitre.org Thu Jan 8 09:22:29 2004 From: rcharbon at mitre.org (Ray Charbonneau) Date: Thu Jan 8 09:22:35 2004 Subject: [spambayes-dev] Doc error Message-ID: <001a01c3d5f2$d9255410$1b375381@MITRE.ORG> In the Outlook plug-in 0081, the about page has a bad link: file:///e:/src/spambayes/Outlook2000/about.html#Field "If you want to add the Spam field to your Outlook views, follow these instructions." Tha same bad link is in: C:\Program Files\Spambayes Outlook Addin\docs\welcome.html -- Ray Charbonneau R107 - Enterprise Desktop Solutions The MITRE Corporation From kennypitt at hotmail.com Thu Jan 8 10:15:32 2004 From: kennypitt at hotmail.com (Kenny Pitt) Date: Thu Jan 8 10:16:17 2004 Subject: [spambayes-dev] Doc error In-Reply-To: <001a01c3d5f2$d9255410$1b375381@MITRE.ORG> Message-ID: Ray Charbonneau wrote: > In the Outlook plug-in 0081, the about page has a bad link: > file:///e:/src/spambayes/Outlook2000/about.html#Field > "If you want to add the Spam field to your Outlook views, follow these > instructions." > Tha same bad link is in: > C:\Program Files\Spambayes Outlook Addin\docs\welcome.html Thanks for the report. It's already been fixed in the source, so should be corrected in the next release. -- Kenny Pitt From rcharbon at mitre.org Thu Jan 8 13:42:31 2004 From: rcharbon at mitre.org (Ray Charbonneau) Date: Thu Jan 8 13:42:48 2004 Subject: [spambayes-dev] Program notes... Message-ID: <00ec01c3d617$2c719920$1b375381@MITRE.ORG> Outlook 2002, plug-in v 0081: 1) When I "Recover from Spam" messages scored as "unsure" or "spam", the messages return to my Personal Folders Inbox. Since I'm using Outlook as an IMAP client, I'd like messages to return to the IMAP Inbox. Since an IMAP PST can't be the default message store, I suppose the request should read "return to a specified folder". 2) My plug-in is configured to watch both the IMAP and the local Inbox. I'd like to remove the local Inbox, but I don't see any way to do so (other than by resetting everything). 3) Please advise as to the most appropriate place to send comments like these, if it's not this address. -- Ray Charbonneau R107 - Enterprise Desktop Solutions The MITRE Corporation From tim.one at comcast.net Thu Jan 8 16:29:34 2004 From: tim.one at comcast.net (Tim Peters) Date: Thu Jan 8 16:29:41 2004 Subject: [spambayes-dev] RE: [Spambayes] SpamBayes old and new In-Reply-To: Message-ID: [followups to spambayes-dev@python.org please, since it would get increasingly technical beyond this point] [Simone Piunno] >> Just out of curiosity, I've read this essay by Greg Louis: >> >> http://www.bgl.nu/bogofilter/bayes.html >> >> I find it has interesting considerations on the balance problem. >> Did you know this essay? Have you ever tried how it works? [Tim Peters] > ... > Alex here did a relevant experiment, but the report is lacking some > needed detail: http://mail.python.org/pipermail/spambayes-dev/2003-November/001592.html I ran a test on my own recent email mix, using current Outlook addin defaults. "base" is the current code. "bycount" replaces one line in classifier.py, from prob = spamratio / (hamratio + spamratio) to prob = float(spamcount) / (spamcount + hamcount) Results are certainly ... remarkable. Since my incoming email is naturally unbalanced in a 4::1 ham::spam ratio lately, it's a more interesting test than Greg's nearly-balanced test: base -> bycount -> tested 528 hams & 130 spams against 4752 hams & 1170 spams <19 repetitions deleted> false positive percentages 0.000 0.000 tied 0.000 0.000 tied 0.000 0.000 tied 0.000 0.000 tied 0.000 0.000 tied 0.189 0.000 won -100.00% 0.189 0.000 won -100.00% 0.379 0.000 won -100.00% 0.000 0.000 tied 0.000 0.000 tied won 3 times tied 7 times lost 0 times total unique fp went from 4 to 0 won -100.00% mean fp % went from 0.0757575757576 to 0.0 won -100.00% false negative percentages 0.769 16.154 lost +2000.65% 0.769 23.077 lost +2900.91% 0.000 19.231 lost +(was 0) 0.769 23.077 lost +2900.91% 0.769 23.846 lost +3000.91% 0.769 16.154 lost +2000.65% 1.538 26.923 lost +1650.52% 0.000 12.308 lost +(was 0) 1.538 20.000 lost +1200.39% 1.538 17.692 lost +1050.33% won 0 times tied 0 times lost 10 times total unique fn went from 11 to 258 lost +2245.45% mean fn % went from 0.846153846153 to 19.8461538462 lost +2245.45% ham mean ham sdev 0.38 0.00 -100.00% 3.57 0.00 -100.00% 0.34 0.00 -100.00% 3.70 0.09 -97.57% 0.07 0.00 -100.00% 0.85 0.00 -100.00% 0.03 0.00 -100.00% 0.43 0.00 -100.00% 0.34 0.00 -100.00% 4.08 0.01 -99.75% 0.26 0.00 -100.00% 4.36 0.00 -100.00% 0.28 0.00 -100.00% 4.32 0.00 -100.00% 0.55 0.00 -100.00% 6.44 0.00 -100.00% 0.28 0.00 -100.00% 3.40 0.00 -100.00% 0.29 0.00 -100.00% 3.24 0.00 -100.00% ham mean and sdev for all runs 0.28 0.00 -100.00% 3.81 0.03 -99.21% spam mean spam sdev 96.12 63.99 -33.43% 14.01 32.86 +134.55% 97.15 58.20 -40.09% 12.56 35.04 +178.98% 97.58 58.34 -40.21% 8.75 34.93 +299.20% 97.72 58.61 -40.02% 10.38 36.75 +254.05% 97.07 57.33 -40.94% 11.68 35.77 +206.25% 97.00 61.26 -36.85% 13.01 33.07 +154.19% 95.36 55.46 -41.84% 15.45 37.77 +144.47% 97.54 67.03 -31.28% 10.86 31.88 +193.55% 96.34 60.80 -36.89% 14.94 34.05 +127.91% 95.81 60.84 -36.50% 14.94 33.66 +125.30% spam mean and sdev for all runs 96.77 60.19 -37.80% 12.86 34.77 +170.37% ham/spam mean difference: 96.49 60.19 -36.30 filename: base bycount ham:spam: 5280:1300 5280:1300 fp total: 4 0 fp %: 0.08 0.00 fn total: 11 258 fn %: 0.85 19.85 unsure t: 101 660 unsure %: 1.53 10.03 real cost: $71.20 $390.00 best cost: $53.00 $147.60 h mean: 0.28 0.00 h sdev: 3.81 0.03 s mean: 96.77 60.19 s sdev: 12.86 34.77 mean diff: 96.49 60.19 k: 5.79 1.73 Overall, since I have a lot more ham than spam now, when computing initial spamprob guess by raw counts instead of by corpus-relative ratios everything ends up looking hammier; if I had a lot more spam than ham instead, everything would end up looking spammier. As a result of everything looking hammier, the ham and spam means both plummet, the spam variance skyrockets, there are fewer false positives, almost-astonishingly more false negatives, and about half the spam scored as unsure: Ham: 5280 (100.00%) ok, 0 (0.00%) unsure, 0 (0.00%) fp Spam: 382 (29.38%) ok, 660 (50.77%) unsure, 258 (19.85%) fn Every ham was classed as ham (no FP, no unsures), but that was at the expense of only 30% of the spam getting classed as spam, and 20% of it getting classed as ham. So, in all, this experiment agreed with what Alex reported earlier: > basing the prob on the raw counts instead of the ratios is > an incredibly clearcut loss. Only won twice on the false positives > (by relatively small margins), but lost EVERY time on the false > negatives by large amounts. I should note that this test was run against *all* the email I've received recently, so it's not that the ham::spam ratio used in the test differed from what I see in real life. From popiel at wolfskeep.com Thu Jan 8 16:51:01 2004 From: popiel at wolfskeep.com (T. Alexander Popiel) Date: Thu Jan 8 16:51:06 2004 Subject: [spambayes-dev] RE: [Spambayes] SpamBayes old and new In-Reply-To: Message from "Tim Peters" of "Thu, 08 Jan 2004 16:29:34 EST." References: Message-ID: <20040108215101.942FD2DF17@cashew.wolfskeep.com> In message: "Tim Peters" writes: > >[Tim Peters] >> ... >> Alex here did a relevant experiment, but the report is lacking some >> needed detail: > >http://mail.python.org/pipermail/spambayes-dev/2003-November/001592.html Sorry, yes, I should have included the comparison report just for completeness; however, since prior discussion had suggested the change was a bad idea and because the results were quite so dramatic (and I was in a hurry), I didn't bother. In any case, my result looked remarkably like Tim's, so I'll just let his be the canonical example. ;-) - Alex From tim.one at comcast.net Thu Jan 8 18:04:05 2004 From: tim.one at comcast.net (Tim Peters) Date: Thu Jan 8 18:04:11 2004 Subject: [spambayes-dev] RE: [Spambayes] SpamBayes old and new In-Reply-To: <20040108215101.942FD2DF17@cashew.wolfskeep.com> Message-ID: [T. Alexander Popiel] > Sorry, yes, I should have included the comparison report just for > completeness; however, since prior discussion had suggested the change > was a bad idea and because the results were quite so dramatic (and I > was in a hurry), I didn't bother. > > In any case, my result looked remarkably like Tim's, so I'll just let > his be the canonical example. ;-) You don't get off that easy, Alex . In http://mail.python.org/pipermail/spambayes-dev/2003-November/001581.html you said: I'm currently testing against my RL data, which is between 60% and 70% spam overall (rising to about 90% spam in recent weeks). so your ratio favored spam, at about 1::2 ham::spam, while my ratio strongly favored ham, at about 4::1 ham::spam. It's fishy then that we would both see reduction in FP and massive increase in FN -- our biases were in different directions. However, since the basic idea appears to suck no matter who tests it, I'll forgive you if you don't want to beat it to death . From tameyer at ihug.co.nz Thu Jan 8 18:23:16 2004 From: tameyer at ihug.co.nz (Tony Meyer) Date: Thu Jan 8 18:23:21 2004 Subject: [spambayes-dev] Program notes... In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F1304A7C98A@its-xchg4.massey.ac.nz> Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F1304677811@its-xchg4.massey.ac.nz> > 1) When I "Recover from Spam" messages scored as "unsure" or > "spam", the messages return to my Personal Folders Inbox. > Since I'm using Outlook as an IMAP client, I'd like messages > to return to the IMAP Inbox. Since an IMAP PST can't be the > default message store, I suppose the request should read > "return to a specified folder". "Recover from spam" tries to return messages to wherever they came from. The problem is that to do this it needs to be able to get & save certain information and it sometimes has troubles with this - particularly with IMAP and Hotmail accounts. I suspect this is what is happening here, and, if so, we don't really have a solution at present. IIRC, your log file would indicate that this is what is happening (maybe at a higher verbosity than the default?). > 2) My plug-in is configured to watch both the IMAP and the > local Inbox. I'd like to remove the local Inbox, but I don't > see any way to do so (other than by resetting everything). Open up the SpamBayes Manager dialog, go to the Filtering tab, click the top Browse button, and untick the box next to the local Inbox. > 3) Please advise as to the most appropriate place to send > comments like these, if it's not this address. spambayes@python.org. spambayes-dev is for discussion of SpamBayes development only, not troubleshooting. =Tony Meyer --- Please always include the list (spambayes@python.org) in your replies (reply-all), and please don't send me personal mail about SpamBayes. This way, you get everyone's help, and avoid a lack of replies when I'm busy. From tameyer at ihug.co.nz Thu Jan 8 18:35:27 2004 From: tameyer at ihug.co.nz (Tony Meyer) Date: Thu Jan 8 18:36:22 2004 Subject: [spambayes-dev] Rebuilding Resources Question In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F130499CF8A@its-xchg4.massey.ac.nz> Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F13026F2A3C@its-xchg4.massey.ac.nz> > This is where my confusion came from. The install does not > copy the resource _source_ so ui_html.py never gets built. This is deliberate - there's no need for the ui.html file to be installed, because in normal circumstances the ui_html.py file never needs to get rebuilt. Same with all the images. > Perhaps the message is for me to stop messing around with > my live installation? :-) There's nothing wrong with messing around with your live installation! (I'm sure many of us here do that - I'm using so many x- options with my Outlook install that I have no idea which ones are doing me any good ). If you are planning on 'messing around' with stuff, though, it's probably better to forget about using setup.py install. Instead, use a cvs checkout - you can checkout the spambayes/spambayes directory right into your Python site-packages directory if you like. That way you'll have all the files. Same sort of thing goes for the scripts, or you can just run them from somewhere else. Personally, I don't have any spambayes stuff in the Python directories - instead I have 4 (at the moment) different spambayes directories (all cvs checkouts) - one for running timcv tests (so it's identical to cvs apart from whatever is being tested at the time, plus has a data directory), one for running incremental tests (so is identical to cvs plus has a data directory), one for general development (which also is the one in day-to-day use), and one that's unchanged from cvs for me to diff against. To use, I just set PYTHONPATH in whatever console window I'm accessing them from, before I start doing anything - that way they're all nicely separate (and by not having any in the site-packages directory, I avoid accidentally importing from there). =Tony Meyer From popiel at wolfskeep.com Thu Jan 8 18:58:26 2004 From: popiel at wolfskeep.com (T. Alexander Popiel) Date: Thu Jan 8 18:58:30 2004 Subject: [spambayes-dev] RE: [Spambayes] SpamBayes old and new In-Reply-To: Message from "Tim Peters" of "Thu, 08 Jan 2004 18:04:05 EST." References: Message-ID: <20040108235826.DACB02DF17@cashew.wolfskeep.com> In message: "Tim Peters" writes: > >You don't get off that easy, Alex . Well, drat. OK, here's the output file that I dug out of mothballs. I don't have the dataset that this was generated from anymore, though... I've rebuilt since then. - Alex output/newnormals.txt -> output/probbycounts.txt -> tested 2152 hams & 5576 spams against 19366 hams & 50183 spams -> tested 2152 hams & 5576 spams against 19366 hams & 50183 spams -> tested 2152 hams & 5576 spams against 19366 hams & 50183 spams -> tested 2152 hams & 5576 spams against 19366 hams & 50183 spams -> tested 2152 hams & 5576 spams against 19366 hams & 50183 spams -> tested 2152 hams & 5576 spams against 19366 hams & 50183 spams -> tested 2152 hams & 5576 spams against 19366 hams & 50183 spams -> tested 2152 hams & 5576 spams against 19366 hams & 50183 spams -> tested 2151 hams & 5576 spams against 19367 hams & 50183 spams -> tested 2151 hams & 5575 spams against 19367 hams & 50184 spams -> tested 2152 hams & 5576 spams against 19366 hams & 50183 spams -> tested 2152 hams & 5576 spams against 19366 hams & 50183 spams -> tested 2152 hams & 5576 spams against 19366 hams & 50183 spams -> tested 2152 hams & 5576 spams against 19366 hams & 50183 spams -> tested 2152 hams & 5576 spams against 19366 hams & 50183 spams -> tested 2152 hams & 5576 spams against 19366 hams & 50183 spams false positive percentages 0.000 0.000 tied 0.000 0.000 tied 0.186 0.000 won -100.00% 0.093 0.000 won -100.00% 0.000 0.000 tied 0.000 0.000 tied won 2 times tied 4 times lost 0 times total unique fp went from 10 to 0 won -100.00% mean fp % went from 0.0464727221194 to 0.0 won -100.00% false negative percentages 0.287 65.082 lost +22576.66% 0.412 64.311 lost +15509.47% 0.287 66.159 lost +22951.92% 0.305 65.531 lost +21385.57% 0.251 64.778 lost +25707.97% 0.377 64.527 lost +17015.92% won 0 times tied 0 times lost 6 times total unique fn went from 171 to 21768 lost +12629.82% mean fn % went from 0.306676274359 to 65.0645624103 lost +21116.04% [info about ham & spam means & sdevs not available in both files] From ta-meyer at ihug.co.nz Thu Jan 8 20:37:42 2004 From: ta-meyer at ihug.co.nz (Tony Meyer) Date: Thu Jan 8 20:37:51 2004 Subject: [spambayes-dev] Incremental training results In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F130499CFBF@its-xchg4.massey.ac.nz> Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F1304677813@its-xchg4.massey.ac.nz> > Hrm. You don't have the X axis labeled; what units is it > using? Days (or rather, groups) as I did? I think so, almost. I think that I made a mistake in the graphing and days (groups) without any mail are skipped. At a guess, the mail isn't all that consistent in the 0000 to 0600 groups, and (see below) is almost all ham, and then from 0600 to 1000 is pretty 'normal'. If about two-thirds of the 0000 to 0600 groups aren't graphed, then the graph makes sense (in it's own confusing way :) again. I'll fix this for the next attempt. > What happens at > about 250 to pull it out of what looks like an approximation > of an inverse function (with all data lines overlapping) to a > very distinct set of separate lines? I wondered that too, but didn't have a chance to investigate yesterday. I see now that it's an artefact of the data I was using - the ham I was using goes back further than my spam. Spam filenames start at about 0600 (except for one oddball at 0354), while ham is from 0000. So until that day, I was training and classifying ham only :) I'll fix this so that the data starts with a ham/spam mix. (I've been archiving ham much longer than spam, so this just means not going back as far, although if I use all the mail that I've kept from that period it will have much more spam than ham). > Can you post the changes to mkgraph.py? Will do. (BTW I tried to find a copy of plotmtv that I could run on Windows or under cygwin, or something else that would and could read mtv files, but was unsuccessful - does anyone know of anything?). > >I also had a stab at creating a regime, which might possibly be all > >wrong :) > > Your regime looks fine to me. As least I understood something! I guess that means that, for me at least, that's not the training regime to use (assuming that things don't change in my next, more informed, attempt). > OK. The incremental harness is built to do all 10 > classifiers at once (for the input sans each set) by default. > There's a command line option to do just one classifier > (excluding a specified set), which I always use (my machine > doesn't have the memory to hold all 10 classifiers at once). > I'm guessing that you used the former (default) behaviour... Yes, I did. I'll change this, too, since it makes the running easier. This is the -s option, I presume. > The 10-day span thing means [...] > The span plots give some idea of 'what is the performance at > this time, as the user would experience it', whereas the > cumulative plots show, well, the overall numbers as they mature. Thanks for that. I had something like this in my head, I think, but I understand it now. I'll post new versions when I get them done. =Tony Meyer From tameyer at ihug.co.nz Fri Jan 9 01:32:08 2004 From: tameyer at ihug.co.nz (Tony Meyer) Date: Fri Jan 9 01:32:15 2004 Subject: [spambayes-dev] Incremental training results In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F130499CF95@its-xchg4.massey.ac.nz> Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F1304677816@its-xchg4.massey.ac.nz> > Well, I don't... I guess I wasn't paying attention that day. > :-( I don't understand what's different between each of the > graphs or what they purport to measure. Can you provide a > little background? Hopefully Alex's post did some of this. My new and improved results are at the same place the old ones were: There's some more explanation there, too, so it should be somewhat clearer what the graphs are telling me (and hopefully what they are telling me is what they mean ). =Tony Meyer From skip at pobox.com Fri Jan 9 09:38:36 2004 From: skip at pobox.com (Skip Montanaro) Date: Fri Jan 9 09:38:58 2004 Subject: [spambayes-dev] Incremental training results In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F1304677816@its-xchg4.massey.ac.nz> References: <1ED4ECF91CDED24C8D012BCF2B034F130499CF95@its-xchg4.massey.ac.nz> <1ED4ECF91CDED24C8D012BCF2B034F1304677816@its-xchg4.massey.ac.nz> Message-ID: <16382.48364.409795.370110@montanaro.dyndns.org> >> Well, I don't... I guess I wasn't paying attention that day. :-( I >> don't understand what's different between each of the graphs or what >> they purport to measure. Can you provide a little background? Tony> Hopefully Alex's post did some of this. My new and improved Tony> results are at the same place the old ones were: Tony> Tony> There's some more explanation there, too... Thanks for the extra info. Where do I find understandable definitions of the different training regimes ("perfect", "nonedge", "expire4months", "corrected", etc)? Even after reading incremental.HOWTO.txt and regimes.py in the testtools directory I don't understand what the different regimes mean. For instance, what is "perfect" training? How is it different from "nonedge"? What does "properly classified with extreme confidence" mean? Skip From popiel at wolfskeep.com Fri Jan 9 12:58:27 2004 From: popiel at wolfskeep.com (T. Alexander Popiel) Date: Fri Jan 9 12:58:31 2004 Subject: [spambayes-dev] Incremental training results In-Reply-To: Message from Skip Montanaro of "Fri, 09 Jan 2004 08:38:36 CST." <16382.48364.409795.370110@montanaro.dyndns.org> References: <1ED4ECF91CDED24C8D012BCF2B034F130499CF95@its-xchg4.massey.ac.nz> <1ED4ECF91CDED24C8D012BCF2B034F1304677816@its-xchg4.massey.ac.nz> <16382.48364.409795.370110@montanaro.dyndns.org> Message-ID: <20040109175827.16BA02DE88@cashew.wolfskeep.com> In message: <16382.48364.409795.370110@montanaro.dyndns.org> Skip Montanaro writes: > >Thanks for the extra info. Where do I find understandable definitions of >the different training regimes ("perfect", "nonedge", "expire4months", >"corrected", etc)? Even after reading incremental.HOWTO.txt and regimes.py >in the testtools directory I don't understand what the different regimes >mean. For instance, what is "perfect" training? How is it different from >"nonedge"? What does "properly classified with extreme confidence" mean? Argh. Most of the confusion arises from a complete lack of documentation on the interface to the regimes: what their parameters mean, what the return code means, etc. I'll try to get to that soon... unless someone beats me to it. Reading incremental.py is pretty much required until such docs get written. 'perfect' and 'corrected' are both train-on-everything regimes. With 'perfect', the trainer is given perfect and immediate knowledge of the proper classification (as defined by location in the Data directory tree). With 'corrected', the trainer trusts the classifier result until end-of-group, at which point all mistrained (or non-trained) items (fp, fn, and unsure) are corrected to be trained with their proper classification. 'expire4months' is like 'perfect', except that messages are untrained after 120 groups have passed. 'nonedge', 'fpfnunsure', and 'fnunsure' are all partial-training regimes, where some messages are never trained on at all. 'nonedge' trains only on messages which are not properly classified with scores of 1.00 or 0.00 (rounded). False positives at 1.00 and false negatives at 0.00 _are_ trained. 'fpfnunsure' only trains on fp, fn, and unsure. 'fnunsure' only trains on fn and unsure. - Alex From popiel at wolfskeep.com Fri Jan 9 13:07:21 2004 From: popiel at wolfskeep.com (T. Alexander Popiel) Date: Fri Jan 9 13:07:24 2004 Subject: [spambayes-dev] Incremental training results In-Reply-To: Message from "Tony Meyer" of "Fri, 09 Jan 2004 19:32:08 +1300." <1ED4ECF91CDED24C8D012BCF2B034F1304677816@its-xchg4.massey.ac.nz> References: <1ED4ECF91CDED24C8D012BCF2B034F1304677816@its-xchg4.massey.ac.nz> Message-ID: <20040109180721.A4AC22DE88@cashew.wolfskeep.com> In message: <1ED4ECF91CDED24C8D012BCF2B034F1304677816@its-xchg4.massey.ac.nz> "Tony Meyer" writes: >> Well, I don't... I guess I wasn't paying attention that day. >> :-( I don't understand what's different between each of the >> graphs or what they purport to measure. Can you provide a >> little background? > >Hopefully Alex's post did some of this. My new and improved results are at >the same place the old ones were: > > > >There's some more explanation there, too, so it should be somewhat clearer >what the graphs are telling me (and hopefully what they are telling me is >what they mean ). Looks like good analysis to me. However, I'm still slightly confused by only one apparent run for each regime; I'd anticipate one run for each set excluded (and thus multiple instances of each of the lines on the corrected, nonedge, etc. graphs). For your data, it might be valuable to make balanced_corrected allow a 3:1 ratio in favor of ham, but only a 2:3 ratio in favor of spam. Further, it might be easier to isolate the balancing effects from the mistake-correction effects if you made a balanced_perfect regime, too. - Alex From kennypitt at hotmail.com Fri Jan 9 13:34:57 2004 From: kennypitt at hotmail.com (Kenny Pitt) Date: Fri Jan 9 13:35:42 2004 Subject: [spambayes-dev] Incremental training results In-Reply-To: <20040109175827.16BA02DE88@cashew.wolfskeep.com> Message-ID: T. Alexander Popiel wrote: > 'fpfnunsure' only trains on fp, fn, and unsure. 'fnunsure' only > trains on fn and unsure. OK, so 'fpfnunsure' sounds like what most people would consider "mistake-based" training, and certainly matches the style that the Outlook plugin uses if you only train with the toolbar. Just out of curiosity, what would 'fnunsure' translate to in terms of real-life training? Since SpamBayes' first priority is to avoid false positives, I can't imagine a training regime where you wouldn't correct it when it generates one. If it's only there for comparison purposes in testing, maybe adding 'fpunsure' would also be useful as the other side of the comparison to full 'fpfnunsure' training. -- Kenny Pitt From popiel at wolfskeep.com Fri Jan 9 13:43:42 2004 From: popiel at wolfskeep.com (T. Alexander Popiel) Date: Fri Jan 9 13:43:50 2004 Subject: [spambayes-dev] Incremental training results In-Reply-To: Message from "Kenny Pitt" of "Fri, 09 Jan 2004 13:34:57 EST." References: Message-ID: <20040109184342.1AC062DE88@cashew.wolfskeep.com> In message: "Kenny Pitt" writes: > >Just out of curiosity, what would 'fnunsure' translate to in terms of >real-life training? That would be mistake-based training where you never looked carefully enough at the spam folder to find anything... which was a scenario proposed about a year ago (particularly with auto-deletion of suspected spam). - Alex From anthony at interlink.com.au Fri Jan 9 13:43:47 2004 From: anthony at interlink.com.au (Anthony Baxter) Date: Fri Jan 9 13:44:14 2004 Subject: [spambayes-dev] Incremental training results In-Reply-To: <20040109175827.16BA02DE88@cashew.wolfskeep.com> Message-ID: <200401091843.i09IhlSP010689@localhost.localdomain> >>> "T. Alexander Popiel" wrote > 'nonedge' trains only on messages which are not properly classified > with scores of 1.00 or 0.00 (rounded). False positives at 1.00 and > false negatives at 0.00 _are_ trained. One variant that has occurred to me for this would be to make sure that it always trains the same numbers of ham and spam. Since for most people nonedge ends up training far more spam, you pick a couple of random 0.00 hams to train with as well. -- Anthony Baxter It's never too late to have a happy childhood. From skip at pobox.com Fri Jan 9 13:59:50 2004 From: skip at pobox.com (Skip Montanaro) Date: Fri Jan 9 14:00:09 2004 Subject: [spambayes-dev] Incremental training results In-Reply-To: <20040109175827.16BA02DE88@cashew.wolfskeep.com> References: <1ED4ECF91CDED24C8D012BCF2B034F130499CF95@its-xchg4.massey.ac.nz> <1ED4ECF91CDED24C8D012BCF2B034F1304677816@its-xchg4.massey.ac.nz> <16382.48364.409795.370110@montanaro.dyndns.org> <20040109175827.16BA02DE88@cashew.wolfskeep.com> Message-ID: <16382.64038.33800.358466@montanaro.dyndns.org> >> Where do I find understandable definitions of the different training >> regimes ("perfect", "nonedge", "expire4months", "corrected", etc)? ... Alex> 'perfect' and 'corrected' are ... ... Alex, Thanks for the explanations. I'll try to update the how-to when I have a moment. I'd like to grab a bite of lunch first. Skip From skip at pobox.com Fri Jan 9 14:01:13 2004 From: skip at pobox.com (Skip Montanaro) Date: Fri Jan 9 14:01:33 2004 Subject: [spambayes-dev] Incremental training results In-Reply-To: <200401091843.i09IhlSP010689@localhost.localdomain> References: <20040109175827.16BA02DE88@cashew.wolfskeep.com> <200401091843.i09IhlSP010689@localhost.localdomain> Message-ID: <16382.64121.682654.683271@montanaro.dyndns.org> Anthony> One variant that has occurred to me for this would be to make Anthony> sure that it always trains the same numbers of ham and Anthony> spam. Since for most people nonedge ends up training far more Anthony> spam, you pick a couple of random 0.00 hams to train with as Anthony> well. Or bring the training threshold for spam down far enough so you are training on roughly the same numbers. You might start your thresholds at 0.01 and 0.99, then have the training regime automatically lower or raise the spam training threshold to keep the training sets roughly in balance. Skip From listsub at wickedgrey.com Fri Jan 9 17:39:12 2004 From: listsub at wickedgrey.com (Eli Stevens (WG.c)) Date: Fri Jan 9 17:42:56 2004 Subject: [spambayes-dev] Incremental training results References: <1ED4ECF91CDED24C8D012BCF2B034F130499CF95@its-xchg4.massey.ac.nz> <1ED4ECF91CDED24C8D012BCF2B034F1304677816@its-xchg4.massey.ac.nz> <16382.48364.409795.370110@montanaro.dyndns.org> <20040109175827.16BA02DE88@cashew.wolfskeep.com> Message-ID: <3FFF2D90.1010805@wickedgrey.com> T. Alexander Popiel wrote: > > Argh. Most of the confusion arises from a complete lack of > documentation on the interface to the regimes: what their > parameters mean, what the return code means, etc. I'll try > to get to that soon... unless someone beats me to it. Reading > incremental.py is pretty much required until such docs get > written. Somewhat tangential, but... Last night I set up the default Data/{Ham,Spam}/SetN testing structure and was able to run incremental.py (with the balance_corrected regime added) on the lot of it. I have 164 * 10 ham and 54 * 10 spam. The spam rate has increased steadily since I started collecting - the first 10 spam took 100 days to come in (ahh the joys of a private domain name and practicing safe computing! Alas, those days are no more). I used a modified version of the dotest.sh script to run each set against each regime, which produced 70 graphs that, while nice, don't allow for easy comparative analysis*. The docs in the timtest.py and timcv.py don't imply any easy/automatic way to change .ini settings or regimes (I haven't gone through the code yet, however), but seem to be the standard for assessing the impact of a change to the tokenizer, etc. I'm wanting to cook up something that will take a list of .ini files (or Option objects, if I understand correctly - they are equivalent?) and a list of regimes and run all the combinations, outputting a few pretty graphs. The end goal is to produce a suite that easily tells a) what effect a regime change has on a range of .ini settings (or the reverse, an .ini change has on the various regimes) and more pragmatically b) what the "best" .ini options and regime are for my mail stream. We'll see how much happens this weekend. :) Any suggestions, ideas for features, pointers, etc.? Eli [*] - Though a few spikes in the FP line did lead me to find a few spam in my ham corpus that I had missed previously. ;) From skip at pobox.com Fri Jan 9 18:11:00 2004 From: skip at pobox.com (Skip Montanaro) Date: Fri Jan 9 18:11:32 2004 Subject: [spambayes-dev] Incremental training results In-Reply-To: <3FFF2D90.1010805@wickedgrey.com> References: <1ED4ECF91CDED24C8D012BCF2B034F130499CF95@its-xchg4.massey.ac.nz> <1ED4ECF91CDED24C8D012BCF2B034F1304677816@its-xchg4.massey.ac.nz> <16382.48364.409795.370110@montanaro.dyndns.org> <20040109175827.16BA02DE88@cashew.wolfskeep.com> <3FFF2D90.1010805@wickedgrey.com> Message-ID: <16383.13572.59723.544919@montanaro.dyndns.org> Eli> The docs in the timtest.py and timcv.py don't imply any Eli> easy/automatic way to change .ini settings or regimes (I haven't Eli> gone through the code yet, however), but seem to be the standard Eli> for assessing the impact of a change to the tokenizer, etc. You can set the BAYESCUSTOMIZE environment variable. Take a look at testtools/Makefile. Skip From popiel at wolfskeep.com Fri Jan 9 18:33:40 2004 From: popiel at wolfskeep.com (T. Alexander Popiel) Date: Fri Jan 9 18:33:44 2004 Subject: [spambayes-dev] Incremental training results In-Reply-To: Message from "Eli Stevens (WG.c)" of "Fri, 09 Jan 2004 14:39:12 PST." <3FFF2D90.1010805@wickedgrey.com> References: <1ED4ECF91CDED24C8D012BCF2B034F130499CF95@its-xchg4.massey.ac.nz> <1ED4ECF91CDED24C8D012BCF2B034F1304677816@its-xchg4.massey.ac.nz> <16382.48364.409795.370110@montanaro.dyndns.org> <20040109175827.16BA02DE88@cashew.wolfskeep.com> <3FFF2D90.1010805@wickedgrey.com> Message-ID: <20040109233340.55BC32DE88@cashew.wolfskeep.com> In message: <3FFF2D90.1010805@wickedgrey.com> "Eli Stevens (WG.c)" writes: > >I used a modified version of the dotest.sh script to run each set against >each regime, which produced 70 graphs that, while nice, don't allow for >easy comparative analysis*. Aye, there are not yet published tools for building the summary graphs. I've thrown together snippets of shell scripts to merge graphs, but that's not generally useful (since most of our comrades are running that blighted OS known as Windows, which doesn't even have a decent tail command!). The simplest thing for merging the graphs (for me) is concatenating the result of tail +2... possibly with a bit of grep -v to remove redundant legend labels. >I'm wanting to cook up something that will take a list of .ini files (or >Option objects, if I understand correctly - they are equivalent?) and a >list of regimes and run all the combinations, outputting a few pretty >graphs. As Skip mentioned, the BAYESCUSTOMIZE environment variable is your friend, here. That and a could for loops should help a lot. Enjoy! - Alex From listsub at wickedgrey.com Fri Jan 9 19:28:10 2004 From: listsub at wickedgrey.com (Eli Stevens (WG.c)) Date: Fri Jan 9 19:31:41 2004 Subject: [spambayes-dev] Incremental training results References: <1ED4ECF91CDED24C8D012BCF2B034F130499CF95@its-xchg4.massey.ac.nz> <1ED4ECF91CDED24C8D012BCF2B034F1304677816@its-xchg4.massey.ac.nz> <16382.48364.409795.370110@montanaro.dyndns.org> <20040109175827.16BA02DE88@cashew.wolfskeep.com> <3FFF2D90.1010805@wickedgrey.com> <20040109233340.55BC32DE88@cashew.wolfskeep.com> Message-ID: <3FFF471A.4030000@wickedgrey.com> T. Alexander Popiel wrote: > > Aye, there are not yet published tools for building the summary graphs. > I've thrown together snippets of shell scripts to merge graphs, but > that's not generally useful (since most of our comrades are running > that blighted OS known as Windows, which doesn't even have a decent > tail command!). All the more reason to do as much as possible in Python. ;) I must admit that I'm a bit of a polyplat(?) - at work and home, I run Windows and Linux side-by-side, so I don't usually think of those kinds of things. Tony - the graphs on your web site looked suspiciously like Excel. ;) Would having .csv output make life easy in terms of making graphs? > As Skip mentioned, the BAYESCUSTOMIZE environment variable is your > friend, here. That and a could for loops should help a lot. Yes, that was exactly the kind of thing I was looking for, my thanks to you both. :) Eli From listsub at wickedgrey.com Sat Jan 10 02:06:27 2004 From: listsub at wickedgrey.com (Eli Stevens) Date: Sat Jan 10 02:05:34 2004 Subject: [spambayes-dev] testtools/mkgraph.py Message-ID: <017f01c3d748$60aacf90$6401a8c0@kane> I forgot to mention this earlier. While playing with the mkgraph.py script last evening, I got the following error: <>> python mkgraph.py < output/perfect1.out $ Data=Curve2d Traceback (most recent call last): File "mkgraph.py", line 172, in ? main() File "mkgraph.py", line 169, in main outputset() File "mkgraph.py", line 77, in outputset print '% toplabel="%s Error Rates"' % (title) ValueError: unsupported format character 't' (0x74) at index 2 When I made the following simple change, things started working: <>> cvs diff mkgraph.py Index: mkgraph.py =================================================================== RCS file: /cvsroot/spambayes/spambayes/testtools/mkgraph.py,v retrieving revision 1.3 diff -r1.3 mkgraph.py 77c77 < print '% toplabel="%s Error Rates"' % (title) --- > print '%% toplabel="%s Error Rates"' % (title) I'd assume it's a typo, but look at how it's spelled... ;) This line was introduced in revision 1.2. Eli From popiel at wolfskeep.com Sat Jan 10 15:49:13 2004 From: popiel at wolfskeep.com (T. Alexander Popiel) Date: Sat Jan 10 15:49:18 2004 Subject: [spambayes-dev] testtools/mkgraph.py In-Reply-To: Message from "Eli Stevens" of "Fri, 09 Jan 2004 23:06:27 PST." <017f01c3d748$60aacf90$6401a8c0@kane> References: <017f01c3d748$60aacf90$6401a8c0@kane> Message-ID: <20040110204913.BD8BC2DF16@cashew.wolfskeep.com> In message: <017f01c3d748$60aacf90$6401a8c0@kane> "Eli Stevens" writes: > >Index: mkgraph.py >=================================================================== >RCS file: /cvsroot/spambayes/spambayes/testtools/mkgraph.py,v >retrieving revision 1.3 >diff -r1.3 mkgraph.py >77c77 >< print '% toplabel="%s Error Rates"' % (title) >--- >> print '%% toplabel="%s Error Rates"' % (title) > > >I'd assume it's a typo, but look at how it's spelled... ;) This line was >introduced in revision 1.2. Yes, that's a bug that I fixed in my local version, and forgot to check in. Sorry. - Alex From tameyer at ihug.co.nz Sat Jan 10 20:27:59 2004 From: tameyer at ihug.co.nz (Tony Meyer) Date: Sat Jan 10 20:28:06 2004 Subject: [spambayes-dev] Incremental training results In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F1304A7CBF1@its-xchg4.massey.ac.nz> Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F130467781C@its-xchg4.massey.ac.nz> [Alex] > Argh. Most of the confusion arises from a complete > lack of documentation on the interface to the regimes: > what their parameters mean, what the return code means, > etc. I'll try to get to that soon... unless someone > beats me to it. As I went along I made some local changes to the docstrings, which I'll check in in a moment. This is based on what things seem like from reading the scripts, and (more heavily) from Alex's posts here, so hopefully it's right (and if not, I'm sure Alex will fix it for us :). Hopefully that helps somewhat, although I haven't touched the readme (which still has the "someone please rewrite this" message at the end :) =Tony Meyer From tameyer at ihug.co.nz Sat Jan 10 20:44:16 2004 From: tameyer at ihug.co.nz (Tony Meyer) Date: Sat Jan 10 20:44:21 2004 Subject: [spambayes-dev] Incremental training results In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F1304A7CC1E@its-xchg4.massey.ac.nz> Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F13026F2A40@its-xchg4.massey.ac.nz> > All the more reason to do as much as possible in Python. ;) I must > admit that I'm a bit of a polyplat(?) - at work and home, I > run Windows and Linux side-by-side, so I don't usually think of those > kinds of things. > > Tony - the graphs on your web site looked suspiciously like > Excel. ;) I really must try and disguise it more ;) I did try and get hold of plotmtv first, but then fell back on Excel (OpenOffice is much slower at graphing, on this machine at least). > Would having .csv output make life easy in terms of making graphs? Outputting to .csv is one of the modifications I've made to mkgraph.py (diff attached). I'm not ready to check this in, though, because it makes the script pretty ugly :) I'll tidy it up somewhat (moving things into separate functions rather than passing 'if' parameters) and then probably will. There are three other changes that I made, which I will include in some form when I check it in: 1. Added a docstring. 2. Added a -f command line arg to pass the file in ('mkgraph.py < file' wouldn't work here, for some reason). 3. Output each set in one set of rows, with each line in a separate column, rather than one line at a time. The other way was workable, but more effort to create a graph in Excel. =Tony Meyer From tameyer at ihug.co.nz Sat Jan 10 20:46:34 2004 From: tameyer at ihug.co.nz (Tony Meyer) Date: Sat Jan 10 20:46:40 2004 Subject: [spambayes-dev] Incremental training results In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F1304A7CD31@its-xchg4.massey.ac.nz> Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F130467781E@its-xchg4.massey.ac.nz> [Tony Meyer] > Outputting to .csv is one of the modifications I've made to > mkgraph.py (diff attached). No it wasn't. Here it is. =Tony Meyer -------------- next part -------------- A non-text attachment was scrubbed... Name: mkgraph.diff Type: application/octet-stream Size: 12276 bytes Desc: not available Url : http://mail.python.org/pipermail/spambayes-dev/attachments/20040111/61cf6b82/mkgraph.obj From tameyer at ihug.co.nz Sat Jan 10 22:26:22 2004 From: tameyer at ihug.co.nz (Tony Meyer) Date: Sat Jan 10 22:26:27 2004 Subject: [spambayes-dev] Incremental training results In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F1304A7CC15@its-xchg4.massey.ac.nz> Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F130467781F@its-xchg4.massey.ac.nz> > The docs in the timtest.py and timcv.py [...] > but seem to be the standard for assessing the > impact of a change to the tokenizer, etc. BTW, you'll almost certainly want to stick to timcv.py and ignore timtest.py. > I'm wanting to cook up something that will take a list of > .ini files (or Option objects, if I understand correctly > - they are equivalent?) More-or-less equivalent, yes. An Option object is what gets used, and an .ini file gets read into the Option object that's used by default. > and a list of regimes and run all the combinations, outputting > a few pretty graphs. The end goal is to produce a suite > that easily tells a) what effect a regime change has on > a range of .ini settings (or the reverse, an .ini change > has on the various regimes) and more pragmatically b) > what the "best" .ini options and regime are for my mail > stream. We'll see how much happens this weekend. :) I considered doing something like this at one point, but never got around to it. It'd be great to see the code if you do manage to get it done. Note that this sort of testing will take quite some time :) (Not that that's a reason not to do it!). =Tony Meyer From tameyer at ihug.co.nz Sat Jan 10 22:34:33 2004 From: tameyer at ihug.co.nz (Tony Meyer) Date: Sat Jan 10 22:34:58 2004 Subject: [spambayes-dev] Incremental training results In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F1304A7CBF1@its-xchg4.massey.ac.nz> Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F13026F2A41@its-xchg4.massey.ac.nz> [Alex] > Looks like good analysis to me. > However, I'm still slightly confused by only one > apparent run for each regime; I'd anticipate one > run for each set excluded (and thus multiple > instances of each of the lines on the corrected, > nonedge, etc. graphs). I averaged them all out into a single line. For some reason I thought this was what the graphs on your website were like, and it made making the graph easy :), but looking more closely (the stuff all makes more sense now that I understand the incremental testing setup more!) I see that it's not. I'll do separate lines in future. > For your data, it might be valuable to make > balanced_corrected allow a 3:1 ratio in favor > of ham, but only a 2:3 ratio in favor of spam. Ok, I'll try that. > Further, it might be easier to isolate the > balancing effects from the mistake-correction > effects if you made a balanced_perfect regime, too. I'll try this, too. Thanks! =Tony Meyer From spambayes at whateley.com Sun Jan 11 02:02:39 2004 From: spambayes at whateley.com (Brendon Whateley) Date: Sun Jan 11 02:02:46 2004 Subject: [spambayes-dev] Upgraded proxy UI... Message-ID: <200401102302.42711.spambayes@whateley.com> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 I've made some user interface changes for web training of spambayes. I like them and can prepare a patch if there is general interest... 1) I added two check boxes to the review page, next to the "Previous, Refresh, Next" buttons. The first is called "Trim Extremes" and the second is called "Rescore Messages". 2) I added two additional parameters to "html_ui" section of the options, they are 'ham_trim_level' and 'spam_trim_level'. They have defaults of 0.02 and 0.98 respectively. 3) "Trim Extremes", if checked removes messages that score closer to perfect than the respective trim level settings. This "declutters" the review for training on unsures and "non-well" scored ham/spam. 4) "Rescore messages" displays recalculated scores for all the messages instead of the original scores. This is useful for finding non-extreme messages in older untrained messages. It's particularly valuable when you score periodically. It means you don't score on many examples of the same spam received over a period of days... I welcome any comments. Thanks, Brendon. -----BEGIN PGP SIGNATURE----- Version: PGP 6.5.8 iQA/AwUBQAD1D5uupqACStRwEQL5tACg1pXqtHGQ8+AAl+dN7CYSXmCmSm4AoOxr c5Kh7ylOx1X6vRHpdS0JBSjU =xBLa -----END PGP SIGNATURE----- From ian at ibygrave.no-ip.org Sun Jan 11 19:35:15 2004 From: ian at ibygrave.no-ip.org (ian) Date: Sun Jan 11 19:47:33 2004 Subject: [spambayes-dev] Dibbler.py digest auth splitting fix Message-ID: <20040112003515.GA16459@wisp.ibygrave.no-ip.org> Hello, I'm new to spambayes. I've been spam-free for just a week :) I did have one problem with the web interface. Here is a patch I made to version 1.0a7 I found that the showclues pages failed with digest authentication. Where the browser sent an authorization lines like this for /home Authorization: Digest username="admin", realm="SpamBayes Web Interface", nonce="TW9uIEphbiAxMiAwMDoxMjo0MiAyMDA0", uri="/helmet.gif", algorithm=MD5, response="6cfc0f78933be05c07022772fcba4a5b", opaque="0000000000000000", qop=auth, nc=00000001, cnonce="4661a408d8400972". A line like this was sent for the failing pages Authorization: Digest username="admin", realm="SpamBayes Web Interface", nonce="TW9uIEphbiAxMiAwMDoyMDoyMCAyMDA0", uri="/showclues?key=1073651941-2&subject=spam,Desire%20more%20confidence?", algorithm=MD5, response="2c5c42fcd3d633e394d7d0c1bb1e8af3", opaque="0000000000000000", qop=auth, nc=00000001, cnonce="7e0a86e43b19e87b". The commas inside the uri value caused an exception in _HTTPHandler._digestAuthentication() when it tried to split the line on commas. --IAN -------------- next part -------------- --- /usr/lib/python2.2/site-packages/spambayes/Dibbler.py 2003-11-04 10:02:42.000000000 +0000 +++ spambayes/Dibbler.py 2004-01-11 23:34:52.000000000 +0000 @@ -340,6 +340,10 @@ for each incoming request, and does the job of decoding the HTTP traffic and driving the plugins.""" + # RE to extract option="value" fields from + # digest auth login field + _login_splitter = re.compile('([a-zA-Z])+=(".*?"|.*?),?') + def __init__(self, clientSocket, server, context): # Grumble: asynchat.__init__ doesn't take a 'map' argument, # hence the two-stage construction. @@ -609,7 +613,7 @@ def stripQuotes(s): return (s[0] == '"' and s[-1] == '"') and s[1:-1] or s - options = dict([s.split('=') for s in login.split(", ")]) + options = dict(self._login_splitter.findall(login)) userName = stripQuotes(options["username"]) password = self._server.getPasswordForUser(userName) nonce = stripQuotes(options["nonce"]) From tameyer at ihug.co.nz Sun Jan 11 23:04:20 2004 From: tameyer at ihug.co.nz (Tony Meyer) Date: Sun Jan 11 23:04:27 2004 Subject: [spambayes-dev] Upgraded proxy UI... In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F1304A7CDCF@its-xchg4.massey.ac.nz> Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F13026F2A45@its-xchg4.massey.ac.nz> > I've made some user interface changes for web training of > spambayes. I like them and can prepare a patch if there is > general interest... A patch would be great, thanks. It's probably best if you open a tracker on sourceforge and put it there, since it's easy to keep track of it. I'm not sure if everyone here reads the bugs list, so it's probably also wise to post a note pointing out the link to the tracker. > 1) I added two check boxes to the review page, next to the > "Previous, Refresh, Next" buttons. The first is called > "Trim Extremes" and the second is called "Rescore Messages". I'm not sure about these being checkboxes. Couldn't "Rescore Messages" just be a button that refreshes the current page, but recalculates the scores as well? That seems more intuitive to me, and simpler to use, too. Same sort of thing for "Trim Extremes" as well. > 3) "Trim Extremes", if checked removes messages that score > closer to perfect than the respective trim level settings. > This "declutters" the review for training on unsures and > "non-well" scored ham/spam. I'm not sure about this. How well would the method proposed (on spambayes@python.org, not here) by Anthony suit you instead? My concern is that if this is on the page by default, it'll get used when it shouldn't and people will miss false positives that score really high and false negatives that score really low (which are messages screaming out for training). Seeing the code might convince me, maybe. > 4) "Rescore messages" displays recalculated scores for all > the messages instead of the original scores. This is useful > for finding non-extreme messages in older untrained messages. > It's particularly valuable when you score periodically. > It means you don't score on many examples of the same spam > received over a period of days... This sounds like a useful addition, and was also just suggested by Anthony. (I should have suggested this in my reply to Anthony, too, but didn't think of it - hopefully he's reading this). Would an option, rather than a button, be better? i.e. a boolean option called something like "always_rescore". If False, everything is like now, if True, then the score displayed is always the one that the message would score at the time the page is loaded. I like this more than a button, but then if you're thinking that you'd sometimes want the current score and sometimes the (faster!) original score, then a button would be better. =Tony Meyer From listsub at wickedgrey.com Mon Jan 12 05:20:32 2004 From: listsub at wickedgrey.com (Eli Stevens (WG.c)) Date: Mon Jan 12 05:19:18 2004 Subject: [spambayes-dev] Pure Python CV comparison proggy Message-ID: <00eb01c3d8f5$d6293dc0$6401a8c0@kane> Hey all, I've got a multi- .ini frontend/wrapper to timcv.py attached as elicv.py (all the diffs are needed for it to work; they are against CVS as of about 1:30am PST). A quick summary of the diffs*: - Options.py - changed the load_options function to take the "alternate" parameter, set to default to None. This allows load_options to be called after loading the module to use a new .ini. I didn't notice any adverse side effects from doing this. - timcv.py - added an "-i" option that allows a non-default, non-envvar .ini file to be specified. This file is loaded when options are parsed (see above). Also changed how main is called to allow calls like timcv.main(("-n 10 -i file.ini").split()) to work. The basic pattern is: def main(sys_argv): opts, args = getopt.getopt(sys_argv, 'abc:d:') #etc. if __name__ == "__main__": main(sys.argv[1:]) - rates.py - also added the same def main(sys_argv) convention as above and made the module no longer just doit ;) when loaded. - cmp.py - much the same as rates.py. The new elicv.py file takes a list of .ini files and runs all the timcv comparison combinations on them in alphabetical order (ie. for a.ini and b.ini, a-b.txt is always the comparison file). All of the file names that it creates match the output of the Makefile, and so should be able to be used interchangeably (but see the caveats below). Is there an algorithmic way to determine if a certain CV run is better than another? I suspect that there isn't a hard'n'fast answer (more "I'll know it when I see it"); consider this an opportunity to educate me. :P I haven't touched anything in the incremental / regime testing yet. Things I don't like/find unsettling: - I had to touch a lot of files that aren't "mine" to get what I wanted to do done (easily). Due to my relative unfamiliarity with the project, I'm not sure if this is to be expected or how it will be recieved. - The timestamp detection and conditional .txt file rebuilding aspects of the Makefile solution aren't present. My data is relatively small and I favor the "run everything from scratch, just to make sure" approach, but this could end up wasting a lot of CPU time. - The resulting .txt files end up cluttering the testtools directory. Easy to fix, but for now I am duplicating the output of the Makefile exactly. - I'm not sure exactly what should go in the call to main in: if __name__ == "__main__": main(sys.argv[1:]) I like [1:] because whenever I call main(sys_argv) from outside, I don't have to cook up a fake leading program name that would most likely be discarded. However, it means that I have to change how the sys.argv list is used, as I did in cmp.py: def main(sys_argv): f1n, f2n = sys_argv[0:2] # Used to be sys.argv[1:3] I'm pretty new to Python, so I am unsure if there is a standard way of doing things like this (I'd love discussion on these kinds of things, but this probably isn't the forum). - Along the same lines, the usage calls of the sub-programs aren't correct, due to """Usage: %(program)s...""" - cmp is also the name of a builtin. - elicv.py isn't very descriptive or creative. ;) I welcome comments and criticisms! Thanks, and now to bed... Eli [*] - I just used "cvs diff foo.py" to generate the diffs; are there other arguments I could use that would be easier for others to work with? -- Give a man some mud, and he plays for a day. Teach a man to mud, and he plays for a lifetime. WickedGrey.com uses SpamBayes on incoming email: http://spambayes.sourceforge.net/ -- -------------- next part -------------- #!/usr/bin/env python """Usage: %(program)s [options] -n nsets file1.ini [file2.ini ...] Where: -h Show usage and exit. -c Cleans files generated from file1.ini [and file2.ini ...] fileN.ini An alternate .ini file to load Options from; each will be compared to the others. Note, all other CLI arguments are passed through to timcv.py after being checked. -n int Number of Set directories (Data/Spam/Set1, ... and Data/Ham/Set1, ...). This is required. If you only want to use some of the messages in each set, --HamTrain int The maximum number of msgs to use from each Ham set for training. The msgs are chosen randomly. See also the -s option. --SpamTrain int The maximum number of msgs to use from each Spam set for training. The msgs are chosen randomly. See also the -s option. --HamTest int The maximum number of msgs to use from each Ham set for testing. The msgs are chosen randomly. See also the -s option. --SpamTest int The maximum number of msgs to use from each Spam set for testing. The msgs are chosen randomly. See also the -s option. --ham-keep int The maximum number of msgs to use from each Ham set for testing and training. The msgs are chosen randomly. See also the -s option. --spam-keep int The maximum number of msgs to use from each Spam set for testing and training. The msgs are chosen randomly. See also the -s option. -s int A seed for the random number generator. Has no effect unless at least on of {--ham-keep, --spam-keep} is specified. If -s isn't specifed, the seed is taken from current time. In addition, an attempt is made to merge bayescustomize.ini into the options. If that exists, it can be used to change the settings in Options.options. """ import os import sys import getopt import timcv import rates import cmp program = sys.argv[0] def usage(code, msg=''): """Print usage message and sys.exit(code).""" if msg: print >> sys.stderr, msg print >> sys.stderr print >> sys.stderr, __doc__ % globals() sys.exit(code) def main(sys_argv): list_alternateOptions = [] str_timcvSysArgv = "" bool_cleanAllFiles = False try: opts, args = getopt.getopt(sys_argv, 'hcn:s:', ['HamTrain=', 'SpamTrain=', 'HamTest=', 'SpamTest=', 'ham-keep=', 'spam-keep=']) except getopt.error, msg: usage(1, msg) for opt, arg in opts: if opt == '-h': usage(0) elif opt == '-c': bool_cleanAllFiles = True else: str_timcvSysArgv += opt + " " + arg + " " for arg in args: list_alternateOptions.append(arg) if bool_cleanAllFiles: list_previousOptions = [] list_alternateOptions.sort() for str_alternateOptionFile in list_alternateOptions: # note that this assumes files end with .ini - should be checked, but I'm lazy str_alternateBaseName = str_alternateOptionFile[:-4] try: os.remove(str_alternateBaseName + ".txt") except: pass try: os.remove(str_alternateBaseName + "s.txt") except: pass for str_previousOption in list_previousOptions: try: os.remove(str_previousOption + "-" + str_alternateBaseName + ".txt") except: pass list_previousOptions.append(str_alternateBaseName) else: list_previousOptions = [] file_originalSysStdout = sys.stdout list_alternateOptions.sort() for str_alternateOptionFile in list_alternateOptions: # note that this assumes files end with .ini - should be checked, but I'm lazy str_alternateBaseName = str_alternateOptionFile[:-4] print >> file_originalSysStdout, "Calling timcv.py: " + str_timcvSysArgv + " -i " + str_alternateOptionFile sys.stdout = open(str_alternateBaseName + ".txt", 'w') timcv.main( (str_timcvSysArgv + " -i " + str_alternateOptionFile).split() ) print >> file_originalSysStdout, "Calling rates.py: " + str_alternateBaseName + ".txt" #sys.stdout = open(str_alternateBaseName + "s.txt", 'w') sys.stdout = open("rates-junk.txt", 'w') rates.main((str_alternateBaseName + ".txt").split()) for str_previousOption in list_previousOptions: print >> file_originalSysStdout, "Calling cmp.py: " + str_previousOption + "s.txt " + str_alternateBaseName + "s.txt" sys.stdout = open(str_previousOption + "-" + str_alternateBaseName + ".txt", 'w') cmp.main((str_previousOption + "s.txt " + str_alternateBaseName + "s.txt").split()) list_previousOptions.append(str_alternateBaseName) os.remove("rates-junk.txt") if __name__ == "__main__": main(sys.argv[1:]) -------------- next part -------------- A non-text attachment was scrubbed... Name: Options.py.diff Type: application/octet-stream Size: 435 bytes Desc: not available Url : http://mail.python.org/pipermail/spambayes-dev/attachments/20040112/6c980eec/Options.py.obj -------------- next part -------------- A non-text attachment was scrubbed... Name: rates.py.diff Type: application/octet-stream Size: 378 bytes Desc: not available Url : http://mail.python.org/pipermail/spambayes-dev/attachments/20040112/6c980eec/rates.py.obj -------------- next part -------------- A non-text attachment was scrubbed... Name: timcv.py.diff Type: application/octet-stream Size: 759 bytes Desc: not available Url : http://mail.python.org/pipermail/spambayes-dev/attachments/20040112/6c980eec/timcv.py.obj -------------- next part -------------- A non-text attachment was scrubbed... Name: cmp.py.diff Type: application/octet-stream Size: 3788 bytes Desc: not available Url : http://mail.python.org/pipermail/spambayes-dev/attachments/20040112/6c980eec/cmp.py.obj From skip at pobox.com Mon Jan 12 09:18:33 2004 From: skip at pobox.com (Skip Montanaro) Date: Mon Jan 12 09:18:45 2004 Subject: [spambayes-dev] Dibbler.py digest auth splitting fix In-Reply-To: <20040112003515.GA16459@wisp.ibygrave.no-ip.org> References: <20040112003515.GA16459@wisp.ibygrave.no-ip.org> Message-ID: <16386.44217.277144.468470@montanaro.dyndns.org> ian> I'm new to spambayes. I've been spam-free for just a week :) Welcome... ian> I found that the showclues pages failed with digest authentication. ... Thanks. Your change looked reasonable to me, so I went ahead and checked it in. I'll leave it to Richie to touch things up if need be. Skip From barry at python.org Mon Jan 12 09:30:46 2004 From: barry at python.org (Barry Warsaw) Date: Mon Jan 12 09:30:51 2004 Subject: [spambayes-dev] two small patches checked in Message-ID: <1073917846.17326.3.camel@anthem> I just checked in two small patches for the head of cvs. One fixed an obvious typo leading to a SyntaxError in hammiebulk.py. The other was to sb_imapfilter.py so that it will ignore messages that can't be parsed. -Barry From tim.one at comcast.net Mon Jan 12 13:48:30 2004 From: tim.one at comcast.net (Tim Peters) Date: Mon Jan 12 13:48:36 2004 Subject: [spambayes-dev] Re: ZODB for spambayes server-side filter? Message-ID: FYI, an interesting relevant thread started on the zodb-dev mailing list today: ZODB for spambayes server-side filter? http://mail.zope.org/pipermail/zodb-dev/2004-January/006456.html From tameyer at ihug.co.nz Mon Jan 12 17:04:10 2004 From: tameyer at ihug.co.nz (Tony Meyer) Date: Mon Jan 12 17:04:16 2004 Subject: [spambayes-dev] two small patches checked in In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F1304A7D13C@its-xchg4.massey.ac.nz> Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F130467782B@its-xchg4.massey.ac.nz> [Barry] > I just checked in two small patches for the head of cvs. One > fixed an obvious typo leading to a SyntaxError in > hammiebulk.py. Damn, sorry - I really must figure out some way of testing hammie[bulk] when I check stuff in (although a syntax check would have got that one). > The other was to sb_imapfilter.py so that it > will ignore messages that can't be parsed. Cool - thanks. Eventually, it'd be nice (IMO) if imapfilter handled this like sb_server - catch exceptions and add an "X-SpamBayes-Exception" header to the message, rather than a classification one. But for the moment, at least things are more stable. =Tony Meyer From T.A.Meyer at massey.ac.nz Mon Jan 12 21:48:13 2004 From: T.A.Meyer at massey.ac.nz (Meyer, Tony) Date: Mon Jan 12 21:48:39 2004 Subject: [spambayes-dev] FW: [BAYES-NEWS] Call for Papers -- First Conference on Email and Anti-Spam Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F1304AD136B@its-xchg4.massey.ac.nz> A colleague forwarded this to me, I figured that others here might also be interested if you haven't heard of it already. Enjoy :) -----Original Message----- From: David Heckerman [mailto:heckerma@microsoft.com] Sent: Sat 10/01/2004 1:20 p.m. To: bayes-news@stat.cmu.edu Cc: Subject: [BAYES-NEWS] Call for Papers -- First Conference on Email and Anti-Spam The First Conference on Email and Anti-Spam (CEAS) Preliminary Call for Papers July 30, 31 and August 1, 2004 Mountain View, CA Immediately Follows AAAI 2004 http://www.ceas.cc In Cooperation with AAAI and IEEE Technical Committee on Security and Privacy General Conference Chair: David Heckerman (Microsoft Research) Program Co-Chairs: Tom Berson (Anagram Laboratories) Joshua Goodman (Microsoft Research) Andrew Ng (Stanford University) The Conference on Email and Anti-Spam invites the submission of papers for its first meeting, held in cooperation with AAAI (the American Association for Artificial Intelligence). Papers are invited on all aspects of email and spam, including research papers, industry reports, and law and policy papers. Research: Computer science oriented academic-style research Industry: Descriptions of important or innovative products Law and Policy: Legal and policy papers Research papers include experimental or theoretical, academic-style papers on all aspects of email and spam, including but not limited to: Techniques for stopping spam, including Machine learning techniques Postage techniques (HIPs or computation, possibly in response to a challenge) Disposable email addresses Protocols for sender authentication and verification Digital signatures Proof of group membership Role and significance of spam as a malware vector Spam traceback New features for email systems Automatic foldering Sorting, clustering, or searching email, including both machine learning techniques and user interface research. Advanced calendaring and scheduling Digital rights management research as applied to email Public Key Infrastructure in an email environment Industry papers describe products or systems (commercial or open source) and matters of commercial or practical interest. Papers claiming excellent results should include good experimental or theoretical evidence supporting the claims. Example topics include Industry cooperation for stopping spam New standards and interoperability For spam For calendaring and scheduling Public key infrastructure for encryption and identity Digital rights management New products, especially those with novel features Legal and policy papers focus on topics such as What new laws or social institutions are most appropriate for spam or other email topics Legal strategies for stopping spam The CAN-SPAM act and potential FTC regulations International legal approaches What can/should be done about Phisher scams and other email scams The economics of spam Email and identity: who should control it? Email and privacy, email at work. In all three areas, submissions closely related to email, such as instant messaging, chat rooms, usenet groups, and mailing lists will also be given full consideration. KEY DATES: ----------- Paper Submission Deadline: April 16 Notification of acceptance: June 1 Final camera-ready version of papers: July 1 Main Conference: July 30 and 31 Workshops: August 1 REQUIREMENTS: Papers may be of one of two types: extended abstracts (two pages) or full papers (at most 8 pages, including appendices and bibliography). Work may not have been previously published in any conference or journal, and simultaneous submissions are not allowed. Papers will be reviewed by a committee from academic and industrial research centers. Papers should be 11 point in single column format. Accepted papers will be made freely available on the web, and will be published on CD-ROM. Authors will retain copyright of their work. A call for workshop proposals will follow this call for papers. Suggestions for panel discussions are also welcome, and should be sent to the Program Chairs at information@ceas.cc. PROGRAM COMMITTEE Martmn Abadi (University of California, Santa Cruz) Josh Alspector (AOL) Richard Clayton (University of Cambridge) Cynthia Dwork (Microsoft Research) Tom Fawcett (HP Labs) Eric Horvitz (Microsoft Research) John Lafferty (CMU) David D. Lewis (Ornarose, Inc. and David D. Lewis Consulting) Miles Libbey (Yahoo! Inc.) Andrew McCallum (U. Mass. Amherst) Kevin McCurley (IBM Almaden) Ralph Merkle (Georgia Tech) John Platt (Microsoft Research) Jon Praed (Internet Law Group) Mehran Sahami (Google and Stanford) Neil Schwartzman (PeteMoss SpamNews) Diana Smetters (PARC) Ian Smith (Intel Research Seattle) William S. Yerazunis (MERL) CONTACT: information@ceas.cc - To unsubscribe from bayes-news send email to majordomo@stat.cmu.edu with message body: unsubscribe bayes-news or visit http://lib.stat.cmu.edu/cgi-bin/mj_wwwusr -------------- next part -------------- An HTML attachment was scrubbed... URL: http://mail.python.org/pipermail/spambayes-dev/attachments/20040113/bae2dfdc/attachment.html From tameyer at ihug.co.nz Mon Jan 12 22:13:17 2004 From: tameyer at ihug.co.nz (Tony Meyer) Date: Mon Jan 12 22:13:25 2004 Subject: [spambayes-dev] Call for Papers -- FirstConference on Email and Anti-Spam In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F1304A7D2CA@its-xchg4.massey.ac.nz> Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F13026F2A49@its-xchg4.massey.ac.nz> Would anyone be interested in co-authoring a paper for this with me? I'm thinking a full 8-page 'industry' paper including lots of testing results, although I'm not certain what the exact focus would be (3 months is plenty of time to do it, anyway). It's unlikely that I'll be able to go (it's just too far from NZ), although I'd really like to go to IAAI'04, which is just before it (as is AAAI'04), and if I manage that somehow I'd certainly go (whether presenting or not). It's more than likely we'd have a version 1.0 out by the submission deadline, which would also be nice. =Tony Meyer From skip at pobox.com Tue Jan 13 09:40:42 2004 From: skip at pobox.com (Skip Montanaro) Date: Tue Jan 13 09:40:57 2004 Subject: [spambayes-dev] auto-training w/ small db seems like a bad idea Message-ID: <16388.874.873856.853118@montanaro.dyndns.org> This is completely anecdotal, but my recent experience with automatic training in the context of a small training database (perhaps fewer than 30-50 each of hams and spams) suggests that this is not a good idea. I've been trying a regime of training on fp/fn/unsure and using non-edge auto-training (where the "edge" is at 0.01 and 0.99). In my case at least, I don't use a slick interface for untraining/retraining misclassified messages, so it's even more trouble than if I did. With a small training database, a couple mistakes can really screw things up. Skip From anthony at interlink.com.au Tue Jan 13 09:59:04 2004 From: anthony at interlink.com.au (Anthony Baxter) Date: Tue Jan 13 09:59:19 2004 Subject: [spambayes-dev] auto-training w/ small db seems like a bad idea In-Reply-To: <16388.874.873856.853118@montanaro.dyndns.org> Message-ID: <20040113145904.E9FBD25AF47@bonanza.off.ekorp.com> My anecdotal experience (based on a couple of days now) is that small DB/non-edge training has done wonderful things for the accuracy of the results. Mind you, if I hadn't made the changes to the training API, it'd be a pain in the arse . Anthony -- Anthony Baxter It's never too late to have a happy childhood. From skip at pobox.com Tue Jan 13 10:07:16 2004 From: skip at pobox.com (Skip Montanaro) Date: Tue Jan 13 10:07:38 2004 Subject: [spambayes-dev] auto-training w/ small db seems like a bad idea In-Reply-To: <20040113145904.E9FBD25AF47@bonanza.off.ekorp.com> References: <16388.874.873856.853118@montanaro.dyndns.org> <20040113145904.E9FBD25AF47@bonanza.off.ekorp.com> Message-ID: <16388.2469.712.670258@montanaro.dyndns.org> Anthony> My anecdotal experience (based on a couple of days now) is that Anthony> small DB/non-edge training has done wonderful things for the Anthony> accuracy of the results. Mind you, if I hadn't made the changes Anthony> to the training API, it'd be a pain in the arse . "small DB/non-edge training" may very well be a great idea. In fact, I started from scratch yesterday based upon your email praising the idea. Auto-training with small databases where you are likely to get more false positives seems like a bad idea though. Skip P.S. Aren't you up kinda late? ;-) From listsub at wickedgrey.com Tue Jan 13 14:00:29 2004 From: listsub at wickedgrey.com (Eli Stevens (WG.c)) Date: Tue Jan 13 14:03:59 2004 Subject: [spambayes-dev] auto-training w/ small db seems like a bad idea References: <16388.874.873856.853118@montanaro.dyndns.org> <20040113145904.E9FBD25AF47@bonanza.off.ekorp.com> <16388.2469.712.670258@montanaro.dyndns.org> Message-ID: <4004404D.8050705@wickedgrey.com> Skip Montanaro wrote: > Anthony> My anecdotal experience (based on a couple of days now) is that > Anthony> small DB/non-edge training has done wonderful things for the > Anthony> accuracy of the results. Mind you, if I hadn't made the changes > Anthony> to the training API, it'd be a pain in the arse . > > "small DB/non-edge training" may very well be a great idea. In fact, I > started from scratch yesterday based upon your email praising the idea. > Auto-training with small databases where you are likely to get more false > positives seems like a bad idea though. Is there a regime that simulates what you had been doing manually? *nudge*wink*nudge* I've actually been wanting to dig into the incremental.py stuff more; if you could provide more detail about how you choose your inital training set, etc. I'd be happy to try whipping something up when more free time rolls around (likely not until this weekend). Eli From skip at pobox.com Tue Jan 13 15:13:53 2004 From: skip at pobox.com (Skip Montanaro) Date: Tue Jan 13 15:14:03 2004 Subject: [spambayes-dev] auto-training w/ small db seems like a bad idea In-Reply-To: <4004404D.8050705@wickedgrey.com> References: <16388.874.873856.853118@montanaro.dyndns.org> <20040113145904.E9FBD25AF47@bonanza.off.ekorp.com> <16388.2469.712.670258@montanaro.dyndns.org> <4004404D.8050705@wickedgrey.com> Message-ID: <16388.20865.427392.150122@montanaro.dyndns.org> >> "small DB/non-edge training" may very well be a great idea. In fact, >> I started from scratch yesterday based upon your email praising the >> idea. Auto-training with small databases where you are likely to get >> more false positives seems like a bad idea though. Eli> Is there a regime that simulates what you had been doing manually? Dunno. I haven't really looked at the incremental training stuff. Eli> *nudge*wink*nudge* I've actually been wanting to dig into the Eli> incremental.py stuff more; if you could provide more detail about Eli> how you choose your inital training set, etc. I'd be happy to try Eli> whipping something up when more free time rolls around (likely not Eli> until this weekend). I choose my initial training set by whatever strikes me at the moment. Generally that means picking a few hams and spams, as few as one of each, often from the most recent hams and spams which I am about to throw away. Other times I pick a couple Python messages and a recent spam or two from one of my spam mailboxes. As far as I can tell, there's no obvious best way to pick those initial few messages. Skip From rob at hooft.net Tue Jan 13 15:50:22 2004 From: rob at hooft.net (Rob Hooft) Date: Tue Jan 13 15:51:51 2004 Subject: [spambayes-dev] Upgraded proxy UI... In-Reply-To: <200401102302.42711.spambayes@whateley.com> References: <200401102302.42711.spambayes@whateley.com> Message-ID: <40045A0E.1020304@hooft.net> Brendon Whateley wrote: > 2) I added two additional parameters to "html_ui" section of the options, they > are 'ham_trim_level' and 'spam_trim_level'. They have defaults of 0.02 and > 0.98 respectively. Not using this at all, but following the discussions that appeared on this list, you might want this feature to auto-balance the amount of ham and spam left, so that the user doesn't end up with an unbalanced database. Rob -- Rob W.W. Hooft || rob@hooft.net || http://www.hooft.net/people/rob/ From skip at pobox.com Tue Jan 13 17:45:11 2004 From: skip at pobox.com (Skip Montanaro) Date: Tue Jan 13 17:45:44 2004 Subject: [spambayes-dev] Another incremental training idea... Message-ID: <16388.29943.121292.675974@montanaro.dyndns.org> For some reason, my ham/spam ratio is getting out-of-whack faster that it seemed to in the past. Another trick I'm experimenting with to keep things in closer balance is to rescore my spam mailbox and delete some of those which now score a rounded 1.00 (they didn't when I first scored them). There are probably all sorts of holes in that idea, but I figured I'd toss it out there for anyone interested. Skip From nobody at spamcop.net Tue Jan 13 18:03:19 2004 From: nobody at spamcop.net (Seth Goodman) Date: Tue Jan 13 18:03:20 2004 Subject: [spambayes-dev] Another incremental training idea... In-Reply-To: <16388.29943.121292.675974@montanaro.dyndns.org> Message-ID: Knowing that cross-posting is poor netiquette, here is a copy of a post on incremental training I made in an unrelated thread on sb_server in the SpamBayes forum. ---------------------------------------------------- [Anthony Baxter] > I have to wonder if making non-edge the default option in the next > release of the code (with advice to toss the training database) isn't > a bad plan. Probably not a bad plan at all. FWIW, I have been running a related regime manually with the plug-in for a while and it works very well. Here's my setup: I originally trained with an incrementally selected reduced training set based on a corpus of around 7800 spam and 2500 ham. The incremental selection was as follows: add the 5 worst-scoring messages of each type to the training set, rescore the corpus, repeat until all non-trained spam in the corpus scored at least 90% (the hams quickly approached 0.0%). The resulting training set was 640 spam/640 ham (about 12% of the total corpus). With thresholds of 80/5, there were only three unsures in the entire corpus of 10K+ messages (the unsures were all trained messages: one ham, two spam). I then start with this training set and "train on almost everything" daily with asymmetric training thresholds (90/0.1) to partly mimic the original training scheme. Using classification thresholds of 80/5, my composite stats with this manual regime since 12/16/2003 have been: total spam 2889 total ham 376 fn 8 0.28% fp 0 0.00% unsure 133 4.07% The ham number is so low because I use Outlook rules to siphon off all the mailing list traffic before the classifier starts. Virtually all unsures were spam (I haven't tracked it, but the number of unsure ham was certainly less than 5). The only issue I've have with this regime previously is that after a while, the performance goes down (unsures increase). Presumably, this is because reduced training sets are hapax-driven and are very sensitive to exactly which messages are trained, but that's just a guess. You then have to go back to the original spam corpus with the new messages added and tweak the training set to get performance back up. A larger training set based on train on almost everything would probably have fewer unsures. These fp and fn results are encouraging, while the unsure rate is mediocre. With nham so low, we can't have much confidence in the measured fp rate (I don't know the distribution, but the ham scores are tightly grouped around zero; does anyone know how to calculate the SD of the fp estimate based on this number of nham?). I am fooling around with variable expiration times based on how "wrong" a particular message classification was to see if I can possibly keep a reduced training set up-to-date automatically and if that is, in fact, a reasonable thing to do at all. Maybe the pure train on almost everything regime with a long time expiration (like Alex's four months) will be the ultimate? -- Seth Goodman Humans: off-list replies to sethg [at] GoodmanAssociates [dot] com Spambots: disregard the above From popiel at wolfskeep.com Tue Jan 13 18:09:26 2004 From: popiel at wolfskeep.com (T. Alexander Popiel) Date: Tue Jan 13 18:09:33 2004 Subject: [spambayes-dev] Another incremental training idea... In-Reply-To: Message from Skip Montanaro of "Tue, 13 Jan 2004 16:45:11 CST." <16388.29943.121292.675974@montanaro.dyndns.org> References: <16388.29943.121292.675974@montanaro.dyndns.org> Message-ID: <20040113230926.E282B2DF1A@cashew.wolfskeep.com> In message: <16388.29943.121292.675974@montanaro.dyndns.org> Skip Montanaro writes: >For some reason, my ham/spam ratio is getting out-of-whack faster that it >seemed to in the past. Another trick I'm experimenting with to keep things >in closer balance is to rescore my spam mailbox and delete some of those >which now score a rounded 1.00 (they didn't when I first scored them). >There are probably all sorts of holes in that idea, but I figured I'd toss >it out there for anyone interested. This is related to behaviour of TOAE with expiry; when exipry started to come into effect, the number of spams that got expired out of the database was significantly higher than the number of new spams getting trained... for about two weeks. After that, things got even worse. Graphs at: http://www.wolfskeep.com/~popiel/spambayes/plots/expire.html. Look at the cumulative trained counts for nonedgeexpire vs. those for plain TOAE at: http://www.wolfskeep.com/~popiel/spambayes/nonedge. I have not yet tried TOAE with balance maintenance. - Alex From kennypitt at hotmail.com Tue Jan 13 18:10:40 2004 From: kennypitt at hotmail.com (Kenny Pitt) Date: Tue Jan 13 18:11:25 2004 Subject: [spambayes-dev] Another incremental training idea... In-Reply-To: <16388.29943.121292.675974@montanaro.dyndns.org> Message-ID: Skip Montanaro wrote: > For some reason, my ham/spam ratio is getting out-of-whack faster > that it seemed to in the past. This is just an unsubstantiated guess based on my experience with my own e-mail mix. I get ham scores near 0.00 a lot more than I get spam scores near 1.00. Maybe the non-edge training is discarding a higher percentage of hams than it is spams. I suppose you could correct for that by setting different edge thresholds, but maybe you've already done that? I've also been kicking around some auto-training ideas hoping for time to try them. One idea I had was based on a "sliding non-edge" scale. You would set a max imbalance, say 2:1, beyond which you would train on everything on the low side. As your imbalance falls back below the maximum, auto-train would start skipping the "edge" messages with near perfect classification scores. The closer you get to a perfect 1:1 balance, the closer to the cutoff score the message would need to be before it would get auto-trained. Anyone see any obvious holes in this idea? -- Kenny Pitt From skip at pobox.com Tue Jan 13 18:18:52 2004 From: skip at pobox.com (Skip Montanaro) Date: Tue Jan 13 18:19:06 2004 Subject: [spambayes-dev] Another incremental training idea... In-Reply-To: <20040113230926.E282B2DF1A@cashew.wolfskeep.com> References: <16388.29943.121292.675974@montanaro.dyndns.org> <20040113230926.E282B2DF1A@cashew.wolfskeep.com> Message-ID: <16388.31964.874443.334246@montanaro.dyndns.org> >> For some reason, my ham/spam ratio is getting out-of-whack faster >> that it seemed to in the past. Alex> This is related to behaviour of TOAE with expiry; when exipry Alex> started to come into effect, the number of spams that got expired Alex> out of the database was significantly higher than the number of Alex> new spams getting trained... for about two weeks. Note that I'm talking about training on almost everything over the course of about a day. I wound up quickly with a 3:1 or worse ratio. I don't think I have time to wait for stuff to expire. ;-) Skip From skip at pobox.com Tue Jan 13 18:21:32 2004 From: skip at pobox.com (Skip Montanaro) Date: Tue Jan 13 18:21:38 2004 Subject: [spambayes-dev] Another incremental training idea... In-Reply-To: References: <16388.29943.121292.675974@montanaro.dyndns.org> Message-ID: <16388.32124.118869.341365@montanaro.dyndns.org> >> For some reason, my ham/spam ratio is getting out-of-whack faster >> that it seemed to in the past. Kenny> This is just an unsubstantiated guess based on my experience with Kenny> my own e-mail mix. I get ham scores near 0.00 a lot more than I Kenny> get spam scores near 1.00. Maybe the non-edge training is Kenny> discarding a higher percentage of hams than it is spams. I Kenny> suppose you could correct for that by setting different edge Kenny> thresholds, but maybe you've already done that? No doubt. I made a change to my procmailrc file to not save spams scoring > 0.97 for training. We'll see how it goes. This of course jives pretty well with many peoples' observation (and my experience) that most unsures are actually spam. I think I need to adjust some thresholds to try and reduce the number of spams which get trained on. Skip From popiel at wolfskeep.com Tue Jan 13 18:21:45 2004 From: popiel at wolfskeep.com (T. Alexander Popiel) Date: Tue Jan 13 18:21:48 2004 Subject: [spambayes-dev] Another incremental training idea... In-Reply-To: Message from "Seth Goodman" of "Tue, 13 Jan 2004 17:03:19 CST." References: Message-ID: <20040113232145.074CA2DF1A@cashew.wolfskeep.com> In message: "Seth Goodman" writes: > >I am fooling around with variable expiration times based on how "wrong" a >particular message classification was to see if I can possibly keep a >reduced training set up-to-date automatically and if that is, in fact, a >reasonable thing to do at all. Maybe the pure train on almost everything >regime with a long time expiration (like Alex's four months) will be the >ultimate? It occurs to be that we need to start being careful about how we talk about expiry. The expiry that I've tested with the harness is based on taking trained messages back out of the database after a certain length of time. However, in real life usage, I'm completely rebuilding the database every night with a 4 month horizon (and likely training on a noticably different collection of messages each night). My testing has shown that the former approach is worse than not expiring at all for my data... but I haven't even attempted to rigorously test the latter approach (and I shudder at the amount of CPU it would take to test, too!). I'm _guessing_ that the latter approach would perform better than the former, however. Argh! - Alex From nobody at spamcop.net Tue Jan 13 18:26:47 2004 From: nobody at spamcop.net (Seth Goodman) Date: Tue Jan 13 18:26:47 2004 Subject: [spambayes-dev] Another incremental training idea... In-Reply-To: Message-ID: [Kenny Pitt] > I've also been kicking around some auto-training ideas hoping for time > to try them. One idea I had was based on a "sliding non-edge" scale. > You would set a max imbalance, say 2:1, beyond which you would train on > everything on the low side. As your imbalance falls back below the > maximum, auto-train would start skipping the "edge" messages with near > perfect classification scores. The closer you get to a perfect 1:1 > balance, the closer to the cutoff score the message would need to be > before it would get auto-trained. Anyone see any obvious holes in this > idea? No obvious problems to me. Another related idea is to dynamically move the edge thresholds until the training ratio averages 1:1. -- Seth Goodman Humans: off-list replies to sethg [at] GoodmanAssociates [dot] com Spambots: disregard the above From popiel at wolfskeep.com Tue Jan 13 18:29:37 2004 From: popiel at wolfskeep.com (T. Alexander Popiel) Date: Tue Jan 13 18:29:41 2004 Subject: [spambayes-dev] Another incremental training idea... In-Reply-To: Message from Skip Montanaro of "Tue, 13 Jan 2004 17:18:52 CST." <16388.31964.874443.334246@montanaro.dyndns.org> References: <16388.29943.121292.675974@montanaro.dyndns.org> <20040113230926.E282B2DF1A@cashew.wolfskeep.com> <16388.31964.874443.334246@montanaro.dyndns.org> Message-ID: <20040113232937.AA24C2DF50@cashew.wolfskeep.com> In message: <16388.31964.874443.334246@montanaro.dyndns.org> Skip Montanaro writes: > >Note that I'm talking about training on almost everything over the course of >about a day. I wound up quickly with a 3:1 or worse ratio. I don't think I >have time to wait for stuff to expire. ;-) Ook. Actually, 3:1 isn't all that bad... looking at the ratio plot at the bottom of http://cashew.wolfskeep.com/~popiel/spambayes/nonedge, my tests started out between 3:1 and 4:1 for the first few days. They dropped back down to 2:1 after about 70-100 days. Error rates didn't look nice for the first week or two, either... but they settled out, too. - Alex From skip at pobox.com Tue Jan 13 18:32:05 2004 From: skip at pobox.com (Skip Montanaro) Date: Tue Jan 13 18:32:10 2004 Subject: [spambayes-dev] Another incremental training idea... In-Reply-To: References: Message-ID: <16388.32757.18796.428639@montanaro.dyndns.org> Seth> Another related idea is to dynamically move the edge thresholds Seth> until the training ratio averages 1:1. I think you'll quickly wind up moving the ham edge threshold to 0.00 and the spam edge threshold would wind up very near to your spam cutoff. That's not necessarily a bad thing, but it has to be considered. Skip From barry at python.org Tue Jan 13 18:33:43 2004 From: barry at python.org (Barry Warsaw) Date: Tue Jan 13 18:33:53 2004 Subject: [spambayes-dev] Another incremental training idea... In-Reply-To: <16388.32124.118869.341365@montanaro.dyndns.org> References: <16388.29943.121292.675974@montanaro.dyndns.org> <16388.32124.118869.341365@montanaro.dyndns.org> Message-ID: <1074036822.21051.11.camel@anthem> On Tue, 2004-01-13 at 18:21, Skip Montanaro wrote: > This of course jives pretty well with many peoples' observation (and my > experience) that most unsures are actually spam. I think I need to adjust > some thresholds to try and reduce the number of spams which get trained on. I'm still training on errors and had very good results, with an occasional reset of my spam train folder. I see everything, including mailing list traffic and admin notifications. I just started to train admin notices that contained attached spam (i.e. auto-discards and hold messages), so now my unsures have started to go up as have false negatives. It's starting to stabilize though because I often see the held messages as pure spam too, so as I train on those, I'm guessing the differences between the wrapped and unwrapped spam is becoming more evident. In any case, fp rate is extremely low -- I haven't seen one in the several weeks since I blew away my database and retrained. All in all, train-on-error /seems/ good enough for me. It does the two things I really want it to do: moves almost all my spam to a separate folder which I can check much less often, and give me no false positives. -Barry From skip at pobox.com Tue Jan 13 18:36:50 2004 From: skip at pobox.com (Skip Montanaro) Date: Tue Jan 13 18:36:55 2004 Subject: [spambayes-dev] Another incremental training idea... In-Reply-To: <1074036822.21051.11.camel@anthem> References: <16388.29943.121292.675974@montanaro.dyndns.org> <16388.32124.118869.341365@montanaro.dyndns.org> <1074036822.21051.11.camel@anthem> Message-ID: <16388.33042.586828.842912@montanaro.dyndns.org> Barry> I'm still training on errors and had very good results, with an Barry> occasional reset of my spam train folder. So you don't train on unsures? Skip From nobody at spamcop.net Tue Jan 13 18:40:29 2004 From: nobody at spamcop.net (Seth Goodman) Date: Tue Jan 13 18:40:34 2004 Subject: [spambayes-dev] Another incremental training idea... In-Reply-To: <20040113232145.074CA2DF1A@cashew.wolfskeep.com> Message-ID: [Alex Popiel] > It occurs to be that we need to start being careful about how we talk > about expiry. The expiry that I've tested with the harness is based on > taking trained messages back out of the database after a certain length > of time. However, in real life usage, I'm completely rebuilding the > database every night with a 4 month horizon (and likely training on a > noticably different collection of messages each night). I guess I don't understand why the two expiry approaches should be different, unless the individual messages expired at precise times of the day exactly 120 days after they were trained rather than all at once at 12:00:01 AM. I would think the differences to be rather small. If the four-month expiry degrades the performance, as your data shows, would a longer expiry do better? I am at a bit of a loss, since we can't keep adding to the training database forever. At some point, and that might be different for every mail stream, I am guessing that very old messages are no longer contributing as much as the newer ones to accurate classification. No? -- Seth Goodman Humans: off-list replies to sethg [at] GoodmanAssociates [dot] com Spambots: disregard the above From barry at python.org Tue Jan 13 18:41:39 2004 From: barry at python.org (Barry Warsaw) Date: Tue Jan 13 18:41:48 2004 Subject: [spambayes-dev] Another incremental training idea... In-Reply-To: <16388.33042.586828.842912@montanaro.dyndns.org> References: <16388.29943.121292.675974@montanaro.dyndns.org> <16388.32124.118869.341365@montanaro.dyndns.org> <1074036822.21051.11.camel@anthem> <16388.33042.586828.842912@montanaro.dyndns.org> Message-ID: <1074037298.21051.20.camel@anthem> On Tue, 2004-01-13 at 18:36, Skip Montanaro wrote: > Barry> I'm still training on errors and had very good results, with an > Barry> occasional reset of my spam train folder. > > So you don't train on unsures? Oh sorry, yes I do move unsure to my ham or spam train folders as I deal with them. Those numbers are going down (and now skew heavily toward spam) as I've started to train on both the wrapped and unwrapped spam messages (wrapped as ham, unwrapped as spam). -Barry From nobody at spamcop.net Tue Jan 13 19:01:18 2004 From: nobody at spamcop.net (Seth Goodman) Date: Tue Jan 13 19:01:21 2004 Subject: [spambayes-dev] Another incremental training idea... In-Reply-To: <16388.32757.18796.428639@montanaro.dyndns.org> Message-ID: > Seth> Another related idea is to dynamically move the edge thresholds > Seth> until the training ratio averages 1:1. > [Skip Montanaro] > I think you'll quickly wind up moving the ham edge threshold to > 0.00 and the > spam edge threshold would wind up very near to your spam cutoff. > That's not > necessarily a bad thing, but it has to be considered. Good point. Given an unbalanced input mail flow, like most of us seem to have, if you want to have 1:1 training, this is inevitable on one side or the other (unless you want to use a random sampling method to select a subset of nonedge spam to train - that scares me as well). We might as well list it as another variation on nonedge: I suggest calling it one-edge. It doesn't give me a particularly good feeling to train on all ham but only non-edge spam, but maybe the 1:1 training ratio will allow it to perform despite the unsatisfying way the balance is achieved? -- Seth Goodman Humans: off-list replies to sethg [at] GoodmanAssociates [dot] com Spambots: disregard the above From popiel at wolfskeep.com Tue Jan 13 19:13:56 2004 From: popiel at wolfskeep.com (T. Alexander Popiel) Date: Tue Jan 13 19:14:00 2004 Subject: [spambayes-dev] Another incremental training idea... In-Reply-To: Message from "Seth Goodman" of "Tue, 13 Jan 2004 17:40:29 CST." References: Message-ID: <20040114001356.3CC442DF1A@cashew.wolfskeep.com> In message: "Seth Goodman" writes: >[Alex Popiel] >> It occurs to be that we need to start being careful about how we talk >> about expiry. The expiry that I've tested with the harness is based on >> taking trained messages back out of the database after a certain length >> of time. However, in real life usage, I'm completely rebuilding the >> database every night with a 4 month horizon (and likely training on a >> noticably different collection of messages each night). > >I guess I don't understand why the two expiry approaches should be >different, unless the individual messages expired at precise times of the >day exactly 120 days after they were trained rather than all at once at >12:00:01 AM. The two methods are different because in the 'take stuff out of the database' method, the selection of messages trained from day 2 doesn't change and remains affected by which messages were trained on day 1, even after day 1 has been taken out... but in the 'rebuild from scratch' method, the selection of messages trained from day 2 (potentially) changes when day 1 disappears over the horizon (and thus the scores for the day 2 messages are presumably closer to .5). >I would think the differences to be rather small. For messages at the later end of the window, the differences probably are small, but the differences at the earlier side of the window are likely to be profound. >If the four-month expiry degrades the performance, as your data shows, would >a longer expiry do better? I am at a bit of a loss, since we can't keep >adding to the training database forever. At some point, and that might be >different for every mail stream, I am guessing that very old messages are no >longer contributing as much as the newer ones to accurate classification. >No? This is an open question, and I don't think we even have a concept on how to measure which _messages_ in a database are contributing more than others. I suppose you could do a 2-d scatter plot, where one axis was the ordinal of the message being classified and the other axis was the ordinal of any message which contained a token that was used in the classification... lots of tiny dots, and see if it's evenly spread through the triangle or biased to on side or another... Why can't we just keep adding to the database forever? My mail is accumulating much more slowly than Moore's law, even with the exponential growth in spam... I can't imagine the DB growing faster than the dataset it's based on. - Alex From nobody at spamcop.net Tue Jan 13 19:15:41 2004 From: nobody at spamcop.net (Seth Goodman) Date: Tue Jan 13 19:15:44 2004 Subject: [spambayes-dev] Another incremental training idea... In-Reply-To: Message-ID: [Seth Goodman] > I guess I don't understand why the two expiry approaches should be > different ... Oooops, sorry for the dumb question. The result from creating a new training set every night does depend very much on how you score the four months of messages to determine which ones are nonedge, though. How do you do this in your regime? -- Seth Goodman Humans: off-list replies to sethg [at] GoodmanAssociates [dot] com Spambots: disregard the above From popiel at wolfskeep.com Tue Jan 13 19:49:10 2004 From: popiel at wolfskeep.com (T. Alexander Popiel) Date: Tue Jan 13 19:49:14 2004 Subject: [spambayes-dev] Another incremental training idea... In-Reply-To: Message from "Seth Goodman" of "Tue, 13 Jan 2004 18:15:41 CST." References: Message-ID: <20040114004910.5C9B32DF1A@cashew.wolfskeep.com> In message: "Seth Goodman" writes: >[Seth Goodman] >> I guess I don't understand why the two expiry approaches should be >> different ... > >Oooops, sorry for the dumb question. The result from creating a new >training set every night does depend very much on how you score the four >months of messages to determine which ones are nonedge, though. How do you >do this in your regime? In my real life usage, the rebuilding process scores messages based on the database it's built so far. Having it score messages for the new database based on the old database might be another interesting scenario, but I have absolutely no clue how effective that would be. In any case, these complete-rebuilding scenarios are predicated on keeping at least the token list for every single message (not just those that have been trained) for as long as we might want to train on them. There is some indication that a significant portion of the userbase is unwilling to keep that much mail data lying around (for months, presumably)... which makes the other form of expiry of more practical interest. - Alex From nobody at spamcop.net Tue Jan 13 20:09:22 2004 From: nobody at spamcop.net (Seth Goodman) Date: Tue Jan 13 20:09:20 2004 Subject: [spambayes-dev] Another incremental training idea... In-Reply-To: <20040114004910.5C9B32DF1A@cashew.wolfskeep.com> Message-ID: [T. Alex Popiel] > In any case, these complete-rebuilding scenarios are predicated on > keeping at least the token list for every single message (not just > those that have been trained) for as long as we might want to train > on them. There is some indication that a significant portion of the > userbase is unwilling to keep that much mail data lying around > (for months, presumably)... which makes the other form of expiry > of more practical interest. I keep that much mail around, but I certainly agree that most people do not like to save spam. It's either back to expiration, then, or just keep on training, as you suggested. I do have a question on your incremental harness with expiry, since it's surprising how much worse it performs as soon as it starts expiring messages. For classification purposes, you obviously use the training set from the last 120 days of nonedge messages. Do you then use those same scores for the current day's messages to determine which are the nonedge messages? I ask this because you would get a different set of messages to train on, and perhaps compensate better for the particular messages you expire, if you first expired the 120-day old messages, then rescored the current day's messages to determine the nonedge messages to train on. Does this make any sense? -- Seth Goodman Humans: off-list replies to sethg [at] GoodmanAssociates [dot] com Spambots: disregard the above From anthony at interlink.com.au Wed Jan 14 00:17:22 2004 From: anthony at interlink.com.au (Anthony Baxter) Date: Wed Jan 14 00:17:29 2004 Subject: [spambayes-dev] beta release time? Message-ID: <20040114051722.592F325AF47@bonanza.off.ekorp.com> So, is there any reason we shouldn't do a first beta release in the next week or so? If people have bugs/features they want fixed/added before then, can we assemble a list? -- Anthony Baxter It's never too late to have a happy childhood. From skip at pobox.com Wed Jan 14 06:22:29 2004 From: skip at pobox.com (Skip Montanaro) Date: Wed Jan 14 06:22:34 2004 Subject: [spambayes-dev] Another incremental training idea... In-Reply-To: <1074037298.21051.20.camel@anthem> References: <16388.29943.121292.675974@montanaro.dyndns.org> <16388.32124.118869.341365@montanaro.dyndns.org> <1074036822.21051.11.camel@anthem> <16388.33042.586828.842912@montanaro.dyndns.org> <1074037298.21051.20.camel@anthem> Message-ID: <16389.9845.984427.65966@montanaro.dyndns.org> >> So you don't train on unsures? Barry> Oh sorry, yes I do move unsure to my ham or spam train folders as Barry> I deal with them. Those numbers are going down (and now skew Barry> heavily toward spam) as I've started to train on both the wrapped Barry> and unwrapped spam messages (wrapped as ham, unwrapped as spam). There's another new term (at least to me). What do "wrapped" and "unwrapped" mean? I'm going to guess that "wrapped spam" is a false negative and "unwrapped spam" is a correctly scored spam, but the connection between what the terms (apparently) mean and choice of "wrapped" to describe that meaning escapes me. Maybe I need more sleep... Skip From skip at pobox.com Wed Jan 14 06:29:11 2004 From: skip at pobox.com (Skip Montanaro) Date: Wed Jan 14 06:29:13 2004 Subject: [spambayes-dev] Another incremental training idea... In-Reply-To: References: <16388.32757.18796.428639@montanaro.dyndns.org> Message-ID: <16389.10247.928382.554729@montanaro.dyndns.org> Seth> I suggest calling it one-edge. It doesn't give me a particularly Seth> good feeling to train on all ham but only non-edge spam, but maybe Seth> the 1:1 training ratio will allow it to perform despite the Seth> unsatisfying way the balance is achieved? I think the dissatisfaction comes in part from the rather arbitrary choice that a message which scores 0.00 or 0.80 is somehow more important to the overall results than one which scores 0.98. It does seem a bit arbitrary, but the system seems to suggest we need to be slaves to balance and that's one way to get it. You could throw out spams using some other criteria. Two that come to mind are by age or random choice. Skip From skip at pobox.com Wed Jan 14 06:31:57 2004 From: skip at pobox.com (Skip Montanaro) Date: Wed Jan 14 06:31:59 2004 Subject: [spambayes-dev] Another incremental training idea... In-Reply-To: <20040114001356.3CC442DF1A@cashew.wolfskeep.com> References: <20040114001356.3CC442DF1A@cashew.wolfskeep.com> Message-ID: <16389.10413.183005.448720@montanaro.dyndns.org> Alex> Why can't we just keep adding to the database forever? My mail is Alex> accumulating much more slowly than Moore's law, even with the Alex> exponential growth in spam... I can't imagine the DB growing Alex> faster than the dataset it's based on. For me, performance suffers as the database grows too large and it becomes much harder to find and eliminate training mistakes. Skip From barry at python.org Wed Jan 14 08:09:33 2004 From: barry at python.org (Barry Warsaw) Date: Wed Jan 14 08:09:37 2004 Subject: [spambayes-dev] Another incremental training idea... In-Reply-To: <16389.9845.984427.65966@montanaro.dyndns.org> References: <16388.29943.121292.675974@montanaro.dyndns.org> <16388.32124.118869.341365@montanaro.dyndns.org> <1074036822.21051.11.camel@anthem> <16388.33042.586828.842912@montanaro.dyndns.org> <1074037298.21051.20.camel@anthem> <16389.9845.984427.65966@montanaro.dyndns.org> Message-ID: <1074085772.21051.108.camel@anthem> On Wed, 2004-01-14 at 06:22, Skip Montanaro wrote: > >> So you don't train on unsures? > > Barry> Oh sorry, yes I do move unsure to my ham or spam train folders as > Barry> I deal with them. Those numbers are going down (and now skew > Barry> heavily toward spam) as I've started to train on both the wrapped > Barry> and unwrapped spam messages (wrapped as ham, unwrapped as spam). > > There's another new term (at least to me). What do "wrapped" and > "unwrapped" mean? I'm going to guess that "wrapped spam" is a false > negative and "unwrapped spam" is a correctly scored spam, but the connection > between what the terms (apparently) mean and choice of "wrapped" to describe > that meaning escapes me. Ah sorry. Anyone who's been a Mailman admin knows that you get notifications when messages are held for approval or when they are auto-discarded. Mailman includes the original message as an attachment to the notification. It's that attachment-in-a-notification that I'm calling a wrapped spam. An unwrapped spam might be the same original message that makes it through to the list, or you get directly. Wrapped spams are very spammy but they are actually ham because of the notification part. -Barry From simone.piunno at wseurope.com Wed Jan 14 08:27:05 2004 From: simone.piunno at wseurope.com (Simone Piunno) Date: Wed Jan 14 08:27:16 2004 Subject: [spambayes-dev] Another incremental training idea... In-Reply-To: <1074085772.21051.108.camel@anthem> References: <16388.29943.121292.675974@montanaro.dyndns.org> <16389.9845.984427.65966@montanaro.dyndns.org> <1074085772.21051.108.camel@anthem> Message-ID: <200401141427.05696.simone.piunno@wseurope.com> Alle 14:09, mercoled? 14 gennaio 2004, Barry Warsaw ha scritto: > > Barry> Oh sorry, yes I do move unsure to my ham or spam train folders as > > Barry> I deal with them. Those numbers are going down (and now skew > > Barry> heavily toward spam) as I've started to train on both the wrapped > > Barry> and unwrapped spam messages (wrapped as ham, unwrapped as spam). > Mailman includes the original message as an attachment > to the notification. It's that attachment-in-a-notification that I'm > calling a wrapped spam. An unwrapped spam might be the same original > message that makes it through to the list, or you get directly. Wrapped > spams are very spammy but they are actually ham because of the > notification part. My experience is that, in the long run, training on these wrapped spam messages kills performance, raising the likeliness of fn and unsure. I don't train them, because anyway I don't need to be fast discarding held spam, so checking the daily report is enough. I just want to react immediatly to held ham. Some possible improvement for list admins would be automatically recognize that a message is a Mailman notification and: - just train on payload or just train on the external message. - only score payload or only score the external message. Of course this would be a for-mailman-list-admins-only patch. -- Simone Piunno, chief architect Wireless Solutions SPA - DADA group Europe HQ, via Castiglione 25 Bologna web:www.wseurope.com tel:+390512966811 fax:+390512966800 God is real, unless declared integer From skip at pobox.com Wed Jan 14 08:28:11 2004 From: skip at pobox.com (Skip Montanaro) Date: Wed Jan 14 08:28:23 2004 Subject: [spambayes-dev] Another incremental training idea... In-Reply-To: <1074085772.21051.108.camel@anthem> References: <16388.29943.121292.675974@montanaro.dyndns.org> <16388.32124.118869.341365@montanaro.dyndns.org> <1074036822.21051.11.camel@anthem> <16388.33042.586828.842912@montanaro.dyndns.org> <1074037298.21051.20.camel@anthem> <16389.9845.984427.65966@montanaro.dyndns.org> <1074085772.21051.108.camel@anthem> Message-ID: <16389.17387.535878.470277@montanaro.dyndns.org> Barry> An unwrapped spam might be the same original message that makes Barry> it through to the list, or you get directly. Wrapped spams are Barry> very spammy but they are actually ham because of the notification Barry> part. Ah, okay. I generally (lately never) train on such mails. It seems like all it might do is confuse the classifier. Skip From skip at pobox.com Wed Jan 14 08:34:34 2004 From: skip at pobox.com (Skip Montanaro) Date: Wed Jan 14 08:34:40 2004 Subject: [spambayes-dev] Another incremental training idea... In-Reply-To: <16389.17387.535878.470277@montanaro.dyndns.org> References: <16388.29943.121292.675974@montanaro.dyndns.org> <16388.32124.118869.341365@montanaro.dyndns.org> <1074036822.21051.11.camel@anthem> <16388.33042.586828.842912@montanaro.dyndns.org> <1074037298.21051.20.camel@anthem> <16389.9845.984427.65966@montanaro.dyndns.org> <1074085772.21051.108.camel@anthem> <16389.17387.535878.470277@montanaro.dyndns.org> Message-ID: <16389.17770.257554.968248@montanaro.dyndns.org> Skip> Ah, okay. I generally (lately never) train on such mails. It seems like ^don't^ Skip> all it might do is confuse the classifier. From barry at python.org Wed Jan 14 09:22:26 2004 From: barry at python.org (Barry Warsaw) Date: Wed Jan 14 09:22:33 2004 Subject: [spambayes-dev] Another incremental training idea... In-Reply-To: <200401141427.05696.simone.piunno@wseurope.com> References: <16388.29943.121292.675974@montanaro.dyndns.org> <16389.9845.984427.65966@montanaro.dyndns.org> <1074085772.21051.108.camel@anthem> <200401141427.05696.simone.piunno@wseurope.com> Message-ID: <1074090145.21051.118.camel@anthem> On Wed, 2004-01-14 at 08:27, Simone Piunno wrote: > My experience is that, in the long run, training on these wrapped spam > messages kills performance, raising the likeliness of fn and unsure. That's my suspicion too, although I figure I'm conducting a real-world experiment to see if that's true. So far fns and unsures are not unmanageable, after a brief period of instability. > Some possible improvement for list admins would be automatically recognize > that a message is a Mailman notification and: > - just train on payload or just train on the external message. > - only score payload or only score the external message. > Of course this would be a for-mailman-list-admins-only patch. A generalization might be to score each attachment (or possibly just each message/rfc822 type attachment) separately. Then choose an algorithm for combining the scores, e.g. outer-only, inner-only, combined, etc. -Barry From kennypitt at hotmail.com Wed Jan 14 09:36:02 2004 From: kennypitt at hotmail.com (Kenny Pitt) Date: Wed Jan 14 09:36:49 2004 Subject: [spambayes-dev] Another incremental training idea... In-Reply-To: Message-ID: Seth Goodman wrote: > [Kenny Pitt] >> I've also been kicking around some auto-training ideas hoping for >> time to try them. One idea I had was based on a "sliding non-edge" > > Another related idea is to dynamically move the edge thresholds until > the training ratio averages 1:1. My description applies to auto-balancing of "train on mistakes and unsures" instead of "train on everything" or "train on almost everything". The algorithm could easily be reversed to do TOE where there are no configured edge thresholds. Doing TOAE effectively would probably require your adjustment. For mistake-based training, the idea is that as long as my balance is very close to 1:1, I'm happy to train only on the messages that I manually reclassify because of mistakes and unsures. If that mistake-based training causes an imbalance then the auto-balancer kicks in with an edge threshold very close to the classifier cutoff so that only the worst-scoring messages are trained. As the imbalance worsens, the edge threshold is dynamically adjusted as needed to train on more messages and try to push the balance back towards 1:1. For TOE it would be the exact opposite. I train on all ham and spam as long as the balance remains at 1:1. If I start to get an imbalance, then the edge threshold of the high side is adjusted so that the best-scoring messages are no longer trained. As the imbalance gets worse, the edge threshold is adjusted so that fewer and fewer messages are trained. TOAE could be accomplished the same way as TOE simply by obeying the configured static edge thresholds as limits for the auto-adjusted thresholds, but this doesn't account for the case where the configured thresholds discard too many messages for proper balancing. This is where you might want to dynamically adjust the configured thresholds, at least until you get back in balance. -- Kenny Pitt From popiel at wolfskeep.com Wed Jan 14 13:07:06 2004 From: popiel at wolfskeep.com (T. Alexander Popiel) Date: Wed Jan 14 13:07:13 2004 Subject: [spambayes-dev] Another piece of anecdotal evidence Message-ID: <20040114180706.EBF342DF16@cashew.wolfskeep.com> In the last week or so, I've been noticing a higher rate of false negatives in my mail. Looking at the clues indicates that I've got a spam or two mis-trained, but I haven't bothered to find it, yet (I'm currently in the middle of restructuring my archives so that I don't have a single directory with over 100,000 files in it). On the other hand, it appears that this mis-training is the only reason I'm getting such a high rate of false negatives, despite a spam:ham training ratio of 50:1. That's right. 50:1. More specifically, for the last four months, I have: Total: 4694 ham, 39913 spam (89.48% spam) Trained: 204 ham, 10994 spam (98.18% spam) Having such a high imbalance does seem to make me particularly susceptible to training errors... but doesn't seem to hurt otherwise. - Alex From popiel at wolfskeep.com Wed Jan 14 13:22:56 2004 From: popiel at wolfskeep.com (T. Alexander Popiel) Date: Wed Jan 14 13:23:01 2004 Subject: [spambayes-dev] Another incremental training idea... In-Reply-To: Message from "Seth Goodman" of "Tue, 13 Jan 2004 19:09:22 CST." References: Message-ID: <20040114182256.937CF2DE7D@cashew.wolfskeep.com> In message: "Seth Goodman" writes: > >I do have a question on your incremental harness with expiry, since it's >surprising how much worse it performs as soon as it starts expiring >messages. For classification purposes, you obviously use the training set >from the last 120 days of nonedge messages. Do you then use those same >scores for the current day's messages to determine which are the nonedge >messages? I ask this because you would get a different set of messages to >train on, and perhaps compensate better for the particular messages you >expire, if you first expired the 120-day old messages, then rescored the >current day's messages to determine the nonedge messages to train on. Does >this make any sense? Well, here's the regime code: ### ### This is a training regime for the incremental.py harness. ### It does perfect training for all messages not already ### properly classified with extreme confidence. ### class nonedgeexpire: def __init__(self): self.ham = [[]] self.spam = [[]] def group_action(self, which, test): if len(self.ham) >= 120: test.untrain(self.ham[119], self.spam[119]) self.ham = self.ham[:119] self.spam = self.spam[:119] self.ham.insert(-1, []) self.spam.insert(-1, []) def guess_action(self, which, test, guess, actual, msg): if guess[0] != actual: if actual < 0: self.spam[0].append(msg) else: self.ham[0].append(msg) return actual if 0.005 < guess[1] and guess[1] < 0.995: if actual < 0: self.spam[0].append(msg) else: self.ham[0].append(msg) return actual return 0 This code trains immediately on the non-edge stuff, and expires at the end of each day. It does not choose the messages to train for the day after expiring, as you suggest. Your suggestion is interesting, though it would be a bit expensive to do (doubling the number of classifications done). - Alex From skip at pobox.com Wed Jan 14 13:57:02 2004 From: skip at pobox.com (Skip Montanaro) Date: Wed Jan 14 13:57:06 2004 Subject: [spambayes-dev] Another piece of anecdotal evidence In-Reply-To: <20040114180706.EBF342DF16@cashew.wolfskeep.com> References: <20040114180706.EBF342DF16@cashew.wolfskeep.com> Message-ID: <16389.37118.304189.514738@montanaro.dyndns.org> Alex> Total: 4694 ham, 39913 spam (89.48% spam) Alex> Trained: 204 ham, 10994 spam (98.18% spam) Alex> Having such a high imbalance does seem to make me particularly Alex> susceptible to training errors... but doesn't seem to hurt Alex> otherwise. How do you plan to find those mistrained messages? Skip From popiel at wolfskeep.com Wed Jan 14 14:05:08 2004 From: popiel at wolfskeep.com (T. Alexander Popiel) Date: Wed Jan 14 14:05:13 2004 Subject: [spambayes-dev] Another piece of anecdotal evidence In-Reply-To: Message from Skip Montanaro of "Wed, 14 Jan 2004 12:57:02 CST." <16389.37118.304189.514738@montanaro.dyndns.org> References: <20040114180706.EBF342DF16@cashew.wolfskeep.com> <16389.37118.304189.514738@montanaro.dyndns.org> Message-ID: <20040114190508.8E9E32DE7D@cashew.wolfskeep.com> In message: <16389.37118.304189.514738@montanaro.dyndns.org> Skip Montanaro writes: > > Alex> Total: 4694 ham, 39913 spam (89.48% spam) > Alex> Trained: 204 ham, 10994 spam (98.18% spam) > > Alex> Having such a high imbalance does seem to make me particularly > Alex> susceptible to training errors... but doesn't seem to hurt > Alex> otherwise. > >How do you plan to find those mistrained messages? As part of my nightly retrain, I'm going to make it score each message (with the fully trained DB) and sort them into 6 directories for each month: {ham,spam}{positive,unsure,negative}. Flipping through the hampositive directory for each month should make it fairly easy to spot the problems... - Alex From nobody at spamcop.net Wed Jan 14 14:05:41 2004 From: nobody at spamcop.net (Seth Goodman) Date: Wed Jan 14 14:05:44 2004 Subject: [spambayes-dev] Another incremental training idea... In-Reply-To: <20040114182256.937CF2DE7D@cashew.wolfskeep.com> Message-ID: [T. Alex Popiel] > This code trains immediately on the non-edge stuff, and expires > at the end of each day. It does not choose the messages to train > for the day after expiring, as you suggest. Your suggestion is > interesting, though it would be a bit expensive to do (doubling > the number of classifications done). Thanks, Alex. That's just what I think of the idea: interesting but who knows. As for it being expensive, it does do two classifications for every message. However, in real life, assuming you do training once per day after making sure all messages are correctly classified, it only has to re-classify one day's messages. That only takes a few seconds on my system with a couple of hundred messages from the Outlook plug-in. I have no idea if the Python scripts using the standard message data structure are slower. Here are a few more potentially dumb questions. Does your script work directly with incremental.py from CVS or do you use a modified version? To implement the expire, reclassify, train regime, would I then modify just incremental.py or are these functions spread around in other modules? I would like to play with this on my own saved mail corpus (it only goes back to September and has just 10K messages but grows daily) and get my feet wet with some cv runs. As always, thanks for your indulgence. -- Seth Goodman Humans: off-list replies to sethg [at] GoodmanAssociates [dot] com Spambots: disregard the above From skip at pobox.com Wed Jan 14 14:13:21 2004 From: skip at pobox.com (Skip Montanaro) Date: Wed Jan 14 14:13:25 2004 Subject: [spambayes-dev] Another piece of anecdotal evidence In-Reply-To: <20040114190508.8E9E32DE7D@cashew.wolfskeep.com> References: <20040114180706.EBF342DF16@cashew.wolfskeep.com> <16389.37118.304189.514738@montanaro.dyndns.org> <20040114190508.8E9E32DE7D@cashew.wolfskeep.com> Message-ID: <16389.38097.307314.458183@montanaro.dyndns.org> >> How do you plan to find those mistrained messages? Alex> As part of my nightly retrain, I'm going to make it score each Alex> message (with the fully trained DB) and sort them into 6 Alex> directories for each month: {ham,spam}{positive,unsure,negative}. Alex> Flipping through the hampositive directory for each month should Alex> make it fairly easy to spot the problems... I'm still confused. You've got a spam mistrained as ham. Are you suggesting that you expect that scoring that message against your training database (which includes features gleaned from that message) will reveal that it is something other than ham? I have a very small training database (microscopic compared to yours) and I generally find it easier to just start from scratch when I reach the conclusion that I have some errors in my database. Skip From listsub at wickedgrey.com Wed Jan 14 14:17:55 2004 From: listsub at wickedgrey.com (Eli Stevens (WG.c)) Date: Wed Jan 14 14:21:24 2004 Subject: [spambayes-dev] Another piece of anecdotal evidence References: <20040114180706.EBF342DF16@cashew.wolfskeep.com> <16389.37118.304189.514738@montanaro.dyndns.org> Message-ID: <400595E3.50108@wickedgrey.com> Skip Montanaro wrote: > Alex> Total: 4694 ham, 39913 spam (89.48% spam) > Alex> Trained: 204 ham, 10994 spam (98.18% spam) > > Alex> Having such a high imbalance does seem to make me particularly > Alex> susceptible to training errors... but doesn't seem to hurt > Alex> otherwise. Does it hurt more when a FP or FN is mistrained? > How do you plan to find those mistrained messages? Hmm... How feasible is: trainEverything() for msg in hamCorpus: untrain( msg ) result = classify( msg ) if result == spam: display( msg ) train( msg ) This won't work if the mistrained messages are not very spammy, but in that case they shouldn't be affecting classification adversely, right? Eli From popiel at wolfskeep.com Wed Jan 14 14:27:53 2004 From: popiel at wolfskeep.com (T. Alexander Popiel) Date: Wed Jan 14 14:27:57 2004 Subject: [spambayes-dev] Another piece of anecdotal evidence In-Reply-To: Message from Skip Montanaro of "Wed, 14 Jan 2004 13:13:21 CST." <16389.38097.307314.458183@montanaro.dyndns.org> References: <20040114180706.EBF342DF16@cashew.wolfskeep.com> <16389.37118.304189.514738@montanaro.dyndns.org> <20040114190508.8E9E32DE7D@cashew.wolfskeep.com> <16389.38097.307314.458183@montanaro.dyndns.org> Message-ID: <20040114192753.78B9A2DE7D@cashew.wolfskeep.com> In message: <16389.38097.307314.458183@montanaro.dyndns.org> Skip Montanaro writes: > > >> How do you plan to find those mistrained messages? > > Alex> As part of my nightly retrain, I'm going to make it score each > Alex> message (with the fully trained DB) and sort them into 6 > Alex> directories for each month: {ham,spam}{positive,unsure,negative}. > Alex> Flipping through the hampositive directory for each month should > Alex> make it fairly easy to spot the problems... > >I'm still confused. You've got a spam mistrained as ham. Are you >suggesting that you expect that scoring that message against your training >database (which includes features gleaned from that message) will reveal >that it is something other than ham? Yes, actually, I do. It certainly has a bunch of spams that it trains on that are still classified as ham (all the false negatives that I'm continuing to see), so I'm expecting the reverse is true, too. At worst, I'll have to make it pay attention to the differences between the score it uses during training to determine edge-ness and the score determined after training and the manually determined ham/spam state. That'll point out the cases where messages were not in the expected category to start with and/or training didn't help... - Alex (who has great faith in the system pointing out errors) From papaDoc at videotron.ca Wed Jan 14 14:35:11 2004 From: papaDoc at videotron.ca (papaDoc) Date: Wed Jan 14 14:35:19 2004 Subject: [spambayes-dev] Another piece of anecdotal evidence In-Reply-To: <16389.38097.307314.458183@montanaro.dyndns.org> References: <20040114180706.EBF342DF16@cashew.wolfskeep.com> <16389.37118.304189.514738@montanaro.dyndns.org> <20040114190508.8E9E32DE7D@cashew.wolfskeep.com> <16389.38097.307314.458183@montanaro.dyndns.org> Message-ID: <400599EF.5000702@videotron.ca> Hi, > >> How do you plan to find those mistrained messages? > > Alex> As part of my nightly retrain, I'm going to make it score each > Alex> message (with the fully trained DB) and sort them into 6 > Alex> directories for each month: {ham,spam}{positive,unsure,negative}. > Alex> Flipping through the hampositive directory for each month should > Alex> make it fairly easy to spot the problems... > >I'm still confused. You've got a spam mistrained as ham. Are you >suggesting that you expect that scoring that message against your training >database (which includes features gleaned from that message) will reveal >that it is something other than ham? > I wrote a little scrit that look at the message training header Then using the current database I was reclassifying the msg. Then I checked if the training header and the classifying header were the same. If the message was misclassified then usually it was showing as unsure. My database was containing thousand of ham and spam. Remi -- /"\ \ / X ASCII Ribbon Campaign / \ Against HTML Email From tim.one at comcast.net Wed Jan 14 14:36:04 2004 From: tim.one at comcast.net (Tim Peters) Date: Wed Jan 14 14:36:17 2004 Subject: [spambayes-dev] Another piece of anecdotal evidence In-Reply-To: <16389.38097.307314.458183@montanaro.dyndns.org> Message-ID: [Skip] >>> How do you plan to find those mistrained messages? [Alex] >> As part of my nightly retrain, I'm going to make it score each >> message (with the fully trained DB) and sort them into directories >> for each month: >> {ham,spam}{positive,unsure,negative} >> Flipping through the hampositive directory for each month should >> make it fairly easy to spot the problems... [Skip] > I'm still confused. You've got a spam mistrained as ham. Are you > suggesting that you expect that scoring that message against your > training database (which includes features gleaned from that message) > will reveal that it is something other than ham? I have a very small > training database (microscopic compared to yours) and I generally > find it easier to just start from scratch when I reach the conclusion > that I have some errors in my database. I'd suggest running a cross-validation test, with any n >= 2, and setting the testing options to show FP and FN. This is extremely effective (IME) at finding misclassified messages, particulary since a CV run never tests a message against a classifier that's been trained on that msg (unless you've got duplicates of a message, yadda yadda). From tp at diffenbach.org Wed Jan 14 16:26:36 2004 From: tp at diffenbach.org (TP Diffenbach) Date: Wed Jan 14 16:25:48 2004 Subject: [spambayes-dev] FW: Spam Clues: Do you remember me ? Message-ID: Here's an interesting approach, that flew right under my spambayes, despite a 5 MB db. (And just to be clear, few or none of the words in my database are in German.) --Tom Combined Score: 0% (0.00149714) Internal ham score (*H*): 0.997321 Internal spam score (*S*): 0.000315422 # ham trained on: 1491 # spam trained on: 2924 72 Significant Tokens token spamprob #ham #spam 'f?r' 0.0196507 11 0 'lyrics' 0.0412844 5 0 'allen' 0.0460477 15 1 'alle' 0.0505618 4 0 'songs.' 0.0505618 4 0 'mp3' 0.064526 40 5 'tool,' 0.0652174 3 0 'dem' 0.0918367 2 0 'from:addr:webmaster' 0.0918367 2 0 'songs' 0.105593 6 1 'to:' 0.129275 410 119 'motion' 0.129486 15 4 'remain' 0.134622 34 10 'music' 0.141792 32 10 'tools.' 0.148171 4 1 'to:addr:tp' 0.153674 880 313 'auf' 0.155172 1 0 'das' 0.155172 1 0 'hier' 0.155172 1 0 'url:did' 0.155172 1 0 'tool' 0.159102 36 13 'small' 0.172208 94 38 'use' 0.177683 402 170 'links' 0.179874 38 16 'die' 0.18052 17 7 'own' 0.184353 179 79 'der' 0.192406 5 2 'fastest' 0.192406 5 2 'skip:f 10' 0.194903 165 78 'program' 0.195467 177 84 'picture' 0.209617 41 21 'software' 0.210179 152 79 'search' 0.215905 158 85 'was' 0.225071 450 256 'url:11' 0.25286 14 9 'skip:l 10' 0.254718 81 54 'what' 0.259593 476 327 'reply-to:no real name:2**0' 0.260885 220 152 'und' 0.278497 3 2 'simply' 0.286594 159 125 'can' 0.306117 636 550 'skip:s 10' 0.307916 304 265 'have' 0.321612 822 764 'that' 0.322486 823 768 'download' 0.328841 122 117 'easy' 0.344293 108 111 "cd's" 0.348966 4 4 'downloads.' 0.352378 3 3 'url:asp' 0.370775 104 120 'favorite' 0.374678 41 48 'artists.' 0.374974 1 1 'sie' 0.374974 1 1 'from:no real name:2**0' 0.387368 334 414 'header:Reply-To:1' 0.623495 225 731 'url:www' 0.626688 509 1676 'hello!' 0.630363 2 7 'films' 0.662975 3 12 'x-mailer:microsoft outlook express 6.00.2600.0000' 0.672848 16 65 'now!' 0.709423 36 173 'subject: ?' 0.728352 1 6 'here' 0.7485 250 1460 'haben' 0.765605 0 1 'ihr' 0.765605 0 1 'ihren' 0.765605 0 1 'k?nnen' 0.765605 0 1 'header:Received:3' 0.799115 72 563 'subject:you' 0.799557 30 236 'movies' 0.860257 10 123 'sex' 0.872857 16 218 'url:226' 0.951802 1 47 'subject: ' 0.98059 2 220 'url:80' 0.991495 0 51 Message Stream Return-Path: Delivered-To: 556-pop-mail@diffenbach.org Received: (qmail 3831 invoked by uid 110); 14 Jan 2004 18:21:05 -0000 Delivered-To: 556-tp@diffenbach.org Received: (qmail 3818 invoked from network); 14 Jan 2004 18:21:00 -0000 Received: from unknown (HELO indatatek.com) (218.80.62.177) by ww25.hostica.com with SMTP; 14 Jan 2004 18:21:00 -0000 Message-ID: <17f001c3dab3$945ba120$185edf2f@msdfiewnwaibm> Reply-To: From: To: "AOL Users" Subject: Do you remember me ? Date: Wed, 14 Jan 2004 06:32:11 -0900 MIME-Version: 1.0 Content-Type: multipart/alternative; boundary="----=_NextPart_57D_EDF4_1F2939FB.857D4F4F" X-Priority: 3 X-MSMail-Priority: Normal X-Mailer: Microsoft Outlook Express 6.00.2600.0000 X-MimeOLE: Produced By Microsoft MimeOLE V6.00.2600.0000 Hello! Untitled Document

Zu hier ist, was Sie haben k?nnen Zugang:

Unbegrenzte Downloads Der Musik-MP3.
Unbegrenzte DVD Film-Downloads.
Unbegrenzte Software-Programm-Downloads.
Alle Ihre Lieblingsliede und K?nstler.
Schnellste Downloads der Musik MP3 auf dem Internet.
Brennen Sie Ihr eigenes CD's und DVD's.
Einfach, Suchwerkzeuge zu benutzen.
Links zu allen Ihren Lieblingslieden.
Gehen Sie hier f?r Zugang jetzt!

Einfach das kleine Tool< /a> runterladen , damit ihr v?llig unendeckt bleibt und die Neusten Kinofilme runterladen k?nnt

Here is what you can have access to:

Unlimited MP3 Music Downloads.
Unlimited DVD Movies Downloads.
Unlimited Software Program Downloads.
All your favorite Songs and Artists.
Fastest MP3 Music downloads on the Internet.
Burn your own CD's and DVD's.
Easy to use Search Tools.
Lyrics to all your favorite Songs.
Sex Password's

Go Here for Access Now!

Download simply the small tool , so that you unendeckt completely remain and can download the newest motion picture films

All Message Tokens 143 unique tokens 'access' 'all' 'alle' 'allen' 'and' 'artists.' 'auf' 'benutzen.' 'bleibt' 'brennen' 'burn' 'can' 'cc:none' "cd's" 'completely' 'content-type:text/plain' 'damit' 'das' 'dem' 'der' 'die' 'document' 'download' 'downloads' 'downloads.' 'dvd' "dvd's." 'easy' 'eigenes' 'einfach' 'einfach,' 'f?r' 'fastest' 'favorite' 'films' 'for' 'from:addr:indatatek.com' 'from:addr:webmaster' 'from:no real name:2**0' 'gehen' 'haben' 'have' 'header:Date:1' 'header:From:1' 'header:MIME-Version:1' 'header:Message-ID:1' 'header:Received:3' 'header:Reply-To:1' 'header:Return-Path:1' 'header:Subject:1' 'header:To:1' 'hello!' 'here' 'hier' 'ihr' 'ihre' 'ihren' 'internet.' 'ist,' 'jetzt!' 'k?nnen' 'k?nnt' 'k?nstler.' 'kinofilme' 'kleine' 'links' 'lyrics' 'message-id:@msdfiewnwaibm' 'motion' 'movies' 'mp3' 'music' 'musik' 'musik-mp3.' 'neusten' 'newest' 'now!' 'own' "password's" 'picture' 'program' 'proto:http' 'remain' 'reply-to:addr:indatatek.com' 'reply-to:addr:webmaster' 'reply-to:no real name:2**0' 'runterladen' 'schnellste' 'search' 'sender:none' 'sex' 'sie' 'simply' 'skip:f 10' 'skip:l 10' 'skip:s 10' 'skip:s 20' 'small' 'software' 'songs' 'songs.' 'subject: ' 'subject: ' 'subject: ?' 'subject:remember' 'subject:you' 'that' 'the' 'to:' 'to:2**0' 'to:addr:diffenbach.org' 'to:addr:tp' 'to:name:aol users' 'tool' 'tool,' 'tools.' 'unbegrenzte' 'und' 'unendeckt' 'unlimited' 'untitled' 'url:100746lxxt00c01s' 'url:11' 'url:226' 'url:80' 'url:asp' 'url:com' 'url:dialergateway' 'url:did' 'url:enter' 'url:enterasp' 'url:warezonlinefree' 'url:websamba' 'url:www' 'use' 'v?llig' 'was' 'what' 'x-mailer:microsoft outlook express 6.00.2600.0000' 'you' 'your' 'zugang' 'zugang:' -------------- next part -------------- An embedded message was scrubbed... From: Subject: Do you remember me ? Date: Wed, 14 Jan 2004 10:32:11 -0500 Size: 2248 Url: http://mail.python.org/pipermail/spambayes-dev/attachments/20040114/00f60548/attachment-0001.mht From pje at telecommunity.com Wed Jan 14 19:27:58 2004 From: pje at telecommunity.com (Phillip J. Eby) Date: Wed Jan 14 19:28:04 2004 Subject: [spambayes-dev] "Tackling the Poor Assumptions of Naive Bayes Text Classifiers" Message-ID: <5.1.1.6.0.20040114164136.00ac7ec0@mail.rapidsite.net> I ran across this paper today, that may be of some interest: http://haystack.lcs.mit.edu/papers/rennie.icml03.pdf It specifically discusses: * Classification bias due to unbalanced training data * Classification bias due to clues that usually or always occur together * Other errors due to the way word frequencies in text differ from an ideal "Bayesian" model It appears that their training approach calls for manipulating token counts of the training documents, adjusting the count of a token 't' in a single message, roughly as follows (if my math is correct): # Adjust token counts for the power-law distribution of terms in normal text count[t] = log(count[t]+1) # Smooth noise caused by random correlations between frequently occurring # tokens and a particular classification count[t] *= log(totalMessagesTrained/numberOfMessagesContaining[t]) # Adjust for differences in token frequency probability based on size of # the message count[t] /= sqrt(sum([x*x for x in count.values()])) ...and then that's about where my math gives out, about halfway into their training adjustments. The math definitely calls for new data structures, though, in that IIUC we only keep raw token counts, without a separate count of "messages this token was seen in". The next steps involved using the *opposite* classification of a message to determine classification weights (i.e. ham and spam weights) for the tokens, and normalizing the weights in order to counteract training sample size bias. I don't understand their math well enough to have any idea if their techniques are similar to the "experimental ham/spam imbalance adjustment" idea or not, or are things already done by the chi-square classifier. From tameyer at ihug.co.nz Wed Jan 14 19:43:40 2004 From: tameyer at ihug.co.nz (Tony Meyer) Date: Wed Jan 14 19:43:48 2004 Subject: [spambayes-dev] auto-training w/ small db seems like a bad idea In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F1304A7D486@its-xchg4.massey.ac.nz> Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F13026F2A4F@its-xchg4.massey.ac.nz> [Skip] > "small DB/non-edge training" may very well be a great idea. [Eli Stevens] > Is there a regime that simulates what you had been doing manually? The "non-edge" part is definitely there (the 'nonedge' regime). To keep the db small, you could either modify the regime to have bigger edges, or do something more fancy. It does usually keep it pretty small as is, though - on the testing set I've been using the average number of messages used for nonedge was 2.3/day, whereas 'perfect' (train on everything) was something like 30/day. BTW, if you're using the incremental testing stuff, you might want to use the versions in CVS - the sort+group.py script was improved by Tim a while back, and the scripts have also grown docstrings, which might help. =Tony Meyer From tameyer at ihug.co.nz Wed Jan 14 20:19:32 2004 From: tameyer at ihug.co.nz (Tony Meyer) Date: Wed Jan 14 20:19:39 2004 Subject: [spambayes-dev] beta release time? In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F1304A7D4CD@its-xchg4.massey.ac.nz> Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F13026F2A50@its-xchg4.massey.ac.nz> > So, is there any reason we shouldn't do a first beta release > in the next week or so? Heh - getting bored now there's no new Python version to release, huh? Mark's rough plan (back on the 28th of Nov) was: > 1) Merge as above. > 2) Let things settle for a week or so, so poor CVS users > all get to suffer alone. > 3) Put together a binary from my current py2exe setup script, > which includes CVS and a number of sb_ programs. > 4) Announce this binary as a "binary-beta", calling it 0.75 or something. > 5) Any major bugs will presumably be part of the "binary framework", so > maybe 0.76 etc, depending on the damage. > 6) Move towards release 0.8 - this will be simultaneous > windows-binary and source. > 7) Move towards release 0.9 - aim for 4 weeks after 0.8, > addressing only bugs. > 8) Just *before* the 0.9 release, cut a new 1.0 branch. Release 0.9. > 9) Move towards 1.0, again aiming for 4 weeks, possibly with 2x release > candidates. Step 1 is done. Step 2 is well done. Step 3 is done. I think step 4 is done, too - at least Mark did announce an 'experimental' release of the binary, calling it 008.5 (the binary was at 008, the source was at a7). Either there wasn't much interest in this version, or there were few bugs, since we didn't hear all that much. So we can probably consider step 5 done. This leaves us at step 6. Mark's plan, of course, doesn't distinguish between 'alpha' and 'beta', just prerelease (<1.0) and release (1.0). I suppose that where there is "0.8" above, that could be version 009 of the installer and 1.0b1 of the source, "0.9" could be version 010 of the installer and 1.0b2 of the source. Then there's 011/1.0rc1 and 012/1.0rc2 and 013/1.0 (or is that 100/1.0?). Anyway, this is a long-winded way of me saying +1 to a release in the next week or so. If we continue along Mark's plan, this means the next one would be that time in Feb, and then a 1.0 in mid March, or so. Whatever we do, it would be nice to have the planned simultaneous source & binary releases. > If people have bugs/features they > want fixed/added before then, can we assemble a list? * The Version.py stuff (Mark was going to do this at some point). * Command-line options, particularly -d/-D/-p. It would be nice if this was consistent :) There's a patch (by Remi?) to make these consistent. Skip's 'any option' option should probably be consistently available, too, rather than just with sb_filter. * I *almost* have the autoconfigure script working on both WinXP and Win98, and I'd like to finish that (I can probably do that this weekend). It's not actually used or advertised, but I'd like it in the release so that I can get people to test it, if they are willing. Recently, there's been a lot of discussion about training regimes, and what we encourage users to do. The web interface probably encourages train-on-everything at the moment; if we are going to change that to encourage nonedge or balanced training, then I think that shouldn't be done between beta releases, so would have to be included. =Tony Meyer From tameyer at ihug.co.nz Wed Jan 14 20:26:47 2004 From: tameyer at ihug.co.nz (Tony Meyer) Date: Wed Jan 14 20:26:52 2004 Subject: [spambayes-dev] Another incremental training idea... In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F1304A7D498@its-xchg4.massey.ac.nz> Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F130467783E@its-xchg4.massey.ac.nz> > I've also been kicking around some auto-training ideas hoping for time > to try them. One idea I had was based on a "sliding non-edge" scale. > You would set a max imbalance, say 2:1, beyond which you would train > on everything on the low side. > As your imbalance falls back below the maximum, auto-train > would start skipping the "edge" messages with near perfect > classification scores. The closer you get to a perfect 1:1 > balance, the closer to the cutoff score the message would > need to be before it would get auto-trained. Anyone see any > obvious holes in this idea? I tried almost this with the incremental regime, using a maximum of 2::1 or 1::2. It did pretty consistently worse than the basic nonedge regime. The only difference is that I didn't choose which messages to use if an imbalance would be created. The idea was basically to do nonedge, except if there was an imbalance, and then only train messages that move the balance closer to 1::1. The balanced TOE you described (in a later message) is also similar to a test I did (I called it 'balanced_perfect'). Again, the difference is in the selection of which messages to use when there is an imbalance (I use the first ones that come along, whereas you choose based on the score). Basically any regime with which I tried using this method to keep the database balanced did worse than just letting it go as normal. As well as the 2::1/1::2, I tried the perfect regime with 3::1 and 2::3, and that was better, but still not as good as just the regular regime. If I have time over the weekend, I'll try and come up with a different self-balancing regime and try that (maybe along these lines, where the messages to ignore are chosen based on score). =Tony Meyer From ta-meyer at ihug.co.nz Wed Jan 14 20:43:31 2004 From: ta-meyer at ihug.co.nz (Tony Meyer) Date: Wed Jan 14 20:43:39 2004 Subject: [spambayes-dev] Outlook plug-in & Windows 95 Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F1304677840@its-xchg4.massey.ac.nz> Does anyone know whether the binary release of the plug-in works with Windows95 (assuming that the version of Outlook is modern enough)? Our website is inconsistent: the FAQ has that it does, but the Windows page has that it doesn't. =Tony Meyer From tim.one at comcast.net Wed Jan 14 20:50:11 2004 From: tim.one at comcast.net (Tim Peters) Date: Wed Jan 14 20:50:29 2004 Subject: [spambayes-dev] Outlook plug-in & Windows 95 In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F1304677840@its-xchg4.massey.ac.nz> Message-ID: [Tony Meyer] > Does anyone know whether the binary release of the plug-in works with > Windows95 (assuming that the version of Outlook is modern enough)? I don't recall anyone on Win95 saying so one way or the other. I don't have access to Win95 myself. > Our website is inconsistent: the FAQ has that it does, but the > Windows page has that it doesn't. I don't know of "a reason" it wouldn't work on Win95, so, if nobody else does either, I vote we change the Windows page, then change it back if someone pops up with a convincing claim that it doesn't work on Win95. From listsub at wickedgrey.com Wed Jan 14 20:53:52 2004 From: listsub at wickedgrey.com (Eli Stevens (WG.c)) Date: Wed Jan 14 20:54:45 2004 Subject: [spambayes-dev] Another incremental training idea... References: <1ED4ECF91CDED24C8D012BCF2B034F130467783E@its-xchg4.massey.ac.nz> Message-ID: <4005F2B0.6080808@wickedgrey.com> Tony Meyer wrote: > Kenny Pitt wrote: >>I've also been kicking around some auto-training ideas hoping for time >>to try them. One idea I had was based on a "sliding non-edge" scale. >>You would set a max imbalance, say 2:1, beyond which you would train >>on everything on the low side. >>As your imbalance falls back below the maximum, auto-train >>would start skipping the "edge" messages with near perfect >>classification scores. The closer you get to a perfect 1:1 >>balance, the closer to the cutoff score the message would >>need to be before it would get auto-trained. Anyone see any >>obvious holes in this idea? >> > > I tried almost this with the incremental regime, using a maximum of 2::1 or > 1::2. It did pretty consistently worse than the basic nonedge regime. The > only difference is that I didn't choose which messages to use if an > imbalance would be created. The idea was basically to do nonedge, except if > there was an imbalance, and then only train messages that move the balance > closer to 1::1. It sounds like you are saying that non-edge messages on the heavy side were not trained. It seems that would be a key difference. Was that the case in your test? Eli From skip at pobox.com Wed Jan 14 20:57:07 2004 From: skip at pobox.com (Skip Montanaro) Date: Wed Jan 14 20:57:31 2004 Subject: [spambayes-dev] beta release time? In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F13026F2A50@its-xchg4.massey.ac.nz> References: <1ED4ECF91CDED24C8D012BCF2B034F1304A7D4CD@its-xchg4.massey.ac.nz> <1ED4ECF91CDED24C8D012BCF2B034F13026F2A50@its-xchg4.massey.ac.nz> Message-ID: <16389.62323.158208.437710@montanaro.dyndns.org> Tony> Skip's 'any option' option should probably be consistently Tony> available, too, rather than just with sb_filter. Yeah, I'll try to get that finished off tonight or tomorrow. Skip From tim.one at comcast.net Wed Jan 14 21:28:56 2004 From: tim.one at comcast.net (Tim Peters) Date: Wed Jan 14 21:28:58 2004 Subject: [spambayes-dev] "Tackling the Poor Assumptions of Naive Bayes TextClassifiers" In-Reply-To: <5.1.1.6.0.20040114164136.00ac7ec0@mail.rapidsite.net> Message-ID: [Phillip J. Eby] > I ran across this paper today, that may be of some interest: > > http://haystack.lcs.mit.edu/papers/rennie.icml03.pdf It is interesting (I've seen it before), but what this project does is so far removed from a classical NBC (Naive Bayesian Classifier) that it's unclear whether any of it can apply directly. Interesting ideas for research, though. Just a few comments: > ... > # Adjust token counts for the power-law distribution of terms in > normal text count[t] = log(count[t]+1) We treat documents as a set of words, not a bag, because testing said treating as a set worked better. It's possible that using a log gimmick would work better still (that's somewhere between "set" and "bag"), but it wasn't tried. (IOW, we're not using what the paper calls a multinomial model.) > ... > The math definitely calls for new data structures, though, in that > IIUC we only keep raw token counts, without a separate count of > "messages this token was seen in". Nope, *because* we treat a msg as a set of features, the ham count we store for a token is equal to the number of distinct messages that contained the token, and likewise for the spam count. > The next steps involved using the *opposite* classification of a > message to determine classification weights (i.e. ham and spam > weights) for the tokens, and normalizing the weights in order to > counteract training sample size bias. I don't understand their math > well enough to have any idea if their techniques are similar to the > "experimental ham/spam imbalance adjustment" idea or not, or are > things already done by the chi-square classifier. No, they've got nothing in common, and *this* part wouldn't do us any good at all. An NBC is an N-way classifier, and as the paper says of this part: In contrast, CNB estimates parameters using data from all classes *except* c. We think CNB's estimates will be more effective because each uses a more even amount of training data per class, which will lessen the bias in the weight estimates. If N is substantially larger than 2, then basing the computations for a specific class on N-1 of the classes instead of on just one should indeed be "more even". But when N=2, as it is in this project, there's no difference -- it would just swap the roles of the two classes (IOW, N-1=1 when N=2, and that's all the math you need for this one ). One final point is that an NBC formally assumes statistical independence of the features it's scoring. The chi-squared combining method doesn't -- in fact, the thrust of our combining method is to *exploit* correlation, and our Unsure category is really the result of failing to find more correlation in one direction than in the other. That's not to say that correlation can't hurt us too, but it appears to help us far more often than it hurts us, and our combining method is based on detecting deviation from independence regardless. From tim.one at comcast.net Wed Jan 14 22:05:17 2004 From: tim.one at comcast.net (Tim Peters) Date: Wed Jan 14 22:05:20 2004 Subject: [spambayes-dev] Another incremental training idea... In-Reply-To: <16389.10247.928382.554729@montanaro.dyndns.org> Message-ID: [Skip Montanaro] > ... > It does seem a bit arbitrary, but the system seems to suggest > we need to be slaves to balance and that's one way to get it. Cross validation testing is measuring random-time-order TOE performance, and we know imbalance hurts that. We also have overwhelming anecdotal evidence that extreme imbalance hurts users of the Outlook addin, and seemingly no matter how they train (but understanding that the Outlook UI makes it difficult to do any kind of training other than "train on everything in such-a-such set of folders, plus mistakes and unsures" -- so we end up with OL users training on 20,000 ham from the last 5 years, plus the 10 spam they got yesterday). I don't think we've seen enough to draw a conclusion about non-insane imbalance in other ways of training. Alex has presented the most evidence about longer-term effects of non-TOE, time-respecting training, and he seems to do OK under those despite that his imbalance gets worse over time (and certainly more imbalanced than I can tolerate in a variety of real-life ad hoc training regimes). OTOH, that's only one corpus, and Alex is weird . From tim.one at comcast.net Wed Jan 14 22:13:45 2004 From: tim.one at comcast.net (Tim Peters) Date: Wed Jan 14 22:13:49 2004 Subject: [spambayes-dev] Another incremental training idea... In-Reply-To: <1074090145.21051.118.camel@anthem> Message-ID: [Barry Warsaw] > ... > A generalization might be to score each attachment (or possibly just > each message/rfc822 type attachment) separately. Then choose an > algorithm for combining the scores, e.g. outer-only, inner-only, > combined, etc. That should simplify things . Or you could upgrade to Outlook: I don't think we have any real idea which attachments we do and don't get back from Outlook when we synthesize a plain-text message for your picky email parser to chew on ("standards" -- what a stupid idea that was ), but I know for a fact that we *don't* get the body of messages attached to things I get from Mailman in my capacity as list admin. So I routinely train on Mailman-wrapped spam and ham, meaning that I've trained on a grand total of about two of them, and all wrapped msgs from Mailman have scored 0% for me thereafter. Something to note: my personal classifier is using the experimental bigrams gimmick, and bigram Mailmanisms like Confirmation succeeded list administrator, list posting: List: PSF-Board@python.org Reason: Post following mailing act like strong lexical fingerprints for Mailman-generated administrivia, never appearing in ham or spam other than the Mailman stuff. This is one clear way in which bigrams can generate a killer-strong collection of hapaxes sufficient to nail an entire large class of messages from just one training example. Of course, that also sets me up for a spectacularly bad false negative someday. From tim.one at comcast.net Wed Jan 14 22:25:30 2004 From: tim.one at comcast.net (Tim Peters) Date: Wed Jan 14 22:26:22 2004 Subject: [spambayes-dev] Another incremental training idea... In-Reply-To: Message-ID: [Kenny Pitt] > My description applies to auto-balancing of "train on mistakes and > unsures" instead of "train on everything" or "train on almost > everything". The algorithm could easily be reversed to do TOE where > there are no configured edge thresholds. Doing TOAE effectively would > probably require your adjustment. > > For mistake-based training, the idea is that as long as my balance is > very close to 1:1, I'm happy to train only on the messages that I > manually reclassify because of mistakes and unsures. If that > mistake-based training causes an imbalance then the auto-balancer > kicks in with an edge threshold very close to the classifier cutoff > so that only the worst-scoring messages are trained. As the > imbalance worsens, the edge threshold is dynamically adjusted as > needed to train on more messages and try to push the balance back > towards 1:1. FWIW, no matter which training strategy I decided to experiment with in day-to-day Outlook use, that's the one I always ended up doing: training on Mistakes and Unsures, but forcing balance every few days by tossing in the worst-scoring msgs in the under-represented category. That's worked great for me in real life, with unigrams for about a year, and again now with bigrams but for less than a month. Alas(?!), I'm getting a lot less spam than I used to -- since Christmas Eve of 2003 (when I started saving all my email), I've only gotten 1834 of the beasties, less than 100 per day. It used to be well over 200 a day. Maybe the photos of my penis I sent out convinced spammers there's no point in trying to sell snow to an Eskimo . From skip at pobox.com Wed Jan 14 23:03:11 2004 From: skip at pobox.com (Skip Montanaro) Date: Wed Jan 14 23:03:13 2004 Subject: [spambayes-dev] -o cmd line flag Message-ID: <16390.4351.392714.109043@montanaro.dyndns.org> I added the -o flag to the following scripts: sb_dbexpimp.py sb_imapfilter.py sb_mboxtrain.py sb_notesfilter.py sb_pop3dnd.py sb_server.py sb_upload.py sb_xmlrpcserver.py I can't really test them effectively since I don't use any of them, but the change was straightforward. I did try sb_server.py -o global:verbose:true and sb_server.py -o globals:verbose:true The first complained as it should about "global" not being a valid section name. Skip From listsub at wickedgrey.com Thu Jan 15 03:37:42 2004 From: listsub at wickedgrey.com (Eli Stevens (WG.c)) Date: Thu Jan 15 03:36:51 2004 Subject: [spambayes-dev] Another incremental training idea... References: Message-ID: <016401c3db42$f3a64330$6401a8c0@kane> Tim Peters wrote: > > Alas(?!), I'm getting a lot less spam than I used to -- since > Christmas Eve of 2003 (when I started saving all my email), I've only > gotten 1834 of the beasties, less than 100 per day. It used to be > well over 200 a day. I've 540 spam on my personal account since June, 2002 and 365 since May 2003 on my list subscription account. Until recently, it was _highly_ repetitive - I suspect my corpus is atypical for people inclined to do test runs (hence my interest). > Maybe the photos of my penis I sent out > convinced spammers there's no point in trying to sell snow to an > Eskimo . Tim, there are impressionable young classifiers present! Imagine what training such vile filth as ham will do to Skip's minimalist hammie.db! Viagra'ly yrs, Eli -- Give a man some mud, and he plays for a day. Teach a man to mud, and he plays for a lifetime. WickedGrey.com uses SpamBayes on incoming email: http://spambayes.sourceforge.net/ -- From tdickenson at geminidataloggers.com Thu Jan 15 06:48:26 2004 From: tdickenson at geminidataloggers.com (Toby Dickenson) Date: Thu Jan 15 06:48:29 2004 Subject: [spambayes-dev] Another incremental training idea... In-Reply-To: References: Message-ID: <200401151148.26615.tdickenson@geminidataloggers.com> On Thursday 15 January 2004 03:05, Tim Peters wrote: > [Skip Montanaro] > > > ... > > It does seem a bit arbitrary, but the system seems to suggest > > we need to be slaves to balance and that's one way to get it. > > Cross validation testing is measuring random-time-order TOE performance, > and we know imbalance hurts that. Ive finally got the cross validation tools working here, and the first thing I looked at was imbalance. My normal training set is currently 14k hams and 2k spams. This test compared that imbalance against three independantly selected balanced sets with 2k of both. If Im reading this right, my 7:1 imbalance doesnt hurt me. filename: unbal bal1 bal2 bal3 ham:spam: 14560:1992 1992:1992 1992:1992 1992:1992 fp total: 0 0 1 0 fp %: 0.00 0.00 0.05 0.00 fn total: 12 6 8 6 fn %: 0.60 0.30 0.40 0.30 unsure t: 102 21 23 29 unsure %: 0.62 0.53 0.58 0.73 real cost: $32.40 $10.20 $22.60 $11.80 best cost: $27.60 $7.00 $9.80 $8.60 h mean: 0.11 0.23 0.30 0.32 h sdev: 1.89 2.47 3.46 3.26 s mean: 96.93 99.06 99.04 99.02 s sdev: 12.11 6.88 6.98 7.21 mean diff: 96.82 98.83 98.74 98.70 k: 6.92 10.57 9.46 9.43 -- Toby Dickenson From skip at pobox.com Thu Jan 15 08:50:22 2004 From: skip at pobox.com (Skip Montanaro) Date: Thu Jan 15 08:50:34 2004 Subject: [spambayes-dev] Another incremental training idea... In-Reply-To: <200401151148.26615.tdickenson@geminidataloggers.com> References: <200401151148.26615.tdickenson@geminidataloggers.com> Message-ID: <16390.39582.295190.923588@montanaro.dyndns.org> Toby> If Im reading this right, my 7:1 imbalance doesnt hurt me. Toby> filename: unbal bal1 bal2 bal3 Toby> ham:spam: 14560:1992 1992:1992 Toby> 1992:1992 1992:1992 Toby> fp total: 0 0 1 0 Toby> fp %: 0.00 0.00 0.05 0.00 Toby> fn total: 12 6 8 6 Toby> fn %: 0.60 0.30 0.40 0.30 Toby> unsure t: 102 21 23 29 Toby> unsure %: 0.62 0.53 0.58 0.73 Toby> real cost: $32.40 $10.20 $22.60 $11.80 Toby> best cost: $27.60 $7.00 $9.80 $8.60 Toby> h mean: 0.11 0.23 0.30 0.32 Toby> h sdev: 1.89 2.47 3.46 3.26 Toby> s mean: 96.93 99.06 99.04 99.02 Toby> s sdev: 12.11 6.88 6.98 7.21 Toby> mean diff: 96.82 98.83 98.74 98.70 Toby> k: 6.92 10.57 9.46 9.43 It doesn't seem to have a negative effect on false positives, but it looks like you will get roughly double the number of false negatives and 4-5x as many unsures. Skip From tdickenson at geminidataloggers.com Thu Jan 15 09:52:00 2004 From: tdickenson at geminidataloggers.com (Toby Dickenson) Date: Thu Jan 15 09:52:06 2004 Subject: [spambayes-dev] Another incremental training idea... In-Reply-To: <16390.39582.295190.923588@montanaro.dyndns.org> References: <200401151148.26615.tdickenson@geminidataloggers.com> <16390.39582.295190.923588@montanaro.dyndns.org> Message-ID: <200401151452.00727.tdickenson@geminidataloggers.com> On Thursday 15 January 2004 13:50, Skip Montanaro wrote: > Toby> If Im reading this right, my 7:1 imbalance doesnt hurt me. > > Toby> filename: unbal bal1 bal2 bal3 > Toby> ham:spam: 14560:1992 1992:1992 > Toby> 1992:1992 1992:1992 > Toby> fp total: 0 0 1 0 > Toby> fp %: 0.00 0.00 0.05 0.00 > Toby> fn total: 12 6 8 6 > Toby> fn %: 0.60 0.30 0.40 0.30 > Toby> unsure t: 102 21 23 29 > Toby> unsure %: 0.62 0.53 0.58 0.73 > It doesn't seem to have a negative effect on false positives, but it looks > like you will get roughly double the number of false negatives and 4-5x as > many unsures. 4x as many unsures, out of a total population that is 4x larger. so no overall percentage change. Am I reading that right? -- Toby Dickenson From kennypitt at hotmail.com Thu Jan 15 09:59:23 2004 From: kennypitt at hotmail.com (Kenny Pitt) Date: Thu Jan 15 10:00:13 2004 Subject: [spambayes-dev] Version numbering (was: beta release time?) In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F13026F2A50@its-xchg4.massey.ac.nz> Message-ID: Tony Meyer wrote: >> If people have bugs/features they >> want fixed/added before then, can we assemble a list? > > * The Version.py stuff (Mark was going to do this at some point). I think I was the first to suggest this, so I'd be happy to make the changes now that I'm on the developer list. However, I'll need input about what you guys are looking for in a versioning scheme. -- Kenny Pitt From kennypitt at hotmail.com Thu Jan 15 10:04:21 2004 From: kennypitt at hotmail.com (Kenny Pitt) Date: Thu Jan 15 10:05:48 2004 Subject: [spambayes-dev] Outlook plug-in & Windows 95 In-Reply-To: Message-ID: Tim Peters wrote: > [Tony Meyer] >> Does anyone know whether the binary release of the plug-in works with >> Windows95 (assuming that the version of Outlook is modern enough)? > > I don't recall anyone on Win95 saying so one way or the other. I > don't have access to Win95 myself. > >> Our website is inconsistent: the FAQ has that it does, but the >> Windows page has that it doesn't. > > I don't know of "a reason" it wouldn't work on Win95, so, if nobody > else does either, I vote we change the Windows page, then change it > back if someone pops up with a convincing claim that it doesn't work > on Win95. There isn't that much functional difference between Win95 and Win98, and we do consistently claim to support Win98. +1 to claiming support for Win95 as well until someone reports otherwise. -- Kenny Pitt From skip at pobox.com Thu Jan 15 10:16:37 2004 From: skip at pobox.com (Skip Montanaro) Date: Thu Jan 15 10:16:44 2004 Subject: [spambayes-dev] Another incremental training idea... In-Reply-To: <200401151452.00727.tdickenson@geminidataloggers.com> References: <200401151148.26615.tdickenson@geminidataloggers.com> <16390.39582.295190.923588@montanaro.dyndns.org> <200401151452.00727.tdickenson@geminidataloggers.com> Message-ID: <16390.44757.769823.36515@montanaro.dyndns.org> Toby> 4x as many unsures, out of a total population that is 4x Toby> larger. so no overall percentage change. Am I reading that right? Ah yes. Sorry... Skip From Greg at Kavalec.com Thu Jan 15 11:18:02 2004 From: Greg at Kavalec.com (Greg Kavalec) Date: Thu Jan 15 11:18:29 2004 Subject: [spambayes-dev] Newbie Message-ID: <07f001c3db83$394e3010$2d02a8c0@bswa.com> Good morning I have just joined up and am very green on the whole Bayesian statistics arena, so be kind. My dumb question... The following kinds of subjects often get past SB... why get Vi`agra when you can get super Vi-agra p.r.i.c.e.s are v.a.l.i.d until 16th of J.a.n.u.a.r.y Can SB do its magic based on a modified text, i.e. all non-alpha removed? whygetViagrawhenyoucangetsuperViagra pricesarevaliduntil16thofJanuary Or is this already happening? Masalaam, G.Waleed Kavalec ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ We live in a world where nothing is impossible, except peace and happiness. ~~~ From tim.one at comcast.net Thu Jan 15 11:48:25 2004 From: tim.one at comcast.net (Tim Peters) Date: Thu Jan 15 11:48:42 2004 Subject: [spambayes-dev] Another incremental training idea... In-Reply-To: <200401151148.26615.tdickenson@geminidataloggers.com> Message-ID: [Toby Dickenson] > Ive finally got the cross validation tools working here, and the > first thing I looked at was imbalance. My normal training set is > currently 14k hams and 2k spams. This test compared that imbalance > against three independantly selected balanced sets with 2k of both. Well, you're varying both balance and total number of messages in these tests, so it's hard to pin down the hypothesis it's really testing. To test only balance, and if you've got no more than 2K spam, then tests of, e.g., 1900:100, 1800:200, 1700:300, ..., 300:1700, 200:1800, 100:1900 would vary balance while keeping total # of messages fixed. The cv tester's --ham-keep and --spam-keep options can be used to automatically pick random subsets of given sizes, btw, without needing to rearrange your data files. > If Im reading this right, my 7:1 imbalance doesnt hurt me. > > filename: unbal bal1 bal2 bal3 > ham:spam: 14560:1992 1992:1992 > 1992:1992 1992:1992 > ... > fp %: 0.00 0.00 0.05 0.00 > .. > fn %: 0.60 0.30 0.40 0.30 > ... > unsure %: 0.62 0.53 0.58 0.73 Whatever this is really testing, the FN *percentage* is worst in the first column, and the Unsure percentage isn't winning there . Since you kept the total # of spam fixed across all 4 tests, and FN are a subset of spam, a decrease in FN percentage is also a decrease in FN absolute count. IOW, if you had trained on less ham, your results show that you would have gotten fewer false negatives (half to two-thirds of the number you got in the first column), despite that you train on some 12,000 less ham after the first column: fn total: 12 6 8 6 Those columns are all "out of 1992"; no ham *can* be a FN. > real cost: $32.40 $10.20 $22.60 $11.80 > best cost: $27.60 $7.00 $9.80 $8.60 Those two are just misleading when the total # of msgs changes across runs. > h mean: 0.11 0.23 0.30 0.32 > h sdev: 1.89 2.47 3.46 3.26 > s mean: 96.93 99.06 99.04 99.02 > s sdev: 12.11 6.88 6.98 7.21 > mean diff: 96.82 98.83 98.74 98.70 > k: 6.92 10.57 9.46 9.43 The first column shows a much fuzzier idea of what spam is (spam sdev is much larger than in the other columns), and k is much smaller -- k is the number such that hmean + k*hsdev == smean - k*ssdev, and is a measure of population separation. Picture the limit: you train on all ham and no spam. Then no message can get classified as spam (no token "looks spammy"). You'll get lots of FN, and at best a spam will score as Unsure. The classifier's idea of spam is extremely fuzzy. The good news is that you'll get no FP. Add 1 spam to the training data, and the situation improves, but probably not by a whole lot. Etc. It's possible that best results will be achieved at some (non-insane) ratio other than 1:1, but almost certain that, if so, the best ratio will vary across specific email mix. I *think* you're doing better at 7:1 than most people would. An FN rate of 0.6% is low enough that I wouldn't bother to change anything in my personal classifier. For truly high-volume application, though, the defference between 0.6% and 0.4% actually is the 50% it looks like it is . From popiel at wolfskeep.com Thu Jan 15 12:25:05 2004 From: popiel at wolfskeep.com (T. Alexander Popiel) Date: Thu Jan 15 12:25:23 2004 Subject: [spambayes-dev] Another incremental training idea... In-Reply-To: Message from "Tim Peters" of "Wed, 14 Jan 2004 22:05:17 EST." References: Message-ID: <20040115172505.47EA72DF18@cashew.wolfskeep.com> In message: "Tim Peters" writes: >and Alex is weird . I prefer 'eccentric'. Or perhaps 'goofy'. ;-) - Alex From skip at pobox.com Thu Jan 15 12:52:33 2004 From: skip at pobox.com (Skip Montanaro) Date: Thu Jan 15 12:52:39 2004 Subject: [spambayes-dev] Newbie In-Reply-To: <07f001c3db83$394e3010$2d02a8c0@bswa.com> References: <07f001c3db83$394e3010$2d02a8c0@bswa.com> Message-ID: <16390.54113.116509.697678@montanaro.dyndns.org> Greg> The following kinds of subjects often get past SB... Greg> why get Vi`agra when you can get super Vi-agra Greg> p.r.i.c.e.s are v.a.l.i.d until 16th of J.a.n.u.a.r.y Greg> Can SB do its magic based on a modified text, i.e. all non-alpha Greg> removed? Greg> whygetViagrawhenyoucangetsuperViagra Greg> pricesarevaliduntil16thofJanuary Greg> Or is this already happening? Yes, it could. No, it's not. ;-) I implemented an experimental "remove-punctuation" config variable which did the obvious thing. I've since deleted it from my copy of the code base, but it wouldn't be hard to reimplement if desired. If your training database is fairly mature (trained on enough samples of ham and spam) it turns out to not really help much because SpamBayes actually does a very good job based on other clues it finds in the message. Note that you don't really want to remove the whitespace because then you get a single long token for each subject instead of a series of separate words. Skip From popiel at wolfskeep.com Thu Jan 15 13:03:39 2004 From: popiel at wolfskeep.com (T. Alexander Popiel) Date: Thu Jan 15 13:03:43 2004 Subject: [spambayes-dev] Another incremental training idea... In-Reply-To: Message from "Tim Peters" of "Wed, 14 Jan 2004 22:05:17 EST." References: Message-ID: <20040115180339.3F77D2DF18@cashew.wolfskeep.com> In message: "Tim Peters" writes: > >I don't think we've seen enough to draw a conclusion about non-insane >imbalance in other ways of training. Alex has presented the most evidence >about longer-term effects of non-TOE, time-respecting training, and he seems >to do OK under those despite that his imbalance gets worse over time (and >certainly more imbalanced than I can tolerate in a variety of real-life ad >hoc training regimes). FWIW, my FN rate has dropped back to what I consider normal (the Nigerian business-deals and maybe 3-6 others per day) with my latest retraining, even though I remain at 50:1 trained (with TOAE). I assume I got enough counter-examples to offset the mistake I made, since I haven't gone through to find the mistake yet (still working on updating my installation to latest code and separating my mail into monthly directories). - Alex From nobody at spamcop.net Thu Jan 15 15:00:14 2004 From: nobody at spamcop.net (Seth Goodman) Date: Thu Jan 15 15:00:18 2004 Subject: [spambayes-dev] Another incremental training idea... In-Reply-To: <200401151452.00727.tdickenson@geminidataloggers.com> Message-ID: > > Toby> If Im reading this right, my 7:1 imbalance doesnt hurt me. > > > > Toby> filename: unbal bal1 bal2 bal3 > > Toby> ham:spam: 14560:1992 1992:1992 > > Toby> 1992:1992 1992:1992 > > Toby> fp total: 0 0 1 0 > > Toby> fp %: 0.00 0.00 0.05 0.00 > > Toby> fn total: 12 6 8 6 > > Toby> fn %: 0.60 0.30 0.40 0.30 > > Toby> unsure t: 102 21 23 29 > > Toby> unsure %: 0.62 0.53 0.58 0.73 > > > [Skip Montanaro] > > It doesn't seem to have a negative effect on false positives, > > but it looks > > like you will get roughly double the number of false negatives > > and 4-5x as > > many unsures. > > [Toby Dickenson] > 4x as many unsures, out of a total population that is 4x larger. > so no overall > percentage change. Am I reading that right? Yes, but if I'm reading it right, the fn's are about double as a percentage. This looks like the case since your nham didn't change across the four data sets, so Skip's original observation on fn's increasing 2X seems right. -- Seth Goodman Humans: off-list replies to sethg [at] GoodmanAssociates [dot] com Spambots: disregard the above From nobody at spamcop.net Thu Jan 15 15:03:47 2004 From: nobody at spamcop.net (Seth Goodman) Date: Thu Jan 15 15:03:50 2004 Subject: [spambayes-dev] Another incremental training idea... In-Reply-To: <200401151452.00727.tdickenson@geminidataloggers.com> Message-ID: > > Toby> If Im reading this right, my 7:1 imbalance doesnt hurt me. > > > > Toby> filename: unbal bal1 bal2 bal3 > > Toby> ham:spam: 14560:1992 1992:1992 > > Toby> 1992:1992 1992:1992 > > Toby> fp total: 0 0 1 0 > > Toby> fp %: 0.00 0.00 0.05 0.00 > > Toby> fn total: 12 6 8 6 > > Toby> fn %: 0.60 0.30 0.40 0.30 > > Toby> unsure t: 102 21 23 29 > > Toby> unsure %: 0.62 0.53 0.58 0.73 > > > [Skip Montanaro] > > It doesn't seem to have a negative effect on false positives, > > but it looks > > like you will get roughly double the number of false negatives > > and 4-5x as > > many unsures. > > [Toby Dickenson] > 4x as many unsures, out of a total population that is 4x larger. > so no overall > percentage change. Am I reading that right? Correction to previous post: Yes, but if I'm reading it right, the fn's are about double as a percentage. This looks like the case since your nham didn't change across the four data sets, ^nspam, not nham so Skip's original observation on fn's increasing 2X seems right. -- Seth Goodman Humans: off-list replies to sethg [at] GoodmanAssociates [dot] com Spambots: disregard the above From ta-meyer at ihug.co.nz Thu Jan 15 18:07:51 2004 From: ta-meyer at ihug.co.nz (Tony Meyer) Date: Thu Jan 15 18:07:58 2004 Subject: [spambayes-dev] Outlook plug-in & Windows 95 In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F1304A7D80D@its-xchg4.massey.ac.nz> Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F1304677847@its-xchg4.massey.ac.nz> [Tim] > I don't recall anyone on Win95 saying so one way or the > other. I don't have access to Win95 myself. Does anyone? > I don't know of "a reason" it wouldn't work on Win95, so, if > nobody else does either, I vote we change the Windows page, > then change it back if someone pops up with a convincing > claim that it doesn't work on Win95. [Kenny] > There isn't that much functional difference between Win95 > and Win98, and we do consistently claim to support Win98. > +1 to claiming support for Win95 as well until someone reports > otherwise. Ok then, I'll update the windows page. =Tony Meyer From sl6dt at cc.usu.edu Fri Jan 16 09:28:19 2004 From: sl6dt at cc.usu.edu (sl6dt) Date: Fri Jan 16 09:30:45 2004 Subject: [spambayes-dev] Articles about bayesian filters Message-ID: <4007CC83@webster.usu.edu> Does anyone know any good articles about bayesian filters for spam? John From tim at fourstonesExpressions.com Fri Jan 16 09:33:54 2004 From: tim at fourstonesExpressions.com (Tim Stone) Date: Fri Jan 16 09:34:05 2004 Subject: [spambayes-dev] Articles about bayesian filters In-Reply-To: <4007CC83@webster.usu.edu> References: <4007CC83@webster.usu.edu> Message-ID: A google on bayesian filter spam will turn up more than you'll ever want to read. There are some references on our site, spambayes.sourceforge.net as well. On Fri, 16 Jan 2004 07:28:19 -0700, sl6dt wrote: > Does anyone know any good articles about bayesian filters for spam? -- Tim Stone From jhs at oes.co.th Sun Jan 18 03:05:38 2004 From: jhs at oes.co.th (Jason Smith) Date: Sun Jan 18 03:05:43 2004 Subject: [spambayes-dev] training from IMAP folder? (with patch) Message-ID: <200401181505.38299.jhs@oes.co.th> Hello. I guess I'll explain in reverse chronological order, so you can stop reading when you get bored. This is a patch to allow sb_mboxtrain.py to train from an IMAP folder, similar to its behavior when training from Maildir. I have tested it on Linux against courier-imapd (Debian Woody) and cyrus-imapd-2.3.2 compiled from source. Currently, it only supports plain-text login. It sould be considered somewhat quick-and-dirty, as I read the RFC and implemented it with imaplib in just one evening. Unfortunately, it is against the latest release and not CVS because I cannot seem to access SF CVS at the moment. The reason I need this feature (as opposed to the IMAP filter) is to implement server-side spam filtering (using cyrus) and training which is intuitive for lay mail users. For the record, cyrus is a mail server which isolates the physical mail data from system users (i.e. the only access to mail is via IMAP/POP, unlike courier/maildir). Many deployments do not even have Unix user accounts on the machine. I have successfully integrated spambayes as an incoming filter using procmail much like the documentation. To train, users just need to drag missed spam to INBOX.Spam, and drag good messages to e.g. INBOX.Read. I want to run a cron job nightly to go through each user and train against their personal database. However, Cyrus uses a custom organization system for speed, and it's not decent to go mucking around /var/spool/cyrus by hand. Looks like the most effective way to do this is to access the mail through localhost IMAP. If this looks okay (or if somebody can suggest a better method), I am interested in making Debian packages as well as an implementation HOWTO as part of the new UserLinux project, since I will be rolling out a fairly large email site later this year. -- Jason Smith Open Enterprise Systems Bangkok, Thailand http://www.oes.co.th -------------- next part -------------- A non-text attachment was scrubbed... Name: mboxtrain-IMAP.diff Type: text/x-diff Size: 4537 bytes Desc: not available Url : http://mail.python.org/pipermail/spambayes-dev/attachments/20040118/b403db29/mboxtrain-IMAP.bin From jhs at oes.co.th Mon Jan 19 05:34:45 2004 From: jhs at oes.co.th (Jason Smith) Date: Mon Jan 19 05:34:53 2004 Subject: [spambayes-dev] training from IMAP folder? (with patch) In-Reply-To: <200401181505.38299.jhs@oes.co.th> References: <200401181505.38299.jhs@oes.co.th> Message-ID: <200401191734.48563.jhs@oes.co.th> On Sunday 18 January 2004 15:05, Jason Smith wrote: > Unfortunately, it is against the latest > release and not CVS because I cannot seem to access SF CVS at the moment. Looks like sourceforge is back to "normal," so I played around and made this patch against current CVS. Hope that gets a little more attention :p -- Jason Smith Open Enterprise Systems Bangkok, Thailand http://www.oes.co.th -------------- next part -------------- A non-text attachment was scrubbed... Name: mboxtrain.cvs-IMAP.diff Type: text/x-diff Size: 4712 bytes Desc: not available Url : http://mail.python.org/pipermail/spambayes-dev/attachments/20040119/35691816/mboxtrain.cvs-IMAP.bin From fms27 at cam.ac.uk Mon Jan 19 06:19:37 2004 From: fms27 at cam.ac.uk (Frank Stajano) Date: Mon Jan 19 06:19:11 2004 Subject: [spambayes-dev] RE: chance to write it up In-Reply-To: References: <5.2.0.9.1.20040116154236.0258ed28@localhost> Message-ID: <5.2.0.9.1.20040119111618.052a4db0@localhost> Dear Spambayes developers, I sent the following mail to Tim Peters who suggested that this list might include someone interested in this. The First Conference on Email and Anti-Spam (CEAS) Preliminary Call for Papers July 30, 31 and August 1, 2004 Mountain View, CA Immediately Follows AAAI 2004 http://www.ceas.cc In Cooperation with AAAI and IEEE Technical Committee on Security and Privacy >> I write to you personally just because I no longer subscribe to the >> spambayes mailing list (it was too high traffic for me). However I'm >> still using spambayes daily and loving it---would never go back to >> reading mail without it. >> >> A colleague of mine is on the program committee for this conference >> and just circulated a call for papers. I'm not involved in any way, >> but I think that if the spambayes crew wrote up this work and gave a >> demo, you could be in for a best paper award... or at least you >> should if there's any justice! ;-) > >Thanks for the info, Frank! I don't have time for this, but perhaps the >other developers do. Please send info to > > mailto:spambayes-dev@python.org > >(which is a different list than the spambayes list I imagine you're thinking >of). Frank (filologo disneyano) http://www-lce.eng.cam.ac.uk/~fms27/ From skip at pobox.com Mon Jan 19 10:44:13 2004 From: skip at pobox.com (Skip Montanaro) Date: Mon Jan 19 10:44:20 2004 Subject: [spambayes-dev] lowest scoring message isn't always "best" one to train on Message-ID: <16395.64333.962880.239404@montanaro.dyndns.org> Based on a suggestion by Eli Stevens, over the weekend I decided to try burning some electrons to decide which message to train on next given a pile of unsures and false negatives (I haven't got any false positives lying around). The script I came up with takes three inputs: a pile of hams, a pile of spams, and a pile of unsures/fn's. It trains on the hams and spams, then for each message in the unsure/fn pile does this: for msg in getmbox(unsures): h.train(msg) newspams = 0 for trial in getmbox(unsures): prob = cls.spamprob(trial): if prob > spam_cutoff: newspams += 1 h.untrain(msg) print trial['message-id'], newspams As you can see, since it's O(n*n) in the number of unsures, it's not a script to be run casually with a large number of unsures (alas, that's how I've been running it). I have a little more code in there to avoid scoring messages which already score as spam and to limit a scoring run to the best candidates from a previous run, but it can still take awhile to run. It pointed out something interesting, however: if you want the most bang for your buck (push the most messages into the spam region), the best message to train on often seems to be a message with a fairly high score. Here's a snippet of output from the start of my latest run: 0.032 4 0.321 <27412761818.707072765454907@python.org> 3 0.539 <4005795A000C4777@occmta11a.terra.com.mx> (added by postmaster@emailcluster.terra.com.mx) 5 0.872 <200401180345.i0I3jvCd013462@manatee.mojam.com> 3 0.682 6 0.869 <5-3w2y$o80o688x-5z0wp58v9h4hi@pm5mn> 6 0.846 <3$1fkv63$0-4u-sn04@otdgq.l1.43isz> 6 0.880 <192k46ax$r$lt@6bslldmd> 12 0.891 15 0.875 11 0.798 <2$-44$$2$mymd27@ulkkm64> 12 0.195 <20040118052804.NDHS11926.out009.verizon.net@terrapin> 10 Note that the first item has a very low spamprob itself, but of the bunch I displayed, the best ones to train on to push the most other spams into spam range all score around 0.8 to 0.9. (I currently have my cutoffs set at 0.1 and 0.9). I suspect this is because I'm selecting messages based on their similarity to lots of other messages in the unsure pile (many of which may already have fairly high unsure scores), so the score of the newly trained message is somewhat less unimportant than its similarity to other unsures. Skip P.S. As an aside, note the message-id for the third message above has "(added by postmaster@..."). I have seen annotations like that a few times. Is it still a valid message-id (from an rfc-2822 standpoint)? It seems like it would be a fairly objective feature to extract from messages. S From nobody at spamcop.net Mon Jan 19 12:20:31 2004 From: nobody at spamcop.net (Seth Goodman) Date: Mon Jan 19 12:20:39 2004 Subject: [spambayes-dev] lowest scoring message isn't always "best" one totrain on In-Reply-To: <16395.64333.962880.239404@montanaro.dyndns.org> Message-ID: [Skip Montanaro] > Note that the first item has a very low spamprob itself, but of > the bunch I > displayed, the best ones to train on to push the most other spams > into spam > range all score around 0.8 to 0.9. ... I can add some anecdotal evidence to that. My manual training regime for coming up with a reduced training set involves iteratively training on the lowest scoring spam until all untrained spam scores above 90%. I've noticed that I also get the most shifting of untrained spam classifications from unsure to spam on the later messages I train on, that is, the ones with higher scores. My recollection is that things start to move much better when the spam I add to the training set is around 75% or higher. The low-scoring unsures do move a few other low-scoring unsures up in score, but I seem to get considerably more "action" out of the higher-scoring ones. Since I stop at 90%, I have no experience as to what cutoff is optimal. I like your concept of doing this explicitly. With a small number of unsures, as in a nightly training session, it would not take very long even though it is O(n^2). -- Seth Goodman replies to sethg [at] GoodmanAssociates [dot] com From kennypitt at hotmail.com Mon Jan 19 14:19:03 2004 From: kennypitt at hotmail.com (Kenny Pitt) Date: Mon Jan 19 14:19:51 2004 Subject: [spambayes-dev] lowest scoring message isn't always "best" onetotrain on In-Reply-To: Message-ID: Seth Goodman wrote: > [Skip Montanaro] >> Note that the first item has a very low spamprob itself, but of the >> bunch I displayed, the best ones to train on to push the most other >> spams into spam range all score around 0.8 to 0.9. ... > > ... I've noticed that I also get the most shifting of > untrained spam classifications from unsure to spam on the later > messages I train on, that is, the ones with higher scores. My > recollection is that things start to move much better when the spam I > add to the training set is around 75% or higher. The low-scoring > unsures do move a few other low-scoring unsures up in score, but I > seem to get considerably more "action" out of the higher-scoring > ones. Speaking theoretically with no evidence to back it up: It seems to me that this is an expected outcome. If you train on a single message, you've added only 1 to the spam count of each token. How much that raises the score of other messages depends both on the size of your current training set and on how similar other messages are to the one you trained. Messages that are similar to the message you choose to train on are probably also going to have similar initial scores. Pushing a message that is already close to the threshold into the spam region doesn't take much of an increase, but pushing a very low-scoring message over the threshold is much more difficult and a single message probably won't be enough to do it in many cases. Just because training a certain message pushes the most other messages into the spam region doesn't necessarily mean it represents the greatest improvement in the classifier. Chances are good that if I have a large number of unsures then I'm not going to stop training after only one message. If I'm going to train on N messages during the same training session, the order in which I train them isn't important. The key is to choose the smallest possible training *set* such that *all* the other unsure messages will be identified properly. Maybe a closer approximation to this would be to look for the message that causes the greatest increase in the mean spam score of the remaining messages. -- Kenny Pitt From tim.one at comcast.net Mon Jan 19 15:40:34 2004 From: tim.one at comcast.net (Tim Peters) Date: Mon Jan 19 15:40:44 2004 Subject: [spambayes-dev] Another incremental training idea... In-Reply-To: <016401c3db42$f3a64330$6401a8c0@kane> Message-ID: [Tim Peters] >> Alas(?!), I'm getting a lot less spam than I used to -- since >> Christmas Eve of 2003 (when I started saving all my email), I've only >> gotten 1834 of the beasties, less than 100 per day. It used to be >> well over 200 a day. More about that: at least 6 distinct email addresses for me end up in my inbox, via 3 distinct POP3 accounts. The reduction is spam appears purely due to my MSN dialup account, which, until recently, appeared to do nothing to inhibit spam or viruses. Now it's doing a remarkably good job, and in fact better than my other two accounts. Those with Hotmail accounts may have noticed something similar over the past couple months (I have a Hotmail account too, which I *don't* use with spambayes, and the number of spam showing up there has dropped by more than a factor 10 over the (approximately) last month). [Eli Stevens (WG.c)] > I've 540 spam on my personal account since June, 2002 and 365 since > May 2003 on my list subscription account. Until recently, it was > _highly_ repetitive - I suspect my corpus is atypical for people > inclined to do test runs (hence my interest). That's appreciated! All email mixes are important to someone . From skip at pobox.com Mon Jan 19 16:03:59 2004 From: skip at pobox.com (Skip Montanaro) Date: Mon Jan 19 16:04:10 2004 Subject: [spambayes-dev] lowest scoring message isn't always "best" onetotrain on In-Reply-To: References:

Message-ID: <16396.17983.944455.129805@montanaro.dyndns.org> Kenny> Maybe a closer approximation to this would be to look for the Kenny> message that causes the greatest increase in the mean spam score Kenny> of the remaining messages. I've started calculating the delta mean as well as the number of messages pushed into spam territory. Just eyeballing a plot of just over 100 pairs of (mean diff, # new spams) suggests there's a weak correlation between the two variables. I'll probably play with it a bit more and check the script into the contrib section so other people can play with it. Skip From skip at pobox.com Mon Jan 19 16:24:00 2004 From: skip at pobox.com (Skip Montanaro) Date: Mon Jan 19 16:24:10 2004 Subject: [spambayes-dev] Another incremental training idea... In-Reply-To: References: <016401c3db42$f3a64330$6401a8c0@kane> Message-ID: <16396.19184.460558.772002@montanaro.dyndns.org> Tim> More about that: at least 6 distinct email addresses for me end up Tim> in my inbox, via 3 distinct POP3 accounts. The reduction is spam Tim> appears purely due to my MSN dialup account, which, until recently, Tim> appeared to do nothing to inhibit spam or viruses. Now it's doing Tim> a remarkably good job, and in fact better than my other two Tim> accounts. Ask them if they began using SpamBayes. Skip From tim.one at comcast.net Mon Jan 19 16:41:36 2004 From: tim.one at comcast.net (Tim Peters) Date: Mon Jan 19 16:42:00 2004 Subject: [spambayes-dev] Another incremental training idea... In-Reply-To: <16396.19184.460558.772002@montanaro.dyndns.org> Message-ID: [Tim, sez his major reduction in total spam over the last couple months is entirely due to less spam on his MSN dialup account] [Skip Montanaro] > Ask them if they began using SpamBayes. Heh. I don't think so. Remember the link posted here to an article about the spam filter in Outlook 2003? The one that doesn't learn, and is identical for all Outlook 2003 users? This is "typical Microsoft", IMO: the first release of a thing is crap, but while everyone else is distracted with laughter, they relentlessly improve the thing. Between MSN and Hotmail, Microsoft has an inconceivably large collection of real-life data to work with, and I expect they're finally learning from it. What I remain confused about is how they intend to make Big Bux off it. From kennypitt at hotmail.com Mon Jan 19 17:11:29 2004 From: kennypitt at hotmail.com (Kenny Pitt) Date: Mon Jan 19 17:12:17 2004 Subject: [spambayes-dev] Another incremental training idea... In-Reply-To: Message-ID: Tim Peters wrote: > [Tim, sez his major reduction in total spam over the last couple > months is entirely due to less spam on his MSN dialup account] > > [Skip Montanaro] >> Ask them if they began using SpamBayes. > > Heh. I don't think so. Remember the link posted here to an article > about the spam filter in Outlook 2003? The one that doesn't learn, > and is identical for all Outlook 2003 users? This is "typical > Microsoft", IMO: the first release of a thing is crap, but while > everyone else is distracted with laughter, they relentlessly improve > the thing. Between MSN and Hotmail, Microsoft has an inconceivably > large collection of real-life data to work with, and I expect they're > finally learning from it. As I understood the article, the Outlook 2003 filter is actually a very well-trained Bayesian-type filter, and the MSN and Hotmail message flow almost certainly provided the data for that. The problem with the Outlook filter is that it isn't user-trainable. I wonder if Microsoft decided not to include that to avoid the accuracy problems that we often see reported when users mis-train the filter. Given their average user base, any accuracy issues would no doubt be blamed on Microsoft and not user error. Rumor has it that the MSN Explorer mail reading interface for MSN dialup accounts does, in fact, support user training. Can you confirm or deny? > ... What I remain confused about is how they > intend to make Big Bux off it. It may not be anything more complicated than stemming the tide of users choosing or switching to other ISP's that claim to provide better spam filtering. Then again, who can fathom the mind of the Microsoft machine? -- Kenny Pitt From tim at fourstonesExpressions.com Mon Jan 19 17:17:06 2004 From: tim at fourstonesExpressions.com (Tim Stone) Date: Mon Jan 19 17:17:12 2004 Subject: [spambayes-dev] Another incremental training idea... In-Reply-To: References: Message-ID: On Mon, 19 Jan 2004 17:11:29 -0500, Kenny Pitt wrote: >> ... What I remain confused about is how they >> intend to make Big Bux off it. > > It may not be anything more complicated than stemming the tide of users > choosing or switching to other ISP's that claim to provide better spam > filtering. Then again, who can fathom the mind of the Microsoft > machine? Overall, their corporate strategy appears to be much more defensive than anyone is accustomed to seeing. So, I'd say that's a good guess... disintermediation is the worst of M$ fears... to become irrelevant is unthinkable. -- Tim Stone From tim.one at comcast.net Mon Jan 19 18:26:54 2004 From: tim.one at comcast.net (Tim Peters) Date: Mon Jan 19 18:27:05 2004 Subject: [spambayes-dev] Another incremental training idea... In-Reply-To: Message-ID: [Kenny Pitt] > As I understood the article, the Outlook 2003 filter is actually a > very well-trained Bayesian-type filter, and the MSN and Hotmail > message flow almost certainly provided the data for that. Yes, I'm sure they did. Spam on Hotmail went way down shortly before OL2003 was shipped -- and shortly thereafter went right back up again. I expect they were testing the static feature weights shipped with OL2003, and spammers quickly learned to out-wit those. > The problem with the Outlook filter is that it isn't user-trainable. Yup. > I wonder if Microsoft decided not to include that to avoid the > accuracy problems that we often see reported when users mis-train > the filter. Given their average user base, any accuracy issues > would no doubt be blamed on Microsoft and not user error. MS accepts blame very well : there's nobody at MS you can talk to about a complaint, unless you pay for the privilege, and vendors incorporating MS stuff as OEM code are stuck with support themselves. I more expect that developing the UI code for OL was intractable in the time they had. I don't think MS would ship something as complex as, e.g., the SpamBayes UI, in a consumer product, and UI code is darned hard regardless. Or it could indeed be that real-life feedback from MSN 8 wasn't as good as anticipated. > Rumor has it that the MSN Explorer mail reading interface for MSN > dialup accounts does, in fact, support user training. Can you > confirm or deny? Not personally -- I never installed MSN 8. I retain that dialup account simply because it gets me a solid nationwide phone network for dialup access when I'm on the road. It's the "ISP" part I pay them for, not the "MSN" part. According to this: http://research.microsoft.com/~joshuago/spamconferenceshort.ppt it *does* do user-driven learning, but that's the most detailed account I've seen, and it doesn't really reveal anything. From mhammond at skippinet.com.au Mon Jan 19 22:19:54 2004 From: mhammond at skippinet.com.au (Mark Hammond) Date: Mon Jan 19 22:20:22 2004 Subject: [spambayes-dev] Outlook plug-in & Windows 95 In-Reply-To: Message-ID: <368001c3df04$46d3ba50$2c00a8c0@eden> > [Tony Meyer] > > Does anyone know whether the binary release of the plug-in > works with > > Windows95 (assuming that the version of Outlook is modern enough)? > > I don't recall anyone on Win95 saying so one way or the > other. I don't have > access to Win95 myself. Sorry I missed all of this. win32all does not work on Windows 95 "out of the box". By installing some magic MS software (probably latest IE, also probably office), the Windows DLLs are updated to the point where we do work. I haven't bothered trying to track this down, nor to fix the win32all issue - as far as I am concerned, win95 is dead :) Mark. From mhammond at skippinet.com.au Mon Jan 19 22:25:25 2004 From: mhammond at skippinet.com.au (Mark Hammond) Date: Mon Jan 19 22:25:37 2004 Subject: [spambayes-dev] Version numbering (was: beta release time?) In-Reply-To: Message-ID: <368101c3df05$0bcfbca0$2c00a8c0@eden> > Tony Meyer wrote: > >> If people have bugs/features they > >> want fixed/added before then, can we assemble a list? > > > > * The Version.py stuff (Mark was going to do this at some point). > > I think I was the first to suggest this, so I'd be happy to make the > changes now that I'm on the developer list. However, I'll need input > about what you guys are looking for in a versioning scheme. No one knows :) My intent was so keep separate versons for different applications. I expected that more apps would start to appear - eg, the notes filter etc would get more attention. This hasn't happened, and I don't see it happening. Now we are (still slowly) moving towards a single Windows binary for all "stable" apps, it is largely moot. A single "version dictionary", for all apps would probably do. There are only a few places that use this version information, so we could just update them, rather than trying to maintain a b/w compat interface. It *will* be necessary to maintain a couple of entries so existing users doing a version check see the correct thing. The only other thing is to remove/update the __init__ file which has a duplicate version number, and any other references to version numbers you can find :) Mark. From mhammond at skippinet.com.au Tue Jan 20 03:44:05 2004 From: mhammond at skippinet.com.au (Mark Hammond) Date: Tue Jan 20 03:44:17 2004 Subject: [spambayes-dev] training from IMAP folder? (with patch) In-Reply-To: <200401191734.48563.jhs@oes.co.th> Message-ID: <36d601c3df31$90597840$2c00a8c0@eden> This code looks OK to me, in terms of not impacting too much other code (If the new code doesn't work perfectly, that is far less critical than breaking what used to - but this appears to be completely new functionality) The only code comment is that you use tabs instead of spaces, and that some lines exceed 80 char (like I can talk ) It would be good if someone who understood IMAP looked at it though :) Mark. > -----Original Message----- > From: spambayes-dev-bounces+mhammond=keypoint.com.au@python.org > [mailto:spambayes-dev-bounces+mhammond=keypoint.com.au@python.org]On > Behalf Of Jason Smith > Sent: Monday, 19 January 2004 9:35 PM > To: spambayes-dev@python.org > Subject: Re: [spambayes-dev] training from IMAP folder? (with patch) > > > On Sunday 18 January 2004 15:05, Jason Smith wrote: > > Unfortunately, it is against the latest > > release and not CVS because I cannot seem to access SF CVS > at the moment. > > Looks like sourceforge is back to "normal," so I played > around and made this > patch against current CVS. Hope that gets a little more attention :p > > -- > Jason Smith > Open Enterprise Systems > Bangkok, Thailand > http://www.oes.co.th > From jhs at oes.co.th Tue Jan 20 04:14:05 2004 From: jhs at oes.co.th (Jason Smith) Date: Tue Jan 20 04:14:18 2004 Subject: [spambayes-dev] training from IMAP folder? (with patch) In-Reply-To: <36d601c3df31$90597840$2c00a8c0@eden> References: <36d601c3df31$90597840$2c00a8c0@eden> Message-ID: <200401201614.05191.jhs@oes.co.th> > The only code comment is that you use tabs instead of spaces, and that some > lines exceed 80 char (like I can talk ) Sorry.. bad vim settings. I didn't notice due to my expansion settings. IIRC, sourceforge public CVS is a day or two behind the developer versions. I didn't see the patch merged; if not, I can resubmit with better formatting. -- Jason Smith Open Enterprise Systems Bangkok, Thailand http://www.oes.co.th From nobody at spamcop.net Tue Jan 20 14:46:27 2004 From: nobody at spamcop.net (Seth Goodman) Date: Tue Jan 20 14:46:33 2004 Subject: [spambayes-dev] lowest scoring message isn't always "best"onetotrain on In-Reply-To: <16396.17983.944455.129805@montanaro.dyndns.org> Message-ID: [Skip Montanaro] > I've started calculating the delta mean as well as the number of messages > pushed into spam territory. Just eyeballing a plot of just over 100 pairs > of (mean diff, # new spams) suggests there's a weak correlation > between the > two variables. I was thinking about this, so glad you're playing with it. An additional figure of merit for a message to train on might be a reduction in the SD of the spam scores. This is almost as important as an increase in the mean (or average, whichever you choose). -- Seth Goodman non-spam replies to sethg [at] GoodmanAssociates [dot] com From sl6dt at cc.usu.edu Tue Jan 20 17:08:26 2004 From: sl6dt at cc.usu.edu (John Mulholland) Date: Tue Jan 20 17:11:05 2004 Subject: [spambayes-dev] Web filter Message-ID: <200401201508.26295.sl6dt@cc.usu.edu> I was just rereading some of the old discussions about a bayesian web filter. I am going to try to write one this semester for grad level class project. I think that I can find some solutions to problems that you have mentioned. There are some very interesting characteristics about porn pages. First of all, they are often linked through java script. It isn't difficult to automate a process to find lots of them. Maybe I am naive but it seems they are quite similar and very different from most other web pages. At least that is the case with most of them. Since one of the purposes of my program is to protect people from accidently going to a porn site then a false negative is much more serious then a false positive. I definitely agree that an open effort to make a base package of n number of sites we definitely want blocked would be very helpful. To check out sites it is as simple as telnet abc.com 80 GET / HTTP/1.1 Then you can get the html and analyze it. If an open effort does start to list sites we should also make sure to have different categories because someone may not want to look at nude art but some may think that it is ok. If people are interested in this please contact me at sl6dt@cc.usu.edu and let me know. I would appreciate any ideas or suggestions because I am fairly new to the linux world and there are many things with this project that I have no idea how to do. My goal is to may a very effective, robust, easy to use, customizable, free web filter that most people can use, including windows users. John Mulholland From tameyer at ihug.co.nz Tue Jan 20 19:02:20 2004 From: tameyer at ihug.co.nz (Tony Meyer) Date: Tue Jan 20 19:02:30 2004 Subject: [spambayes-dev] Outlook plug-in & Windows 95 In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F1304BA369E@its-xchg4.massey.ac.nz> Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F130467784C@its-xchg4.massey.ac.nz> > Sorry I missed all of this. win32all does not work on > Windows 95 "out of the box". By installing some magic MS > software (probably latest IE, also probably office), the > Windows DLLs are updated to the point where we do work. I > haven't bothered trying to track this down, nor to fix the > win32all issue - as far as I am concerned, win95 is dead :) Fair enough :) I'll update the website to reflect this. =Tony Meyer From tameyer at ihug.co.nz Tue Jan 20 21:20:35 2004 From: tameyer at ihug.co.nz (Tony Meyer) Date: Tue Jan 20 21:20:47 2004 Subject: [spambayes-dev] training from IMAP folder? (with patch) In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F1304BA366F@its-xchg4.massey.ac.nz> Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F13026F2A5C@its-xchg4.massey.ac.nz> > The reason I need this feature (as opposed to the IMAP > filter) is to implement server-side spam filtering (using cyrus) > and training which is intuitive for lay mail users. I'm not sure why imapfilter wouldn't work here (i.e. users still drag mail to train to particular folders, and on the server, imapfilter is run once for each user (as mboxtrain presumably will be) set to train on the appropriate folders (and not classify). Be that as it may, I don't see any problem with adding this to mboxtrain. Comments on the patch: Rather than doing this: message_flags = string.replace(message_flags, '\\Recent ', '') You could do this: message_flags = message_flags.replace(message_flags, '\\Recent ', '') This removes the need for importing string, and is, I believe, the more 'correct' way to do it. The import of imaplib should be at the top, too, according to the rather loose SpamBayes coding rules. I'm also curious about whether the single space at the end has the same effect if the \Recent flag is the only flag present and when it isn't the only one. Rather than getting the message headers and body separately, you could use "RFC822" to get both together. You could also use "BODY.PEEK[]" to get it without setting the \Seen flag. (sb_imapfilter.py needs to be updated to use "BODY.PEEK[]"). Ideally, it would be great if mboxtrain and imapfilter used the same code to do this. It would save a lot of maintenance and hassle if that was the case. It's difficult to import from imapfilter, since it's in the scripts directory, so I suppose the solution would be to create a new module in the spambayes directory, and have both imapfilter and mboxtrain import from that. If you did that, you could just: 1. Create an IMAPSession object (with server + port details). 2. Call Login() on this object. 3. For each of the folders to train a. Create an IMAPFolder object b. Call Train() on this object. 4. Call Logout() on the IMAPSession object. The only modifications that would need to be done are to match the standard mboxtrain "include_trained" header option and remove trained option. These would be simple additions to the code, though. Thoughts? =Tony Meyer From tameyer at ihug.co.nz Wed Jan 21 01:01:49 2004 From: tameyer at ihug.co.nz (Tony Meyer) Date: Wed Jan 21 01:01:57 2004 Subject: [spambayes-dev] Web filter In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F1304BA3C55@its-xchg4.massey.ac.nz> Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F13026F2A5F@its-xchg4.massey.ac.nz> > If people are interested in this please contact me at > sl6dt@cc.usu.edu and let me know. I would appreciate > any ideas or suggestions because I am fairly new > to the linux world and there are many things with this > project that I have no idea how to do. Take a look at the mod_spambayes.py script in the contrib directory of the spambayes archive. This implements a SpamBayes based filter using Amit Patel's proxy3 web proxy. It would certainly be a reasonable start (depending on the requirements of the project). I suspect that other than building in training & configuration somehow, the majority of the work that you'd need to do would be creating a tokenizer optimised for web pages, rather than email. =Tony Meyer From tameyer at ihug.co.nz Wed Jan 21 03:34:26 2004 From: tameyer at ihug.co.nz (Tony Meyer) Date: Wed Jan 21 03:35:55 2004 Subject: [spambayes-dev] Another incremental training idea... In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F1304A7D814@its-xchg4.massey.ac.nz> Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F13026F2A63@its-xchg4.massey.ac.nz> [Tony Meyer] > I tried almost this with the incremental regime, using a maximum of > 2::1 or 1::2. It did pretty consistently worse than the > basic nonedge regime. The only difference is that I didn't choose > which messages to use if an imbalance would be created. The idea > was basically to do nonedge, except if there was an imbalance, and > then only train messages that move the balance closer to 1::1. [Eli Stevens] > It sounds like you are saying that non-edge messages on the > heavy side were not trained. It seems that would be a key difference. > Was that the case in your test? I'm not sure what you mean by "on the heavy side". Do you mean that scored closest to the edge? If so, then yes. Basically, it dealt with messages as they arrived, one-by-one, just as an automated system would. I haven't found time, but an alternative would be to do this by day/group. So at the end of each day, nonedge messages are trained as long as the db stays in balance, but using the messages closest to the edge first (i.e. do a sort by the distance from 0.5). I'm sure it's worth a try, since anecdotal evidence and timcv testing seems to show that imbalance hurts, so some sort of structured balancing must be able to help. =Tony Meyer From ta-meyer at ihug.co.nz Wed Jan 21 03:47:40 2004 From: ta-meyer at ihug.co.nz (Tony Meyer) Date: Wed Jan 21 03:47:44 2004 Subject: [spambayes-dev] Incremental testing Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F13026F2A61@its-xchg4.massey.ac.nz> I've finally got around to writing up my latest incremental testing results. I think I've finally managed to get my head around the incremental setup, so my earlier results are probably better ignored :) The results are summarised here, but for all the pretty graphs, see: (The page has over 60 graphs, so it may take a little bit to load...) These testing runs had two aims - to test the various regimes, including a few that aren't in the CVS copy of regimes.py (the balanced ones), and also to compare each regime with the experimental bigrams option enabled. The winner ---------- Expiring data after 30 days did better than keeping it. I suspect this is because with each of the major changes in spam volume, the new spam was of a different type, and the expiring regime managed these changes better. It was still beaten by other regimes, but not by much, and it would be interesting to try, for example, a 'fpfnunsure' regime that expired as well. The 'nonedge' regime wins for the most part, except when there was a large spike in the amount of spam around day 320, at which point it loses. The 'fpfnunsure' regime seems to be the overall winner, since it almost matches the 'nonedge' regime most of the time, and does much better after day 320. Self-balancing -------------- None of the self-balancing regime adaptations that I've tried has improved the results (apart from with nonedge with bigrams, oddly). I'm sure that a balancing regime could be designed that did help, but it seems that this code isn't doing it. Amount of training ------------------ Although it's not displayed in the graphs, the number of messages that were trained on varied a lot between the regimes. The perfect regime trained on about 12000 messages, or 32.9/day, which is much, much, more than any of the partial-training regimes. Interestingly, the balanced options trained on about the same number of total messages as the non-balanced options (indicating that little was gained from the balancing, I think). The nonedge regime trained on about 850 messages, or 2.3/day, just under the nonedge regime with bigrams, which took more messages - about 1050, or 2.9/day. Bigrams ------- Bigrams hurt the results of balanced_perfect, imbalanced_perfect, fpfnunsure, and nonedge. Bigrams helped the results of balanced_nonedge, perfect, balanced_corrected, expire1month, and corrected. When bigrams won, it tended to reduce false-positives a lot, false negatives a little, and increase unsures. Using bigrams the results for each set did not differ nearly so much, and, as a result, the regimes were much more clearly delimited. What does this say about using bigrams? Pretty much that more research (particularly with other corpora) is needed, IMO. BTW, to compare, I ran a timcv.py -n5 test with the same data. Bigrams easily won. =Tony Meyer From kennypitt at hotmail.com Wed Jan 21 09:51:52 2004 From: kennypitt at hotmail.com (Kenny Pitt) Date: Wed Jan 21 09:52:41 2004 Subject: [spambayes-dev] training from IMAP folder? (with patch) In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F13026F2A5C@its-xchg4.massey.ac.nz> Message-ID: Tony Meyer wrote: > Comments on the patch: > > Rather than doing this: > message_flags = string.replace(message_flags, '\\Recent ', '') > You could do this: > message_flags = message_flags.replace(message_flags, '\\Recent ', '') > Wouldn't that be: message_flags = message_flags.replace('\\Recent ', '') ? -- Kenny Pitt From tim at fourstonesExpressions.com Wed Jan 21 10:47:24 2004 From: tim at fourstonesExpressions.com (Tim Stone) Date: Wed Jan 21 10:47:30 2004 Subject: [spambayes-dev] Fwd: [Spambayes] SpamBayes: sponsorship accepted! In-Reply-To: <1662160.1074698665306.JavaMail.jboss@p15135617.pureserver.info> References: <1662160.1074698665306.JavaMail.jboss@p15135617.pureserver.info> Message-ID: Ok, guys, this showed up on the spambayes list. These guys WANT to send us money... but it's checques. We're gonna have to manage this kind of thing. Since I've not contributed much other than opinions for a while now, I'll be glad to administer this kind of thing, if that's good with everyone. - Tim S ------- Forwarded message ------- From: Dawn Wesolek To: spambayes@python.org Subject: [Spambayes] SpamBayes: sponsorship accepted! Date: Wed, 21 Jan 2004 16:24:25 +0100 (CET) > > Dear SpamBayes team, > Congratulations, your project "SpamBayes" has been > accepted for the I3T Award for Software Excellence. > > This is what our sponsorship panel had to say about it: > "An intelligent solution to a difficult problem. I doubt whether Thomas > Bayes could ever have imagined that his work would be employed in such > an interesting manner back in the 18th century. Congratulations on the > award." > > What this means for you and your project: > > o You are now entitled to display the I3T Award for Software > Excellence > emblem on your web-site. > > o You will receive funds every time the institute recruits a member > who > joins after clicking our emblem on your page. > This is typically 20% of their first membership payment, which > amounts > to a massive $20 in most cases. > > o You will be listed on I3T's list of sponsored projects. This > should bring > greater exposure to your project. > > o Best of all, up to 3 members of your team can become sponsored > members of the International Institute of Information > Technologists. This > entitles each of you to the qualification MIinstIT and the full > benefits that > being a member bring. This sponsorship is worth $99 a year for > each > member so your project effectively receives a further $297 of > sponsorship > per year. > > o If you choose to become members, we will be happy to set up a > community > in our members area for your project. > > In order to accept this award you need to make two simple steps: > > 1) Go to the following URL to set up your account. You need to do > this so that > we know where to send the cheques. > > http://i3t.org/sponsorship/activate.jsf?activationCode=698664928 > > 2) Paste the following HTML fragment onto a prominent place on > your web-site. > This is the emblem which shows that your project has won the > award: > > >

alt="The I3T Award for Software Excellence"> > > > And that's it. Every month, you will receive an email that will let you > know how > much funds you will receive, and a cheque will be dispatched in the post. > > You can also check up over the internet at any time to see how much you > have been awarded > so far for the current month. > > The URL is: http://i3t.org/sponsorship > > If you have any comments or queries, I'll be happy to hear from you. > > Best wishes, > > Dawn C. Wesolek, > I3T - Sponsorship awards > > > > _______________________________________________ > Spambayes@python.org > http://mail.python.org/mailman/listinfo/spambayes > Check the FAQ before asking: http://spambayes.sf.net/faq.html > -- Tim Stone From skip at pobox.com Wed Jan 21 11:19:22 2004 From: skip at pobox.com (Skip Montanaro) Date: Wed Jan 21 11:34:55 2004 Subject: [spambayes-dev] Fwd: [Spambayes] SpamBayes: sponsorship accepted! In-Reply-To: References: <1662160.1074698665306.JavaMail.jboss@p15135617.pureserver.info> Message-ID: <16398.42634.371644.316367@montanaro.dyndns.org> Tim> Ok, guys, this showed up on the spambayes list. These guys WANT to Tim> send us money... but it's checques. We're gonna have to manage Tim> this kind of thing. Since I've not contributed much other than Tim> opinions for a while now, I'll be glad to administer this kind of Tim> thing, if that's good with everyone. Why not have these guys cut their checques to the PSF? Skip From tim at fourstonesExpressions.com Wed Jan 21 11:52:57 2004 From: tim at fourstonesExpressions.com (Tim Stone) Date: Wed Jan 21 11:53:04 2004 Subject: [spambayes-dev] Fwd: [Spambayes] SpamBayes: sponsorship accepted! In-Reply-To: <16398.42634.371644.316367@montanaro.dyndns.org> References: <1662160.1074698665306.JavaMail.jboss@p15135617.pureserver.info> <16398.42634.371644.316367@montanaro.dyndns.org> Message-ID: On Wed, 21 Jan 2004 10:19:22 -0600, Skip Montanaro wrote: > > Why not have these guys cut their checques to the PSF? Sure, that's what I was thinking... In general, it'd be best if things went that way. I'm just checking in to make sure that more than one of us doesn't try to interact with these guys... if someone else would be better, fine, otherwise, I'm quite willing to do it, and handle these kind of things in the future. I think this will happen at least a few more times... Interesting thing is, we got this award and we've not even gone beat yet... :) -- Tim Stone From popiel at wolfskeep.com Wed Jan 21 13:51:06 2004 From: popiel at wolfskeep.com (T. Alexander Popiel) Date: Wed Jan 21 13:51:11 2004 Subject: [spambayes-dev] Fwd: [Spambayes] SpamBayes: sponsorship accepted! In-Reply-To: Message from Tim Stone of "Wed, 21 Jan 2004 09:47:24 CST." References: <1662160.1074698665306.JavaMail.jboss@p15135617.pureserver.info> Message-ID: <20040121185106.7C0AD2DE7C@cashew.wolfskeep.com> In message: Tim Stone writes: >Ok, guys, this showed up on the spambayes list. These guys WANT to send >us money... but it's checques. We're gonna have to manage this kind of >thing. Since I've not contributed much other than opinions for a while >now, I'll be glad to administer this kind of thing, if that's good with >everyone. +1 - Alex From listsub at wickedgrey.com Wed Jan 21 14:52:48 2004 From: listsub at wickedgrey.com (Eli Stevens (WG.c)) Date: Wed Jan 21 14:53:19 2004 Subject: [spambayes-dev] Another incremental training idea... References: <1ED4ECF91CDED24C8D012BCF2B034F13026F2A63@its-xchg4.massey.ac.nz> Message-ID: <400ED890.6080807@wickedgrey.com> Tony Meyer wrote: > [Tony Meyer] > >>I tried almost this with the incremental regime, using a maximum of >>2::1 or 1::2. It did pretty consistently worse than the >>basic nonedge regime. The only difference is that I didn't choose >>which messages to use if an imbalance would be created. The idea >>was basically to do nonedge, except if there was an imbalance, and >>then only train messages that move the balance closer to 1::1. >> > > [Eli Stevens] > >>It sounds like you are saying that non-edge messages on the >>heavy side were not trained. It seems that would be a key difference. >>Was that the case in your test? >> > > I'm not sure what you mean by "on the heavy side". Do you mean that scored > closest to the edge? If so, then yes. Basically, it dealt with messages as > they arrived, one-by-one, just as an automated system would. Sorry, I wasn't being very clear. Hmm. Below, isEdge( score, type ) can either be a fixed cutoff, or slide based on how imbalanced things are, but that is orthogonal to my question. type, score = classify( msg ) if not isEdge( score, type ): train( msg, type ) elif more_ham_than_spam and type == spam: train( msg, type ) elif more_spam_than_ham and type == ham: train( msg, type ) # Corpus contains: # MoreHam Balanced MoreSpam # MsgIsEdgeHam Train # MsgIsHam Train Train Train # MsgIsSpam Train Train Train # MsgIsEdgeSpam Train Versus: type, score = classify( msg ) if more_ham_than_spam and type == spam: train( msg, type ) elif more_spam_than_ham and type == ham: train( msg, type ) elif not isEdge( score, type ): # implies balanced train( msg, type ) # Corpus contains: # MoreHam Balanced MoreSpam # MsgIsEdgeHam Train # MsgIsHam Train Train # MsgIsSpam Train Train # MsgIsEdgeSpam Train In the first, training on non-edge takes priority over balance, while in the second, balance takes priority over training on non-edge. The only difference is how non-edge messages that would increase imbalance are treated. The second is how I interpreted your description, and is what I meant by non-edge messages on the heavy side not being trained (heavy side = message type with the most trained messages = what we are imbalanced towards... It was a poor choice of words). Does that make sense? I'm still not sure if this is putting it clearly. :/ Eli From tameyer at ihug.co.nz Wed Jan 21 16:29:21 2004 From: tameyer at ihug.co.nz (Tony Meyer) Date: Wed Jan 21 16:29:29 2004 Subject: [spambayes-dev] training from IMAP folder? (with patch) In-Reply-To: <1ED4ECF91CDED24C8D012BCF2B034F1304BA3D08@its-xchg4.massey.ac.nz> Message-ID: <1ED4ECF91CDED24C8D012BCF2B034F1304677862@its-xchg4.massey.ac.nz> > Wouldn't that be: > message_flags = message_flags.replace('\\Recent ', '') Opps. Yes, it would. Thanks :) =Tony Meyer From skip at pobox.com Wed Jan 21 16:44:47 2004 From: skip at pobox.com (Skip Montanaro) Date: Wed Jan 21 16:44:59 2004 Subject: [spambayes-dev] contrib/findbest.py Message-ID: <16398.62159.34529.538566@montanaro.dyndns.org> I just checked findbest.py into the contrib directory. Here's the docstring. Find the next "best" unsure message to train on. %(prog)s [ -h ] [ -s ] [ -b N ] ham spam unsure Given a number of unsure messages and a desire to keep your training database small, the question naturally arises, "Which message should I add to my database next?". A common approach is to sort the unsures by their SpamBayes scores and train on the one which scores lowest. This is a reasonable approach, but there is no guarantee the lowest scoring unsure is in any way related to the other unsure messages. This script offers a different approach. Given an existing pile of ham and spam, it trains on them to establish a baseline, then for each message in the unsure pile, it trains on that message, scores the entire unsure pile against the resulting training database, then untrains on that message. For each such message the following output is generated: * spamprob of the candidate message * number of other unsure messages which would score as spam if it was added to the training database * overall mean of all scored messages after training * standard deviation of all scored messages after training * message-id of the candidate message With no options, all candidate unsure messages are trained and scored against. At the end of the run, a file, "best.pck" is written out which is a dictionary keyed by the overall mean rounded to three decimal places. The values are lists of message-ids which generate that mean. Three options affect the behavior of the program. If the -h flag is given, this help message is displayed and the program exits. If the -s flag is given, no messages which score as spam are tested as candidates. If the -b N flag is given, only the messages which generated the N highest means in the last run without the -b flag are tested as candidates. Because the program runtime can be very slow (O(n^2) in the number of unsure messages), if you have a fairly large pile of unsure messages, these options can speed things up dramatically. If the -b flag is used, a new "best.pck" file is not written. Typically you would run once without the -b flag, then several times with the -b flag, adding one message to the spam pile after each run. After adding several messages to your spam file, you might then redistribute the unsure pile to move spams and hams to their respective folders, then start again with a smaller unsure pile. The ham, spam and unsure command line arguments can be anything suitable for feeding to spambayes.mboxutils.getmbox(). The "best.pck" file is searched for and written to these files in this order: * best.pck in the current directory * $HOME/tmp/best.pck * $HOME/best.pck [To do? Someone might consider the reverse operation. Given a pile of ham and spam, which message can be removed with the least impact? What pile of mail should that removal be tested against?] I'm sure there are mistakes in there. Feel free to rap my virtual knuckles or fix them... Skip From listsub at wickedgrey.com Wed Jan 21 22:28:57 2004 From: listsub at wickedgrey.com (Eli Stevens (WG.c)) Date: Wed Jan 21 22:29:45 2004 Subject: [spambayes-dev] Preventing FAQ 3.13 References: Message-ID: <400F4379.5080104@wickedgrey.com> Kenny Pitt wrote, as have many before: > > This is covered by FAQ 3.13: > > http://spambayes.sourceforge.net/faq.html#help-i-deleted-the-unsure-spam-folder Would it be possible to detect the error state resulting from deleting the spam folder from within the plugin? Having a dialog that pops up describing the problem and possible solutions (and perhaps a few do-it-for-me buttons?) might cut down a fair bit of the support questions that come through. But I say this ignorant of the crufty details that comprise Outlook. :) Eli From tim.one at comcast.net Thu Jan 22 00:31:10 2004 From: tim.one at comcast.net (Tim Peters) Date: Thu Jan 22 00:31:19 2004 Subject: [spambayes-dev] Fwd: [Spambayes] SpamBayes: sponsorship accepted! In-Reply-To: Message-ID: [Tim Stone] > Ok, guys, this showed up on the spambayes list. These guys WANT to > send us money... but it's checques. I advise caution on this. The only thing a Google search turns up about the I3T Award for Software Excellence is that it's been mentioned twice in the spambayes archives. That makes it arguably less flattering than, say, a Nobel Prize. I'm not clear on motivations, either. Yes, they say they'll send us funds, but that's contingent upon our first putting up a link to their site, and then someone has to click on that link and pay *them* $99 first. So, on the face of it, it seems no more or less than that a company is offering to pay us a modest 20% commission on click-throughs to their site that result in sales for them. If that's what we want, we could get a better deal from a porn site . Now maybe they're wonderful people and this is a thoroughly legit deal. I don't know -- information about I3T seems hard to come by. But for that reason also, I'd be opposed to putting their link on our site: I don't want to lend the reputation of our project to an organization we know nothing about. > We're gonna have to manage this kind of thing. Since I've not > contributed much other than opinions for a while now, I'll be > glad to administer this kind of thing, if that's good with everyone. That's cool, and appreciated, keeping in mind that everyone working on a spam project should have an overactive sense of skepticism. From mhammond at skippinet.com.au Thu Jan 22 02:28:05 2004 From: mhammond at skippinet.com.au (Mark Hammond) Date: Thu Jan 22 02:28:21 2004 Subject: [spambayes-dev] Fwd: [Spambayes] SpamBayes: sponsorship accepted! In-Reply-To: Message-ID: <007a01c3e0b9$49327140$2c00a8c0@eden> > That's cool, and appreciated, keeping in mind that everyone > working on a spam project should have an overactive sense of > skepticism. Why do you think I started with SpamBayes? I was going broke trying to extend my dick by 3 inches! Not-to-mention-larger-breasts ly, Mark. From jhs at oes.co.th Thu Jan 22 02:50:33 2004 From: jhs at oes.co.th (Jason Smith) Date: Thu Jan 22 02:51:34 2004 Subject: [spambayes-dev] Fwd: [Spambayes] SpamBayes: sponsorship accepted! In-Reply-To: <007a01c3e0b9$49327140$2c00a8c0@eden> References: <007a01c3e0b9$49327140$2c00a8c0@eden> Message-ID: <200401221450.37185.jhs@oes.co.th> On Thursday 22 January 2004 14:28, Mark Hammond wrote: > Why do you think I started with SpamBayes? I was going broke trying to Blast! I can't train my DB from this message! Here's where I rush to delete it before cron comes around. -- Jason Smith Open Enterprise Systems Bangkok, Thailand http://www.oes.co.th -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 189 bytes Desc: signature Url : http://mail.python.org/pipermail/spambayes-dev/attachments/20040122/e11951fd/attachment.bin From tim at fourstonesExpressions.com Thu Jan 22 08:22:47 2004 From: tim at fourstonesExpressions.com (Tim Stone) Date: Thu Jan 22 08:22:53 2004 Subject: [spambayes-dev] Fwd: [Spambayes] SpamBayes: sponsorship accepted! In-Reply-To: References: Message-ID: On Thu, 22 Jan 2004 00:31:10 -0500, Tim Peters