[Spambayes-checkins] SF.net SVN: spambayes: [3156] trunk/website
montanaro at users.sourceforge.net
montanaro at users.sourceforge.net
Wed Jul 25 15:51:11 CEST 2007
Revision: 3156
http://spambayes.svn.sourceforge.net/spambayes/?rev=3156&view=rev
Author: montanaro
Date: 2007-07-25 06:51:11 -0700 (Wed, 25 Jul 2007)
Log Message:
-----------
read the file name incorrectly as a misspelling of "prefs changelog" instead of "pre sf changelog"!
Added Paths:
-----------
trunk/website/presfchangelog.ht
Removed Paths:
-------------
trunk/website/prefschangelog.ht
Deleted: trunk/website/prefschangelog.ht
===================================================================
--- trunk/website/prefschangelog.ht 2007-07-25 13:49:42 UTC (rev 3155)
+++ trunk/website/prefschangelog.ht 2007-07-25 13:51:11 UTC (rev 3156)
@@ -1,905 +0,0 @@
-<h2>Pre-Sourceforge ChangeLog</h2>
-<p>This changelog lists the commits on the spambayes projects before the
- separate project was set up. See also the
-<a href="http://spambayes.cvs.sourceforge.net/python/python/nondist/sandbox/spambayes/?hideattic=0">old CVS repository</a>, but don't forget that it's now out of date, and you probably want to be looking at <a href="http://spambayes.cvs.sourceforge.net/spambayes/spambayes/">the current CVS</a>.
-</p>
-<pre>
-2002-09-06 02:27 tim_one
-
- * GBayes.py (1.16), Tester.py (1.4), classifier.py (1.12),
- cleanarch (1.3), mboxcount.py (1.6), rebal.py (1.4), setup.py
- (1.2), split.py (1.6), splitn.py (1.3), timtest.py (1.18):
-
- This code has been moved to a new SourceForge project (spambayes).
-
-2002-09-05 15:37 tim_one
-
- * classifier.py (1.11):
-
- Added note about MINCOUNT oddities.
-
-2002-09-05 14:32 tim_one
-
- * timtest.py (1.17):
-
- Added note about word length.
-
-2002-09-05 13:48 tim_one
-
- * timtest.py (1.16):
-
- tokenize_word(): Oops! This was awfully permissive in what it
- took as being "an email address". Tightened that, and also
- avoided 5-gram'ing of email addresses w/ high-bit characters.
-
- false positive percentages
- 0.000 0.000 tied
- 0.000 0.000 tied
- 0.050 0.050 tied
- 0.000 0.000 tied
- 0.025 0.025 tied
- 0.025 0.025 tied
- 0.050 0.050 tied
- 0.025 0.025 tied
- 0.025 0.025 tied
- 0.025 0.050 lost
- 0.075 0.075 tied
- 0.025 0.025 tied
- 0.025 0.025 tied
- 0.025 0.025 tied
- 0.025 0.025 tied
- 0.025 0.025 tied
- 0.025 0.025 tied
- 0.000 0.000 tied
- 0.025 0.025 tied
- 0.050 0.050 tied
-
- won 0 times
- tied 19 times
- lost 1 times
-
- total unique fp went from 7 to 8
-
- false negative percentages
- 0.764 0.691 won
- 0.691 0.655 won
- 0.981 0.945 won
- 1.309 1.309 tied
- 1.418 1.164 won
- 0.873 0.800 won
- 0.800 0.763 won
- 1.163 1.163 tied
- 1.491 1.345 won
- 1.200 1.127 won
- 1.381 1.345 won
- 1.454 1.490 lost
- 1.164 0.909 won
- 0.655 0.582 won
- 0.655 0.691 lost
- 1.163 1.163 tied
- 1.200 1.018 won
- 0.982 0.873 won
- 0.982 0.909 won
- 1.236 1.127 won
-
- won 15 times
- tied 3 times
- lost 2 times
-
- total unique fn went from 260 to 249
-
- Note: Each of the two losses there consist of just 1 msg difference.
- The wins are bigger as well as being more common, and 260-249 = 11
- spams no longer sneak by any run (which is more than 4% of the 260
- spams that used to sneak thru!).
-
-2002-09-05 11:51 tim_one
-
- * classifier.py (1.10):
-
- Comment about test results moving MAX_DISCRIMINATORS back to 15; doesn't
- really matter; leaving it alone.
-
-2002-09-05 10:02 tim_one
-
- * classifier.py (1.9):
-
- A now-rare pure win, changing spamprob() to work harder to find more
- evidence when competing 0.01 and 0.99 clues appear. Before in the left
- column, after in the right:
-
- false positive percentages
- 0.000 0.000 tied
- 0.000 0.000 tied
- 0.050 0.050 tied
- 0.000 0.000 tied
- 0.025 0.025 tied
- 0.025 0.025 tied
- 0.050 0.050 tied
- 0.025 0.025 tied
- 0.025 0.025 tied
- 0.025 0.025 tied
- 0.075 0.075 tied
- 0.025 0.025 tied
- 0.025 0.025 tied
- 0.025 0.025 tied
- 0.075 0.025 won
- 0.025 0.025 tied
- 0.025 0.025 tied
- 0.000 0.000 tied
- 0.025 0.025 tied
- 0.050 0.050 tied
-
- won 1 times
- tied 19 times
- lost 0 times
-
- total unique fp went from 9 to 7
-
- false negative percentages
- 0.909 0.764 won
- 0.800 0.691 won
- 1.091 0.981 won
- 1.381 1.309 won
- 1.491 1.418 won
- 1.055 0.873 won
- 0.945 0.800 won
- 1.236 1.163 won
- 1.564 1.491 won
- 1.200 1.200 tied
- 1.454 1.381 won
- 1.599 1.454 won
- 1.236 1.164 won
- 0.800 0.655 won
- 0.836 0.655 won
- 1.236 1.163 won
- 1.236 1.200 won
- 1.055 0.982 won
- 1.127 0.982 won
- 1.381 1.236 won
-
- won 19 times
- tied 1 times
- lost 0 times
-
- total unique fn went from 284 to 260
-
-2002-09-04 11:21 tim_one
-
- * timtest.py (1.15):
-
- Augmented the spam callback to display spams with low probability.
-
-2002-09-04 09:53 tim_one
-
- * Tester.py (1.3), timtest.py (1.14):
-
- Added support for simple histograms of the probability distributions for
- ham and spam.
-
-2002-09-03 12:13 tim_one
-
- * timtest.py (1.13):
-
- A reluctant "on principle" change no matter what it does to the stats:
- take a stab at removing HTML decorations from plain text msgs. See
- comments for why it's *only* in plain text msgs. This puts an end to
- false positives due to text msgs talking *about* HTML. Surprisingly, it
- also gets rid of some false negatives. Not surprisingly, it introduced
- another small class of false positives due to the dumbass regexp trick
- used to approximate HTML tag removal removing pieces of text that had
- nothing to do with HTML tags (e.g., this happened in the middle of a
- uuencoded .py file in such a why that it just happened to leave behind
- a string that "looked like" a spam phrase; but before this it looked
- like a pile of "too long" lines that didn't generate any tokens --
- it's a nonsense outcome either way).
-
- false positive percentages
- 0.000 0.000 tied
- 0.000 0.000 tied
- 0.050 0.050 tied
- 0.000 0.000 tied
- 0.025 0.025 tied
- 0.025 0.025 tied
- 0.050 0.050 tied
- 0.025 0.025 tied
- 0.025 0.025 tied
- 0.000 0.025 lost
- 0.075 0.075 tied
- 0.050 0.025 won
- 0.025 0.025 tied
- 0.000 0.025 lost
- 0.050 0.075 lost
- 0.025 0.025 tied
- 0.025 0.025 tied
- 0.000 0.000 tied
- 0.025 0.025 tied
- 0.050 0.050 tied
-
- won 1 times
- tied 16 times
- lost 3 times
-
- total unique fp went from 8 to 9
-
- false negative percentages
- 0.945 0.909 won
- 0.836 0.800 won
- 1.200 1.091 won
- 1.418 1.381 won
- 1.455 1.491 lost
- 1.091 1.055 won
- 1.091 0.945 won
- 1.236 1.236 tied
- 1.564 1.564 tied
- 1.236 1.200 won
- 1.563 1.454 won
- 1.563 1.599 lost
- 1.236 1.236 tied
- 0.836 0.800 won
- 0.873 0.836 won
- 1.236 1.236 tied
- 1.273 1.236 won
- 1.018 1.055 lost
- 1.091 1.127 lost
- 1.490 1.381 won
-
- won 12 times
- tied 4 times
- lost 4 times
-
- total unique fn went from 292 to 284
-
-2002-09-03 06:57 tim_one
-
- * classifier.py (1.8):
-
- Added a new xspamprob() method, which computes the combined probability
- "correctly", and a long comment block explaining what happened when I
- tried it. There's something worth pursuing here (it greatly improves
- the false negative rate), but this change alone pushes too many marginal
- hams into the spam camp
-
-2002-09-03 05:23 tim_one
-
- * timtest.py (1.12):
-
- Made "skip:" tokens shorter.
-
- Added a surprising treatment of Organization headers, with a tiny f-n
- benefit for a tiny cost. No change in f-p stats.
-
- false negative percentages
- 1.091 0.945 won
- 0.945 0.836 won
- 1.236 1.200 won
- 1.454 1.418 won
- 1.491 1.455 won
- 1.091 1.091 tied
- 1.127 1.091 won
- 1.236 1.236 tied
- 1.636 1.564 won
- 1.345 1.236 won
- 1.672 1.563 won
- 1.599 1.563 won
- 1.236 1.236 tied
- 0.836 0.836 tied
- 1.018 0.873 won
- 1.236 1.236 tied
- 1.273 1.273 tied
- 1.055 1.018 won
- 1.091 1.091 tied
- 1.527 1.490 won
-
- won 13 times
- tied 7 times
- lost 0 times
-
- total unique fn went from 302 to 292
-
-2002-09-03 02:18 tim_one
-
- * timtest.py (1.11):
-
- tokenize_word(): dropped the prefix from the signature; it's faster
- to let the caller do it, and this also repaired a bug in one place it
- was being used (well, a *conceptual* bug anyway, in that the code didn't
- do what I intended there). This changes the stats in an insignificant
- way. The f-p stats didn't change. The f-n stats shifted by one message
- in a few cases:
-
- false negative percentages
- 1.091 1.091 tied
- 0.945 0.945 tied
- 1.200 1.236 lost
- 1.454 1.454 tied
- 1.491 1.491 tied
- 1.091 1.091 tied
- 1.091 1.127 lost
- 1.236 1.236 tied
- 1.636 1.636 tied
- 1.382 1.345 won
- 1.636 1.672 lost
- 1.599 1.599 tied
- 1.236 1.236 tied
- 0.836 0.836 tied
- 1.018 1.018 tied
- 1.236 1.236 tied
- 1.273 1.273 tied
- 1.055 1.055 tied
- 1.091 1.091 tied
- 1.527 1.527 tied
-
- won 1 times
- tied 16 times
- lost 3 times
-
- total unique unchanged
-
-2002-09-02 19:30 tim_one
-
- * timtest.py (1.10):
-
- Don't ask me why this helps -- I don't really know! When skipping "long
- words", generating a token with a brief hint about what and how much got
- skipped makes a definite improvement in the f-n rate, and doesn't affect
- the f-p rate at all. Since experiment said it's a winner, I'm checking
- it in. Before (left columan) and after (right column):
-
- false positive percentages
- 0.000 0.000 tied
- 0.000 0.000 tied
- 0.050 0.050 tied
- 0.000 0.000 tied
- 0.025 0.025 tied
- 0.025 0.025 tied
- 0.050 0.050 tied
- 0.025 0.025 tied
- 0.025 0.025 tied
- 0.000 0.000 tied
- 0.075 0.075 tied
- 0.050 0.050 tied
- 0.025 0.025 tied
- 0.000 0.000 tied
- 0.050 0.050 tied
- 0.025 0.025 tied
- 0.025 0.025 tied
- 0.000 0.000 tied
- 0.025 0.025 tied
- 0.050 0.050 tied
-
- won 0 times
- tied 20 times
- lost 0 times
-
- total unique fp went from 8 to 8
-
- false negative percentages
- 1.236 1.091 won
- 1.164 0.945 won
- 1.454 1.200 won
- 1.599 1.454 won
- 1.527 1.491 won
- 1.236 1.091 won
- 1.163 1.091 won
- 1.309 1.236 won
- 1.891 1.636 won
- 1.418 1.382 won
- 1.745 1.636 won
- 1.708 1.599 won
- 1.491 1.236 won
- 0.836 0.836 tied
- 1.091 1.018 won
- 1.309 1.236 won
- 1.491 1.273 won
- 1.127 1.055 won
- 1.309 1.091 won
- 1.636 1.527 won
-
- won 19 times
- tied 1 times
- lost 0 times
-
- total unique fn went from 336 to 302
-
-2002-09-02 17:55 tim_one
-
- * timtest.py (1.9):
-
- Some comment changes and nesting reduction.
-
-2002-09-02 11:18 tim_one
-
- * timtest.py (1.8):
-
- Fixed some out-of-date comments.
-
- Made URL clumping lumpier: now distinguishes among just "first field",
- "second field", and "everything else".
-
- Changed tag names for email address fields (semantically neutral).
-
- Added "From:" line tagging.
-
- These add up to an almost pure win. Before-and-after f-n rates across 20
- runs:
-
- 1.418 1.236
- 1.309 1.164
- 1.636 1.454
- 1.854 1.599
- 1.745 1.527
- 1.418 1.236
- 1.381 1.163
- 1.418 1.309
- 2.109 1.891
- 1.491 1.418
- 1.854 1.745
- 1.890 1.708
- 1.818 1.491
- 1.055 0.836
- 1.164 1.091
- 1.599 1.309
- 1.600 1.491
- 1.127 1.127
- 1.164 1.309
- 1.781 1.636
-
- It only increased in one run. The variance appears to have been reduced
- too (I didn't bother to compute that, though).
-
- Before-and-after f-p rates across 20 runs:
-
- 0.000 0.000
- 0.000 0.000
- 0.075 0.050
- 0.000 0.000
- 0.025 0.025
- 0.050 0.025
- 0.075 0.050
- 0.025 0.025
- 0.025 0.025
- 0.025 0.000
- 0.100 0.075
- 0.050 0.050
- 0.025 0.025
- 0.000 0.000
- 0.075 0.050
- 0.025 0.025
- 0.025 0.025
- 0.000 0.000
- 0.075 0.025
- 0.100 0.050
-
- Note that 0.025% is a single message; it's really impossible to *measure*
- an improvement in the f-p rate anymore with 4000-msg ham sets.
-
- Across all 20 runs,
-
- the total # of unique f-n fell from 353 to 336
- the total # of unique f-p fell from 13 to 8
-
-2002-09-02 10:06 tim_one
-
- * timtest.py (1.7):
-
- A number of changes. The most significant is paying attention to the
- Subject line (I was wrong before when I said my c.l.py ham corpus was
- unusable for this due to Mailman-injected decorations). In all, across
- my 20 test runs,
-
- the total # of unique false positives fell from 23 to 13
- the total # of unique false negatives rose from 337 to 353
-
- Neither result is statistically significant, although I bet the first
- one would be if I pissed away a few days trying to come up with a more
- realistic model for what "stat. sig." means here <wink>.
-
-2002-09-01 17:22 tim_one
-
- * classifier.py (1.7):
-
- Added a comment block about HAMBIAS experiments. There's no clearer
- example of trading off precision against recall, and you can favor either
- at the expense of the other to any degree you like by fiddling this knob.
-
-2002-09-01 14:42 tim_one
-
- * timtest.py (1.6):
-
- Long new comment block summarizing all my experiments with character
- n-grams. Bottom line is that they have nothing going for them, and a
- lot going against them, under Graham's scheme. I believe there may
- still be a place for them in *part* of a word-based tokenizer, though.
-
-2002-09-01 10:05 tim_one
-
- * classifier.py (1.6):
-
- spamprob(): Never count unique words more than once anymore. Counting
- up to twice gave a small benefit when UNKNOWN_SPAMPROB was 0.2, but
- that's now a small drag instead.
-
-2002-09-01 07:33 tim_one
-
- * rebal.py (1.3), timtest.py (1.5):
-
- Folding case is here to stay. Read the new comments for why. This may
- be a bad idea for other languages, though.
-
- Refined the embedded-URL tagging scheme. Curious: as a protocol,
- http is spam-neutral, but https is a strong spam indicator. That
- surprised me.
-
-2002-09-01 06:47 tim_one
-
- * classifier.py (1.5):
-
- spamprob(): Removed useless check that wordstream isn't empty. For one
- thing, it didn't work, since wordstream is often an iterator. Even if
- it did work, it isn't needed -- the probability of an empty wordstream
- gets computed as 0.5 based on the total absence of evidence.
-
-2002-09-01 05:37 tim_one
-
- * timtest.py (1.4):
-
- textparts(): Worm around what feels like a bug in msg.walk() (Barry has
- details).
-
-2002-09-01 05:09 tim_one
-
- * rebal.py (1.2):
-
- Aha! Staring at the checkin msg revealed a logic bug that explains why
- my ham directories sometimes remained unbalanced after running this --
- if the randomly selected reservoir msg turned out to be spam, it wasn't
- pushing the too-small directory on the stack again.
-
-2002-09-01 04:56 tim_one
-
- * timtest.py (1.3):
-
- textparts(): This was failing to weed out redundant HTML in cases like
- this:
-
- multipart/alternative
- text/plain
- multipart/related
- text/html
-
- The tokenizer here also transforms everything to lowercase, but that's
- an accident due simply to that I'm testing that now. Can't say for
- sure until the test runs end, but so far it looks like a bad idea for
- the false positive rate.
-
-2002-09-01 04:52 tim_one
-
- * rebal.py (1.1):
-
- A little script I use to rebalance the ham corpora after deleting what
- turns out to be spam. I have another Ham/reservoir directory with a
- few thousand randomly selected msgs from the presumably-good archive.
- These aren't used in scoring or training. This script marches over all
- the ham corpora directories that are used, and if any have gotten too
- big (this never happens anymore) deletes msgs at random from them, and
- if any have gotten too small plugs the holes by moving in random
- msgs from the reservoir.
-
-2002-09-01 03:25 tim_one
-
- * classifier.py (1.4), timtest.py (1.2):
-
- Boost UNKNOWN_SPAMPROB.
- # The spam probability assigned to words never seen before. Graham used
- # 0.2 here. Neil Schemenauer reported that 0.5 seemed to work better. In
- # Tim's content-only tests (no headers), boosting to 0.5 cut the false
- # negative rate by over 1/3. The f-p rate increased, but there were so few
- # f-ps that the increase wasn't statistically significant. It also caught
- # 13 more spams erroneously classified as ham. By eyeball (and common
- # sense <wink>), this has most effect on very short messages, where there
- # simply aren't many high-value words. A word with prob 0.5 is (in effect)
- # completely ignored by spamprob(), in favor of *any* word with *any* prob
- # differing from 0.5. At 0.2, an unknown word favors ham at the expense
- # of kicking out a word with a prob in (0.2, 0.8), and that seems dubious
- # on the face of it.
-
-2002-08-31 16:50 tim_one
-
- * timtest.py (1.1):
-
- This is a driver I've been using for test runs. It's specific to my
- corpus directories, but has useful stuff in it all the same.
-
-2002-08-31 16:49 tim_one
-
- * classifier.py (1.3):
-
- The explanation for these changes was on Python-Dev. You'll find out
- why if the moderator approves the msg <wink>.
-
-2002-08-29 07:04 tim_one
-
- * Tester.py (1.2), classifier.py (1.2):
-
- Tester.py: Repaired a comment. The false_{positive,negative})_rate()
- functions return a percentage now (e.g., 1.0 instead of 0.01 -- it's
- too hard to get motivated to reduce 0.01 <0.1 wink>).
-
- GrahamBayes.spamprob: New optional bool argument; when true, a list of
- the 15 strongest (word, probability) pairs is returned as well as the
- overall probability (this is how to find out why a message scored as it
- did).
-
-2002-08-28 13:45 montanaro
-
- * GBayes.py (1.15):
-
- ehh - it actually didn't work all that well. the spurious report that it
- did well was pilot error. besides, tim's report suggests that a simple
- str.split() may be the best tokenizer anyway.
-
-2002-08-28 10:45 montanaro
-
- * setup.py (1.1):
-
- trivial little setup.py file - i don't expect most people will be interested
- in this, but it makes it a tad simpler to work with now that there are two
- files
-
-2002-08-28 10:43 montanaro
-
- * GBayes.py (1.14):
-
- add simple trigram tokenizer - this seems to yield the best results I've
- seen so far (but has not been extensively tested)
-
-2002-08-28 08:10 tim_one
-
- * Tester.py (1.1):
-
- A start at a testing class. There isn't a lot here, but it automates
- much of the tedium, and as the doctest shows it can already do
- useful things, like remembering which inputs were misclassified.
-
-2002-08-27 06:45 tim_one
-
- * mboxcount.py (1.5):
-
- Updated stats to what Barry and I both get now. Fiddled output.
-
-2002-08-27 05:09 bwarsaw
-
- * split.py (1.5), splitn.py (1.2):
-
- _factory(): Return the empty string instead of None in the except
- clauses, so that for-loops won't break prematurely. mailbox.py's base
- class defines an __iter__() that raises a StopIteration on None
- return.
-
-2002-08-27 04:55 tim_one
-
- * GBayes.py (1.13), mboxcount.py (1.4):
-
- Whitespace normalization (and some ambiguous tabs snuck into mboxcount).
-
-2002-08-27 04:40 bwarsaw
-
- * mboxcount.py (1.3):
-
- Some stats after splitting b/w good messages and unparseable messages
-
-2002-08-27 04:23 bwarsaw
-
- * mboxcount.py (1.2):
-
- _factory(): Use a marker object to designate between good messages and
- unparseable messages. For some reason, returning None from the except
- clause in _factory() caused Python 2.2.1 to exit early out of the for
- loop.
-
- main(): Print statistics about both the number of good messages and
- the number of unparseable messages.
-
-2002-08-27 03:06 tim_one
-
- * cleanarch (1.2):
-
- "From " is a header more than a separator, so don't bump the msg count
- at the end.
-
-2002-08-24 01:42 tim_one
-
- * GBayes.py (1.12), classifier.py (1.1):
-
- Moved all the interesting code that was in the *original* GBayes.py into
- a new classifier.py. It was designed to have a very clean interface,
- and there's no reason to keep slamming everything into one file. The
- ever-growing tokenizer stuff should probably also be split out, leaving
- GBayes.py a pure driver.
-
- Also repaired _test() (Skip's checkin left it without a binding for
- the tokenize function).
-
-2002-08-24 01:17 tim_one
-
- * splitn.py (1.1):
-
- Utility to split an mbox into N random pieces in one gulp. This gives
- a convenient way to break a giant corpus into multiple files that can
- then be used independently across multiple training and testing runs.
- It's important to do multiple runs on different random samples to avoid
- drawing conclusions based on accidents in a single random training corpus;
- if the algorithm is robust, it should have similar performance across
- all runs.
-
-2002-08-24 00:25 montanaro
-
- * GBayes.py (1.11):
-
- Allow command line specification of tokenize functions
- run w/ -t flag to override default tokenize function
- run w/ -H flag to see list of tokenize functions
-
- When adding a new tokenizer, make docstring a short description and add a
- key/value pair to the tokenizers dict. The key is what the user specifies.
- The value is a tokenize function.
-
- Added two new tokenizers - tokenize_wordpairs_foldcase and
- tokenize_words_and_pairs. It's not obvious that either is better than any
- of the preexisting functions.
-
- Should probably add info to the pickle which indicates the tokenizing
- function used to build it. This could then be the default for spam
- detection runs.
-
- Next step is to drive this with spam/non-spam corpora, selecting each of the
- various tokenizer functions, and presenting the results in tabular form.
-
-2002-08-23 13:10 tim_one
-
- * GBayes.py (1.10):
-
- spamprob(): Commented some subtleties.
-
- clearjunk(): Undid Guido's attempt to space-optimize this. The problem
- is that you can't delete entries from a dict that's being crawled over
- by .iteritems(), which is why I (I suddenly recall) materialized a
- list of words to be deleted the first time I wrote this. It's a lot
- better to materialize a list of to-be-deleted words than to materialize
- the entire database in a dict.items() list.
-
-2002-08-23 12:36 tim_one
-
- * mboxcount.py (1.1):
-
- Utility to count and display the # of msgs in (one or more) Unix mboxes.
-
-2002-08-23 12:11 tim_one
-
- * split.py (1.4):
-
- Open files in binary mode. Else, e.g., about 400MB of Barry's python-list
- corpus vanishes on Windows. Also use file.write() instead of print>>, as
- the latter invents an extra newline.
-
-2002-08-22 07:01 tim_one
-
- * GBayes.py (1.9):
-
- Renamed "modtime" to "atime", to better reflect its meaning, and added a
- comment block to explain that better.
-
-2002-08-21 08:07 bwarsaw
-
- * split.py (1.3):
-
- Guido suggests a different order for the positional args.
-
-2002-08-21 07:37 bwarsaw
-
- * split.py (1.2):
-
- Get rid of the -1 and -2 arguments and make them positional.
-
-2002-08-21 07:18 bwarsaw
-
- * split.py (1.1):
-
- A simple mailbox splitter
-
-2002-08-21 06:42 tim_one
-
- * GBayes.py (1.8):
-
- Added a bunch of simple tokenizers. The originals are renamed to
- tokenize_words_foldcase and tokenize_5gram_foldcase_wscollapse.
- New ones are tokenize_words, tokenize_split_foldcase, tokenize_split,
- tokenize_5gram, tokenize_10gram, and tokenize_15gram. I don't expect
- any of these to be the last word. When Barry has the test corpus
- set up it should be easy to let the data tell us which "pure" strategy
- works best. Straight character n-grams are very appealing because
- they're the simplest and most language-neutral; I didn't have any luck
- with them over the weekend, but the size of my training data was
- trivial.
-
-2002-08-21 05:08 bwarsaw
-
- * cleanarch (1.1):
-
- An archive cleaner, adapted from the Mailman 2.1b3 version, but
- de-Mailman-ified.
-
-2002-08-21 04:44 gvanrossum
-
- * GBayes.py (1.7):
-
- Indent repair in clearjunk().
-
-2002-08-21 04:22 gvanrossum
-
- * GBayes.py (1.6):
-
- Some minor cleanup:
-
- - Move the identifying comment to the top, clarify it a bit, and add
- author info.
-
- - There's no reason for _time and _heapreplace to be hidden names;
- change these back to time and heapreplace.
-
- - Rename main1() to _test() and main2() to main(); when main() sees
- there are no options or arguments, it runs _test().
-
- - Get rid of a list comprehension from clearjunk().
-
- - Put wordinfo.get as a local variable in _add_msg().
-
-2002-08-20 15:16 tim_one
-
- * GBayes.py (1.5):
-
- Neutral typo repairs, except that clearjunk() has a better chance of
- not blowing up immediately now <wink -- I have yet to try it!>.
-
-2002-08-20 13:49 montanaro
-
- * GBayes.py (1.4):
-
- help make it more easily executable... ;-)
-
-2002-08-20 09:32 bwarsaw
-
- * GBayes.py (1.3):
-
- Lots of hacks great and small to the main() program, but I didn't
- touch the guts of the algorithm.
-
- Added a module docstring/usage message.
-
- Added a bunch of switches to train the system on an mbox of known good
- and known spam messages (using PortableUnixMailbox only for now).
- Uses the email package but does not decoding of message bodies. Also,
- allows you to specify a file for pickling the training data, and for
- setting a threshold, above which messages get an X-Bayes-Score
- header. Also output messages (marked and unmarked) to an output file
- for retraining.
-
- Print some statistics at the end.
-
-2002-08-20 05:43 tim_one
-
- * GBayes.py (1.2):
-
- Turned off debugging vrbl mistakenly checked in at True.
-
- unlearn(): Gave this an update_probabilities=True default arg, for
- symmetry with learn().
-
-2002-08-20 03:33 tim_one
-
- * GBayes.py (1.1):
-
- An implementation of Paul Graham's Bayes-like spam classifier.
-
-</pre>
Copied: trunk/website/presfchangelog.ht (from rev 3155, trunk/website/prefschangelog.ht)
===================================================================
--- trunk/website/presfchangelog.ht (rev 0)
+++ trunk/website/presfchangelog.ht 2007-07-25 13:51:11 UTC (rev 3156)
@@ -0,0 +1,905 @@
+<h2>Pre-Sourceforge ChangeLog</h2>
+<p>This changelog lists the commits on the spambayes projects before the
+ separate project was set up. See also the
+<a href="http://spambayes.cvs.sourceforge.net/python/python/nondist/sandbox/spambayes/?hideattic=0">old CVS repository</a>, but don't forget that it's now out of date, and you probably want to be looking at <a href="http://spambayes.cvs.sourceforge.net/spambayes/spambayes/">the current CVS</a>.
+</p>
+<pre>
+2002-09-06 02:27 tim_one
+
+ * GBayes.py (1.16), Tester.py (1.4), classifier.py (1.12),
+ cleanarch (1.3), mboxcount.py (1.6), rebal.py (1.4), setup.py
+ (1.2), split.py (1.6), splitn.py (1.3), timtest.py (1.18):
+
+ This code has been moved to a new SourceForge project (spambayes).
+
+2002-09-05 15:37 tim_one
+
+ * classifier.py (1.11):
+
+ Added note about MINCOUNT oddities.
+
+2002-09-05 14:32 tim_one
+
+ * timtest.py (1.17):
+
+ Added note about word length.
+
+2002-09-05 13:48 tim_one
+
+ * timtest.py (1.16):
+
+ tokenize_word(): Oops! This was awfully permissive in what it
+ took as being "an email address". Tightened that, and also
+ avoided 5-gram'ing of email addresses w/ high-bit characters.
+
+ false positive percentages
+ 0.000 0.000 tied
+ 0.000 0.000 tied
+ 0.050 0.050 tied
+ 0.000 0.000 tied
+ 0.025 0.025 tied
+ 0.025 0.025 tied
+ 0.050 0.050 tied
+ 0.025 0.025 tied
+ 0.025 0.025 tied
+ 0.025 0.050 lost
+ 0.075 0.075 tied
+ 0.025 0.025 tied
+ 0.025 0.025 tied
+ 0.025 0.025 tied
+ 0.025 0.025 tied
+ 0.025 0.025 tied
+ 0.025 0.025 tied
+ 0.000 0.000 tied
+ 0.025 0.025 tied
+ 0.050 0.050 tied
+
+ won 0 times
+ tied 19 times
+ lost 1 times
+
+ total unique fp went from 7 to 8
+
+ false negative percentages
+ 0.764 0.691 won
+ 0.691 0.655 won
+ 0.981 0.945 won
+ 1.309 1.309 tied
+ 1.418 1.164 won
+ 0.873 0.800 won
+ 0.800 0.763 won
+ 1.163 1.163 tied
+ 1.491 1.345 won
+ 1.200 1.127 won
+ 1.381 1.345 won
+ 1.454 1.490 lost
+ 1.164 0.909 won
+ 0.655 0.582 won
+ 0.655 0.691 lost
+ 1.163 1.163 tied
+ 1.200 1.018 won
+ 0.982 0.873 won
+ 0.982 0.909 won
+ 1.236 1.127 won
+
+ won 15 times
+ tied 3 times
+ lost 2 times
+
+ total unique fn went from 260 to 249
+
+ Note: Each of the two losses there consist of just 1 msg difference.
+ The wins are bigger as well as being more common, and 260-249 = 11
+ spams no longer sneak by any run (which is more than 4% of the 260
+ spams that used to sneak thru!).
+
+2002-09-05 11:51 tim_one
+
+ * classifier.py (1.10):
+
+ Comment about test results moving MAX_DISCRIMINATORS back to 15; doesn't
+ really matter; leaving it alone.
+
+2002-09-05 10:02 tim_one
+
+ * classifier.py (1.9):
+
+ A now-rare pure win, changing spamprob() to work harder to find more
+ evidence when competing 0.01 and 0.99 clues appear. Before in the left
+ column, after in the right:
+
+ false positive percentages
+ 0.000 0.000 tied
+ 0.000 0.000 tied
+ 0.050 0.050 tied
+ 0.000 0.000 tied
+ 0.025 0.025 tied
+ 0.025 0.025 tied
+ 0.050 0.050 tied
+ 0.025 0.025 tied
+ 0.025 0.025 tied
+ 0.025 0.025 tied
+ 0.075 0.075 tied
+ 0.025 0.025 tied
+ 0.025 0.025 tied
+ 0.025 0.025 tied
+ 0.075 0.025 won
+ 0.025 0.025 tied
+ 0.025 0.025 tied
+ 0.000 0.000 tied
+ 0.025 0.025 tied
+ 0.050 0.050 tied
+
+ won 1 times
+ tied 19 times
+ lost 0 times
+
+ total unique fp went from 9 to 7
+
+ false negative percentages
+ 0.909 0.764 won
+ 0.800 0.691 won
+ 1.091 0.981 won
+ 1.381 1.309 won
+ 1.491 1.418 won
+ 1.055 0.873 won
+ 0.945 0.800 won
+ 1.236 1.163 won
+ 1.564 1.491 won
+ 1.200 1.200 tied
+ 1.454 1.381 won
+ 1.599 1.454 won
+ 1.236 1.164 won
+ 0.800 0.655 won
+ 0.836 0.655 won
+ 1.236 1.163 won
+ 1.236 1.200 won
+ 1.055 0.982 won
+ 1.127 0.982 won
+ 1.381 1.236 won
+
+ won 19 times
+ tied 1 times
+ lost 0 times
+
+ total unique fn went from 284 to 260
+
+2002-09-04 11:21 tim_one
+
+ * timtest.py (1.15):
+
+ Augmented the spam callback to display spams with low probability.
+
+2002-09-04 09:53 tim_one
+
+ * Tester.py (1.3), timtest.py (1.14):
+
+ Added support for simple histograms of the probability distributions for
+ ham and spam.
+
+2002-09-03 12:13 tim_one
+
+ * timtest.py (1.13):
+
+ A reluctant "on principle" change no matter what it does to the stats:
+ take a stab at removing HTML decorations from plain text msgs. See
+ comments for why it's *only* in plain text msgs. This puts an end to
+ false positives due to text msgs talking *about* HTML. Surprisingly, it
+ also gets rid of some false negatives. Not surprisingly, it introduced
+ another small class of false positives due to the dumbass regexp trick
+ used to approximate HTML tag removal removing pieces of text that had
+ nothing to do with HTML tags (e.g., this happened in the middle of a
+ uuencoded .py file in such a why that it just happened to leave behind
+ a string that "looked like" a spam phrase; but before this it looked
+ like a pile of "too long" lines that didn't generate any tokens --
+ it's a nonsense outcome either way).
+
+ false positive percentages
+ 0.000 0.000 tied
+ 0.000 0.000 tied
+ 0.050 0.050 tied
+ 0.000 0.000 tied
+ 0.025 0.025 tied
+ 0.025 0.025 tied
+ 0.050 0.050 tied
+ 0.025 0.025 tied
+ 0.025 0.025 tied
+ 0.000 0.025 lost
+ 0.075 0.075 tied
+ 0.050 0.025 won
+ 0.025 0.025 tied
+ 0.000 0.025 lost
+ 0.050 0.075 lost
+ 0.025 0.025 tied
+ 0.025 0.025 tied
+ 0.000 0.000 tied
+ 0.025 0.025 tied
+ 0.050 0.050 tied
+
+ won 1 times
+ tied 16 times
+ lost 3 times
+
+ total unique fp went from 8 to 9
+
+ false negative percentages
+ 0.945 0.909 won
+ 0.836 0.800 won
+ 1.200 1.091 won
+ 1.418 1.381 won
+ 1.455 1.491 lost
+ 1.091 1.055 won
+ 1.091 0.945 won
+ 1.236 1.236 tied
+ 1.564 1.564 tied
+ 1.236 1.200 won
+ 1.563 1.454 won
+ 1.563 1.599 lost
+ 1.236 1.236 tied
+ 0.836 0.800 won
+ 0.873 0.836 won
+ 1.236 1.236 tied
+ 1.273 1.236 won
+ 1.018 1.055 lost
+ 1.091 1.127 lost
+ 1.490 1.381 won
+
+ won 12 times
+ tied 4 times
+ lost 4 times
+
+ total unique fn went from 292 to 284
+
+2002-09-03 06:57 tim_one
+
+ * classifier.py (1.8):
+
+ Added a new xspamprob() method, which computes the combined probability
+ "correctly", and a long comment block explaining what happened when I
+ tried it. There's something worth pursuing here (it greatly improves
+ the false negative rate), but this change alone pushes too many marginal
+ hams into the spam camp
+
+2002-09-03 05:23 tim_one
+
+ * timtest.py (1.12):
+
+ Made "skip:" tokens shorter.
+
+ Added a surprising treatment of Organization headers, with a tiny f-n
+ benefit for a tiny cost. No change in f-p stats.
+
+ false negative percentages
+ 1.091 0.945 won
+ 0.945 0.836 won
+ 1.236 1.200 won
+ 1.454 1.418 won
+ 1.491 1.455 won
+ 1.091 1.091 tied
+ 1.127 1.091 won
+ 1.236 1.236 tied
+ 1.636 1.564 won
+ 1.345 1.236 won
+ 1.672 1.563 won
+ 1.599 1.563 won
+ 1.236 1.236 tied
+ 0.836 0.836 tied
+ 1.018 0.873 won
+ 1.236 1.236 tied
+ 1.273 1.273 tied
+ 1.055 1.018 won
+ 1.091 1.091 tied
+ 1.527 1.490 won
+
+ won 13 times
+ tied 7 times
+ lost 0 times
+
+ total unique fn went from 302 to 292
+
+2002-09-03 02:18 tim_one
+
+ * timtest.py (1.11):
+
+ tokenize_word(): dropped the prefix from the signature; it's faster
+ to let the caller do it, and this also repaired a bug in one place it
+ was being used (well, a *conceptual* bug anyway, in that the code didn't
+ do what I intended there). This changes the stats in an insignificant
+ way. The f-p stats didn't change. The f-n stats shifted by one message
+ in a few cases:
+
+ false negative percentages
+ 1.091 1.091 tied
+ 0.945 0.945 tied
+ 1.200 1.236 lost
+ 1.454 1.454 tied
+ 1.491 1.491 tied
+ 1.091 1.091 tied
+ 1.091 1.127 lost
+ 1.236 1.236 tied
+ 1.636 1.636 tied
+ 1.382 1.345 won
+ 1.636 1.672 lost
+ 1.599 1.599 tied
+ 1.236 1.236 tied
+ 0.836 0.836 tied
+ 1.018 1.018 tied
+ 1.236 1.236 tied
+ 1.273 1.273 tied
+ 1.055 1.055 tied
+ 1.091 1.091 tied
+ 1.527 1.527 tied
+
+ won 1 times
+ tied 16 times
+ lost 3 times
+
+ total unique unchanged
+
+2002-09-02 19:30 tim_one
+
+ * timtest.py (1.10):
+
+ Don't ask me why this helps -- I don't really know! When skipping "long
+ words", generating a token with a brief hint about what and how much got
+ skipped makes a definite improvement in the f-n rate, and doesn't affect
+ the f-p rate at all. Since experiment said it's a winner, I'm checking
+ it in. Before (left columan) and after (right column):
+
+ false positive percentages
+ 0.000 0.000 tied
+ 0.000 0.000 tied
+ 0.050 0.050 tied
+ 0.000 0.000 tied
+ 0.025 0.025 tied
+ 0.025 0.025 tied
+ 0.050 0.050 tied
+ 0.025 0.025 tied
+ 0.025 0.025 tied
+ 0.000 0.000 tied
+ 0.075 0.075 tied
+ 0.050 0.050 tied
+ 0.025 0.025 tied
+ 0.000 0.000 tied
+ 0.050 0.050 tied
+ 0.025 0.025 tied
+ 0.025 0.025 tied
+ 0.000 0.000 tied
+ 0.025 0.025 tied
+ 0.050 0.050 tied
+
+ won 0 times
+ tied 20 times
+ lost 0 times
+
+ total unique fp went from 8 to 8
+
+ false negative percentages
+ 1.236 1.091 won
+ 1.164 0.945 won
+ 1.454 1.200 won
+ 1.599 1.454 won
+ 1.527 1.491 won
+ 1.236 1.091 won
+ 1.163 1.091 won
+ 1.309 1.236 won
+ 1.891 1.636 won
+ 1.418 1.382 won
+ 1.745 1.636 won
+ 1.708 1.599 won
+ 1.491 1.236 won
+ 0.836 0.836 tied
+ 1.091 1.018 won
+ 1.309 1.236 won
+ 1.491 1.273 won
+ 1.127 1.055 won
+ 1.309 1.091 won
+ 1.636 1.527 won
+
+ won 19 times
+ tied 1 times
+ lost 0 times
+
+ total unique fn went from 336 to 302
+
+2002-09-02 17:55 tim_one
+
+ * timtest.py (1.9):
+
+ Some comment changes and nesting reduction.
+
+2002-09-02 11:18 tim_one
+
+ * timtest.py (1.8):
+
+ Fixed some out-of-date comments.
+
+ Made URL clumping lumpier: now distinguishes among just "first field",
+ "second field", and "everything else".
+
+ Changed tag names for email address fields (semantically neutral).
+
+ Added "From:" line tagging.
+
+ These add up to an almost pure win. Before-and-after f-n rates across 20
+ runs:
+
+ 1.418 1.236
+ 1.309 1.164
+ 1.636 1.454
+ 1.854 1.599
+ 1.745 1.527
+ 1.418 1.236
+ 1.381 1.163
+ 1.418 1.309
+ 2.109 1.891
+ 1.491 1.418
+ 1.854 1.745
+ 1.890 1.708
+ 1.818 1.491
+ 1.055 0.836
+ 1.164 1.091
+ 1.599 1.309
+ 1.600 1.491
+ 1.127 1.127
+ 1.164 1.309
+ 1.781 1.636
+
+ It only increased in one run. The variance appears to have been reduced
+ too (I didn't bother to compute that, though).
+
+ Before-and-after f-p rates across 20 runs:
+
+ 0.000 0.000
+ 0.000 0.000
+ 0.075 0.050
+ 0.000 0.000
+ 0.025 0.025
+ 0.050 0.025
+ 0.075 0.050
+ 0.025 0.025
+ 0.025 0.025
+ 0.025 0.000
+ 0.100 0.075
+ 0.050 0.050
+ 0.025 0.025
+ 0.000 0.000
+ 0.075 0.050
+ 0.025 0.025
+ 0.025 0.025
+ 0.000 0.000
+ 0.075 0.025
+ 0.100 0.050
+
+ Note that 0.025% is a single message; it's really impossible to *measure*
+ an improvement in the f-p rate anymore with 4000-msg ham sets.
+
+ Across all 20 runs,
+
+ the total # of unique f-n fell from 353 to 336
+ the total # of unique f-p fell from 13 to 8
+
+2002-09-02 10:06 tim_one
+
+ * timtest.py (1.7):
+
+ A number of changes. The most significant is paying attention to the
+ Subject line (I was wrong before when I said my c.l.py ham corpus was
+ unusable for this due to Mailman-injected decorations). In all, across
+ my 20 test runs,
+
+ the total # of unique false positives fell from 23 to 13
+ the total # of unique false negatives rose from 337 to 353
+
+ Neither result is statistically significant, although I bet the first
+ one would be if I pissed away a few days trying to come up with a more
+ realistic model for what "stat. sig." means here <wink>.
+
+2002-09-01 17:22 tim_one
+
+ * classifier.py (1.7):
+
+ Added a comment block about HAMBIAS experiments. There's no clearer
+ example of trading off precision against recall, and you can favor either
+ at the expense of the other to any degree you like by fiddling this knob.
+
+2002-09-01 14:42 tim_one
+
+ * timtest.py (1.6):
+
+ Long new comment block summarizing all my experiments with character
+ n-grams. Bottom line is that they have nothing going for them, and a
+ lot going against them, under Graham's scheme. I believe there may
+ still be a place for them in *part* of a word-based tokenizer, though.
+
+2002-09-01 10:05 tim_one
+
+ * classifier.py (1.6):
+
+ spamprob(): Never count unique words more than once anymore. Counting
+ up to twice gave a small benefit when UNKNOWN_SPAMPROB was 0.2, but
+ that's now a small drag instead.
+
+2002-09-01 07:33 tim_one
+
+ * rebal.py (1.3), timtest.py (1.5):
+
+ Folding case is here to stay. Read the new comments for why. This may
+ be a bad idea for other languages, though.
+
+ Refined the embedded-URL tagging scheme. Curious: as a protocol,
+ http is spam-neutral, but https is a strong spam indicator. That
+ surprised me.
+
+2002-09-01 06:47 tim_one
+
+ * classifier.py (1.5):
+
+ spamprob(): Removed useless check that wordstream isn't empty. For one
+ thing, it didn't work, since wordstream is often an iterator. Even if
+ it did work, it isn't needed -- the probability of an empty wordstream
+ gets computed as 0.5 based on the total absence of evidence.
+
+2002-09-01 05:37 tim_one
+
+ * timtest.py (1.4):
+
+ textparts(): Worm around what feels like a bug in msg.walk() (Barry has
+ details).
+
+2002-09-01 05:09 tim_one
+
+ * rebal.py (1.2):
+
+ Aha! Staring at the checkin msg revealed a logic bug that explains why
+ my ham directories sometimes remained unbalanced after running this --
+ if the randomly selected reservoir msg turned out to be spam, it wasn't
+ pushing the too-small directory on the stack again.
+
+2002-09-01 04:56 tim_one
+
+ * timtest.py (1.3):
+
+ textparts(): This was failing to weed out redundant HTML in cases like
+ this:
+
+ multipart/alternative
+ text/plain
+ multipart/related
+ text/html
+
+ The tokenizer here also transforms everything to lowercase, but that's
+ an accident due simply to that I'm testing that now. Can't say for
+ sure until the test runs end, but so far it looks like a bad idea for
+ the false positive rate.
+
+2002-09-01 04:52 tim_one
+
+ * rebal.py (1.1):
+
+ A little script I use to rebalance the ham corpora after deleting what
+ turns out to be spam. I have another Ham/reservoir directory with a
+ few thousand randomly selected msgs from the presumably-good archive.
+ These aren't used in scoring or training. This script marches over all
+ the ham corpora directories that are used, and if any have gotten too
+ big (this never happens anymore) deletes msgs at random from them, and
+ if any have gotten too small plugs the holes by moving in random
+ msgs from the reservoir.
+
+2002-09-01 03:25 tim_one
+
+ * classifier.py (1.4), timtest.py (1.2):
+
+ Boost UNKNOWN_SPAMPROB.
+ # The spam probability assigned to words never seen before. Graham used
+ # 0.2 here. Neil Schemenauer reported that 0.5 seemed to work better. In
+ # Tim's content-only tests (no headers), boosting to 0.5 cut the false
+ # negative rate by over 1/3. The f-p rate increased, but there were so few
+ # f-ps that the increase wasn't statistically significant. It also caught
+ # 13 more spams erroneously classified as ham. By eyeball (and common
+ # sense <wink>), this has most effect on very short messages, where there
+ # simply aren't many high-value words. A word with prob 0.5 is (in effect)
+ # completely ignored by spamprob(), in favor of *any* word with *any* prob
+ # differing from 0.5. At 0.2, an unknown word favors ham at the expense
+ # of kicking out a word with a prob in (0.2, 0.8), and that seems dubious
+ # on the face of it.
+
+2002-08-31 16:50 tim_one
+
+ * timtest.py (1.1):
+
+ This is a driver I've been using for test runs. It's specific to my
+ corpus directories, but has useful stuff in it all the same.
+
+2002-08-31 16:49 tim_one
+
+ * classifier.py (1.3):
+
+ The explanation for these changes was on Python-Dev. You'll find out
+ why if the moderator approves the msg <wink>.
+
+2002-08-29 07:04 tim_one
+
+ * Tester.py (1.2), classifier.py (1.2):
+
+ Tester.py: Repaired a comment. The false_{positive,negative})_rate()
+ functions return a percentage now (e.g., 1.0 instead of 0.01 -- it's
+ too hard to get motivated to reduce 0.01 <0.1 wink>).
+
+ GrahamBayes.spamprob: New optional bool argument; when true, a list of
+ the 15 strongest (word, probability) pairs is returned as well as the
+ overall probability (this is how to find out why a message scored as it
+ did).
+
+2002-08-28 13:45 montanaro
+
+ * GBayes.py (1.15):
+
+ ehh - it actually didn't work all that well. the spurious report that it
+ did well was pilot error. besides, tim's report suggests that a simple
+ str.split() may be the best tokenizer anyway.
+
+2002-08-28 10:45 montanaro
+
+ * setup.py (1.1):
+
+ trivial little setup.py file - i don't expect most people will be interested
+ in this, but it makes it a tad simpler to work with now that there are two
+ files
+
+2002-08-28 10:43 montanaro
+
+ * GBayes.py (1.14):
+
+ add simple trigram tokenizer - this seems to yield the best results I've
+ seen so far (but has not been extensively tested)
+
+2002-08-28 08:10 tim_one
+
+ * Tester.py (1.1):
+
+ A start at a testing class. There isn't a lot here, but it automates
+ much of the tedium, and as the doctest shows it can already do
+ useful things, like remembering which inputs were misclassified.
+
+2002-08-27 06:45 tim_one
+
+ * mboxcount.py (1.5):
+
+ Updated stats to what Barry and I both get now. Fiddled output.
+
+2002-08-27 05:09 bwarsaw
+
+ * split.py (1.5), splitn.py (1.2):
+
+ _factory(): Return the empty string instead of None in the except
+ clauses, so that for-loops won't break prematurely. mailbox.py's base
+ class defines an __iter__() that raises a StopIteration on None
+ return.
+
+2002-08-27 04:55 tim_one
+
+ * GBayes.py (1.13), mboxcount.py (1.4):
+
+ Whitespace normalization (and some ambiguous tabs snuck into mboxcount).
+
+2002-08-27 04:40 bwarsaw
+
+ * mboxcount.py (1.3):
+
+ Some stats after splitting b/w good messages and unparseable messages
+
+2002-08-27 04:23 bwarsaw
+
+ * mboxcount.py (1.2):
+
+ _factory(): Use a marker object to designate between good messages and
+ unparseable messages. For some reason, returning None from the except
+ clause in _factory() caused Python 2.2.1 to exit early out of the for
+ loop.
+
+ main(): Print statistics about both the number of good messages and
+ the number of unparseable messages.
+
+2002-08-27 03:06 tim_one
+
+ * cleanarch (1.2):
+
+ "From " is a header more than a separator, so don't bump the msg count
+ at the end.
+
+2002-08-24 01:42 tim_one
+
+ * GBayes.py (1.12), classifier.py (1.1):
+
+ Moved all the interesting code that was in the *original* GBayes.py into
+ a new classifier.py. It was designed to have a very clean interface,
+ and there's no reason to keep slamming everything into one file. The
+ ever-growing tokenizer stuff should probably also be split out, leaving
+ GBayes.py a pure driver.
+
+ Also repaired _test() (Skip's checkin left it without a binding for
+ the tokenize function).
+
+2002-08-24 01:17 tim_one
+
+ * splitn.py (1.1):
+
+ Utility to split an mbox into N random pieces in one gulp. This gives
+ a convenient way to break a giant corpus into multiple files that can
+ then be used independently across multiple training and testing runs.
+ It's important to do multiple runs on different random samples to avoid
+ drawing conclusions based on accidents in a single random training corpus;
+ if the algorithm is robust, it should have similar performance across
+ all runs.
+
+2002-08-24 00:25 montanaro
+
+ * GBayes.py (1.11):
+
+ Allow command line specification of tokenize functions
+ run w/ -t flag to override default tokenize function
+ run w/ -H flag to see list of tokenize functions
+
+ When adding a new tokenizer, make docstring a short description and add a
+ key/value pair to the tokenizers dict. The key is what the user specifies.
+ The value is a tokenize function.
+
+ Added two new tokenizers - tokenize_wordpairs_foldcase and
+ tokenize_words_and_pairs. It's not obvious that either is better than any
+ of the preexisting functions.
+
+ Should probably add info to the pickle which indicates the tokenizing
+ function used to build it. This could then be the default for spam
+ detection runs.
+
+ Next step is to drive this with spam/non-spam corpora, selecting each of the
+ various tokenizer functions, and presenting the results in tabular form.
+
+2002-08-23 13:10 tim_one
+
+ * GBayes.py (1.10):
+
+ spamprob(): Commented some subtleties.
+
+ clearjunk(): Undid Guido's attempt to space-optimize this. The problem
+ is that you can't delete entries from a dict that's being crawled over
+ by .iteritems(), which is why I (I suddenly recall) materialized a
+ list of words to be deleted the first time I wrote this. It's a lot
+ better to materialize a list of to-be-deleted words than to materialize
+ the entire database in a dict.items() list.
+
+2002-08-23 12:36 tim_one
+
+ * mboxcount.py (1.1):
+
+ Utility to count and display the # of msgs in (one or more) Unix mboxes.
+
+2002-08-23 12:11 tim_one
+
+ * split.py (1.4):
+
+ Open files in binary mode. Else, e.g., about 400MB of Barry's python-list
+ corpus vanishes on Windows. Also use file.write() instead of print>>, as
+ the latter invents an extra newline.
+
+2002-08-22 07:01 tim_one
+
+ * GBayes.py (1.9):
+
+ Renamed "modtime" to "atime", to better reflect its meaning, and added a
+ comment block to explain that better.
+
+2002-08-21 08:07 bwarsaw
+
+ * split.py (1.3):
+
+ Guido suggests a different order for the positional args.
+
+2002-08-21 07:37 bwarsaw
+
+ * split.py (1.2):
+
+ Get rid of the -1 and -2 arguments and make them positional.
+
+2002-08-21 07:18 bwarsaw
+
+ * split.py (1.1):
+
+ A simple mailbox splitter
+
+2002-08-21 06:42 tim_one
+
+ * GBayes.py (1.8):
+
+ Added a bunch of simple tokenizers. The originals are renamed to
+ tokenize_words_foldcase and tokenize_5gram_foldcase_wscollapse.
+ New ones are tokenize_words, tokenize_split_foldcase, tokenize_split,
+ tokenize_5gram, tokenize_10gram, and tokenize_15gram. I don't expect
+ any of these to be the last word. When Barry has the test corpus
+ set up it should be easy to let the data tell us which "pure" strategy
+ works best. Straight character n-grams are very appealing because
+ they're the simplest and most language-neutral; I didn't have any luck
+ with them over the weekend, but the size of my training data was
+ trivial.
+
+2002-08-21 05:08 bwarsaw
+
+ * cleanarch (1.1):
+
+ An archive cleaner, adapted from the Mailman 2.1b3 version, but
+ de-Mailman-ified.
+
+2002-08-21 04:44 gvanrossum
+
+ * GBayes.py (1.7):
+
+ Indent repair in clearjunk().
+
+2002-08-21 04:22 gvanrossum
+
+ * GBayes.py (1.6):
+
+ Some minor cleanup:
+
+ - Move the identifying comment to the top, clarify it a bit, and add
+ author info.
+
+ - There's no reason for _time and _heapreplace to be hidden names;
+ change these back to time and heapreplace.
+
+ - Rename main1() to _test() and main2() to main(); when main() sees
+ there are no options or arguments, it runs _test().
+
+ - Get rid of a list comprehension from clearjunk().
+
+ - Put wordinfo.get as a local variable in _add_msg().
+
+2002-08-20 15:16 tim_one
+
+ * GBayes.py (1.5):
+
+ Neutral typo repairs, except that clearjunk() has a better chance of
+ not blowing up immediately now <wink -- I have yet to try it!>.
+
+2002-08-20 13:49 montanaro
+
+ * GBayes.py (1.4):
+
+ help make it more easily executable... ;-)
+
+2002-08-20 09:32 bwarsaw
+
+ * GBayes.py (1.3):
+
+ Lots of hacks great and small to the main() program, but I didn't
+ touch the guts of the algorithm.
+
+ Added a module docstring/usage message.
+
+ Added a bunch of switches to train the system on an mbox of known good
+ and known spam messages (using PortableUnixMailbox only for now).
+ Uses the email package but does not decoding of message bodies. Also,
+ allows you to specify a file for pickling the training data, and for
+ setting a threshold, above which messages get an X-Bayes-Score
+ header. Also output messages (marked and unmarked) to an output file
+ for retraining.
+
+ Print some statistics at the end.
+
+2002-08-20 05:43 tim_one
+
+ * GBayes.py (1.2):
+
+ Turned off debugging vrbl mistakenly checked in at True.
+
+ unlearn(): Gave this an update_probabilities=True default arg, for
+ symmetry with learn().
+
+2002-08-20 03:33 tim_one
+
+ * GBayes.py (1.1):
+
+ An implementation of Paul Graham's Bayes-like spam classifier.
+
+</pre>
This was sent by the SourceForge.net collaborative development platform, the world's largest Open Source development site.
More information about the Spambayes-checkins
mailing list