[Spambayes-checkins] website background.ht,1.1,1.2

Anthony Baxter anthonybaxter@users.sourceforge.net
Mon Nov 4 06:39:44 2002


Update of /cvsroot/spambayes/website
In directory usw-pr-cvs1:/tmp/cvs-serv16178

Modified Files:
	background.ht 
Log Message:
A bit of a potted history here. I probably have a bunch of things here
that need to be cleaned up and made more obvious, but hey, it's a start.


Index: background.ht
===================================================================
RCS file: /cvsroot/spambayes/website/background.ht,v
retrieving revision 1.1
retrieving revision 1.2
diff -C2 -d -r1.1 -r1.2
*** background.ht	19 Sep 2002 23:39:24 -0000	1.1
--- background.ht	4 Nov 2002 06:39:42 -0000	1.2
***************
*** 15,18 ****
--- 15,67 ----
  <p><i>more links? mail anthony at interlink.com.au</i></p>
  
+ <h2>Overall Approach</h2>
+ <b>Please note that I (Anthony) am writing this based on memory and
+ limited understanding of some of the subtler points of the maths. Gentle
+ corrections are welcome, or even encouraged.</b>
+ <h3>Tokenizing</h3>
+ <p>The architecture of the spambayes system has a couple of distinct 
+ parts. The first, and most obvious, is the <i>tokenizer</i>. This takes
+ a mail message and breaks it up into a series of tokens. At the moment
+ it splits words out of the text parts of a message, there's a variety
+ of header tokenization that goes on as well. The code in tokenizer.py
+ and the comments in the Tokenizer section of Options.py contain more 
+ information about various approaches to tokenizing.</p>
+ 
+ <h3>Combining and Scoring</h3>
+ <p>The next part of the system is the scoring and combining part. This
+ is where the hairy mathematics and statistics come in. </p>
+ <p>Initially we started with Paul Graham's original combining scheme - 
+ this has a number of "magic numbers" and "fuzz factors" built into it. 
+ The Graham combining scheme has a number of problems, aside from the
+ magic in the internal fudge factors - it tends to produce scores of 
+ either 1 or 0, and there's a very small middle ground in between - it 
+ doesn't often claim to be "unsure", and gets it wrong because of this. 
+ There's a number of discussions back and forth between Tim Peters and 
+ Gary Robinson on this subject in the mailing list archives - I'll try 
+ and put links to the relevant threads at some point.</p>
+ <p>Gary produced a number of alternative approaches to combining and
+ scoring word probabilities. The initial one, after much back and forth
+ in the mailing list, is in the code today as 'gary_combining'. A couple
+ of other approaches, using the Central Limit Theorem, were also tried.
+ They produced interesting output - but histograms of the ham and spam
+ distributions had a disturbingly large overlap in the middle. There was
+ also an issue with incremental training and untraining of messages that
+ made it harder to use in the "real world". These two central limit 
+ approaches were dropped after Tim, Gary and Rob Hooft produced a combining
+ scheme using chi-squared probabilities. This is now the default combining
+ scheme. </p>
+ <p>The chi-squared approach produces two numbers - a "ham probability" ("*H*")
+ and a "spam probability" ("*S*"). A typical spam will have a high *S*
+ and low *H*, while a ham will have high *H* and low *S*. In the case where
+ the message looks entirely unlike anything the system's been trained on,
+ you can end up with a low *H* and low *S* - this is the code saying "I don't
+ know what this message is". So at the end of the processing, you end up 
+ with three possible results - "Spam", "Ham", or "Unsure". It's possible to
+ tweak the high and low cutoffs for the Unsure window - this trades off 
+ unsure messages vs possible false positives or negatives.</P>
+ 
+ <h3>Training</h3>
+ <p>TBD</p>
+ 
  <h2>Mailing list archives</h2>
  <p>There's a lot of background on what's been tried available from