From tim.one at comcast.net  Tue May 27 23:17:07 2003
From: tim.one at comcast.net (Tim Peters)
Date: Tue May 27 22:20:08 2003
Subject: [spambayes-dev] Website bug: Inactive links in FAQ
Message-ID: <LNBBLJKPBEHFEDALKOLCCEMAEHAB.tim.one@comcast.net>

Just noticed that in the FAQ

    http://spambayes.sourceforge.net/faq.html

The links in 1b aren't clickable (http://www.python.org/download/ and
http://mimelib.sf.net).

But mostly sending this just to test that the new list works!


From bill at parducci.net  Tue May 27 20:59:10 2003
From: bill at parducci.net (bill parducci)
Date: Tue May 27 23:06:10 2003
Subject: [spambayes-dev] Website bug: Inactive links in FAQ
References: <LNBBLJKPBEHFEDALKOLCCEMAEHAB.tim.one@comcast.net>
Message-ID: <3ED425FE.8090109@parducci.net>

i can fix that...if i could find where the FAQ is in cvs. i have just 
updated /cvsroot/spambayes and don't see it here. is there another 
repository for the website stuff?

also, the link in the faq for:

"If you have any suggestions about other questions and answers that 
should be included here, please mail the list with them."

points to spambayes@python.org. should this be directed to spambayes-dev?

b

Tim Peters wrote:
> Just noticed that in the FAQ
> 
>     http://spambayes.sourceforge.net/faq.html
> 
> The links in 1b aren't clickable (http://www.python.org/download/ and
> http://mimelib.sf.net).
> 
> But mostly sending this just to test that the new list works!
> 
> 
> _______________________________________________
> spambayes-dev mailing list
> spambayes-dev@python.org
> http://mail.python.org/mailman/listinfo/spambayes-dev


From tim.one at comcast.net  Wed May 28 00:34:28 2003
From: tim.one at comcast.net (Tim Peters)
Date: Tue May 27 23:35:03 2003
Subject: [spambayes-dev] Website bug: Inactive links in FAQ
In-Reply-To: <3ED425FE.8090109@parducci.net>
Message-ID: <LNBBLJKPBEHFEDALKOLCMEMDEHAB.tim.one@comcast.net>

[bill parducci]
> i can fix that...if i could find where the FAQ is in cvs. i have just
> updated /cvsroot/spambayes and don't see it here. is there another
> repository for the website stuff?

It's in the same repository, but in a different "module".  If you look at

    http://cvs.sf.net/cgi-bin/viewcvs.cgi/spambayes/#dirlist

you'll see that both "spambayes" and "website" live under the root.  So you
need to cvs checkout the website module:

    cvs -d:...:/cvsroot/spambayes co website

where "..." is whatever gibberish you used to check out the spambayes module
to begin with.  The FAQ will then live on your box as

    website/FAQ.ht

> also, the link in the faq for:
>
> "If you have any suggestions about other questions and answers that
> should be included here, please mail the list with them."
>
> points to spambayes@python.org. should this be directed to
> spambayes-dev?

I don't know, but best guess is "yes".


From noreply at sourceforge.net  Tue May 27 22:26:56 2003
From: noreply at sourceforge.net (SourceForge.net)
Date: Wed May 28 00:31:55 2003
Subject: [spambayes-dev] [ spambayes-Bugs-744380 ] W982E/Outlook 2000:
	exception on loading
Message-ID: <E19KsWe-0003lg-00@sc8-sf-web4.sourceforge.net>

Bugs item #744380, was opened at 2003-05-27 09:51
Message generated for change (Comment added) made by jobbins
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=498103&aid=744380&group_id=61702

Category: Outlook
Group: None
Status: Open
Resolution: None
Priority: 5
Submitted By: Steve Clift (sclift)
Assigned to: Mark Hammond (mhammond)
Summary: W982E/Outlook 2000: exception on loading

Initial Comment:
Windows 98 2nd Edition
Outlook 2000 SR-1 - Corporate or Workgroup

SpamBayes throws an execption when loading. From the log file:

SpamAddin - Connecting to Outlook
pythoncom error: Failed to call the universal dispatcher
Traceback (most recent call last):
  File "E:\src\pythonex\com\win32com\universal.py", line 170, in 
dispatch
  File "E:\src\pythonex\com\win32com\server\policy.py", line 322, 
in _InvokeEx_
  File "E:\src\pythonex\com\win32com\server\policy.py", line 601, 
in _invokeex_
  File "E:\src\pythonex\com\win32com\server\policy.py", line 541, 
in _invokeex_
  File "E:\src\spambayes\Outlook2000\addin.py", line 655, in 
OnConnection
  File "E:\src\spambayes\Outlook2000\manager.py", line 475, in 
GetManager
  File "E:\src\spambayes\Outlook2000\manager.py", line 141, in 
__init__
  File "E:\src\spambayes\Outlook2000\manager.py", line 182, in 
LocateDataDirectory
  File "E:\src\python-cvs\lib\ntpath.py", line 269, in isdir
exceptions.LookupError: no codec search functions registered: 
can't find encoding


----------------------------------------------------------------------

Comment By: Larry Jobbins (jobbins)
Date: 2003-05-27 21:26

Message:
Logged In: YES 
user_id=788287

Same error.  Installed Setup-002.exe from 
http://starship.python.net/crew/mhammond/spambayes/ 
Using Win98SE, Outlook 2000, all MS updates.  Shows add-in, 
but won't stay checked, no icon appears.

Install log looks same - pythoncom error: Failed to call the 
universal dispatcher, etc

----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=498103&aid=744380&group_id=61702

From tim_one at email.msn.com  Wed May 28 03:00:22 2003
From: tim_one at email.msn.com (Tim Peters)
Date: Wed May 28 02:00:58 2003
Subject: [spambayes-dev] RE: [Spambayes] Intersection of two databases
In-Reply-To: <16083.55644.605271.500891@montanaro.dyndns.org>
Message-ID: <LNBBLJKPBEHFEDALKOLCOEGEELAB.tim_one@email.msn.com>

[Skip]
> ...
> Using that list, I then merged the corresponding entries from the two
> source databases.

Skip, how did you do the merge?  That is, word w in your database had a
certain hamcount and spamcount, while word w in Alex's had a presumably
different pair of counts.  Did you add them?  Take the max?  Something else?

It was a very interesting experiment regardless of the answer <wink>.


From bill at parducci.net  Wed May 28 01:18:08 2003
From: bill at parducci.net (bill parducci)
Date: Wed May 28 03:18:15 2003
Subject: [spambayes-dev] Website bug: Inactive links in FAQ
References: <LNBBLJKPBEHFEDALKOLCMEMDEHAB.tim.one@comcast.net>
Message-ID: <3ED462B0.8090002@parducci.net>

attached is updated version with requested fixes (including e-mail 
address update). since it was "tidy'd" already i retidy'd the file for 
consistency.

b

Tim Peters wrote:
> [bill parducci]
> 
>>i can fix that...if i could find where the FAQ is in cvs. i have just
>>updated /cvsroot/spambayes and don't see it here. is there another
>>repository for the website stuff?
> 
> 
> It's in the same repository, but in a different "module".  If you look at
> 
>     http://cvs.sf.net/cgi-bin/viewcvs.cgi/spambayes/#dirlist
> 
> you'll see that both "spambayes" and "website" live under the root.  So you
> need to cvs checkout the website module:
> 
>     cvs -d:...:/cvsroot/spambayes co website
> 
> where "..." is whatever gibberish you used to check out the spambayes module
> to begin with.  The FAQ will then live on your box as
> 
>     website/FAQ.ht
> 
> 
>>also, the link in the faq for:
>>
>>"If you have any suggestions about other questions and answers that
>>should be included here, please mail the list with them."
>>
>>points to spambayes@python.org. should this be directed to
>>spambayes-dev?
> 
> 
> I don't know, but best guess is "yes".

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.python.org/pipermail/spambayes-dev/attachments/20030528/f638e834/faq-0001.htm
From skip at pobox.com  Wed May 28 08:17:00 2003
From: skip at pobox.com (Skip Montanaro)
Date: Wed May 28 08:17:06 2003
Subject: [spambayes-dev] Website bug: Inactive links in FAQ
In-Reply-To: <LNBBLJKPBEHFEDALKOLCCEMAEHAB.tim.one@comcast.net>
References: <LNBBLJKPBEHFEDALKOLCCEMAEHAB.tim.one@comcast.net>
Message-ID: <16084.43196.134724.367964@montanaro.dyndns.org>


    Tim> Just noticed that in the FAQ
    Tim>     http://spambayes.sourceforge.net/faq.html

    Tim> The links in 1b aren't clickable (http://www.python.org/download/
    Tim> and http://mimelib.sf.net).

Fixed.

    Tim> But mostly sending this just to test that the new list works!

It does...

Skip

From skip at pobox.com  Wed May 28 08:48:09 2003
From: skip at pobox.com (Skip Montanaro)
Date: Wed May 28 08:48:13 2003
Subject: [spambayes-dev] RE: [Spambayes] Intersection of two databases
In-Reply-To: <LNBBLJKPBEHFEDALKOLCOEGEELAB.tim_one@email.msn.com>
References: <16083.55644.605271.500891@montanaro.dyndns.org>
	<LNBBLJKPBEHFEDALKOLCOEGEELAB.tim_one@email.msn.com>
Message-ID: <16084.45065.242138.717158@montanaro.dyndns.org>


    >> Using that list, I then merged the corresponding entries from the two
    >> source databases.

    Tim> Skip, how did you do the merge?  That is, word w in your database
    Tim> had a certain hamcount and spamcount, while word w in Alex's had a
    Tim> presumably different pair of counts.  Did you add them?  Take the
    Tim> max?  Something else?

I simply added them.  I also added the 'saved state' values.  This made
intuitive sense to me, though we all know intuition is often wrong.  I was
effectively training using both databases, just eliminating the less useful
tokens.  (Ignore for the moment that actually training on the complete set
of emails Alex and I have would probably have generated slightly different
results.)

Skip

From noreply at sourceforge.net  Wed May 28 10:30:16 2003
From: noreply at sourceforge.net (SourceForge.net)
Date: Wed May 28 12:35:03 2003
Subject: [spambayes-dev] [ spambayes-Bugs-745003 ] hammiebulk.py: Untrain
	does not work
Message-ID: <E19L3oe-0000gu-00@sc8-sf-web1.sourceforge.net>

Bugs item #745003, was opened at 2003-05-28 11:30
Message generated for change (Tracker Item Submitted) made by Item Submitter
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=498103&aid=745003&group_id=61702

Category: None
Group: None
Status: Open
Resolution: None
Priority: 5
Submitted By: Paramjit Oberoi (psoberoi)
Assigned to: Nobody/Anonymous (nobody)
Summary: hammiebulk.py: Untrain does not work

Initial Comment:
hammiebulk.py: Untrain bug

Untraining does not work since when the "-U"
option is detected, the "untrain" variable is
set to "1", overriding the function definition...
                                                                                          
patch:
                                                                                          
145c145
<     untrain = 0
---
>     untrain_mode = 0
169c169
<             untrain = 1
---
>             untrain_mode = 1
182c182
<     if not untrain:
---
>     if not untrain_mode:


----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=498103&aid=745003&group_id=61702

From bill at parducci.net  Wed May 28 10:29:24 2003
From: bill at parducci.net (bill parducci)
Date: Wed May 28 12:36:24 2003
Subject: [spambayes-dev] FAQ update
Message-ID: <3ED4E3E4.20508@parducci.net>

1. added "why don't you bounce back spam?"

2. made the page w3c compliant (html 4.01)

3. reTIDY'd (using attached tidy.conf if anyone cares -- would be nice 
if there was one in cvs ;-)


side note: the site itself uses a lot of deprecated tags, etc.:
http://validator.w3.org/check?uri=http%3A%2F%2Fspambayes.sourceforge.net%2F
would it be of any benefit if i cleaned it up? (is that even possible 
within the contraints of sf.net?)

b
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.python.org/pipermail/spambayes-dev/attachments/20030528/bac7ef81/faq-0001.htm
-------------- next part --------------
break-before-br: no
char-encoding: latin1
enclose-text: yes
enclose-block-text: yes
indent-spaces: 2
indent: yes
input-xml: no
markup: yes
numeric-entities: yes
output-xml: no
quote-marks: yes
quote-nbsp: yes
show-warnings: yes
tidy-mark: no
uppercase-attributes: no
uppercase-tags: no
wrap: 72
wrap-attributes: yes
wrap-script-literals: yes
From skip at pobox.com  Wed May 28 17:18:23 2003
From: skip at pobox.com (Skip Montanaro)
Date: Wed May 28 17:18:31 2003
Subject: [spambayes-dev] FAQ updated
Message-ID: <16085.10143.186656.280842@montanaro.dyndns.org>

(Sending to both spambayes and spambayes-dev to catch all interested
parties.) 

Folks,

With a little assistance from Anthony Baxter, I updated the faq.ht file to
automatically number both the table of contents and the main section with
the answers.  I also ran it through ispell and added a comment near the top
to help people figure out how to add new content.  If you see anything
amiss, feel free to send me a correction or check it in yourself if you're
so enabled.

Skip

From skip at pobox.com  Wed May 28 17:25:05 2003
From: skip at pobox.com (Skip Montanaro)
Date: Wed May 28 17:25:27 2003
Subject: [spambayes-dev] FAQ update
In-Reply-To: <3ED4E3E4.20508@parducci.net>
References: <3ED4E3E4.20508@parducci.net>
Message-ID: <16085.10545.151192.514702@montanaro.dyndns.org>


    bill> 1. added "why don't you bounce back spam?"

    bill> 2. made the page w3c compliant (html 4.01)

    bill> 3. reTIDY'd (using attached tidy.conf if anyone cares -- would be
    bill>    nice if there was one in cvs ;-)

Thanks.  For future reference it would be a lot simpler to incorporate
changes if you could just post a context diff against the latest version in
CVS.  Before seeing your email I checked in a massive change to the way
numbering is done.  Any chance you can send me a context diff against that?

Thanks for the tidy.conf file.  I'll probably check it in as well.

Thx,

Skip

From bill at parducci.net  Wed May 28 16:39:41 2003
From: bill at parducci.net (bill parducci)
Date: Wed May 28 18:52:30 2003
Subject: [spambayes-dev] FAQ update
References: <3ED4E3E4.20508@parducci.net>
	<16085.10545.151192.514702@montanaro.dyndns.org>
Message-ID: <3ED53AAD.7060608@parducci.net>

1. readded "why don't you bounce back spam?"

2. remade the page w3c compliant (html 4.01)

3. skipped TIDY (manually conformed to format)

> Thanks.  For future reference it would be a lot simpler to incorporate
> changes if you could just post a context diff against the latest version in
> CVS.  Before seeing your email I checked in a massive change to the way
> numbering is done.  Any chance you can send me a context diff against that?

sure, as long as you check your e-mail before introducing massive 
changes ;-)

b
-------------- next part --------------
Index: faq.ht
===================================================================
RCS file: /cvsroot/spambayes/website/faq.ht,v
retrieving revision 1.15
diff -c -r1.15 faq.ht
*** faq.ht	28 May 2003 20:56:23 -0000	1.15
--- faq.ht	28 May 2003 22:39:34 -0000
***************
*** 1,960 ****
! Title: SpamBayes: Frequently Asked Questions
! Author-Email: spambayes@python.org
  Author: spambayes
! 
! <!-- ************************************************************
! This really needs to be done differently.  How about a Wiki or a blog?
! ************************************************************ -->
! 
! <!--
! 
! Note to future maintainers: This FAQ has a table of contents listing
! sections and questions and a parallel body which lists the sections and
! questions followed by answers.  Please try and keep the ordering the same
! between the two so the generated numbers align.
! 
! To add a new question and answer please do the following:
! 
!     * Insert the question at the end of the appropriate section of the table
!       of contents.  Highlight it with an anchor tag whose href attribute
!       references the name you'll give to the question in the body, e.g., <a
!       href="#blarg">question</a>.  Surround the whole question with a
!       <li>...</li> pair.
! 
!     * Insert the question followed by the answer at the end of the
!       appropriate section of the body.  Introduce the q&a with <li>.  The
!       question needs to be surrounded by a named anchor: <a
!       name="blarg">question</a>, where "blarg" is the target of the href.
!       Surround the question and anchor in <span
!       class="faq-header">...</span>.  The answer should be one or more
!       <p>aragraphs.  Any <pre>formatted text should be indented four spaces.
!       Terminate the answer with </li>.
! 
! If you feel it's necessary to reorder a section or add a new section, do it
! carefully.  Good luck!
! 
! -->
! 
! <h1>
!   Frequently Asked Questions
! </h1>
! <ol type="1">
!   <li>
!     Overview<br>
!     <ol type="a">
!       <li>
!         <a href="#whatisit">So what is Spambayes?</a>
!       </li>
!       <li>
!         <a href="#requirements">What do I need to install
!         Spambayes?</a>
!       </li>
!       <li>
!         <a href="#tenkfoot">Is there a "ten thousand foot view" that
!         shows how this thing works?</a>
!       </li>
!       <li>
!         <a href="#whereis">Where does all this stuff live?</a>
!       </li>
!     </ol>
!   </li>
!   <li>
!     Compatibility<br>
!     <ol type="a">
!       <li>
!         <a href="#outlookversions">What version of Outlook does it
!         work with?</a>
!       </li>
!       <li>
!         <a href="#outlookexpress">Does Spambayes work with Outlook
!         Express?</a>
!       </li>
!       <li>
!         <a href="#nopython">Do I have to have python installed to use
!         Spambayes with Outlook?</a>
!       </li>
!       <li>
!         <a href="#nonoutlook">Forget Outlook, what clients will
!         Spambayes work with in general?</a>
!       </li>
!       <li>
!         <a href="#exchange">We have Outlook 2000 connecting to an
!         Exchange 2000 server. Will spambayes work for us?</a>
!       </li>
!     </ol>
!   </li>
!   <li>
!     Using Spambayes<br>
!     <ol type="a">
!       <li>
!         <a href="#configs">How do I configure Spambayes?</a>
!       </li>
!       <li>
!         <a href="#webinterface">How do I train Spambayes (web
!         method)</a>
!       </li>
!       <li>
!         <a href="#smtptraining">How do I train Spambayes
!         (forward/bounce method)</a>
!       </li>
!       <li>
!         <a href="#cmdline">How do I train Spambayes (command line
!         method)</a>
!       </li>
!       <li>
!         <a href="#unsure">I just got a spam, but the system said it
!         was "unsure". Why couldn't it tell that it was spam - it's
!         obvious?</a>
!       </li>
!       <li>
!         <a href="#stillunsure">OK, I trained on that message. But I
!         just got <em>another</em> one, and the stupid system still
!         thinks it's unsure. Why did it ignore me???</a>
!       </li>
!       <li>
!         <a href="#wipetraining">I've mucked up my training and I want
!         to start all over again, but there isn't an option for this
!         anywhere. What do I do?</a>
!       </li>
!       <li>
!         <a href="#configfiles">I can't use a web browser, so I can't
!         configure pop3proxy/imapfilter.<br>
!          Also: how do I configure hammiefilter and the other
!         applications that don't have a user interface?</a>
!       </li>
!       <li>
!         <a href="#optionstoset">That's great, now I know what the
!         format looks like, but what options do I need to set?</a>
!       </li>
!       <li>
!         <a href="#configlocation">I've made a configuration file, but
!         Spambayes is ignoring it. Now what?</a>
!       </li>
!       <li>
!         <a href="#shortwords">Why don't short words or long words
!         show up in the clues?</a>
        </li>
!       <li>
!         <a href="#whatelse">Is there anything else I should know?</a>
        </li>
!     </ol>
!   </li>
!   <li>
!     Development<br>
!     <ol type="a">
!       <li>
!         <a href="#tokentrick">Hey! Why don't you implement cool
!         tokenizer trick X? I think it would really foil those
!         spammers!</a>
!       </li>
!       <li>
!         <a href="#serverside">This software is great! I want to
!         implement it for all my users. Are there plans to develop a
!         server-side spambayes solution?</a>
!       </li>
!       <li>
!         <a href="#ngrams">Forget tokenising words - you should use
!         character n-grams!</a>
!       </li>
!       <li>
!         <a href="#clues">The clues for my mail are all in lower case,
!         but "FREE" is a much better clue than "free". Why do you
!         force everything into lower case?</a>
        </li>
!     </ol>
!   </li>
! </ol>
! <p>
!   If you have any suggestions about other questions and answers that
!   should be included here, please mail <a href=
!   "mailto:spambayes@python.org?Subject=(from%20FAQ)">the list</a>
!   with them.
! </p>
! 
! <ol type="1">
! 
! <li>
!   <span class="faq-section">Overview</span>
!     <ol type="a">
! 
!       <li>
!         <span class="faq-header"><a name="whatisit">So what is
!       Spambayes?</a></span>
!         <p>
!           Spambayes is a tool used to segregate unwanted mail (spam) from
!           the mail you want (ham). Before Spambayes can be your spam filter
!           of choice you need to train it on representative samples of email
!           you receive. After it's been trained, you use Spambayes to
!           classify new mail according to its spamminess and hamminess
!           qualities.
!         </p>
!         <p>
!           To train Spambayes (which you don't need to do if you're going to
!           be using the POP3 proxy to classify messages, but you'll get
!           better results from the outset if you do) you need to save your
!           incoming email for awhile, segregating it into two piles, known
!           spam and known ham (ham is our nickname for good mail). It's best
!           to train on recent email, because your interests and the nature of
!           what spam looks like change over time. Once you've collected a
!           fair portion of each (anything is better than nothing, but it
!           helps to have a couple hundred of each), you can tell Spambayes,
!           "Here's my ham and my spam". It will then process that mail and
!           save information about different patterns which appear in ham and
!           spam. That information is then used during the filtering
!           stage. See the "Command-line training" section below for details.
!         </p>
!         <p>
!           When Spambayes filters your email, it compares each unclassified
!           message against the information it saved from training and makes a
!           decision about whether it thinks the message qualifies as ham or
!           spam, or if it's unsure about how to classify the message. It then
!           adds its classification to the message, either by adding a header
!           (X-Spambayes-Classification: spam|ham|unsure), modifying the To:
!           or Subject: headers, or adding a "Spam" field to the message.
!           Depending on which Spambayes application you are using, it may
!           then filter this message for you, or you can set up your own
!           filters (to file away suspected spam into its own mail folder, for
!           example).
!         </p>
!     </li>
! 
!     <li>
!       <span class="faq-header"><a name="requirements">What do I need to
!         install Spambayes?</a></span>
!         <p>
!           Unless you are using the Outlook plugin, you must have a recent
!           version of Python installed on your computer, version 2.2 or
!           later.  (Don't ask about backporting it to earlier versions of
!           Python. It's almost a certainty this won't happen.) If you need to
!           install Python on your system, check the Python download page for
!           the version appropriate to your computer: <a
!           href="http://www.python.org/download/">http://www.python.org/download/</a>
!           You also need version 2.4.3 or above of the Python "email"
!           package.  If you're running Python 2.2.2 or above, then you
!           already have this. If not, you can download it from <a
!           href="http://mimelib.sf.net/">http://mimelib.sf.net/</a> and
!           install it - unpack the archive, cd to the email-2.4.3 directory
!           and type "python setup.py install" (YMMV on different
!           platforms). This will install it into your Python site-packages
!           directory. You'll also need to move aside the standard "email"
!           library - go to your Python "Lib" directory and rename "email" to
!           "email_old".
!         </p>
!     </li>
! 
!     <li>
!       <span class="faq-header"><a name="tenkfoot">Is there a "ten thousand
!       foot view" that shows how this thing works?</a></span>
! 
!         <p>
!           There are eight main components to the Spambayes system:
!         </p>
!         <ol type="i">
!           <li>
!             A database. Loosely speaking, this is a collection of words and
!             associated spam and ham probabilities. The database says "If a
!             message contains the word 'Viagra' then there's a 98% chance
!             that it's spam, and a 2% chance that it's ham." This database is
!             created by training - you give it messages, tell it whether
!             those messages are ham or spam, and it adjusts its probabilities
!             accordingly. How to train it is covered below. By default it
!             lives in a file called "hammie.db" or (for the Outlook plugin)
!             "default_bayes_database".
!           </li>
!           <li>
!             The tokenizer/classifier. This is the core engine of the system.
!             The tokenizer splits emails into tokens (words, roughly
!             speaking), and the classifier looks at those tokens to determine
!             whether the message looks like spam or not. You don't use the
!             tokenizer/classifier directly - it powers the other parts of the
!             system.
!           </li>
!           <li>
!             The POP3 proxy. This sits between your email client (Eudora,
!             Outlook Express, etc) and your incoming email server, and adds
!             the classification header to emails as you download them. A
!             typical user's email setup looks like this:
! 
! <pre>
!     +-----------------+                       +-------------+
!     | Outlook Express |      Internet or      |             |
!     |  (or similar)   | &lt;-------------------&gt; | POP3 server |
!     |                 |      Intranet         |             |
!     +-----------------+                       +-------------+
! </pre>
!             The POP3 server runs either at your ISP for Internet mail, or
!             somewhere on your internal network for corporate mail. The POP3
!             proxy sits in the middle and adds the classification header as
!             you retrieve your email:
! <pre>
!     +-----------------+      +------------+      +-------------+
!     | Outlook Express |      | Spambayes  |      |             |
!     |  (or similar)   | &lt;--&gt; | POP3 proxy | &lt;--&gt; | POP3 server |
!     |                 |      |            |      |             |
!     +-----------------+      +------------+      +-------------+
! </pre>
!             So where you currently have your email client configured to talk
!             to say, "pop3.my-isp.com", you instead configure the
!             <em>proxy</em> to talk to "pop3.my-isp.com" and configure your
!             email client to talk to the proxy.  The POP3 proxy can live on
!             your PC, or on the same machine as the POP3 server, or on a
!             different machine entirely, it really doesn't matter. Say it's
!             living on your PC, you'd configure your email client to talk to
!             "localhost". You can configure the proxy to talk to multiple
!             POP3 servers, if you have more than one email account.
!           </li>
!           <li>
!             The SMTP proxy. This sits between your email client (Eudora,
!             Outlook Express, etc) and your outgoing email server. Any mail
!             sent to spambayes_spam@localhost or spambayes_ham@localhost is
!             intercepted and trained appropriately. A typical user's email
!             setup looks like this:
! 
! <pre>
!     +-----------------+                       +-------------+
!     | Outlook Express |      Internet or      |             |
!     |  (or similar)   | &lt;-------------------&gt; | SMTP server |
!     |                 |      Intranet         |             |
!     +-----------------+                       +-------------+
! </pre>
! 
!             The SMTP server runs either at your ISP for Internet mail, or
!             somewhere on your internal network for corporate mail. The SMTP
!             proxy sits in the middle and checks for mail to train on as you
!             send your email:
! 
! <pre>
!     +-----------------+      +------------+      +-------------+
!     | Outlook Express |      | Spambayes  |      |             |
!     |  (or similar)   | &lt;--&gt; | SMTP proxy | &lt;--&gt; | SMTP server |
!     |                 |      |            |      |             |
!     +-----------------+      +------------+      +-------------+
! </pre>
! 
!             So where you currently have your email client configured to talk
!             to say, "smtp.my-isp.com", you instead configure the
!             <em>proxy</em> to talk to "smtp.my-isp.com" and configure your
!             email client to talk to the proxy. The SMTP proxy can live on
!             your PC, or on the same machine as the SMTP server, or on a
!             different machine entirely, it really doesn't matter. Say it's
!             living on your PC, you'd configure your email client to talk to
!             "localhost". You can configure the proxy to talk to multiple
!             SMTP servers, if you have more than one email account.
! 
            </li>
            <li>
! 
!             The web interface. This is a server that runs alongside the POP3
!             proxy, SMTP proxy, and IMAP filter (see below) and lets you
!             control it through the web. You can upload emails to it for
!             training or classification, query the probabilities database
!             ("How many valid emails really <em>do</em> contain the word
!             Viagra") find particular messages, and most importantly, train
!             it on the emails you've received. When you start using the
!             system, unless you train it using the Hammie script it will
!             classify most things as Unsure, and often make mistakes. But it
!             keeps copies of all the emails it's seen, and through the web
!             interface you can train it by going through a list of all the
!             emails you've received and checking a Ham/Spam box next to each
!             one. After training on a few messages (say 20 spams and 20
!             hams), you'll find that it's getting it right most of the
!             time. The web training interface automatically checks the
!             Ham/Spam boxes according to what it thinks, so all you need to
!             do it correct the odd mistake - it's very quick and easy.
! 
            </li>
            <li>
! 
!             The Outlook plug-in. For Outlook 2000 and Outlook XP (2002)
!             users (not Outlook Express) this lets you manage the whole thing
!             from within Outlook. You set up a Ham folder and a Spam folder,
!             and train it simply by dragging messages into those folders.
!             Alternatively there are buttons to do the same thing. And it
!             integrates into Outlook's filtering system to make it easy to
!             file all the suspected spam into its own folder, for instance.
! 
            </li>
            <li>
! 
!             The Hammie script. This does three jobs: command-line training,
!             procmail filtering, and XML-RPC. See below for details of how to
!             use Hammie for training, and how to use it as procmail filter.
!             Hammie can also run as an XML-RPC server, so that a programmer
!             can write code that uses a remote server to classify emails
!             programmatically - see hammiesrv.py.
! 
            </li>
            <li>
! 
!             The IMAP filter. This is a cross between the POP3 proxy and the
!             Outlook plugin. If your mail sits on an IMAP server, you can use
!             the this to filter your mail. You can designate folders that
!             contain mail to train as ham and folders that contain mail to
!             train as spam, and the filter does this for you. You can also
!             designate folders to filter, along with a folder for messages
!             Spambayes is unsure about, and a folder for suspected spam. When
!             new mail arrives, the filter will move mail to the appropriate
!             location (ham is left in the original folder).
! 
            </li>
          </ol>
!     </li>
! 
!     <li>
!       <span class="faq-header"><a name="whereis">Where does all this
!     stuff live?</a></span>
! 
!       <p>
!         The Hammie script is called hammie.py. The POP3 proxy lives in
!         pop3proxy.py, and the smtpproxy lives in smtpproxy.py. The IMAP
!         filter lives in imapfilter.py. The Outlook plug-in lives in the
!         Outlook2000 subdirectory &mdash; see the README.txt in that
!         directory for more information on that.
!       </p>
!       <p>
!         As well as these components, there's also a whole pile of utility
!         scripts, test harnesses and so on &mdash; see README.txt and
!         TESTING.txt in the spambayes distribution for more information.
!       </p>
!     </li>
!   </ol>
! </li>
! 
! <li>
!       <span class="faq-section">Compatibility</span>
! 
!     <ol type="a">
!     <li>
!     <span class="faq-header"><a name="outlookversions">What version of
!     Outlook does it work with?</a></span>
! 
!       <p>
!         The most up to date list of known compatible versions of Outlook
!         may be found <a href=
!         "http://spambayes.sourceforge.net/windows.html">here</a>.
!       </p>
!     </li>
! 
!     <li>
!       <span class="faq-header"><a name="outlookexpress">Does Spambayes
!     work with Outlook Express?</a></span>
! 
!         <p>
!           Outlook Express isn't a version of Outlook, it's a completely
!           separate program (from the same company). Because they give it
!           away for free, Outlook Express is a really stripped down program, and it's
!           extremely difficult to create a plugin for it.
!         </p>
!         <p>
!           You can use pop3proxy and/or imapfilter with Outlook Express,
!           however you must have either the alpha 3 release, or a recent CVS
!           snapshot in order to do so (alpha 2 does not include all the
!           necessary features). Because Outlook Express does not let you
!           filter on arbitrary headers (like X-Spambayes-Classification),
!           pop3proxy must add the classification to the "To:" line, or the
!           "Subject" line.
!         </p>
!         <p>
!           Pop3proxy/imapfilter aren't quite as 'transparent' as the Outlook
!           plugin, but they're still quite easy to use/setup, and they use the
!           same core, so the results will be the same.
!         </p>
!     </li>
! 
!     <li>
!       <span class="faq-header"><a name="nopython">Do I have to have
!     Python installed to use Spambayes with Outlook?</a></span>
! 
!         <p>
!           You should be able to download the Outlook plugin binary and
!           install that, and that's all you need
!         </p>
!     </li>
! 
!     <li>
!       <span class="faq-header"><a name="nonoutlook">Forget Outlook, what
!     clients will Spambayes work with in general?</a></span>
! 
!       <p>
!         Spambayes will work with most POP3 or IMAP compatible clients. How
!         you implement depends on your local architecture. Users with access
!         to procmail can just write a recipe that invokes spambayes like
!         this:
! <pre>
!     :0fw
!     | /opt/spambayes/hammiefilter.py
! 
!     Followed by a recipe to check the results and take action:
! 
!     :0
!     * ^X-Spambayes-Classification: spam
!     ${MAILDIR}/spam
  </pre>
! 
!       </p>
! 
!       <p>
!         Emacs and XEmacs both come with VM, one of a choice of several
!         Emacs-based mail packages. Emacs is extensible using Emacs Lisp or
!         Pymacs. This extensibility allows you to easily segregate your
!         incoming mail for training purposes. Here's one such example. If you
!         place the following code in your ~/.vm file:
! <pre>
!     (defun copy-to-spam ()
!       (interactive)
!       (vm-save-message (expand-file-name "~/tmp/newspam"))
!       (vm-undelete-message 1))
! 
!     (defun copy-to-nonspam ()
!       (interactive)
!       (vm-save-message (expand-file-name "~/tmp/newham"))
!       (vm-undelete-message 1))
! 
!     (define-key vm-mode-map "ls" 'copy-to-spam)
!     (define-key vm-summary-mode-map "ls" 'copy-to-spam)
!     (define-key vm-mode-map "lh" 'copy-to-nonspam)
!     (define-key vm-summary-mode-map "lh" 'copy-to-nonspam)
  </pre>
- 
-       "ls" will save a copy of the current message to ~/tmp/newspam and "lh"
-       will save a copy of the current message to ~/tmp/newham. You can then
-       use those files later as arguments to hammie.py for training.
- 
-       </p>
- 
- 
-       <p>
-         Users limited to POP3/IMAP communications to the server can use the
-         <a href=
-         "http://spambayes.sourceforge.net/applications.html#pop3">POP3</a>
-         or <a href=
-         "http://spambayes.sourceforge.net/applications.html#imap">IMAP
-         proxy</a> with the <a href=
-         "https://sourceforge.net/project/showfiles.php?group_id=61702">Spambayes
-         source code</a>.
-       </p>
-     </li>
- 
-     <li>
-       <span class="faq-header"><a name="exchange">We have Outlook 2000
-     connecting to an Exchange 2000 server. Will spambayes work for
-     us?</a></span>
- 
      <p>
!       It should, yes. There haven't been any problems reported using that
!       combination.
      </p>
!     </li>
!   </ol>
! </li>
! 
! <li>
!       <span class="faq-section">Using Spambayes</span>
! 
!     <ol type="a">
! 
!     <li>
!       <span class="faq-header"><a name="configs">How do I configure
!     Spambayes?</a></span>
! 
!       <p>
!         The system is configured through a file called "bayescustomize.ini".
!         In here you can configure the name and type of your database, the
!         POP3 server(s) you want to proxy to, the ports you want the proxy
!         and the web interface to run on, and so on. You can also control
!         details like how sure you want the system to be that message really
!         is spam before it marks it as such. The default values for all the
!         options, and the documentation for them, all lives in Options.py.
!       </p>
!       <p>
!         To change an option, create a bayescustomize.ini and add the option
!         to that - don't edit Options.py. If you are using the POP3 proxy,
!         SMTP proxy or IMAP filter, you can also change most of the options
!         you will need to access via the web user interface. You will
!         probably find this at http://localhost:8880. To configure the
!         Outlook plugin, you should click on the Anti-Spam button on the
!         toolbar.
!       </p>
!       <p>
!         To setup the POP3 and SMTP proxies (optional), run;
! <pre>
!     pop3proxy.py -b
! </pre>
! 	from the command line. The web interface should open in your default
!       	browser. You need to click on the "Configuration Link" to go to the
!       	setup page. The minimum you need to do to get started is enter the
!       	servers and ports information in the POP3 proxy and SMTP proxy
!       	sections.
!       </p>
! 
!       <p>
!         The POP3 proxy is then ready for your email client to connect to it
!         on port 110 and the SMTP proxy is ready for connections on port 25.
!         You now need to configure your email client to talk to the proxies
!         instead of the real email servers. Change your equivalent of
!         "pop3.my-isp.com" to "localhost" (or to the name of the machine
!         you're running the proxy on) in your email client's setup, and do
!         the same with your equivalent of "smtp.my-isp.com". Hit "Get new
!         email" and look at the headers of the emails (send yourself an email
!         if you don't have any!) - there should be an
!         X-Spambayes-Classification header there. It probably says "unsure",
!         if you haven't done any training yet. You should be able to create a
!         mail folder called "Suspected spam" and set up a filtering rule that
!         puts emails with an "X-Spambayes-Classification: spam" heading into
!         that folder. (Eventually we should publish instructions on how to do
!         this in all the popular email clients).
!       </p>
!     </li>
! 
!     <li>
!       <span class="faq-header"><a name="webinterface">How do I train
!       Spambayes (web method)</a></span>
!       <p>
!         Follow the "Review messages" link and you'll see a list of the
!         emails that the system has seen so far. Check the appropriate boxes
!         and hit Train. The messages disappear (eventually you'll be able to
!         get back to them, for instance to correct any training mistakes) and
!         if you go back to the home page you'll see that the "Total emails
!         trained" has increased.
!       </p>
!       <p>
!         Once you've done this on a few spams and a few hams, you'll find
!         that the X-Spambayes-Classification header is getting it right most
!         of the time. The more you train it the more accurate it gets.
!         There's no need to train it on every message you receive, but you
!         should train on a few spams and a few hams on a regular basis. You
!         should also try to train it on about the same number of spams as
!         hams.
!       </p>
!       <p>
!         You can train it on lots of messages in one go by either using the
!         Hammie script as explained in the "Command-line training" section,
!         or by giving messages to the web interface via the "Train" form on
!         the Home page. You can train on individual messages (which is
!         tedious) or using mbox files.
!       </p>
!     </li>
! 
!     <li>
!       <span class="faq-header"><a name="smtptraining">How do I train
!     Spambayes (forward/bounce method)</a></span>
!       <p>
!         Alternatively, when you receive an incorrectly classified message,
!         you can forward it to the SMTP proxy for training. If the message
!         should have been classified as spam, forward or bounce the message
!         to spambayes_spam@localhost, and if the message should have been
!         classified as ham, forward it to spambayes_ham@localhost. You can
!         still review the training through the web interface, if you wish to
!         do so.
!       </p>
!       <p>
!         Note that you must set (via the web interface) the "add mail id to"
!         option in order to use this. You can also use this id to find a
!         particular message via the web interface.
!       </p>
!       <p>
!         Note that some mail clients (particularly Outlook Express) do not
!         forward all headers when you bounce, forward or redirect mail. For
!         these clients, you will need to set (via the web interface) the "add
!         mail id to" option to body, which will add a unique id to the body
!         of each message you receive.
!       </p>
!     </li>
! 
!     <li>
!       <span class="faq-header"><a name="cmdline">How do I train Spambayes
!     (command line method)</a></span>
! 
!       <p>
!         Given a pair of Unix mailbox format files (each message starts with
!         a line which begins with 'From '), one containing nothing but spam
!         and the other containing nothing but ham, you can train Spambayes
!         using a command like:
! <pre>
!     hammie.py -g ~/tmp/newham -s ~/tmp/newspam
! </pre>
!         The above command is OS-centric (e.g., UNIX, or Windows command
!         prompt).  You can also use the web interface for training as
!         detailed above.
!       </p>
!     </li>
! 
!     <li>
!       <span class="faq-header"><a name="unsure">I just got a spam, but the
!       system said it was "unsure".  Why couldn't it tell that it was spam
!       &mdash; it's obvious?</a></span>
! 
!       <p>
!         It may be obvious to you, but the classifier only works on the
!         information it has been given. Maybe this is "new" (you've never
!         seen this particular flavor of spam before), or maybe there aren't
!         enough clues in the message which the system is aware of as strong
!         spam clues.
!       </p>
!     </li>
! 
!     <li>
!       <span class="faq-header"><a name="stillunsure">OK, I trained on that
!       message. But I just got <em>another</em> one, and the stupid system
!       still thinks it's unsure.  Why did it ignore me?</a></span>
! 
!       <p>
!         It didn't, but you may need to train on a few more of this type of
!         message to get it classified as "spam". The classification algorithm
!         weights its results based on the number of times it has seen a
!         particular clue, so that clues unique to this type of message may
!         need a few more instances to become "convincing".
!       </p>
!     </li>
! 
!     <li>
!       <span class="faq-header"><a name="wipetraining">I've mucked up my
!       training and I want to start all over again, but there isn't an option
!       for this anywhere.  What do I do?</a></span>
! 
!       <p>
!         Because training from scratch is a very rare occurrence, and because
!         deleting all your training information is something you don't want
!         to do by accident, there isn't an option for this. However, you can
!         quite simply do this manually. All the training data is stored in a
!         file, usually called hammie.db, and if you delete (or rename) this,
!         then you will start training from scratch. If you are using the web
!         interface for the POP3 proxy, the configuration page tells you what
!         this file is called (and where it is) down towards the bottom of the
!         page.
!       </p>
!     </li>
! 
!     <li>
!       <span class="faq-header"><a name="configfiles">I can't use a web
!       browser, so I can't configure pop3proxy/imapfilter.  Also: how do I
!       configure hammiefilter and the other applications that don't have a
!       user interface?</a></span>
! 
!       <p>
!         You need to create a configuration file. This is in the 'standard'
!         ini file format (originally created for Windows 3.1, I believe).
!         You can find documentation on this format in the <a href=
!         "http://www.python.org/doc/current/lib/module-ConfigParser.html">Python
!         ConfigParser doc</a>, but basically, it's just a text file: lines
!         beginning with # are comments, sections start with a line like
!         "[Section Name]", and options are set out within the appropriate
!         section with lines like "opt = val" or "opt: val" (either is okay).
!         Whitespace other than line endings is for the most part ignored, so
!         you can make it look like whatever you like. You can see a list of
!         what a configuration file of all the defaults would like like if you
!         execute the following Python commands:
! <pre>
!     &gt;&gt;&gt; from spambayes.Options import options
!     &gt;&gt;&gt; print options.display()
  </pre>
!       </p>
!     </li>
! 
!     <li>
!       <span class="faq-header"><a name="optionstoset">That's great, now I
!       know what the format looks like, but what options do I need to
!       set?</a></span>
! 
!       <p>
!         This depends on exactly what you want to do, and which application
!         you are intending to use. The easiest thing is to execute the
!         following Python commands:
! <pre>
!     &gt;&gt;&gt; from spambayes.Options import options
!     &gt;&gt;&gt; print options.display_full()
  </pre>
        This will print out a complete list of the options, including a
!       description of the option, and its default value. You can also look up
!       a single section, if you know its name:<br>
! 
! <pre>
!     &gt;&gt;&gt; print options.display_full("section_name")
  </pre>
! 
!       Or just a single option:<br>
! 
! <pre>
!     &gt;&gt;&gt; print options.display_full("section_name", "option_name")
  </pre>
!       If you want a list of all the sections, you can use this command:<br>
! 
! <pre>
!     &gt;&gt;&gt; print options.sections()
  </pre>
! 
!       If you want a list of all the options, you can use this command:<br>
! 
! <pre>
!     &gt;&gt;&gt; print options.options(prepend_section_name=False)
  </pre>
!       </p>
!     </li>
! 
!     <li>
!       <span class="faq-header"><a name="configlocation">I've made a
!     configuration file, but Spambayes is ignoring it. Now what?</a></span>
! 
!       <p>
!         Spambayes looks for your configuration file in three places - if it
!         can't find it, then, obviously, your options will not be loaded.
!         The first place that Spambayes checks is the environment variable
!         BAYESCUSTOMIZE. You can set this to the path of your configuration
!         file, wherever it is, and it will be loaded. You can also specify
!         more than one file, separated by the appropriate path separator for
!         your platform. This is the recommended method of specifying the
!         location of the file, unless you do so via a user interface (as
!         provided by the POP3 proxy, the Outlook plugin, and the IMAP
!         filter). If Spambayes doesn't find anything in the BAYESCUSTOMIZE
!         variable, then it checks the current working directory and your home
!         directory for a bayescustomize.ini or .spambayesrc file
!         (respectively).
!       </p>
!     </li>
! 
!     <li>
!       <span class="faq-header"><a name="shortwords">Why don't short words or
!       long words show up in the clues?</a></span>
! 
!       <p>
!         Words less than 3 characters long are skipped, and words greater
!         than 12 characters long are converted into a special 'long-word'
!         token. These numbers (3 and 12) were determined by brute force
!         testing, and produced the best overall results (including compared
!         to no upper or lower limits).
!       </p>
!     </li>
! 
!     <li>
!       <span class="faq-header"><a name="graybutton">Why is the enable filter
!       button is grayed out in Outlook?</a></span>
! 
!       <p>
!         You need to have done these things to enable that button:
!         <ol type="i">
!           <li>
!             Trained at least 5 ham and 5 spam
!           </li>
!           <li>
!             Set at least one folder to watch
!           </li>
!           <li>
!             Set folders to move spam to, and to move unsures to
!           </li>
!           <li>
!             Changed the action to "copy" or "move", rather than "untouched"
!           </li>
!         </ol>
!       </p>
!     </li>
! 
!     <li>
!       <span class="faq-header"><a name="whatelse">Is there anything else I
!       should know?</a></span>
! 
!       <p>
!         While Spambayes does an excellent job of classifying incoming mail,
!         it is only as good as the data on which it was trained. Here are
!         some tips to help you create a good training set:
!       </p>
!       <ul>
!         <li>
!           Don't use old mail. The characteristics of your email change over
!           time, sometimes subtly, sometimes dramatically, so it's best to
!           use very recent mail to train Spambayes. If you've abandoned an
!           email address in the past because it was getting spammed heavily,
!           there are probably some clues in mail sent to your old address
!           which would bias Spambayes.
!         </li>
!         <li>
!           Check and recheck your training collections. While you are
!           manually classifying mail as spam or ham, it's easy to make a
!           mistake and toss a message or ten in the wrong file. Such
!           miscategorized mail will throw off the classifier.
!         </li>
!       </ul>
!     </li>
!   </ol>
! 
!   <li>
!       <span class="faq-section">Development</span>
! 
!     <ol type="a">
!     <li>
!       <span class="faq-header"><a name="tokentrick">Hey! Why don't you
!       implement cool tokenizer trick X? I think it would really foil those
!       spammers!</a></span>
! 
!       <p>
!         Have you run your tokenizer trick against a set of messages to see
!         if it actually works? Many times what seems like a good idea turns
!         out not to help much, and sometimes even hurts. If you have a good
!         idea, you've run it against a batch of messages and can prove that
!         it helps, paste the code for your technique and the proof to the
!         mailing list. If you're not a coder, but are really keen on your
!         idea, post a feature request on the project page, and wait for
!         someone else to code it for you (but make sure you do some testing
!         when it's done). Otherwise, you will likely get a message from Tim
!         Peters about why you need to test your idea :) Note that as a
!         general rule, we've found that with the tokenizer, "stupid beats
!         smart" &mdash; that is, very specialized tokenizer behavior usually
!         produces worse results than a more general approach that just
!         generates tokens and throws them at the classifier.
!       </p>
!     </li>
! 
!     <li>
!       <span class="faq-header"><a name="serverside">This software is great!
!       I want to implement it for all my users. Are there plans to develop a
!       server-side spambayes solution?</a></span>
! 
!       <p>
!         The problem with a server-side solution is that everyone has a
!         different idea of what is spam - that's the whole strength of the
!         bayesian-style filtering concept. If you are certain that
!         <em>all</em> of your users would agree on what is spam and what is
!         not, then this might work for you, but otherwise you really have to
!         have individual databases for each user. Either way, you should be
!         able to modify spambayes easily enough to fit into your setup.
!         Please let the list know if you do have success in this area, and
!         we'll update this answer.
!       </p>
!     </li>
! 
!     <li>
!       <span class="faq-header"><a name="ngrams">Forget tokenizing words -
!       you should use character n-grams!</a></span>
! 
!       <p>
!         This was quite carefully tested. Character 3-grams gave five times
!         as many false positives, and twice as many false negatives as
!         splitting on whitespace (words). Character 5-grams came fairly close
!         to words with false positives, but the number of false negatives was
!         worse than with 3-grams. n-grams also creates many more unique
!         tokens, which means much slower operation. In addition, it's much
!         harder to figure out <em>why</em> a message scored as it did with
!         n-grams. On the other hand, words are easy to understand.  There
!         was, however, one area where n-grams were much better: detecting
!         spam in Asian languages. Since a 'word' in an Asian language message
!         ends up being an entire line, words don't work very well at all.
!       </p>
!     </li>
! 
!     <li>
!       <span class="faq-header"><a name="clues">The clues for my mail are all
!       in lower case, but "FREE" is a much better clue than "free". Why do
!       you force everything into lower case?</a></span>
! 
!       <p>
!         This was very carefully considered. On the positive side, removing
!         case does hide information (and we're not really sure what it does
!         to non-English languages), but on the negative side, it makes the
!         database a lot bigger, and requires more training. In the end,
!         testing with case removed resulted in no change in the false
!         positive rate, and a small reduction in the false negative rate, so
!         that's what we do. There is one exception: we keep case in subject
!         lines, because testing showed an improvement if we did that.
!       </p>
!     </li>
!   </ol>
! </ol>
--- 1,867 ----
! Title: SpamBayes: Frequently Asked Questions 
! Author-Email: spambayes-dev@python.org 
  Author: spambayes
!     <h1>
!       Frequently Asked Questions
!     </h1>
!     <ol>
!       <li>Overview<br>
!         <ol style="list-style:lower-alpha;">
!           <li>
!             <a href="#whatisit">So what is Spambayes?</a>
!           </li>
!           <li>
!             <a href="#requirements">What do I need to install
!             Spambayes?</a>
!           </li>
!           <li>
!             <a href="#tenkfoot">Is there a &quot;ten thousand foot
!             view&quot; that shows how this thing works?</a>
!           </li>
!           <li>
!             <a href="#whereis">Where does all this stuff live?</a>
!           </li>
!         </ol>
        </li>
!       <li>Compatibility<br>
!         <ol style="list-style:lower-alpha;">
!           <li>
!             <a href="#outlookversions">What version of Outlook does it
!             work with?</a>
!           </li>
!           <li>
!             <a href="#outlookexpress">Does Spambayes work with Outlook
!             Express?</a>
!           </li>
!           <li>
!             <a href="#nopython">Do I have to have python installed to
!             use Spambayes with Outlook?</a>
!           </li>
!           <li>
!             <a href="#nonoutlook">Forget Outlook, what clients will
!             Spambayes work with in general?</a>
!           </li>
!           <li>
!             <a href="#exchange">We have Outlook 2000 connecting to an
!             Exchange 2000 server. Will spambayes work for us?</a>
!           </li>
!         </ol>
        </li>
!       <li>Using Spambayes<br>
!         <ol style="list-style:lower-alpha;">
!           <li>
!             <a href="#configs">How do I configure Spambayes?</a>
!           </li>
!           <li>
!             <a href="#webinterface">How do I train Spambayes (web
!             method)</a>
!           </li>
!           <li>
!             <a href="#smtptraining">How do I train Spambayes
!             (forward/bounce method)</a>
!           </li>
!           <li>
!             <a href="#cmdline">How do I train Spambayes (command line
!             method)</a>
!           </li>
!           <li>
!             <a href="#unsure">I just got a spam, but the system said it
!             was &quot;unsure&quot;. Why couldn&#39;t it tell that it
!             was spam - it&#39;s obvious?</a>
!           </li>
!           <li>
!             <a href="#stillunsure">OK, I trained on that message. But I
!             just got <i>another</i> one, and the stupid system still
!             thinks it&#39;s unsure. Why did it ignore me???</a>
!           </li>
!           <li>
!             <a href="#wipetraining">I&#39;ve mucked up my training and
!             I want to start all over again, but there isn&#39;t an
!             option for this anywhere. What do I do?</a>
!           </li>
!           <li>
!             <a href="#configfiles">I can&#39;t use a web browser, so I
!             can&#39;t configure pop3proxy/imapfilter.<br>
!             Also: how do I configure hammiefilter and the other
!             applications that don&#39;t have a user interface?</a>
!           </li>
!           <li>
!             <a href="#optionstoset">That&#39;s great, now I know what
!             the format looks like, but what options do I need to
!             set?</a>
!           </li>
!           <li>
!             <a href="#configlocation">I&#39;ve made a configuration
!             file, but Spambayes is ignoring it. Now what?</a>
!           </li>
!           <li>
!             <a href="#shortwords">Why don&#39;t short words or long
!             words show up in the clues?</a>
!           </li>
!           <li>
!             <a href="#whatelse">Is there anything else I should
!             know?</a>
!           </li>
!         </ol>
        </li>
!       <li>Development<br>
!         <ol style="list-style:lower-alpha;">
!           <li>
!             <a href="#tokentrick">Hey! Why don&#39;t you implement cool
!             tokenizer trick X? I think it would really foil those
!             spammers!</a>
            </li>
            <li>
!             <a href="#serverside">This software is great! I want to
!             implement it for all my users. Are there plans to develop a
!             server-side spambayes solution?</a>
            </li>
            <li>
!             <a href="#ngrams">Forget tokenising words - you should use
!             character n-grams!</a>
            </li>
            <li>
!             <a href="#clues">The clues for my mail are all in lower
!             case, but &quot;FREE&quot; is a much better clue than
!             &quot;free&quot;. Why do you force everything into lower
!             case?</a>
            </li>
            <li>
!             <a href="#bounce">Why don&#39;t you provide the ability to
!             bounce spam back to the sender?</a>
            </li>
          </ol>
!       </li>
!     </ol>
!     <p>
!       If you have any suggestions about other questions and answers
!       that should be included here, please mail <a href= 
!       "mailto:spambayes-dev@python.org?Subject=(from%20FAQ)">the list</a>
!       with them.
!     </p>
!     <h2>
!       1. Overview
!     </h2>
!     <h3>
!       <a name="whatisit">1a. So what is Spambayes?</a>
!     </h3>
!     <p>
!       Spambayes is a tool used to segregate unwanted mail (spam) from
!       the mail you want (ham). Before Spambayes can be your spam filter
!       of choice you need to train it on representative samples of email
!       you receive. After it&#39;s been trained, you use Spambayes to
!       classify new mail according to its spamminess and hamminess
!       qualities.
!     </p>
!     <p>
!       To train Spambayes (which you don&#39;t need to do if you&#39;re
!       going to be using the POP3 proxy to classify messages, but
!       you&#39;ll get better results from the outset if you do) you need
!       to save your incoming email for awhile, segregating it into two
!       piles, known spam and known ham (ham is our nickname for good
!       mail). It&#39;s best to train on recent email, because your
!       interests and the nature of what spam looks like change over
!       time. Once you&#39;ve collected a fair portion of each (anything
!       is better than nothing, but it helps to have a couple hundred of
!       each), you can tell Spambayes, &quot;Here&#39;s my ham and my
!       spam&quot;. It will then process that mail and save information
!       about different patterns which appear in ham and spam. That
!       information is then used during the filtering stage. See the
!       &quot;Command-line training&quot; section below for details.
!     </p>
!     <p>
!       When Spambayes filters your email, it compares each unclassified
!       message against the information it saved from training and makes
!       a decision about whether it thinks the message qualifies as ham
!       or spam, or if it&#39;s unsure about how to classify the message.
!       It then adds its classification to the message, either by adding
!       a header (X-Spambayes-Classification: spam|ham|unsure), modifying
!       the To: or Subject: headers, or adding a &quot;Spam&quot; field
!       to the message. Depending on which Spambayes application you are
!       using, it may then filter this message for you, or you can set up
!       your own filters (to file away suspected spam into its own mail
!       folder, for example).
!     </p>
!     <h3>
!       <a name="requirements">1b. What do I need to install
!       Spambayes?</a>
!     </h3>
!     <p>
!       Unless you are using the Outlook plugin, you must have a recent
!       version of Python installed on your computer, version 2.2 or
!       later. (Don&#39;t ask about backporting it to earlier versions of
!       Python. It&#39;s almost a certainty this won&#39;t happen.) If
!       you need to install Python on your system, check the Python
!       download page for the version appropriate to your computer:
!     </p>
!     <p style="text-indent: 1em;">
!       <a href=
!       "http://www.python.org/download/">http://www.python.org/download/</a>
!     </p>
!     <p>
!       You also need version 2.4.3 or above of the Python
!       &quot;email&quot; package. If you&#39;re running Python 2.2.2 or
!       above, then you already have this. If not, you can download it
!       from <a href="http://mimelib.sf.net">http://mimelib.sf.net</a>
!       and install it - unpack the archive, cd to the email-2.4.3
!       directory and type &quot;python setup.py install&quot; (YMMV on
!       different platforms). This will install it into your Python
!       site-packages directory. You&#39;ll also need to move aside the
!       standard &quot;email&quot; library - go to your Python
!       &quot;Lib&quot; directory and rename &quot;email&quot; to
!       &quot;email_old&quot;.
!     </p>
!     <h3>
!       <a name="tenkfoot">1c. Is there a &quot;ten thousand foot
!       view&quot; that shows how this thing works?</a>
!     </h3>
!     <p>
!       There are eight main components to the Spambayes system:
!     </p>
!     <ol>
!       <li>A database. Loosely speaking, this is a collection of words
!       and associated spam and ham probabilities. The database says
!       &quot;If a message contains the word &#39;Viagra&#39; then
!       there&#39;s a 98% chance that it&#39;s spam, and a 2% chance that
!       it&#39;s ham.&quot; This database is created by training - you
!       give it messages, tell it whether those messages are ham or spam,
!       and it adjusts its probabilities accordingly. How to train it is
!       covered below. By default it lives in a file called
!       &quot;hammie.db&quot; or (for the Outlook plugin)
!       &quot;default_bayes_database&quot;.
!       </li>
!       <li>The tokeniser/classifier. This is the core engine of the
!       system. The tokenizer splits emails into tokens (words, roughly
!       speaking), and the classifier looks at those tokens to determine
!       whether the message looks like spam or not. You don&#39;t use the
!       tokeniser/classifier directly - it powers the other parts of the
!       system.
!       </li>
!       <li>The POP3 proxy. This sits between your email client (Eudora,
!       Outlook Express, etc) and your incoming email server, and adds
!       the classification header to emails as you download them. A
!       typical user&#39;s email setup looks like this: 
!         <pre>
!    +-----------------+                              +-------------+
!    | Outlook Express |      Internet or intranet    |             |
!    |  (or similar)   | &lt;--------------------------&gt; | POP3 server |
!    |                 |                              |             |
!    +-----------------+                              +-------------+
! </pre>The POP3 server runs either at your ISP for internet mail, or
! somewhere on your internal network for corporate mail. The POP3 proxy
! sits in the middle and adds the classification header as you retrieve
! your email: 
!         <pre>
!    +-----------------+        +------------+        +-------------+
!    | Outlook Express |        | Spambayes  |        |             |
!    |  (or similar)   | &lt;----&gt; | POP3 proxy | &lt;----&gt; | POP3 server |
!    |                 |        |            |        |             |
!    +-----------------+        +------------+        +-------------+
! </pre>So where you currently have your email client configured to talk
! to say, &quot;pop3.my-isp.com&quot;, you instead configure the <i>
!         proxy</i> to talk to &quot;pop3.my-isp.com&quot; and configure
!         your email client to talk to the proxy. The POP3 proxy can live
!         on your PC, or on the same machine as the POP3 server, or on a
!         different machine entirely, it really doesn&#39;t matter. Say
!         it&#39;s living on your PC, you&#39;d configure your email
!         client to talk to &quot;localhost&quot;. You can configure the
!         proxy to talk to multiple POP3 servers, if you have more than
!         one email account.
!       </li>
!       <li>The SMTP proxy. This sits between your email client (Eudora,
!       Outlook Express, etc) and your outgoing email server. Any mail
!       sent to spambayes_spam@localhost or spambayes_ham@localhost is
!       intercepted and trained appropriately. A typical user&#39;s email
!       setup looks like this: 
!         <pre>
!    +-----------------+                              +-------------+
!    | Outlook Express |      Internet or intranet    |             |
!    |  (or similar)   | &lt;--------------------------&gt; | SMTP server |
!    |                 |                              |             |
!    +-----------------+                              +-------------+
! </pre>The SMTP server runs either at your ISP for internet mail, or
! somewhere on your internal network for corporate mail. The SMTP proxy
! sits in the middle and checks for mail to train on as you send your
! email: 
!         <pre>
!    +-----------------+        +------------+        +-------------+
!    | Outlook Express |        | Spambayes  |        |             |
!    |  (or similar)   | &lt;----&gt; | SMTP proxy | &lt;----&gt; | SMTP server |
!    |                 |        |            |        |             |
!    +-----------------+        +------------+        +-------------+
! </pre>So where you currently have your email client configured to talk
! to say, &quot;smtp.my-isp.com&quot;, you instead configure the <i>
!         proxy</i> to talk to &quot;smtp.my-isp.com&quot; and configure
!         your email client to talk to the proxy. The SMTP proxy can live
!         on your PC, or on the same machine as the SMTP server, or on a
!         different machine entirely, it really doesn&#39;t matter. Say
!         it&#39;s living on your PC, you&#39;d configure your email
!         client to talk to &quot;localhost&quot;. You can configure the
!         proxy to talk to multiple SMTP servers, if you have more than
!         one email account.
!       </li>
!       <li>The web interface. This is a server that runs alongside the
!       POP3 proxy, SMTP proxy, and IMAP filter (see below) and lets you
!       control it through the web. You can upload emails to it for
!       training or classification, query the probabilities database
!       (&quot;How many of my emails really <i>do</i> contain the word
!       Viagra&quot; find particular messages, and most importantly,
!       train it on the emails you&#39;ve received. When you start using
!       the system, unless you train it using the Hammie script it will
!       classify most things as Unsure, and often make mistakes. But it
!       keeps copies of all the emails it&#39;s seen, and through the web
!       interface you can train it by going through a list of all the
!       emails you&#39;ve received and checking a Ham/Spam box next to
!       each one. After training on a few messages (say 20 spams and 20
!       hams), you&#39;ll find that it&#39;s getting it right most of the
!       time. The web training interface automatically checks the
!       Ham/Spam boxes according to what it thinks, so all you need to do
!       it correct the odd mistake - it&#39;s very quick and easy.
!       </li>
!       <li>The Outlook plug-in. For Outlook 2000 and Outlook XP (2002)
!       users (not Outlook Express) this lets you manage the whole thing
!       from within Outlook. You set up a Ham folder and a Spam folder,
!       and train it simply by dragging messages into those folders.
!       Alternatively there are buttons to do the same thing. And it
!       integrates into Outlook&#39;s filtering system to make it easy to
!       file all the suspected spam into its own folder, for instance.
!       </li>
!       <li>The Hammie script. This does three jobs: command-line
!       training, procmail filtering, and XML-RPC. See below for details
!       of how to use Hammie for training, and how to use it as procmail
!       filter. Hammie can also run as an XML-RPC server, so that a
!       programmer can write code that uses a remote server to classify
!       emails programmatically - see hammiesrv.py.
!       </li>
!       <li>The IMAP filter. This is a cross between the POP3 proxy and
!       the Outlook plugin. If your mail sits on an IMAP server, you can
!       use the this to filter your mail. You can designate folders that
!       contain mail to train as ham and folders that contain mail to
!       train as spam, and the filter does this for you. You can also
!       designate folders to filter, along with a folder for messages
!       Spambayes is unsure about, and a folder for suspected spam. When
!       new mail arrives, the filter will move mail to the appropriate
!       location (ham is left in the original folder).
!       </li>
!     </ol>
!     <h3>
!       <a name="whereis">1d. Where does all this stuff live?</a>
!     </h3>
!     <p>
!       The Hammie script is called hammie.py. The POP3 proxy lives in
!       pop3proxy.py, and the smtpproxy lives in smtpproxy.py. The IMAP
!       filter lives in imapfilter.py. The Outlook plug-in lives in the
!       Outlook2000 subdirectory &#8212; see the README.txt in that
!       directory for more information on that.
!     </p>
!     <p>
!       As well as these components, there&#39;s also a whole pile of
!       utility scripts, test harnesses and so on &#8212; see README.txt
!       and TESTING.txt in the spambayes distribution for more
!       information.
!     </p>
!     <h2>
!       2. Compatibility
!     </h2>
!     <h3>
!       <a name="outlookversions">2a. What version of Outlook does it
!       work with?</a>
!     </h3>
!     <p>
!       The most up to date list of known compatible versions of Outlook
!       may be found <a href= 
!       "http://spambayes.sourceforge.net/windows.html">here</a>.
!     </p>
!     <h3>
!       <a name="outlookexpress">2b. Does Spambayes work with Outlook
!       Express?</a>
!     </h3>
!     <p>
!       Outlook Express isn&#39;t a version of Outlook, it&#39;s a
!       completely separate program (from the same company). Because they
!       give it away for free, OE is a really stripped down program, and
!       it&#39;s extremely difficult to create a plugin for it.
!     </p>
!     <p>
!       You can use pop3proxy and/or imapfilter with Outlook Express,
!       however you must have either the alpha 3 release, or a recent CVS
!       snapshot in order to do so (alpha 2 does not include all the
!       necessary features). Because Outlook Express does not let you
!       filter on arbitrary headers (like X-Spambayes-Classification),
!       pop3proxy must add the classification to the &quot;To:&quot;
!       line, or the &quot;Subject&quot; line.
!     </p>
!     <p>
!       Pop3proxy/imapfilter aren&#39;t quite as &#39;transparent&#39; as
!       the Outlook plugin, but they&#39;re still quite easy to
!       use/setup, and they use the same core, so the results will be the
!       same.
!     </p>
!     <h3>
!       <a name="nopython">2c. Do I have to have Python installed to use
!       Spambayes with Outlook?</a>
!     </h3>
!     <p>
!       You should be able to download the Outlook plugin binary and
!       install that, and that&#39;s all you need
!     </p>
!     <h3>
!       <a name="nonoutlook">2d. Forget Outlook, what clients will
!       Spambayes work with in general?</a>
!     </h3>
!     <p>
!       Spambayes will work with most POP3 or IMAP compatible clients.
!       How you implement depends on your local architecture. Users with
!       access to procmail can just write a recipe that invokes spambayes
!       like this:
!     </p>
!     <pre>
!  :0fw 
!  | /opt/spambayes/hammiefilter.py
  </pre>
!     <p>
!       Followed by a recipe to check the results and take action:
!     </p>
!     <pre>
!  :0
!  * ^X-Spambayes-Classification: spam 
!  ${MAILDIR}/spam
  </pre>
      <p>
!       Emacs and XEmacs both come with VM, one of a choice of several
!       Emacs-based mail packages. Emacs is extensible using Emacs Lisp
!       or Pymacs. This extensibility allows you to easily segregate your
!       incoming mail for training purposes. Here&#39;s one such example.
!       If you place the following code in your ~/.vm file:
      </p>
!     <pre>
!  (defun copy-to-spam ()
!    (interactive)
!    (vm-save-message (expand-file-name &quot;~/tmp/newspam&quot;))
!    (vm-undelete-message 1))
! 
!  (defun copy-to-nonspam ()
!    (interactive)
!    (vm-save-message (expand-file-name &quot;~/tmp/newham&quot;))
!    (vm-undelete-message 1))
! 
!  (define-key vm-mode-map &quot;ls&quot; &#39;copy-to-spam)
!  (define-key vm-summary-mode-map &quot;ls&quot; &#39;copy-to-spam)
!  (define-key vm-mode-map &quot;lh&quot; &#39;copy-to-nonspam)
!  (define-key vm-summary-mode-map &quot;lh&quot; &#39;copy-to-nonspam)
  </pre>
!     <p>
!       &quot;ls&quot; will save a copy of the current message to
!       ~/tmp/newspam and &quot;lh&quot; will save a copy of the current
!       message to ~/tmp/newham. You can then use those files later as
!       arguments to hammie.py for training.
!     </p>
!     <p>
!       Users limited to POP3/IMAP communications to the server can use
!       the <a href= 
!       "http://spambayes.sourceforge.net/applications.html#pop3">POP3</a>
!       or <a href= 
!       "http://spambayes.sourceforge.net/applications.html#imap">IMAP
!       proxy</a> with the <a href= 
!       "https://sourceforge.net/project/showfiles.php?group_id=61702">Spambayes
!       source code</a>.
!     </p>
!     <h3>
!       <a name="exchange">2e. We have Outlook 2000 connecting to an
!       Exchange 2000 server. Will spambayes work for us?</a>
!     </h3>
!     <p>
!       It should, yes. There haven&#39;t been any problems reported
!       using that combination.
!     </p>
!     <h2>
!       3. Using Spambayes
!     </h2>
!     <h3>
!       <a name="configs">3a. How do I configure Spambayes?</a>
!     </h3>
!     <p>
!       The system is configured through a file called
!       &quot;bayescustomize.ini&quot;. In here you can configure the
!       name and type of your database, the POP3 server(s) you want to
!       proxy to, the ports you want the proxy and the web interface to
!       run on, and so on. You can also control details like how sure you
!       want the system to be that message really is spam before it marks
!       it as such. The default values for all the options, and the
!       documentation for them, all lives in Options.py.
!     </p>
!     <p>
!       To change an option, create a bayescustomize.ini and add the
!       option to that - don&#39;t edit Options.py. If you are using the
!       POP3 proxy, SMTP proxy or IMAP filter, you can also change most
!       of the options you will need to access via the web user
!       interface. You will probably find this at
!       &lt;http://localhost:8880&gt;. To configure the Outlook plugin,
!       you should click on the Anti-Spam button on the toolbar.
!     </p>
!     <p>
!       To setup the POP3 and SMTP proxies (optional), run;
!     </p>
!     <pre>
!   pop3proxy.py -b
! </pre>
!     <p>
!       from the command line. The web interface should open in your
!       default browser. You need to click on the &quot;Configuration
!       Link&quot; to go to the setup page. The minimum you need to do to
!       get started is enter the servers and ports information in the
!       POP3 proxy and SMTP proxy sections.
!     </p>
!     <p>
!       The POP3 proxy is then ready for your email client to connect to
!       it on port 110 and the SMTP proxy is ready for connections on
!       port 25. You now need to configure your email client to talk to
!       the proxies instead of the real email servers. Change your
!       equivalent of &quot;pop3.my-isp.com&quot; to
!       &quot;localhost&quot; (or to the name of the machine you&#39;re
!       running the proxy on) in your email client&#39;s setup, and do
!       the same with your equivalent of &quot;smtp.my-isp.com&quot;. Hit
!       &quot;Get new email&quot; and look at the headers of the emails
!       (send yourself an email if you don&#39;t have any!) - there
!       should be an X-Spambayes-Classification header there. It probably
!       says &quot;unsure&quot;, if you haven&#39;t done any training
!       yet. You should be able to create a mail folder called
!       &quot;Suspected spam&quot; and set up a filtering rule that puts
!       emails with an &quot;X-Spambayes-Classification: spam&quot;
!       heading into that folder. (Eventually we should publish
!       instructions on how to do this in all the popular email clients).
!     </p>
!     <h3>
!       <a name="webinterface">3b. How do I train Spambayes (web
!       method)</a>
!     </h3>
!     <p>
!       Follow the &quot;Review messages&quot; link and you&#39;ll see a
!       list of the emails that the system has seen so far. Check the
!       appropriate boxes and hit Train. The messages disappear
!       (eventually you&#39;ll be able to get back to them, for instance
!       to correct any training mistakes) and if you go back to the home
!       page you&#39;ll see that the &quot;Total emails trained&quot; has
!       increased.
!     </p>
!     <p>
!       Once you&#39;ve done this on a few spams and a few hams,
!       you&#39;ll find that the X-Spambayes-Classification header is
!       getting it right most of the time. The more you train it the more
!       accurate it gets. There&#39;s no need to train it on every
!       message you receive, but you should train on a few spams and a
!       few hams on a regular basis. You should also try to train it on
!       about the same number of spams as hams.
!     </p>
!     <p>
!       You can train it on lots of messages in one go by either using
!       the Hammie script as explained in the &quot;Command-line
!       training&quot; section, or by giving messages to the web
!       interface via the &quot;Train&quot; form on the Home page. You
!       can train on individual messages (which is tedious) or using mbox
!       files.
!     </p>
!     <h3>
!       <a name="smtptraining">3c. How do I train Spambayes
!       (forward/bounce method)</a>
!     </h3>
!     <p>
!       Alternatively, when you receive an incorrectly classified
!       message, you can forward it to the SMTP proxy for training. If
!       the message should have been classified as spam, forward or
!       bounce the message to spambayes_spam@localhost, and if the
!       message should have been classified as ham, forward it to
!       spambayes_ham@localhost. You can still review the training
!       through the web interface, if you wish to do so.
!     </p>
!     <p>
!       Note that you must set (via the web interface) the &quot;add mail
!       id to&quot; option in order to use this. You can also use this id
!       to find a particular message via the web interface.
!     </p>
!     <p>
!       Note that some mail clients (particularly Outlook Express) do not
!       forward all headers when you bounce, forward or redirect mail.
!       For these clients, you will need to set (via the web interface)
!       the &quot;add mail id to&quot; option to body, which will add a
!       unique id to the body of each message you receive.
!     </p>
!     <h3>
!       <a name="cmdline">3d. How do I train Spambayes (command line
!       method)</a>
!     </h3>
!     <p>
!       Given a pair of Unix mailbox format files (each message starts
!       with a line which begins with &#39;From &#39;), one containing
!       nothing but spam and the other containing nothing but ham, you
!       can train Spambayes using a command like:
!     </p>
!     <pre>
!   hammie.py -g ~/tmp/newham -s ~/tmp/newspam
  </pre>
+     <p>
+       The above command is OS-centric (eg. UNIX, or Windows command
+       prompt). You can also use the web interface for training as
+       detailed above.
+     </p>
+     <h3>
+       <a name="unsure">3e. I just got a spam, but the system said it
+       was &quot;unsure&quot;. Why couldn&#39;t it tell that it was spam
+       &#8212; it&#39;s obvious?</a>
+     </h3>
+     <p>
+       It may be obvious to you, but the classifier only works on the
+       information it has been given. Maybe this is &quot;new&quot;
+       (you&#39;ve never seen this particular flavour of spam before),
+       or maybe there aren&#39;t enough clues in the message which the
+       system is aware of as strong spam clues.
+     </p>
+     <h3>
+       <a name="stillunsure">3f. OK, I trained on that message. But I
+       just got <i>another</i> one, and the stupid system still thinks
+       it&#39;s unsure. Why did it ignore me?</a>
+     </h3>
+     <p>
+       It didn&#39;t, but you may need to train on a few more of this
+       type of message to get it classified as &quot;spam&quot;. The
+       classification algorithm weights its results based on the number
+       of times it has seen a particular clue, so that clues unique to
+       this type of message may need a few more instances to become
+       &quot;convincing&quot;.
+     </p>
+     <h3>
+       <a name="wipetraining">3g. I&#39;ve mucked up my training and I
+       want to start all over again, but there isn&#39;t an option for
+       this anywhere. What do I do?</a>
+     </h3>
+     <p>
+       Because training from scratch is a very rare occurrence, and
+       because deleting all your training information is something you
+       don&#39;t want to do by accident, there isn&#39;t an option for
+       this. However, you can quite simply do this manually. All the
+       training data is stored in a file, usually called hammie.db, and
+       if you delete (or rename) this, then you will start training from
+       scratch. If you are using the web interface for the POP3 proxy,
+       the configuration page tells you what this file is called (and
+       where it is) down towards the bottom of the page.
+     </p>
+     <h3>
+       <a name="configfiles">3h. I can&#39;t use a web browser, so I
+       can&#39;t configure pop3proxy/imapfilter.<br>
+       Also: how do I configure hammiefilter and the other applications
+       that don&#39;t have a user interface?</a>
+     </h3>
+     <p>
+       You need to create a configuration file. This is in the
+       &#39;standard&#39; ini file format (originally created for
+       Windows 3.1, I believe). You can find documentation on this
+       format in the <a href= 
+       "http://www.python.org/doc/current/lib/module-ConfigParser.html">Python
+       ConfigParser doc</a>, but basically, it&#39;s just a text file:
+       lines beginning with # are comments, sections start with a line
+       like &quot;[Section Name]&quot;, and options are set out within
+       the appropriate section with lines like &quot;opt = val&quot; or
+       &quot;opt: val&quot; (either is ok). Whitespace other than line
+       endings is for the most part ignored, so you can make it look
+       like whatever you like. You can see a list of what a
+       configuration file of all the defaults would like like if you
+       execute the following Python commands:
+     </p>
+     <pre>
+   &gt;&gt;&gt; from spambayes.Options import options
+   &gt;&gt;&gt; print options.display()
+ </pre>
+     <h3>
+       <a name="optionstoset">3i. That&#39;s great, now I know what the
+       format looks like, but what options do I need to set?</a>
+     </h3>
+     <p>
+       This depends on exactly what you want to do, and which
+       application you are intending to use. The easiest thing is to
+       execute the following Python commands:
+     </p>
+     <pre>
+   &gt;&gt;&gt; from spambayes.Options import options
+   &gt;&gt;&gt; print options.display_full()
+ </pre>
+     <p>
        This will print out a complete list of the options, including a
!       description of the option, and its default value. You can also
!       look up a single section, if you know its name:
!     </p>
!     <pre>
!   &gt;&gt;&gt; print options.display_full(&quot;section_name&quot;)
  </pre>
!     <p>
!       Or just a single option:
!     </p>
!     <pre>
!   &gt;&gt;&gt; print options.display_full(&quot;section_name&quot;, &quot;option_name&quot;)
  </pre>
!     <p>
!       If you want a list of all the sections, you can use this command:
!     </p>
!     <pre>
!   &gt;&gt;&gt; print options.sections()
  </pre>
!     <p>
!       If you want a list of all the options, you can use this command:
!     </p>
!     <pre>
!   &gt;&gt;&gt; print options.options(prepend_section_name=False)
  </pre>
!     <h3>
!       <a name="configlocation">3j. I&#39;ve made a configuration file,
!       but Spambayes is ignoring it. Now what?</a>
!     </h3>
!     <p>
!       Spambayes looks for your configuration file in three places - if
!       it can&#39;t find it, then, obviously, your options will not be
!       loaded. The first place that Spambayes checks is the environment
!       variable BAYESCUSTOMIZE. You can set this to the path of your
!       configuration file, wherever it is, and it will be loaded. You
!       can also specify more than one file, separated by the appropriate
!       path separator for your platform. This is the recommended method
!       of specifying the location of the file, unless you do so via a
!       user interface (as provided by the POP3 proxy, the Outlook
!       plugin, and the IMAP filter). If Spambayes doesn&#39;t find
!       anything in the BAYESCUSTOMIZE variable, then it checks the
!       current working directory and your home directory for a
!       bayescustomize.ini or .spambayesrc file (respectively).
!     </p>
!     <h3>
!       <a name="shortwords">3k. Why don&#39;t short words or long words
!       show up in the clues?</a>
!     </h3>
!     <p>
!       Words less than 3 characters long are skipped, and words greater
!       than 12 characters long are converted into a special
!       &#39;long-word&#39; token. These numbers (3 and 12) were
!       determined by brute force testing, and produced the best overall
!       results (including compared to no upper or lower limits).
!     </p>
!     <h3>
!       <a name="graybutton">Why is the enable filter button is grayed
!       out in Outlook?</a>
!     </h3>
!     <p>
!       You need to have done these things to enable that button:
!     </p>
!     <ol>
!       <li>Trained at least 5 ham and 5 spam
!       </li>
!       <li>Set at least one folder to watch
!       </li>
!       <li>Set folders to move spam to, and to move unsures to
!       </li>
!       <li>Changed the action to &quot;copy&quot; or &quot;move&quot;,
!       rather than &quot;untouched&quot;
!       </li>
!     </ol>
!     <h3>
!       <a name="whatelse">3l. Is there anything else I should know?</a>
!     </h3>
!     <p>
!       While Spambayes does an excellent job of classifying incoming
!       mail, it is only as good as the data on which it was trained.
!       Here are some tips to help you create a good training set:
!     </p>
!     <ul>
!       <li>Don&#39;t use old mail. The characteristics of your email
!       change over time, sometimes subtly, sometimes dramatically, so
!       it&#39;s best to use very recent mail to train Spambayes. If
!       you&#39;ve abandoned an email address in the past because it was
!       getting spammed heavily, there are probably some clues in mail
!       sent to your old address which would bias Spambayes.
!       </li>
!       <li>Check and recheck your training collections. While you are
!       manually classifying mail as spam or ham, it&#39;s easy to make a
!       mistake and toss a message or ten in the wrong file. Such
!       miscategorized mail will throw off the classifier.
!       </li>
!     </ul>
!     <h2>
!       4. Development
!     </h2>
!     <h3>
!       <a name="tokentrick">4a. Hey! Why don&#39;t you implement cool
!       tokenizer trick X? I think it would really foil those
!       spammers!</a>
!     </h3>
!     <p>
!       Have you run your tokenizer trick against a set of messages to
!       see if it actually works? Many times what seems like a good idea
!       turns out not to help much, and sometimes even hurts. If you have
!       a good idea, you&#39;ve run it against a batch of messages and
!       can prove that it helps, paste the code for your technique and
!       the proof to the mailing list. If you&#39;re not a coder, but are
!       really keen on your idea, post a feature request on the project
!       page, and wait for someone else to code it for you (but make sure
!       you do some testing when it&#39;s done). Otherwise, you will
!       likely get a message from Tim Peters about why you need to test
!       your idea :) Note that as a general rule, we&#39;ve found that
!       with the tokenizer, &quot;stupid beats smart&quot; &#8212; that
!       is, very specialised tokenizer behaviour usually produces worse
!       results than a more general approach that just generates tokens
!       and throws them at the classifier.
!     </p>
!     <h3>
!       <a name="serverside">4b. This software is great! I want to
!       implement it for all my users. Are there plans to develop a
!       server-side spambayes solution?</a>
!     </h3>
!     <p>
!       The problem with a server-side solution is that everyone has a
!       different idea of what is spam - that&#39;s the whole strength of
!       the bayesian-style filtering concept. If you are certain that
!       <i>all</i> of your users would agree on what is spam and what is
!       not, then this might work for you, but otherwise you really have
!       to have individual databases for each user. Either way, you
!       should be able to modify spambayes easily enough to fit into your
!       setup. Please let the list know if you do have success in this
!       area, and we&#39;ll update this answer.
!     </p>
!     <h3>
!       <a name="ngrams">4c. Forget tokenising words - you should use
!       character n-grams!</a>
!     </h3>
!     <p>
!       This was quite carefully tested. Character 3-grams gave five
!       times as many false positives, and twice as many false negatives
!       as splitting on whitespace (words). Character 5-grams came fairly
!       close to words with false positives, but the number of false
!       negatives was worse than with 3-grams. n-grams also creates many
!       more unique tokens, which means much slower operation. In
!       addition, it&#39;s much harder to figure out <i>why</i> a message
!       scored as it did with n-grams. On the other hand, words are easy
!       to understand. There was, however, one area where n-grams were
!       much better: detecting spam in Asian languages. Since a
!       &#39;word&#39; in an Asian language message ends up being an
!       entire line, words don&#39;t work very well at all.
!     </p>
!     <h3>
!       <a name="clues">4d. The clues for my mail are all in lower case,
!       but &quot;FREE&quot; is a much better clue than &quot;free&quot;.
!       Why do you force everything into lower case?</a>
!     </h3>
!     <p>
!       This was very carefully weighed up. On the positive side,
!       removing case does hide information (and we&#39;re not really
!       sure what it does to non-English languages), but on the negative
!       side, it makes the database a lot bigger, and requires more
!       training. In the end, testing with case removed resulted in no
!       change in the false positive rate, and a small reduction in the
!       false negative rate, so that&#39;s what we do. There is one
!       exception: we keep case in subject lines, because testing showed
!       an improvement if we did that.
!     </p>
!     <h3>
!       <a name="bounce">4e. Why don&#39;t you provide the ability to
!       bounce spam back to the sender?</a>
!     </h3>
!     <p>
!       Most spammers these days don&#39;t accept incoming email, or
!       (worse) forge the From and sender addresses, it&#39;s unlikely
!       that it would do any good, and may well do some innocent much
!       harm.
!     </p>
From skip at pobox.com  Wed May 28 21:31:11 2003
From: skip at pobox.com (Skip Montanaro)
Date: Wed May 28 21:31:20 2003
Subject: [spambayes-dev] FAQ update
In-Reply-To: <3ED53AAD.7060608@parducci.net>
References: <3ED4E3E4.20508@parducci.net>
	<16085.10545.151192.514702@montanaro.dyndns.org>
	<3ED53AAD.7060608@parducci.net>
Message-ID: <16085.25311.732508.773180@montanaro.dyndns.org>


    bill> 1. readded "why don't you bounce back spam?"

I added this.

    bill> 2. remade the page w3c compliant (html 4.01)

Can you explain in general what you did?  I can't apply your patch as it
stands because it would completely undo what I did to create version 1.15.

    bill> 3. skipped TIDY (manually conformed to format)

Tidy's not normally a huge deal, but with all the nested lists it helps get
all the <ol>'s and <li>'s lined up with the corresponding </ol>'s and
</li>'s.  (And when I first programmed LISP on the CDC Cyber at Iowa I
thought all the parens would drive me nuts.  Parens are downright docile
compared with HTML tags.)

This exercise has convinced me this is a really bad way to maintain the FAQ.
We either need to maintain it in another form which can be converted to
something with a TOC and body as part of the ht2html/make process or switch
to another technology altogether (faq wizard, blog, wiki).  Any idea what,
if anything could be run on SF?

Inputs, give me inputs!  My kingdom for an input!

Skip

From noreply at sourceforge.net  Wed May 28 20:23:20 2003
From: noreply at sourceforge.net (SourceForge.net)
Date: Wed May 28 22:30:56 2003
Subject: [spambayes-dev] [ spambayes-Bugs-745292 ] Logs Show COM error
Message-ID: <E19LD4a-0002at-00@sc8-sf-web2.sourceforge.net>

Bugs item #745292, was opened at 2003-05-28 22:23
Message generated for change (Tracker Item Submitted) made by Item Submitter
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=498103&aid=745292&group_id=61702

Category: Outlook
Group: v1.0 (example)
Status: Open
Resolution: None
Priority: 5
Submitted By: Bryan Hunt (brhunt)
Assigned to: Mark Hammond (mhammond)
Summary: Logs Show COM error

Initial Comment:
I installed, configured and trained one day.  Everything 
worked great.  

Next day, it says that I no longer have any items in the 
database.  The "delete as spam" and "filter now" buttons 
no longer work.  The log files show that there are COM 
errors.

This looks similar to bug 689298.


----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=498103&aid=745292&group_id=61702

From bill at parducci.net  Wed May 28 22:12:49 2003
From: bill at parducci.net (bill parducci)
Date: Thu May 29 00:19:46 2003
Subject: [spambayes-dev] FAQ update
References: <3ED4E3E4.20508@parducci.net>
	<16085.10545.151192.514702@montanaro.dyndns.org>
	<3ED53AAD.7060608@parducci.net>
	<16085.25311.732508.773180@montanaro.dyndns.org>
Message-ID: <3ED588C1.4020805@parducci.net>

Skip Montanaro wrote:
>     bill> 2. remade the page w3c compliant (html 4.01)
> 
> Can you explain in general what you did?  I can't apply your patch as it
> stands because it would completely undo what I did to create version 1.15.

mostly it was misplaced "</p>" (they cannot contain "<pre>" tags. the 
changes are not major, which is why i was suprised that diff responded 
thus. i think that if you open both versions in a browser you will only 
see the addition of the 'why not respond to spam' passage. i used the 
latest cvs version so ALL of your changes should be there.

>     bill> 3. skipped TIDY (manually conformed to format)
> 
> Tidy's not normally a huge deal, but with all the nested lists it helps get
> all the <ol>'s and <li>'s lined up with the corresponding </ol>'s and
> </li>'s.  (And when I first programmed LISP on the CDC Cyber at Iowa I
> thought all the parens would drive me nuts.  Parens are downright docile
> compared with HTML tags.)

either way works for me. personally, i like the indentation it provides.

> This exercise has convinced me this is a really bad way to maintain the FAQ.
> We either need to maintain it in another form which can be converted to
> something with a TOC and body as part of the ht2html/make process or switch
> to another technology altogether (faq wizard, blog, wiki).  Any idea what,
> if anything could be run on SF?

i for one am not a big wiki fan because it takes the formatting syntax 
to a whole new level of obscurity. :) a blog might work, but then you 
have to read a whole thread to see what is going on.

i noticed that sf.net uses php. perhaps we could whip up something that 
will take simple txt files and generate the necessary html. i do 
something like this for websites that i admin where users can update 
their content via e-mail.

b


From anthony at interlink.com.au  Thu May 29 15:25:13 2003
From: anthony at interlink.com.au (Anthony Baxter)
Date: Thu May 29 00:26:09 2003
Subject: [spambayes-dev] FAQ update 
In-Reply-To: <3ED588C1.4020805@parducci.net> 
Message-ID: <200305290425.h4T4PEf12646@localhost.localdomain>


>>> bill parducci wrote
> i noticed that sf.net uses php. perhaps we could whip up something that 
> will take simple txt files and generate the necessary html. i do 
> something like this for websites that i admin where users can update 
> their content via e-mail.

The other approach would be to make a directory full of text or 
html files, one per question, and "assemble" the FAQ with the 
Makefile (use a simple python script). I can do this if people 
think it's the right thing to do.

Anthony

From skip at pobox.com  Thu May 29 09:20:34 2003
From: skip at pobox.com (Skip Montanaro)
Date: Thu May 29 09:20:41 2003
Subject: [spambayes-dev] FAQ update
In-Reply-To: <3ED588C1.4020805@parducci.net>
References: <3ED4E3E4.20508@parducci.net>
	<16085.10545.151192.514702@montanaro.dyndns.org>
	<3ED53AAD.7060608@parducci.net>
	<16085.25311.732508.773180@montanaro.dyndns.org>
	<3ED588C1.4020805@parducci.net>
Message-ID: <16086.2338.196846.547051@montanaro.dyndns.org>


    bill> i noticed that sf.net uses php. perhaps we could whip up something
    bill> that will take simple txt files and generate the necessary html. 

The most likely candidate in the Python world would be reStructured Text,
the package used to do PEPs and the python-dev summary, among other things.
It looks as if it can be used to generate HTML FAQs:

    http://docutils.sourceforge.net/FAQ.html

I will investigate.

Skip

From bill at parducci.net  Thu May 29 07:24:22 2003
From: bill at parducci.net (bill parducci)
Date: Thu May 29 09:31:20 2003
Subject: [spambayes-dev] FAQ update
References: <3ED4E3E4.20508@parducci.net>
	<16085.10545.151192.514702@montanaro.dyndns.org>
	<3ED53AAD.7060608@parducci.net>
	<16085.25311.732508.773180@montanaro.dyndns.org>
	<3ED588C1.4020805@parducci.net>
	<16086.2338.196846.547051@montanaro.dyndns.org>
Message-ID: <3ED60A06.3010302@parducci.net>

html writer module looks interesting as well: html 4.01 compliant output 
and stylesheet aware.

b

Skip Montanaro wrote:
>     bill> i noticed that sf.net uses php. perhaps we could whip up something
>     bill> that will take simple txt files and generate the necessary html. 
> 
> The most likely candidate in the Python world would be reStructured Text,
> the package used to do PEPs and the python-dev summary, among other things.
> It looks as if it can be used to generate HTML FAQs:
> 
>     http://docutils.sourceforge.net/FAQ.html
> 
> I will investigate.
> 
> Skip


From skip at pobox.com  Thu May 29 10:11:17 2003
From: skip at pobox.com (Skip Montanaro)
Date: Thu May 29 10:11:21 2003
Subject: [spambayes-dev] FAQ update
In-Reply-To: <16086.2338.196846.547051@montanaro.dyndns.org>
References: <3ED4E3E4.20508@parducci.net>
	<16085.10545.151192.514702@montanaro.dyndns.org>
	<3ED53AAD.7060608@parducci.net>
	<16085.25311.732508.773180@montanaro.dyndns.org>
	<3ED588C1.4020805@parducci.net>
	<16086.2338.196846.547051@montanaro.dyndns.org>
Message-ID: <16086.5381.225847.348619@montanaro.dyndns.org>

    Skip> It looks as if it can be used to generate HTML FAQs:

    Skip>     http://docutils.sourceforge.net/FAQ.html

    Skip> I will investigate.

Okay, please check out

    http://spambayes.sf.net/faq-rest.txt
    http://spambayes.sf.net/faq-rest.html

If people like it well enough, I will replace the current stuff with it.  To
generate the HTML from the reST file you'll have to have the Docutils code
installed.  It's the usual "python setup.py install" thing, though for some
reason the actual front-end program isn't installed in /usr/local/bin.

Skip

From skip at pobox.com  Thu May 29 10:17:24 2003
From: skip at pobox.com (Skip Montanaro)
Date: Thu May 29 10:17:52 2003
Subject: [spambayes-dev] FAQ update 
In-Reply-To: <200305290425.h4T4PEf12646@localhost.localdomain>
References: <3ED588C1.4020805@parducci.net>
	<200305290425.h4T4PEf12646@localhost.localdomain>
Message-ID: <16086.5748.383124.408066@montanaro.dyndns.org>


    >> i noticed that sf.net uses php. perhaps we could whip up something
    >> that will take simple txt files and generate the necessary html. 

    Anthony> The other approach would be to make a directory full of text or
    Anthony> html files, one per question, and "assemble" the FAQ with the
    Anthony> Makefile (use a simple python script). 

Let's try out preexisting solutions first.  reST can be used to generate
FAQs though it doesn't yet have specific support for them.  I converted the
current faq.ht to reST format and uploaded both it and the HTML generated
from it to spambayes.sf.net:

    http://spambayes.sf.net/faq-rest.txt
    http://spambayes.sf.net/faq-rest.html

If we're going to develop something I think it would be worthwhile to get in
touch with the Docutils folks and discuss a FAQ generator using their code
base.

Skip

From noreply at sourceforge.net  Thu May 29 08:07:34 2003
From: noreply at sourceforge.net (SourceForge.net)
Date: Thu May 29 10:18:01 2003
Subject: [spambayes-dev] [ spambayes-Bugs-745518 ] Dragging multiple files
	doens't update stats
Message-ID: <E19LO46-0000zs-00@sc8-sf-web4.sourceforge.net>

Bugs item #745518, was opened at 2003-05-30 00:07
Message generated for change (Tracker Item Submitted) made by Item Submitter
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=498103&aid=745518&group_id=61702

Category: Outlook
Group: None
Status: Open
Resolution: None
Priority: 5
Submitted By: Mark Hammond (mhammond)
Assigned to: Mark Hammond (mhammond)
Summary: Dragging multiple files doens't update stats

Initial Comment:
>From the mailing list - unverified

Using the latest binary for Windows 1.02a. If I select
multiple messages in
the possible spam folder and drag to the spam folder,
the statistics
displayed next to the "Train Now" button are not
updated. I am unsure if the
database is actually updated. Doing each message
separately does update the
statistics.


----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=498103&aid=745518&group_id=61702

From noreply at sourceforge.net  Thu May 29 08:10:05 2003
From: noreply at sourceforge.net (SourceForge.net)
Date: Thu May 29 10:18:02 2003
Subject: [spambayes-dev] [ spambayes-Bugs-745518 ] Dragging multiple files
	doens't update stats
Message-ID: <E19LO6X-0003CO-00@sc8-sf-web3.sourceforge.net>

Bugs item #745518, was opened at 2003-05-30 00:07
Message generated for change (Comment added) made by mhammond
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=498103&aid=745518&group_id=61702

Category: Outlook
Group: None
>Status: Pending
Resolution: None
Priority: 5
Submitted By: Mark Hammond (mhammond)
Assigned to: Mark Hammond (mhammond)
Summary: Dragging multiple files doens't update stats

Initial Comment:
>From the mailing list - unverified

Using the latest binary for Windows 1.02a. If I select
multiple messages in
the possible spam folder and drag to the spam folder,
the statistics
displayed next to the "Train Now" button are not
updated. I am unsure if the
database is actually updated. Doing each message
separately does update the
statistics.


----------------------------------------------------------------------

>Comment By: Mark Hammond (mhammond)
Date: 2003-05-30 00:10

Message:
Logged In: YES 
user_id=14198

Works for me in CVS.  Just did 2 "maybe" via dragging, and
the dialog shows 2 additional spam.

----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=498103&aid=745518&group_id=61702

From bill at parducci.net  Thu May 29 08:43:50 2003
From: bill at parducci.net (bill parducci)
Date: Thu May 29 10:43:54 2003
Subject: [spambayes-dev] FAQ update
References: <3ED4E3E4.20508@parducci.net>
	<16085.10545.151192.514702@montanaro.dyndns.org>
	<3ED53AAD.7060608@parducci.net>
	<16085.25311.732508.773180@montanaro.dyndns.org>
	<3ED588C1.4020805@parducci.net>
	<16086.2338.196846.547051@montanaro.dyndns.org>
	<16086.5381.225847.348619@montanaro.dyndns.org>
Message-ID: <3ED61CA6.9090101@parducci.net>

Skip Montanaro wrote:
> Okay, please check out
> 
>     http://spambayes.sf.net/faq-rest.txt
>     http://spambayes.sf.net/faq-rest.html

very nice. even work with links & lynx. :)

b


From noreply at sourceforge.net  Thu May 29 10:28:24 2003
From: noreply at sourceforge.net (SourceForge.net)
Date: Thu May 29 12:31:23 2003
Subject: [spambayes-dev] [ spambayes-Bugs-744380 ] W982E/Outlook 2000:
	exception on loading
Message-ID: <E19LQGO-0008A0-00@sc8-sf-web4.sourceforge.net>

Bugs item #744380, was opened at 2003-05-27 09:51
Message generated for change (Comment added) made by jobbins
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=498103&aid=744380&group_id=61702

Category: Outlook
Group: None
Status: Open
Resolution: None
Priority: 5
Submitted By: Steve Clift (sclift)
Assigned to: Mark Hammond (mhammond)
Summary: W982E/Outlook 2000: exception on loading

Initial Comment:
Windows 98 2nd Edition
Outlook 2000 SR-1 - Corporate or Workgroup

SpamBayes throws an execption when loading. From the log file:

SpamAddin - Connecting to Outlook
pythoncom error: Failed to call the universal dispatcher
Traceback (most recent call last):
  File "E:\src\pythonex\com\win32com\universal.py", line 170, in 
dispatch
  File "E:\src\pythonex\com\win32com\server\policy.py", line 322, 
in _InvokeEx_
  File "E:\src\pythonex\com\win32com\server\policy.py", line 601, 
in _invokeex_
  File "E:\src\pythonex\com\win32com\server\policy.py", line 541, 
in _invokeex_
  File "E:\src\spambayes\Outlook2000\addin.py", line 655, in 
OnConnection
  File "E:\src\spambayes\Outlook2000\manager.py", line 475, in 
GetManager
  File "E:\src\spambayes\Outlook2000\manager.py", line 141, in 
__init__
  File "E:\src\spambayes\Outlook2000\manager.py", line 182, in 
LocateDataDirectory
  File "E:\src\python-cvs\lib\ntpath.py", line 269, in isdir
exceptions.LookupError: no codec search functions registered: 
can't find encoding


----------------------------------------------------------------------

Comment By: Larry Jobbins (jobbins)
Date: 2003-05-29 09:28

Message:
Logged In: YES 
user_id=788287

Looks similar to 725449 and 740893.

----------------------------------------------------------------------

Comment By: Larry Jobbins (jobbins)
Date: 2003-05-27 21:26

Message:
Logged In: YES 
user_id=788287

Same error.  Installed Setup-002.exe from 
http://starship.python.net/crew/mhammond/spambayes/ 
Using Win98SE, Outlook 2000, all MS updates.  Shows add-in, 
but won't stay checked, no icon appears.

Install log looks same - pythoncom error: Failed to call the 
universal dispatcher, etc

----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=498103&aid=744380&group_id=61702

From skip at pobox.com  Fri May 30 12:34:54 2003
From: skip at pobox.com (Skip Montanaro)
Date: Fri May 30 12:35:00 2003
Subject: [spambayes-dev] Next to no feedback on the trial faq
Message-ID: <16087.34862.433359.570136@montanaro.dyndns.org>


(spambayes-dev now seems to have enough people on it to form a quorum of
sorts (26), so I'm excluding spambayes...)

I posted URLs yesterday for a trial version of the Spambayes FAQ built using
Docutils tools:

    http://spambayes.sourceforge.net/faq-rest.txt
    http://spambayes.sourceforge.net/faq-rest.html

So far, the only person who responded was Bill Parducci.  It's not
surprising that he responded, because he's the other person who seems to be
in FAQ maintenance hell at the moment.  Still, I thought one or two other
people would have responded.

The Docutils version has the advantage that it will automagically generate
the table of contents.  Its main drawback is that there is no explicit faq
generator script in Docutils, so the format of the questions is a bit
constrained. 

Before I make an executive decision and simply adopt this new stuff (I
believe it will be much easier to maintain than the status quo), I would
like some feedback:

    * Does anyone have a problem that it doesn't follow the ht2html format
      for the spambayes site as a whole?  I can probably worm around this by
      extracting the body of the faq to faq.ht then let the usual ht->html
      dependency work its magic.

    * Should anyone wanting to update the FAQ be required to install a
      recent version of Docutils or should I somehow worm around that?
      (This is especially disconcerting because when you install Docutils it
      doesn't install the html.py front-end script used to generate the
      faq.html file, so I can't really assume it will be in PATH.)

If either of these is a big deal, please let me know.

Skip

From tim at fourstonesexpressions.com  Fri May 30 12:53:51 2003
From: tim at fourstonesexpressions.com (Tim Stone)
Date: Fri May 30 12:55:29 2003
Subject: [spambayes-dev] Next to no feedback on the trial faq
In-Reply-To: <16087.34862.433359.570136@montanaro.dyndns.org>
References: <16087.34862.433359.570136@montanaro.dyndns.org>
Message-ID: <oprpzvv1ig34x2gn@mail.fourstonesexpressions.com>

On Fri, 30 May 2003 11:34:54 -0500, Skip Montanaro <skip@pobox.com> wrote:

>
> (spambayes-dev now seems to have enough people on it to form a quorum of
> sorts (26), so I'm excluding spambayes...)
>
> I posted URLs yesterday for a trial version of the Spambayes FAQ built 
> using
> Docutils tools:
>
> http://spambayes.sourceforge.net/faq-rest.txt
> http://spambayes.sourceforge.net/faq-rest.html
>
> So far, the only person who responded was Bill Parducci.  It's not
> surprising that he responded, because he's the other person who seems to 
> be
> in FAQ maintenance hell at the moment.  Still, I thought one or two other
> people would have responded.

I didn't get my -dev subscription to work till yesterday... not sure what 
happened... anyway, the faq looks very good.  I see that the infoworld 
article brought quite a number of users, and subsequent questions.  We'll 
need to mine the user list for faq regularly.
>
> The Docutils version has the advantage that it will automagically 
> generate
> the table of contents.  Its main drawback is that there is no explicit 
> faq
> generator script in Docutils, so the format of the questions is a bit
> constrained.
>
> Before I make an executive decision and simply adopt this new stuff (I
> believe it will be much easier to maintain than the status quo), I would
> like some feedback:
>
> * Does anyone have a problem that it doesn't follow the ht2html format
> for the spambayes site as a whole?

I don't have a problem with it.

> I can probably worm around this by
> extracting the body of the faq to faq.ht then let the usual ht->html
> dependency work its magic.

This would be more confusing, I would think.

>
> * Should anyone wanting to update the FAQ be required to install a
> recent version of Docutils or should I somehow worm around that?

Doesn't matter, as long as we can find out somewhere exactly what we gotta 
have and do to update faq.  OOOORRR, we could assume that you and Bill will 
be more than happy to incorporate any q&a we send ya <wink>


c'est moi - TimS

From popiel at wolfskeep.com  Fri May 30 11:04:16 2003
From: popiel at wolfskeep.com (T. Alexander Popiel)
Date: Fri May 30 13:04:21 2003
Subject: [spambayes-dev] Next to no feedback on the trial faq 
In-Reply-To: Message from Skip Montanaro <skip@pobox.com> of "Fri,
	30 May 2003 11:34:54 CDT."
	<16087.34862.433359.570136@montanaro.dyndns.org> 
References: <16087.34862.433359.570136@montanaro.dyndns.org> 
Message-ID: <20030530170416.CC7B52DE9C@cashew.wolfskeep.com>

In message:  <16087.34862.433359.570136@montanaro.dyndns.org>
             Skip Montanaro <skip@pobox.com> writes:
>
>(spambayes-dev now seems to have enough people on it to form a quorum of
>sorts (26), so I'm excluding spambayes...)

Reasonable.  Heck, I'm close to dropping off the original spambayes
list, for lack of interest in hearing about people fighting with
Outlook. <.5 wink>

>I posted URLs yesterday for a trial version of the Spambayes FAQ built using
>Docutils tools:
>
>    http://spambayes.sourceforge.net/faq-rest.txt
>    http://spambayes.sourceforge.net/faq-rest.html

The <em> markup in the words vs. n-grams question doesn't seem
to work.  Similarly the &mdash; in the cool tokenizer trick
question.  This is not saying that docutils is bad, just that
there'll be some trivial cleanup.  Overall, the docutils format
seems reasonably coherent.

Back with the editor hat on, the positive and negative seem
to be reversed (or at least confusing) in the case-folding
question.


>So far, the only person who responded was Bill Parducci.  It's not
>surprising that he responded, because he's the other person who
>seems to be in FAQ maintenance hell at the moment.  Still, I thought
>one or two other people would have responded.

Sorry; I haven't had more than about 2 minutes to rub together.

>    * Does anyone have a problem that it doesn't follow the ht2html format
>      for the spambayes site as a whole?  I can probably worm around this by
>      extracting the body of the faq to faq.ht then let the usual ht->html
>      dependency work its magic.

Doesn't bug me; I'd say don't add the extra step.

>    * Should anyone wanting to update the FAQ be required to install a
>      recent version of Docutils or should I somehow worm around that?
>      (This is especially disconcerting because when you install Docutils it
>      doesn't install the html.py front-end script used to generate the
>      faq.html file, so I can't really assume it will be in PATH.)

Making FAQ devs install a recent version seems reasonable...
but put that requirement in the FAQ under 'Why can't I get
Docutils to work on my locally-edited version of this FAQ?'. ;-)

- Alex

From skip at pobox.com  Fri May 30 13:18:49 2003
From: skip at pobox.com (Skip Montanaro)
Date: Fri May 30 13:19:05 2003
Subject: [spambayes-dev] Next to no feedback on the trial faq 
In-Reply-To: <20030530170416.CC7B52DE9C@cashew.wolfskeep.com>
References: <16087.34862.433359.570136@montanaro.dyndns.org>
	<20030530170416.CC7B52DE9C@cashew.wolfskeep.com>
Message-ID: <16087.37497.497330.922379@montanaro.dyndns.org>


    >> http://spambayes.sourceforge.net/faq-rest.txt
    >> http://spambayes.sourceforge.net/faq-rest.html

    Alex> The <em> markup in the words vs. n-grams question doesn't seem to
    Alex> work.  Similarly the &mdash; in the cool tokenizer trick question.

I beliebe those were holdover bits of markup from faq.ht, not something
Docutils did.  I'll fix them.  Thanks.

    Alex> Back with the editor hat on, the positive and negative seem to be
    Alex> reversed (or at least confusing) in the case-folding question.

I'll take a look.

    >> * Does anyone have a problem that it doesn't follow the ht2html
    >>   format

    Alex> Doesn't bug me; I'd say don't add the extra step.

That's cool.  The less work the better.

    Alex> Making FAQ devs install a recent version seems reasonable...  but
    Alex> put that requirement in the FAQ under 'Why can't I get Docutils to
    Alex> work on my locally-edited version of this FAQ?'. ;-)

I think that can be arranged.

Thanks for the feedback.

Skip

From tim at fourstonesexpressions.com  Fri May 30 13:31:28 2003
From: tim at fourstonesexpressions.com (Tim Stone)
Date: Fri May 30 13:31:53 2003
Subject: [spambayes-dev] Proposed fourth X-Spambayes-Classification header
	value
Message-ID: <oprpzxmqwi34x2gn@mail.fourstonesexpressions.com>

I'm working on this problem that crops up in pop3proxy and hammie where 
malformed headers cause the parser to raise an uncaught exception, 
rendering spambayes more helpless than a baby seal.  It's no problem to 
catch the exception, but what to do with the message is the issue.  I have 
two suggested approaches:

1. Assume that the mail is spam.

2. Add a new possible value to the classification header, something like 
'unclassifiable'.

Other alternatives, votes, or general booing and hissing?

-- 


c'est moi - TimS

From popiel at wolfskeep.com  Fri May 30 11:35:06 2003
From: popiel at wolfskeep.com (T. Alexander Popiel)
Date: Fri May 30 13:35:09 2003
Subject: [spambayes-dev] Proposed fourth X-Spambayes-Classification header
	value 
In-Reply-To: Message from Tim Stone <tim@fourstonesexpressions.com> of "Fri,
	30 May 2003 12:31:28 CDT."
	<oprpzxmqwi34x2gn@mail.fourstonesexpressions.com> 
References: <oprpzxmqwi34x2gn@mail.fourstonesexpressions.com> 
Message-ID: <20030530173506.0F6992DE9C@cashew.wolfskeep.com>

In message:  <oprpzxmqwi34x2gn@mail.fourstonesexpressions.com>
             Tim Stone <tim@fourstonesexpressions.com> writes:
>I'm working on this problem that crops up in pop3proxy and hammie where 
>malformed headers cause the parser to raise an uncaught exception, 
>rendering spambayes more helpless than a baby seal.  It's no problem to 
>catch the exception, but what to do with the message is the issue.  I have 
>two suggested approaches:
>
>1. Assume that the mail is spam.
>
>2. Add a new possible value to the classification header, something like 
>'unclassifiable'.
>
>Other alternatives, votes, or general booing and hissing?

3. Mark it as unsure.

- Alex

From bill at parducci.net  Fri May 30 15:00:34 2003
From: bill at parducci.net (bill parducci)
Date: Fri May 30 17:07:39 2003
Subject: [spambayes-dev] Next to no feedback on the trial faq
References: <16087.34862.433359.570136@montanaro.dyndns.org>
	<oprpzvv1ig34x2gn@mail.fourstonesexpressions.com>
Message-ID: <3ED7C672.7080707@parducci.net>

Tim Stone wrote:
> Doesn't matter, as long as we can find out somewhere exactly what we 
> gotta have and do to update faq.  OOOORRR, we could assume that you and 
> Bill will be more than happy to incorporate any q&a we send ya <wink>

fine by me (i will just need to get whatever ultimate contraption is 
agreed upon installed locally :o)

b


From bill at parducci.net  Fri May 30 15:13:56 2003
From: bill at parducci.net (bill parducci)
Date: Fri May 30 17:20:56 2003
Subject: [spambayes-dev] Proposed fourth X-Spambayes-Classification header
	value
References: <oprpzxmqwi34x2gn@mail.fourstonesexpressions.com>
Message-ID: <3ED7C994.3090709@parducci.net>

Tim Stone wrote:
> I'm working on this problem that crops up in pop3proxy and hammie where 
> malformed headers cause the parser to raise an uncaught exception, 
> rendering spambayes more helpless than a baby seal.  It's no problem to 
> catch the exception, but what to do with the message is the issue.  I 
> have two suggested approaches:
> 
> 1. Assume that the mail is spam.
> 
> 2. Add a new possible value to the classification header, something like 
> 'unclassifiable'.
> 
> Other alternatives, votes, or general booing and hissing?

would it be possible to catch the exception and move to the next line in 
the header (and/or payload) for parsing?

if so, then you could at least make a highly informed guess as to the 
nature of the message by creating a 'malform token' that can be weighed 
like anything else (exchanging the contents of the malformed header 
entry with a single, 'reserved' token).

in other words, let the stats decide if malformedness is a bad thing ;-)

b


From popiel at wolfskeep.com  Fri May 30 15:38:18 2003
From: popiel at wolfskeep.com (T. Alexander Popiel)
Date: Fri May 30 17:38:22 2003
Subject: [spambayes-dev] Proposed fourth X-Spambayes-Classification header
	value 
In-Reply-To: Message from bill parducci <bill@parducci.net> 
	of "Fri, 30 May 2003 14:13:56 PDT." <3ED7C994.3090709@parducci.net> 
References: <oprpzxmqwi34x2gn@mail.fourstonesexpressions.com>
	<3ED7C994.3090709@parducci.net> 
Message-ID: <20030530213818.438042DE9C@cashew.wolfskeep.com>

In message:  <3ED7C994.3090709@parducci.net>
             bill parducci <bill@parducci.net> writes:
>Tim Stone wrote:
>> I'm working on this problem that crops up in pop3proxy and hammie where 
>> malformed headers cause the parser to raise an uncaught exception, 
>> rendering spambayes more helpless than a baby seal.  It's no problem to 
>> catch the exception, but what to do with the message is the issue.  I 
>> have two suggested approaches:
>> 
>> 1. Assume that the mail is spam.
>> 
>> 2. Add a new possible value to the classification header, something like 
>> 'unclassifiable'.
>> 
>> Other alternatives, votes, or general booing and hissing?
>
>would it be possible to catch the exception and move to the next line in 
>the header (and/or payload) for parsing?

Alas, we're at the wrong level to do that sort of thing.  To do that
level of granularity properly, we'd have to be in the guts of the
parser... and while I'm sure that Barry would love for us to come up
with a way to recover inside the parser, I think it's a bit out of
scope for the something external to the parser.

- Alex

From noreply at sourceforge.net  Fri May 30 15:44:02 2003
From: noreply at sourceforge.net (SourceForge.net)
Date: Fri May 30 19:22:45 2003
Subject: [spambayes-dev] [ spambayes-Bugs-740843 ] No Disk Error with
	Outlook 2000 on startup
Message-ID: <E19LrfO-00062j-00@sc8-sf-web1.sourceforge.net>

Bugs item #740843, was opened at 2003-05-20 18:39
Message generated for change (Comment added) made by portola
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=498103&aid=740843&group_id=61702

Category: Outlook
Group: None
Status: Open
Resolution: None
Priority: 5
Submitted By: Sam Snow (snowsam)
Assigned to: Mark Hammond (mhammond)
Summary: No Disk Error with Outlook 2000 on startup

Initial Comment:
After installing SpamBayes-Outlook-Setup-002.exe I am
now getting an error dialog on Outlook startup. The box
says:

(Header) Inbox - Microsoft Outlook:OUTLOOK.EXE - No Disk
(Body) There is no disk in the drive. Please insert a
disk into drive \Device\Harddisk0\DR0.
(Buttons) Cancel, Try Again, Continue

I am able to click cancel or continue several times and
then outlook goes ahead and opens up. I just installed
this evening, so I am not sure if the filtering is
still working correctly. I was able to train the
program sucessfully. 

I am using Office 2000 SP3 on Win 2000. I will try to
attach a jpg of the dialog box. 

My error log says the following:

SpamAddin - Connecting to Outlook
Loaded bayes database from 'C:\Documents and
Settings\Snow1\Application
Data\SpamBayes\default_bayes_database.db'
Loaded message database from 'C:\Documents and
Settings\Snow1\Application
Data\SpamBayes\default_message_database.db'
Bayes database initialized with 0 spam and 0 good messages
Loaded databases in 4.64165ms
AntiSpam: Watching for new messages in folder Inbox
AntiSpam: Watching for new messages in folder Spam
Processing 0 missed spam in folder 'Inbox' took 31.9599ms
pythoncom error: Python error invoking COM method.
Traceback (most recent call last):
  File "E:\src\pythonex\com\win32com\server\policy.py",
line 275, in _Invoke_
  File "E:\src\pythonex\com\win32com\server\policy.py",
line 280, in _invoke_
  File "E:\src\pythonex\com\win32com\server\policy.py",
line 601, in _invokeex_
  File "E:\src\pythonex\com\win32com\server\policy.py",
line 541, in _invokeex_
  File "E:\src\spambayes\Outlook2000\addin.py", line
203, in OnItemAdd
  File "E:\src\spambayes\Outlook2000\addin.py", line
163, in ProcessMessage
  File "E:\src\spambayes\Outlook2000\filter.py", line
15, in filter_message
  File "E:\src\spambayes\Outlook2000\manager.py", line
440, in score
  File "e:\src\spambayes\spambayes\classifier.py", line
217, in chi2_spamprob
  File "e:\src\spambayes\spambayes\classifier.py", line
465, in _getclues
  File "e:\src\spambayes\spambayes\classifier.py", line
316, in probability
exceptions.AssertionError: 


----------------------------------------------------------------------

Comment By: Dennis Austin (portola)
Date: 2003-05-30 14:44

Message:
Logged In: YES 
user_id=787905

I have noted an additional piece of information.  There is no 
alert first time I start Outlook after logging in.  If Outlook is 
closed and reopened, the alert appears and requires three 
Cancel clicks before it goes away.  (Unless I put a disk in the 
CD drive.)

I'm using Outlook 2002 sp2 on Windows XP sp1.  I have two 
CD drives on the secondary IDE channel and the error 
appears on the second drive.

----------------------------------------------------------------------

Comment By: Dennis Austin (portola)
Date: 2003-05-27 10:16

Message:
Logged In: YES 
user_id=787905

I also usually see this error when I start Outlook, although 
not every time.  I also see it at the end of running the installer.

In my configuration it shows up as "No disk in drive E:".  E: 
is CD-ROM 1 on this machine.  I can get past the error either 
by clicking Cancel several times, or by putting any old CD in 
the drive and clicking Try Again.

The error does not seem to affect any function of the add-on.

----------------------------------------------------------------------

Comment By: Ferruccio Barletta (fgb)
Date: 2003-05-25 07:40

Message:
Logged In: YES 
user_id=786210

I may have found the root cause of this problem. When I brought 
up disk management on my notebook I noticed that my hard 
drive was Disk1 and the SD media drive was Disk0. When I 
disabled the SD drive and rebooted, the hard drive became 
Disk0 and the problem disappeared.


----------------------------------------------------------------------

Comment By: Ferruccio Barletta (fgb)
Date: 2003-05-24 18:30

Message:
Logged In: YES 
user_id=786210

I get the same error with Office 2002 SP1 on Windows XP SP1


----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=498103&aid=740843&group_id=61702

From skip at pobox.com  Fri May 30 19:49:51 2003
From: skip at pobox.com (Skip Montanaro)
Date: Fri May 30 19:49:56 2003
Subject: [spambayes-dev] Proposed fourth X-Spambayes-Classification header
	value 
In-Reply-To: <20030530213818.438042DE9C@cashew.wolfskeep.com>
References: <oprpzxmqwi34x2gn@mail.fourstonesexpressions.com>
	<3ED7C994.3090709@parducci.net>
	<20030530213818.438042DE9C@cashew.wolfskeep.com>
Message-ID: <16087.60959.324302.344655@montanaro.dyndns.org>


    >> would it be possible to catch the exception and move to the next line
    >> in the header (and/or payload) for parsing?

    Alex> Alas, we're at the wrong level to do that sort of thing.  To do
    Alex> that level of granularity properly, we'd have to be in the guts of
    Alex> the parser... 

This is just a shot in the dark, but would it be possible to modify the
email parser sufficiently so that it gave more detail about where it was
when the error condition was detected (e.g., what body line number, header
and/or MIME part)?  That might allow Spambayes to tweak the message in the
right spot and retry the parse.

Skip

From noreply at sourceforge.net  Fri May 30 18:36:09 2003
From: noreply at sourceforge.net (SourceForge.net)
Date: Fri May 30 20:47:00 2003
Subject: [spambayes-dev] [ spambayes-Bugs-706520 ] assert fails in classifier
Message-ID: <E19LuLx-0000cR-00@sc8-sf-web4.sourceforge.net>

Bugs item #706520, was opened at 2003-03-19 12:46
Message generated for change (Comment added) made by leobru
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=498103&aid=706520&group_id=61702

Category: None
Group: None
Status: Open
Resolution: None
Priority: 5
Submitted By: Adam Glass (adamglass)
Assigned to: Nobody/Anonymous (nobody)
Summary: assert fails in classifier

Initial Comment:
This morning, I noticed that my emails no longer had a
X-Spambayes-Classification header, so I looked through
my procmail logs, and sure enough, hammiefilter.py is
giving a traceback when an assertion fails.  This
happens on all messages now; it is not specific to a
single message, or intermittent.  Therefore, I suspect
my .hammiedb is corrupted... I can supply it to anyone
who would like to investigate it for debugging purposes.

I am using Spambayes 1.0a2, installed on a system with
Python 2.2.1, with the new version of the email library
(as per the install docs.)

Please contact me if you require any further details.

Example of how to generate the error follows, along
with traceback:

adam$ /usr/local/bin/hammiefilter.py -f -d
$HOME/.hammiedb < example
Traceback (most recent call last):
  File "/usr/local/bin/hammiefilter.py", line 179, in ?
    main()
  File "/usr/local/bin/hammiefilter.py", line 175, in main
    action(msg)
  File "/usr/local/bin/hammiefilter.py", line 113, in
filter
    return h.filter(msg)
  File
"/usr/local/lib/python2.2/site-packages/spambayes/hammie.py",
line 108, in filter
    prob, clues = self._scoremsg(msg, True)
  File
"/usr/local/lib/python2.2/site-packages/spambayes/hammie.py",
line 38, in _scoremsg
    return self.bayes.spamprob(tokenize(msg), evidence)
  File
"/usr/local/lib/python2.2/site-packages/spambayes/classifier.py",
line 217, in chi2_spamprob
    clues = self._getclues(wordstream)
  File
"/usr/local/lib/python2.2/site-packages/spambayes/classifier.py",
line 441, in _getclues
    prob = self.probability(record)
  File
"/usr/local/lib/python2.2/site-packages/spambayes/classifier.py",
line 304, in probability
    assert spamcount <= nspam
AssertionError


----------------------------------------------------------------------

Comment By: Leonid (leobru)
Date: 2003-05-30 17:36

Message:
Logged In: YES 
user_id=790676

This happens, e.g., if a forced re-training was performed on
a non-empty database, thus screwing up the message counts -
this is for sure, I was bitten by it myself;

or, potentially, if hammiefilter.py -t and mboxtrain.py were
running at the same time ???

To avoid: do not do it (I do not use hammiefilter.py -t to
be on the safe side).

To fix, once it happens: start from scratch.

Good to have in the next version: a database validator and
corrector. 


----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=498103&aid=706520&group_id=61702

From vanhorn at whidbey.com  Fri May 30 18:56:33 2003
From: vanhorn at whidbey.com (G. Armour Van Horn)
Date: Fri May 30 20:56:37 2003
Subject: [spambayes-dev] Next to no feedback on the trial faq
References: <16087.34862.433359.570136@montanaro.dyndns.org>
Message-ID: <3ED7FDC1.8AA3E871@whidbey.com>

Well, I didn't jump on it yesterday because I already thought the FAQ was pretty
impressive. But now that you're laying on the guilt trip I went to check out the
changes and got bonged by the HTML version:

Not Found
The requested URL /default.css was not found on this server.
Apache/1.3.26 Server at spambayes.sourceforge.net Port 80

Van


Skip Montanaro wrote:

> (spambayes-dev now seems to have enough people on it to form a quorum of
> sorts (26), so I'm excluding spambayes...)
>
> I posted URLs yesterday for a trial version of the Spambayes FAQ built using
> Docutils tools:
>
>     http://spambayes.sourceforge.net/faq-rest.txt
>     http://spambayes.sourceforge.net/faq-rest.html
>
>

--
----------------------------------------------------------
Sign up now for Quotes of the Day, a handful of quotations
on a theme delivered every morning.
Enlightenment! Daily, for free!
mailto:twisted@whidbey.com?subject=Subscribe_QOTD

For web hosting and maintenance,
visit Van's home page: http://www.domainvanhorn.com/van/
----------------------------------------------------------

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.python.org/pipermail/spambayes-dev/attachments/20030530/68ee90a4/attachment.htm
From skip at pobox.com  Fri May 30 21:18:48 2003
From: skip at pobox.com (Skip Montanaro)
Date: Fri May 30 21:18:48 2003
Subject: [spambayes-dev] Next to no feedback on the trial faq
In-Reply-To: <3ED7FDC1.8AA3E871@whidbey.com>
References: <16087.34862.433359.570136@montanaro.dyndns.org>
	<3ED7FDC1.8AA3E871@whidbey.com>
Message-ID: <16088.760.32882.625210@montanaro.dyndns.org>


    Van> Well, I didn't jump on it yesterday because I already thought the
    Van> FAQ was pretty impressive. But now that you're laying on the guilt
    Van> trip I went to check out the changes and got bonged by the HTML
    Van> version:

    Van> Not Found
    Van> The requested URL /default.css was not found on this server.
    Van> Apache/1.3.26 Server at spambayes.sourceforge.net Port 80

Thanks, I forgot that Docutils expects its own css file.  I use Safari,
which for one reason or another didn't complain.  I'll correct that.

Skip

From skip at pobox.com  Fri May 30 21:40:22 2003
From: skip at pobox.com (Skip Montanaro)
Date: Fri May 30 21:40:21 2003
Subject: [spambayes-dev] New faq.txt
Message-ID: <16088.2054.75594.38536@montanaro.dyndns.org>

Folks,

I just checked in faq.txt, default.css, Makefile and scripts/make.rules, and
cvs removed faq.ht.  Together, this means you should edit faq.txt now to
update the faq.  The last question (4.6) has some info about what you need
to install to rebuild faq.html from faq.txt.

Thanks for the feedback.  We now return you to your regularly scheduled
programming.

Skip


From popiel at wolfskeep.com  Fri May 30 20:49:14 2003
From: popiel at wolfskeep.com (T. Alexander Popiel)
Date: Fri May 30 22:49:18 2003
Subject: [spambayes-dev] Proposed fourth X-Spambayes-Classification header
	value 
In-Reply-To: Message from Skip Montanaro <skip@pobox.com> of "Fri,
	30 May 2003 18:49:51 CDT."
	<16087.60959.324302.344655@montanaro.dyndns.org> 
References: <oprpzxmqwi34x2gn@mail.fourstonesexpressions.com>
	<3ED7C994.3090709@parducci.net>
	<20030530213818.438042DE9C@cashew.wolfskeep.com>
	<16087.60959.324302.344655@montanaro.dyndns.org> 
Message-ID: <20030531024914.2416C2DE9C@cashew.wolfskeep.com>

In message:  <16087.60959.324302.344655@montanaro.dyndns.org>
             Skip Montanaro <skip@pobox.com> writes:
>
>    >> would it be possible to catch the exception and move to the next line
>    >> in the header (and/or payload) for parsing?
>
>    Alex> Alas, we're at the wrong level to do that sort of thing.  To do
>    Alex> that level of granularity properly, we'd have to be in the guts of
>    Alex> the parser... 
>
>This is just a shot in the dark, but would it be possible to modify the
>email parser sufficiently so that it gave more detail about where it was
>when the error condition was detected (e.g., what body line number, header
>and/or MIME part)?  That might allow Spambayes to tweak the message in the
>right spot and retry the parse.

It is my belief that tweaking the message intelligently (as opposed
to just forcing the entire body to be treated as plain text by blowing
away the MIME headers) would require more intelligence than doing the
parsing in the first place.  After all, you'd be permuting the data
to make it parse, which means that you understand all about the
parsing.  If we're going to get that smart, we might as well not use
the email package... which has already been circled around a few times.

Personally, I'd like to see us have a simpler parser which just
understood headers vs. body, and didn't try to decode the individual
headers (for charset, or anything like that).  Ideally, we'd give
this simple parser the message (as a string) and a list of headers
to remove from the message, and it would return a modified message
(again as a string).  We could use this simpler parser both for
blowing away the MIME headers (as alluded to above for dealing with
malformed messages) and for annotating the message with the
classification results (blow away all the classification headers,
then prepend the new ones (properly formatted) to the message).

Of course, that would take about two hours of work, and I'm lucky
to get two consecutive minutes right now...

- Alex

From tim.one at comcast.net  Sat May 31 00:06:49 2003
From: tim.one at comcast.net (Tim Peters)
Date: Fri May 30 23:08:05 2003
Subject: [spambayes-dev] Proposed fourth X-Spambayes-Classification
	headervalue
In-Reply-To: <20030531024914.2416C2DE9C@cashew.wolfskeep.com>
Message-ID: <LNBBLJKPBEHFEDALKOLCAEFEEIAB.tim.one@comcast.net>

[T. Alexander Popiel]
> ...
> Personally, I'd like to see us have a simpler parser which just
> understood headers vs. body, and didn't try to decode the individual
> headers (for charset, or anything like that).  Ideally, we'd give
> this simple parser the message (as a string) and a list of headers
> to remove from the message, and it would return a modified message
> (again as a string).  We could use this simpler parser both for
> blowing away the MIME headers (as alluded to above for dealing with
> malformed messages) and for annotating the message with the
> classification results (blow away all the classification headers,
> then prepend the new ones (properly formatted) to the message).
>
> Of course, that would take about two hours of work, and I'm lucky
> to get two consecutive minutes right now...

I don't expect this would help.  Decoding base64 and quoted-printable are
important, but base64 if and only if it's a text section.  In order to
identify this stuff requires decoding the MIME structure too.  Decoding
charsets probably isn't important for *me*, because virtually all my ham is
in 7-bit ASCII English, but for non-English users I can easily believe it's
vital.  Etc -- the email package does a lot of stuff, and it's valuable.

As to fiddling damaged msgs to get them thru the parser, the next time just
try it.  I've had easy success with this every time I've seen it pop up in
the Outlook client.  Appending a newline is sometimes all it takes.  In one
case, it required falling back to a different base64 decoder, because the
email pkg's decoder is too(!) forgiving.

The reason this crap keeps popping up has been covered before:  we don't
have a chokepoint now for asking the email pkg to parse stuff, so
workarounds are spread around the codebase.  Of course this won't get fixed
until someone who actually likes the email package makes time to make it fly
<wink>.


From skip at pobox.com  Fri May 30 23:14:52 2003
From: skip at pobox.com (Skip Montanaro)
Date: Fri May 30 23:14:52 2003
Subject: [spambayes-dev] Proposed fourth X-Spambayes-Classification
	headervalue
In-Reply-To: <LNBBLJKPBEHFEDALKOLCAEFEEIAB.tim.one@comcast.net>
References: <20030531024914.2416C2DE9C@cashew.wolfskeep.com>
	<LNBBLJKPBEHFEDALKOLCAEFEEIAB.tim.one@comcast.net>
Message-ID: <16088.7724.849970.104717@montanaro.dyndns.org>

    Tim> Of course this won't get fixed until someone who actually likes the
    Tim> email package makes time to make it fly <wink>.

Maybe you could arrange to trip and spill your Dunkin' Donuts coffee on
Barry every morning when you arrive at the office.  Just apologize with
something like, "Oh, sorry Barry.  I was thinking about that email parsing
problem in Spambayes again.  I just can't seem to figure it out."  After
awhile he'll get the hint and solve the problem for you^H^H^Hus. ;-)

Skip

From tim at fourstonesexpressions.com  Fri May 30 23:38:11 2003
From: tim at fourstonesexpressions.com (Tim Stone)
Date: Fri May 30 23:38:37 2003
Subject: [spambayes-dev] Proposed fourth X-Spambayes-Classification header
	value 
In-Reply-To: <20030530173506.0F6992DE9C@cashew.wolfskeep.com>
References: <oprpzxmqwi34x2gn@mail.fourstonesexpressions.com>
	<20030530173506.0F6992DE9C@cashew.wolfskeep.com>
Message-ID: <oprp0ppxaz34x2gn@mail.fourstonesexpressions.com>

>> Other alternatives, votes, or general booing and hissing?
>
> 3. Mark it as unsure.

This will almost certainly cause someone to try to train it, which will 
again break things.  We need some way to make sure we never look at this 
message again.
>
> - Alex
>
>


-- 


c'est moi - TimS

From popiel at wolfskeep.com  Fri May 30 21:55:58 2003
From: popiel at wolfskeep.com (T. Alexander Popiel)
Date: Fri May 30 23:56:03 2003
Subject: [spambayes-dev] Proposed fourth X-Spambayes-Classification
	headervalue 
In-Reply-To: Message from Tim Peters <tim.one@comcast.net> of "Fri,
	30 May 2003 23:06:49 EDT."
	<LNBBLJKPBEHFEDALKOLCAEFEEIAB.tim.one@comcast.net> 
References: <LNBBLJKPBEHFEDALKOLCAEFEEIAB.tim.one@comcast.net> 
Message-ID: <20030531035558.243A32DE9C@cashew.wolfskeep.com>

In message:  <LNBBLJKPBEHFEDALKOLCAEFEEIAB.tim.one@comcast.net>
             Tim Peters <tim.one@comcast.net> writes:
>[T. Alexander Popiel]
>> ...
>> Personally, I'd like to see us have a simpler parser
>
>I don't expect this would help.  Decoding base64 and quoted-printable are
>important, but base64 if and only if it's a text section.  In order to
>identify this stuff requires decoding the MIME structure too.  Decoding
>charsets probably isn't important for *me*, because virtually all my ham is
>in 7-bit ASCII English, but for non-English users I can easily believe it's
>vital.  Etc -- the email package does a lot of stuff, and it's valuable.

I see I wasn't clear... I only want the simpler parser for handling
the classification headers annotation and the cases where the email
package barfs.  Definitely keep the email package for what we've got
it for now... because yes, it helps immensely.

- Alex

From bill at parducci.net  Sat May 31 08:59:15 2003
From: bill at parducci.net (bill parducci)
Date: Sat May 31 11:00:21 2003
Subject: [spambayes-dev] cvs resets
References: <LNBBLJKPBEHFEDALKOLCMEDEECAB.tim.one@comcast.net>
	<a05200f00baaf40b11225@[192.168.1.20]>
	<3E89A90A.3060600@parducci.net>
	<16009.43979.172138.854043@montanaro.dyndns.org>
	<3ED7CAFE.4000906@parducci.net>
	<16087.59891.631899.666778@montanaro.dyndns.org>
Message-ID: <3ED8C343.1020204@parducci.net>

i see this a lot:

$ cvs up -Pd
cvs [update aborted]: end of file from server (consult above messages if 
any)

is the python cvs server overloaded frequently, or is it just me? 
(usually by the fifth or sixth try i get through.)

thanks

b


From skip at pobox.com  Sat May 31 12:19:30 2003
From: skip at pobox.com (Skip Montanaro)
Date: Sat May 31 12:19:30 2003
Subject: [spambayes-dev] cvs resets
In-Reply-To: <3ED8C343.1020204@parducci.net>
References: <LNBBLJKPBEHFEDALKOLCMEDEECAB.tim.one@comcast.net>
	<a05200f00baaf40b11225@[192.168.1.20]>
	<3E89A90A.3060600@parducci.net>
	<16009.43979.172138.854043@montanaro.dyndns.org>
	<3ED7CAFE.4000906@parducci.net>
	<16087.59891.631899.666778@montanaro.dyndns.org>
	<3ED8C343.1020204@parducci.net>
Message-ID: <16088.54802.82666.585488@montanaro.dyndns.org>


    bill> i see this a lot:
    bill> $ cvs up -Pd
    bill> cvs [update aborted]: end of file from server (consult above messages if 
    bill> any)

SourceForge is busy a lot.  It tends to give lower priority to anonymous CVS
requests to allow developers continued access.

Skip

From popiel at wolfskeep.com  Sat May 31 15:43:02 2003
From: popiel at wolfskeep.com (T. Alexander Popiel)
Date: Sat May 31 17:43:07 2003
Subject: [spambayes-dev] More testing on the common db
Message-ID: <20030531214302.2EBCD2DDF2@cashew.wolfskeep.com>

Here's some more results from testing with the common db and
my own private db:

Testing a selection of messages 4-9 months old:
Ham (2052 msgs):
             ham  unsure    spam
  common    2011      36       5
  popiel    2041       8       3

Spam (3838 msgs):
             ham  unsure    spam
  common       5      53    3773
  popiel       8      75    3748


Testing only the most recent 500 messages of each type:
Ham (500 msgs):
             ham  unsure    spam
  common     488      11       1
  popiel     495       5       0

Spam (500 msgs):
             ham  unsure    spam
  common       1      21     478
  popiel       1      10     489


I find it rather interesting that the common db did better on
the old spam than my personal one did; I think this is evidence
of mail mutations having a real effect on accuracy (since my
personal db only contains info from the most recent 4 months),
but it could also be attributable to other things... such as
differences between Skip's training regime and my own.

For the most recent mail, the personal db was a clear win over
the common db.

- Alex

From matt at mondoinfo.com  Sat May 31 22:19:41 2003
From: matt at mondoinfo.com (Matthew Dixon Cowles)
Date: Sat May 31 22:19:47 2003
Subject: [spambayes-dev] Re: [Spambayes] Database cleaning?
In-Reply-To: <20030531170037.10DB82DDF2@cashew.wolfskeep.com>
References: <3ED6F33F.9050000@mailcom.com>
	<20030531170037.10DB82DDF2@cashew.wolfskeep.com>
Message-ID: <1054430548.31.1335@sake.mondoinfo.com>

[Alex Popiel on nonsense words in spam]
> Yes, those words cause database pollution, and yes, they can be
> weeded out with just a handful of lines of code... but it's hard to
> tell which hapax legomena will be useless, and which will soon get
> reinforced by other occurences, so it's (IMNSHO) generally not
> worth the hassle.

With an eye toward reducing the size of the database, I instrumented
the classifier a while ago and found a very strong indication that
that's true. Indeed, hapaxes often figured in scoring. I didn't
bother to calculate exact numbers because the results were strong
enough to persuade me that removing hapaxes wasn't a useful strategy.

I tore that code out and instead hacked the classifier so that I
could determine how soon after a word figures in scoring that it's
used again. I think that the results are at least slightly
interesting. Note that the histogram below is log scaled.


Unique tokens used for scoring  60627
Used Once                       17388

Days prev  Count Histogram is log scaled
        0 903644 **************************************************
        1  27694 *************************************
        2  15121 ***********************************
        3   7024 ********************************
        4   4694 *******************************
        5   3634 ******************************
        6   3134 *****************************
        7   2443 ****************************
        8   1697 ***************************
        9   1340 **************************
       10    982 *************************
       11    801 ************************
       12    671 ************************
       13    871 *************************
       14    630 ************************
       15    494 ***********************
       16    374 **********************
       17    343 *********************
       18    227 ********************
       19    216 ********************
       20    199 *******************
       21    226 ********************
       22    126 ******************
       23    114 *****************
       24     55 ***************
       25     22 ***********
       26     49 **************


My mail may not be representative in ways that exaggerate the slope
here. Specifically, I read postmaster, webmaster, etc addresses for
several domains so it's common for me to get multiple copies of the
same spam.

Regards,
Matt