From tim.one at comcast.net Tue May 27 23:17:07 2003 From: tim.one at comcast.net (Tim Peters) Date: Tue May 27 22:20:08 2003 Subject: [spambayes-dev] Website bug: Inactive links in FAQ Message-ID: Just noticed that in the FAQ http://spambayes.sourceforge.net/faq.html The links in 1b aren't clickable (http://www.python.org/download/ and http://mimelib.sf.net). But mostly sending this just to test that the new list works! From bill at parducci.net Tue May 27 20:59:10 2003 From: bill at parducci.net (bill parducci) Date: Tue May 27 23:06:10 2003 Subject: [spambayes-dev] Website bug: Inactive links in FAQ References: Message-ID: <3ED425FE.8090109@parducci.net> i can fix that...if i could find where the FAQ is in cvs. i have just updated /cvsroot/spambayes and don't see it here. is there another repository for the website stuff? also, the link in the faq for: "If you have any suggestions about other questions and answers that should be included here, please mail the list with them." points to spambayes@python.org. should this be directed to spambayes-dev? b Tim Peters wrote: > Just noticed that in the FAQ > > http://spambayes.sourceforge.net/faq.html > > The links in 1b aren't clickable (http://www.python.org/download/ and > http://mimelib.sf.net). > > But mostly sending this just to test that the new list works! > > > _______________________________________________ > spambayes-dev mailing list > spambayes-dev@python.org > http://mail.python.org/mailman/listinfo/spambayes-dev From tim.one at comcast.net Wed May 28 00:34:28 2003 From: tim.one at comcast.net (Tim Peters) Date: Tue May 27 23:35:03 2003 Subject: [spambayes-dev] Website bug: Inactive links in FAQ In-Reply-To: <3ED425FE.8090109@parducci.net> Message-ID: [bill parducci] > i can fix that...if i could find where the FAQ is in cvs. i have just > updated /cvsroot/spambayes and don't see it here. is there another > repository for the website stuff? It's in the same repository, but in a different "module". If you look at http://cvs.sf.net/cgi-bin/viewcvs.cgi/spambayes/#dirlist you'll see that both "spambayes" and "website" live under the root. So you need to cvs checkout the website module: cvs -d:...:/cvsroot/spambayes co website where "..." is whatever gibberish you used to check out the spambayes module to begin with. The FAQ will then live on your box as website/FAQ.ht > also, the link in the faq for: > > "If you have any suggestions about other questions and answers that > should be included here, please mail the list with them." > > points to spambayes@python.org. should this be directed to > spambayes-dev? I don't know, but best guess is "yes". From noreply at sourceforge.net Tue May 27 22:26:56 2003 From: noreply at sourceforge.net (SourceForge.net) Date: Wed May 28 00:31:55 2003 Subject: [spambayes-dev] [ spambayes-Bugs-744380 ] W982E/Outlook 2000: exception on loading Message-ID: Bugs item #744380, was opened at 2003-05-27 09:51 Message generated for change (Comment added) made by jobbins You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=498103&aid=744380&group_id=61702 Category: Outlook Group: None Status: Open Resolution: None Priority: 5 Submitted By: Steve Clift (sclift) Assigned to: Mark Hammond (mhammond) Summary: W982E/Outlook 2000: exception on loading Initial Comment: Windows 98 2nd Edition Outlook 2000 SR-1 - Corporate or Workgroup SpamBayes throws an execption when loading. From the log file: SpamAddin - Connecting to Outlook pythoncom error: Failed to call the universal dispatcher Traceback (most recent call last): File "E:\src\pythonex\com\win32com\universal.py", line 170, in dispatch File "E:\src\pythonex\com\win32com\server\policy.py", line 322, in _InvokeEx_ File "E:\src\pythonex\com\win32com\server\policy.py", line 601, in _invokeex_ File "E:\src\pythonex\com\win32com\server\policy.py", line 541, in _invokeex_ File "E:\src\spambayes\Outlook2000\addin.py", line 655, in OnConnection File "E:\src\spambayes\Outlook2000\manager.py", line 475, in GetManager File "E:\src\spambayes\Outlook2000\manager.py", line 141, in __init__ File "E:\src\spambayes\Outlook2000\manager.py", line 182, in LocateDataDirectory File "E:\src\python-cvs\lib\ntpath.py", line 269, in isdir exceptions.LookupError: no codec search functions registered: can't find encoding ---------------------------------------------------------------------- Comment By: Larry Jobbins (jobbins) Date: 2003-05-27 21:26 Message: Logged In: YES user_id=788287 Same error. Installed Setup-002.exe from http://starship.python.net/crew/mhammond/spambayes/ Using Win98SE, Outlook 2000, all MS updates. Shows add-in, but won't stay checked, no icon appears. Install log looks same - pythoncom error: Failed to call the universal dispatcher, etc ---------------------------------------------------------------------- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=498103&aid=744380&group_id=61702 From tim_one at email.msn.com Wed May 28 03:00:22 2003 From: tim_one at email.msn.com (Tim Peters) Date: Wed May 28 02:00:58 2003 Subject: [spambayes-dev] RE: [Spambayes] Intersection of two databases In-Reply-To: <16083.55644.605271.500891@montanaro.dyndns.org> Message-ID: [Skip] > ... > Using that list, I then merged the corresponding entries from the two > source databases. Skip, how did you do the merge? That is, word w in your database had a certain hamcount and spamcount, while word w in Alex's had a presumably different pair of counts. Did you add them? Take the max? Something else? It was a very interesting experiment regardless of the answer . From bill at parducci.net Wed May 28 01:18:08 2003 From: bill at parducci.net (bill parducci) Date: Wed May 28 03:18:15 2003 Subject: [spambayes-dev] Website bug: Inactive links in FAQ References: Message-ID: <3ED462B0.8090002@parducci.net> attached is updated version with requested fixes (including e-mail address update). since it was "tidy'd" already i retidy'd the file for consistency. b Tim Peters wrote: > [bill parducci] > >>i can fix that...if i could find where the FAQ is in cvs. i have just >>updated /cvsroot/spambayes and don't see it here. is there another >>repository for the website stuff? > > > It's in the same repository, but in a different "module". If you look at > > http://cvs.sf.net/cgi-bin/viewcvs.cgi/spambayes/#dirlist > > you'll see that both "spambayes" and "website" live under the root. So you > need to cvs checkout the website module: > > cvs -d:...:/cvsroot/spambayes co website > > where "..." is whatever gibberish you used to check out the spambayes module > to begin with. The FAQ will then live on your box as > > website/FAQ.ht > > >>also, the link in the faq for: >> >>"If you have any suggestions about other questions and answers that >>should be included here, please mail the list with them." >> >>points to spambayes@python.org. should this be directed to >>spambayes-dev? > > > I don't know, but best guess is "yes". -------------- next part -------------- An HTML attachment was scrubbed... URL: http://mail.python.org/pipermail/spambayes-dev/attachments/20030528/f638e834/faq-0001.htm From skip at pobox.com Wed May 28 08:17:00 2003 From: skip at pobox.com (Skip Montanaro) Date: Wed May 28 08:17:06 2003 Subject: [spambayes-dev] Website bug: Inactive links in FAQ In-Reply-To: References: Message-ID: <16084.43196.134724.367964@montanaro.dyndns.org> Tim> Just noticed that in the FAQ Tim> http://spambayes.sourceforge.net/faq.html Tim> The links in 1b aren't clickable (http://www.python.org/download/ Tim> and http://mimelib.sf.net). Fixed. Tim> But mostly sending this just to test that the new list works! It does... Skip From skip at pobox.com Wed May 28 08:48:09 2003 From: skip at pobox.com (Skip Montanaro) Date: Wed May 28 08:48:13 2003 Subject: [spambayes-dev] RE: [Spambayes] Intersection of two databases In-Reply-To: References: <16083.55644.605271.500891@montanaro.dyndns.org> Message-ID: <16084.45065.242138.717158@montanaro.dyndns.org> >> Using that list, I then merged the corresponding entries from the two >> source databases. Tim> Skip, how did you do the merge? That is, word w in your database Tim> had a certain hamcount and spamcount, while word w in Alex's had a Tim> presumably different pair of counts. Did you add them? Take the Tim> max? Something else? I simply added them. I also added the 'saved state' values. This made intuitive sense to me, though we all know intuition is often wrong. I was effectively training using both databases, just eliminating the less useful tokens. (Ignore for the moment that actually training on the complete set of emails Alex and I have would probably have generated slightly different results.) Skip From noreply at sourceforge.net Wed May 28 10:30:16 2003 From: noreply at sourceforge.net (SourceForge.net) Date: Wed May 28 12:35:03 2003 Subject: [spambayes-dev] [ spambayes-Bugs-745003 ] hammiebulk.py: Untrain does not work Message-ID: Bugs item #745003, was opened at 2003-05-28 11:30 Message generated for change (Tracker Item Submitted) made by Item Submitter You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=498103&aid=745003&group_id=61702 Category: None Group: None Status: Open Resolution: None Priority: 5 Submitted By: Paramjit Oberoi (psoberoi) Assigned to: Nobody/Anonymous (nobody) Summary: hammiebulk.py: Untrain does not work Initial Comment: hammiebulk.py: Untrain bug Untraining does not work since when the "-U" option is detected, the "untrain" variable is set to "1", overriding the function definition... patch: 145c145 < untrain = 0 --- > untrain_mode = 0 169c169 < untrain = 1 --- > untrain_mode = 1 182c182 < if not untrain: --- > if not untrain_mode: ---------------------------------------------------------------------- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=498103&aid=745003&group_id=61702 From bill at parducci.net Wed May 28 10:29:24 2003 From: bill at parducci.net (bill parducci) Date: Wed May 28 12:36:24 2003 Subject: [spambayes-dev] FAQ update Message-ID: <3ED4E3E4.20508@parducci.net> 1. added "why don't you bounce back spam?" 2. made the page w3c compliant (html 4.01) 3. reTIDY'd (using attached tidy.conf if anyone cares -- would be nice if there was one in cvs ;-) side note: the site itself uses a lot of deprecated tags, etc.: http://validator.w3.org/check?uri=http%3A%2F%2Fspambayes.sourceforge.net%2F would it be of any benefit if i cleaned it up? (is that even possible within the contraints of sf.net?) b -------------- next part -------------- An HTML attachment was scrubbed... URL: http://mail.python.org/pipermail/spambayes-dev/attachments/20030528/bac7ef81/faq-0001.htm -------------- next part -------------- break-before-br: no char-encoding: latin1 enclose-text: yes enclose-block-text: yes indent-spaces: 2 indent: yes input-xml: no markup: yes numeric-entities: yes output-xml: no quote-marks: yes quote-nbsp: yes show-warnings: yes tidy-mark: no uppercase-attributes: no uppercase-tags: no wrap: 72 wrap-attributes: yes wrap-script-literals: yes From skip at pobox.com Wed May 28 17:18:23 2003 From: skip at pobox.com (Skip Montanaro) Date: Wed May 28 17:18:31 2003 Subject: [spambayes-dev] FAQ updated Message-ID: <16085.10143.186656.280842@montanaro.dyndns.org> (Sending to both spambayes and spambayes-dev to catch all interested parties.) Folks, With a little assistance from Anthony Baxter, I updated the faq.ht file to automatically number both the table of contents and the main section with the answers. I also ran it through ispell and added a comment near the top to help people figure out how to add new content. If you see anything amiss, feel free to send me a correction or check it in yourself if you're so enabled. Skip From skip at pobox.com Wed May 28 17:25:05 2003 From: skip at pobox.com (Skip Montanaro) Date: Wed May 28 17:25:27 2003 Subject: [spambayes-dev] FAQ update In-Reply-To: <3ED4E3E4.20508@parducci.net> References: <3ED4E3E4.20508@parducci.net> Message-ID: <16085.10545.151192.514702@montanaro.dyndns.org> bill> 1. added "why don't you bounce back spam?" bill> 2. made the page w3c compliant (html 4.01) bill> 3. reTIDY'd (using attached tidy.conf if anyone cares -- would be bill> nice if there was one in cvs ;-) Thanks. For future reference it would be a lot simpler to incorporate changes if you could just post a context diff against the latest version in CVS. Before seeing your email I checked in a massive change to the way numbering is done. Any chance you can send me a context diff against that? Thanks for the tidy.conf file. I'll probably check it in as well. Thx, Skip From bill at parducci.net Wed May 28 16:39:41 2003 From: bill at parducci.net (bill parducci) Date: Wed May 28 18:52:30 2003 Subject: [spambayes-dev] FAQ update References: <3ED4E3E4.20508@parducci.net> <16085.10545.151192.514702@montanaro.dyndns.org> Message-ID: <3ED53AAD.7060608@parducci.net> 1. readded "why don't you bounce back spam?" 2. remade the page w3c compliant (html 4.01) 3. skipped TIDY (manually conformed to format) > Thanks. For future reference it would be a lot simpler to incorporate > changes if you could just post a context diff against the latest version in > CVS. Before seeing your email I checked in a massive change to the way > numbering is done. Any chance you can send me a context diff against that? sure, as long as you check your e-mail before introducing massive changes ;-) b -------------- next part -------------- Index: faq.ht =================================================================== RCS file: /cvsroot/spambayes/website/faq.ht,v retrieving revision 1.15 diff -c -r1.15 faq.ht *** faq.ht 28 May 2003 20:56:23 -0000 1.15 --- faq.ht 28 May 2003 22:39:34 -0000 *************** *** 1,960 **** ! Title: SpamBayes: Frequently Asked Questions ! Author-Email: spambayes@python.org Author: spambayes ! ! ! ! ! !

! Frequently Asked Questions !

!
    !
  1. ! Overview
    !
      !
    1. ! So what is Spambayes? !
    2. !
    3. ! What do I need to install ! Spambayes? !
    4. !
    5. ! Is there a "ten thousand foot view" that ! shows how this thing works? !
    6. !
    7. ! Where does all this stuff live? !
    8. !
    !
  2. !
  3. ! Compatibility
    !
      !
    1. ! What version of Outlook does it ! work with? !
    2. !
    3. ! Does Spambayes work with Outlook ! Express? !
    4. !
    5. ! Do I have to have python installed to use ! Spambayes with Outlook? !
    6. !
    7. ! Forget Outlook, what clients will ! Spambayes work with in general? !
    8. !
    9. ! We have Outlook 2000 connecting to an ! Exchange 2000 server. Will spambayes work for us? !
    10. !
    !
  4. !
  5. ! Using Spambayes
    !
      !
    1. ! How do I configure Spambayes? !
    2. !
    3. ! How do I train Spambayes (web ! method) !
    4. !
    5. ! How do I train Spambayes ! (forward/bounce method) !
    6. !
    7. ! How do I train Spambayes (command line ! method) !
    8. !
    9. ! I just got a spam, but the system said it ! was "unsure". Why couldn't it tell that it was spam - it's ! obvious? !
    10. !
    11. ! OK, I trained on that message. But I ! just got another one, and the stupid system still ! thinks it's unsure. Why did it ignore me??? !
    12. !
    13. ! I've mucked up my training and I want ! to start all over again, but there isn't an option for this ! anywhere. What do I do? !
    14. !
    15. ! I can't use a web browser, so I can't ! configure pop3proxy/imapfilter.
      ! Also: how do I configure hammiefilter and the other ! applications that don't have a user interface?
      !
    16. !
    17. ! That's great, now I know what the ! format looks like, but what options do I need to set? !
    18. !
    19. ! I've made a configuration file, but ! Spambayes is ignoring it. Now what? !
    20. !
    21. ! Why don't short words or long words ! show up in the clues?
    22. !
    23. ! Is there anything else I should know?
    24. !
    !
  6. !
  7. ! Development
    !
      !
    1. ! Hey! Why don't you implement cool ! tokenizer trick X? I think it would really foil those ! spammers! !
    2. !
    3. ! This software is great! I want to ! implement it for all my users. Are there plans to develop a ! server-side spambayes solution? !
    4. !
    5. ! Forget tokenising words - you should use ! character n-grams! !
    6. !
    7. ! The clues for my mail are all in lower case, ! but "FREE" is a much better clue than "free". Why do you ! force everything into lower case?
    8. !
    !
  8. !
!

! If you have any suggestions about other questions and answers that ! should be included here, please mail the list ! with them. !

! !
    ! !
  1. ! Overview !
      ! !
    1. ! So what is ! Spambayes? !

      ! Spambayes is a tool used to segregate unwanted mail (spam) from ! the mail you want (ham). Before Spambayes can be your spam filter ! of choice you need to train it on representative samples of email ! you receive. After it's been trained, you use Spambayes to ! classify new mail according to its spamminess and hamminess ! qualities. !

      !

      ! To train Spambayes (which you don't need to do if you're going to ! be using the POP3 proxy to classify messages, but you'll get ! better results from the outset if you do) you need to save your ! incoming email for awhile, segregating it into two piles, known ! spam and known ham (ham is our nickname for good mail). It's best ! to train on recent email, because your interests and the nature of ! what spam looks like change over time. Once you've collected a ! fair portion of each (anything is better than nothing, but it ! helps to have a couple hundred of each), you can tell Spambayes, ! "Here's my ham and my spam". It will then process that mail and ! save information about different patterns which appear in ham and ! spam. That information is then used during the filtering ! stage. See the "Command-line training" section below for details. !

      !

      ! When Spambayes filters your email, it compares each unclassified ! message against the information it saved from training and makes a ! decision about whether it thinks the message qualifies as ham or ! spam, or if it's unsure about how to classify the message. It then ! adds its classification to the message, either by adding a header ! (X-Spambayes-Classification: spam|ham|unsure), modifying the To: ! or Subject: headers, or adding a "Spam" field to the message. ! Depending on which Spambayes application you are using, it may ! then filter this message for you, or you can set up your own ! filters (to file away suspected spam into its own mail folder, for ! example). !

      !
    2. ! !
    3. ! What do I need to ! install Spambayes? !

      ! Unless you are using the Outlook plugin, you must have a recent ! version of Python installed on your computer, version 2.2 or ! later. (Don't ask about backporting it to earlier versions of ! Python. It's almost a certainty this won't happen.) If you need to ! install Python on your system, check the Python download page for ! the version appropriate to your computer: http://www.python.org/download/ ! You also need version 2.4.3 or above of the Python "email" ! package. If you're running Python 2.2.2 or above, then you ! already have this. If not, you can download it from http://mimelib.sf.net/ and ! install it - unpack the archive, cd to the email-2.4.3 directory ! and type "python setup.py install" (YMMV on different ! platforms). This will install it into your Python site-packages ! directory. You'll also need to move aside the standard "email" ! library - go to your Python "Lib" directory and rename "email" to ! "email_old". !

      !
    4. ! !
    5. ! Is there a "ten thousand ! foot view" that shows how this thing works? ! !

      ! There are eight main components to the Spambayes system: !

      !
        !
      1. ! A database. Loosely speaking, this is a collection of words and ! associated spam and ham probabilities. The database says "If a ! message contains the word 'Viagra' then there's a 98% chance ! that it's spam, and a 2% chance that it's ham." This database is ! created by training - you give it messages, tell it whether ! those messages are ham or spam, and it adjusts its probabilities ! accordingly. How to train it is covered below. By default it ! lives in a file called "hammie.db" or (for the Outlook plugin) ! "default_bayes_database". !
      2. !
      3. ! The tokenizer/classifier. This is the core engine of the system. ! The tokenizer splits emails into tokens (words, roughly ! speaking), and the classifier looks at those tokens to determine ! whether the message looks like spam or not. You don't use the ! tokenizer/classifier directly - it powers the other parts of the ! system. !
      4. !
      5. ! The POP3 proxy. This sits between your email client (Eudora, ! Outlook Express, etc) and your incoming email server, and adds ! the classification header to emails as you download them. A ! typical user's email setup looks like this: ! !
        !     +-----------------+                       +-------------+
        !     | Outlook Express |      Internet or      |             |
        !     |  (or similar)   | <-------------------> | POP3 server |
        !     |                 |      Intranet         |             |
        !     +-----------------+                       +-------------+
        ! 
        ! The POP3 server runs either at your ISP for Internet mail, or ! somewhere on your internal network for corporate mail. The POP3 ! proxy sits in the middle and adds the classification header as ! you retrieve your email: !
        !     +-----------------+      +------------+      +-------------+
        !     | Outlook Express |      | Spambayes  |      |             |
        !     |  (or similar)   | <--> | POP3 proxy | <--> | POP3 server |
        !     |                 |      |            |      |             |
        !     +-----------------+      +------------+      +-------------+
        ! 
        ! So where you currently have your email client configured to talk ! to say, "pop3.my-isp.com", you instead configure the ! proxy to talk to "pop3.my-isp.com" and configure your ! email client to talk to the proxy. The POP3 proxy can live on ! your PC, or on the same machine as the POP3 server, or on a ! different machine entirely, it really doesn't matter. Say it's ! living on your PC, you'd configure your email client to talk to ! "localhost". You can configure the proxy to talk to multiple ! POP3 servers, if you have more than one email account. !
      6. !
      7. ! The SMTP proxy. This sits between your email client (Eudora, ! Outlook Express, etc) and your outgoing email server. Any mail ! sent to spambayes_spam@localhost or spambayes_ham@localhost is ! intercepted and trained appropriately. A typical user's email ! setup looks like this: ! !
        !     +-----------------+                       +-------------+
        !     | Outlook Express |      Internet or      |             |
        !     |  (or similar)   | <-------------------> | SMTP server |
        !     |                 |      Intranet         |             |
        !     +-----------------+                       +-------------+
        ! 
        ! ! The SMTP server runs either at your ISP for Internet mail, or ! somewhere on your internal network for corporate mail. The SMTP ! proxy sits in the middle and checks for mail to train on as you ! send your email: ! !
        !     +-----------------+      +------------+      +-------------+
        !     | Outlook Express |      | Spambayes  |      |             |
        !     |  (or similar)   | <--> | SMTP proxy | <--> | SMTP server |
        !     |                 |      |            |      |             |
        !     +-----------------+      +------------+      +-------------+
        ! 
        ! ! So where you currently have your email client configured to talk ! to say, "smtp.my-isp.com", you instead configure the ! proxy to talk to "smtp.my-isp.com" and configure your ! email client to talk to the proxy. The SMTP proxy can live on ! your PC, or on the same machine as the SMTP server, or on a ! different machine entirely, it really doesn't matter. Say it's ! living on your PC, you'd configure your email client to talk to ! "localhost". You can configure the proxy to talk to multiple ! SMTP servers, if you have more than one email account. !
      8. ! ! The web interface. This is a server that runs alongside the POP3 ! proxy, SMTP proxy, and IMAP filter (see below) and lets you ! control it through the web. You can upload emails to it for ! training or classification, query the probabilities database ! ("How many valid emails really do contain the word ! Viagra") find particular messages, and most importantly, train ! it on the emails you've received. When you start using the ! system, unless you train it using the Hammie script it will ! classify most things as Unsure, and often make mistakes. But it ! keeps copies of all the emails it's seen, and through the web ! interface you can train it by going through a list of all the ! emails you've received and checking a Ham/Spam box next to each ! one. After training on a few messages (say 20 spams and 20 ! hams), you'll find that it's getting it right most of the ! time. The web training interface automatically checks the ! Ham/Spam boxes according to what it thinks, so all you need to ! do it correct the odd mistake - it's very quick and easy. !
      9. ! ! The Outlook plug-in. For Outlook 2000 and Outlook XP (2002) ! users (not Outlook Express) this lets you manage the whole thing ! from within Outlook. You set up a Ham folder and a Spam folder, ! and train it simply by dragging messages into those folders. ! Alternatively there are buttons to do the same thing. And it ! integrates into Outlook's filtering system to make it easy to ! file all the suspected spam into its own folder, for instance. !
      10. ! ! The Hammie script. This does three jobs: command-line training, ! procmail filtering, and XML-RPC. See below for details of how to ! use Hammie for training, and how to use it as procmail filter. ! Hammie can also run as an XML-RPC server, so that a programmer ! can write code that uses a remote server to classify emails ! programmatically - see hammiesrv.py. !
      11. ! ! The IMAP filter. This is a cross between the POP3 proxy and the ! Outlook plugin. If your mail sits on an IMAP server, you can use ! the this to filter your mail. You can designate folders that ! contain mail to train as ham and folders that contain mail to ! train as spam, and the filter does this for you. You can also ! designate folders to filter, along with a folder for messages ! Spambayes is unsure about, and a folder for suspected spam. When ! new mail arrives, the filter will move mail to the appropriate ! location (ham is left in the original folder). !
      !
    6. ! !
    7. ! Where does all this ! stuff live? ! !

      ! The Hammie script is called hammie.py. The POP3 proxy lives in ! pop3proxy.py, and the smtpproxy lives in smtpproxy.py. The IMAP ! filter lives in imapfilter.py. The Outlook plug-in lives in the ! Outlook2000 subdirectory — see the README.txt in that ! directory for more information on that. !

      !

      ! As well as these components, there's also a whole pile of utility ! scripts, test harnesses and so on — see README.txt and ! TESTING.txt in the spambayes distribution for more information. !

      !
    8. !
    !
  2. ! !
  3. ! Compatibility ! !
      !
    1. ! What version of ! Outlook does it work with? ! !

      ! The most up to date list of known compatible versions of Outlook ! may be found here. !

      !
    2. ! !
    3. ! Does Spambayes ! work with Outlook Express? ! !

      ! Outlook Express isn't a version of Outlook, it's a completely ! separate program (from the same company). Because they give it ! away for free, Outlook Express is a really stripped down program, and it's ! extremely difficult to create a plugin for it. !

      !

      ! You can use pop3proxy and/or imapfilter with Outlook Express, ! however you must have either the alpha 3 release, or a recent CVS ! snapshot in order to do so (alpha 2 does not include all the ! necessary features). Because Outlook Express does not let you ! filter on arbitrary headers (like X-Spambayes-Classification), ! pop3proxy must add the classification to the "To:" line, or the ! "Subject" line. !

      !

      ! Pop3proxy/imapfilter aren't quite as 'transparent' as the Outlook ! plugin, but they're still quite easy to use/setup, and they use the ! same core, so the results will be the same. !

      !
    4. ! !
    5. ! Do I have to have ! Python installed to use Spambayes with Outlook? ! !

      ! You should be able to download the Outlook plugin binary and ! install that, and that's all you need !

      !
    6. ! !
    7. ! Forget Outlook, what ! clients will Spambayes work with in general? ! !

      ! Spambayes will work with most POP3 or IMAP compatible clients. How ! you implement depends on your local architecture. Users with access ! to procmail can just write a recipe that invokes spambayes like ! this: !

      !     :0fw
      !     | /opt/spambayes/hammiefilter.py
      ! 
      !     Followed by a recipe to check the results and take action:
      ! 
      !     :0
      !     * ^X-Spambayes-Classification: spam
      !     ${MAILDIR}/spam
        
      ! !

      ! !

      ! Emacs and XEmacs both come with VM, one of a choice of several ! Emacs-based mail packages. Emacs is extensible using Emacs Lisp or ! Pymacs. This extensibility allows you to easily segregate your ! incoming mail for training purposes. Here's one such example. If you ! place the following code in your ~/.vm file: !

      !     (defun copy-to-spam ()
      !       (interactive)
      !       (vm-save-message (expand-file-name "~/tmp/newspam"))
      !       (vm-undelete-message 1))
      ! 
      !     (defun copy-to-nonspam ()
      !       (interactive)
      !       (vm-save-message (expand-file-name "~/tmp/newham"))
      !       (vm-undelete-message 1))
      ! 
      !     (define-key vm-mode-map "ls" 'copy-to-spam)
      !     (define-key vm-summary-mode-map "ls" 'copy-to-spam)
      !     (define-key vm-mode-map "lh" 'copy-to-nonspam)
      !     (define-key vm-summary-mode-map "lh" 'copy-to-nonspam)
        
      - - "ls" will save a copy of the current message to ~/tmp/newspam and "lh" - will save a copy of the current message to ~/tmp/newham. You can then - use those files later as arguments to hammie.py for training. - -

      - - -

      - Users limited to POP3/IMAP communications to the server can use the - POP3 - or IMAP - proxy with the Spambayes - source code. -

      -
    8. - -
    9. - We have Outlook 2000 - connecting to an Exchange 2000 server. Will spambayes work for - us? -

      ! It should, yes. There haven't been any problems reported using that ! combination.

      !
    10. !
    !
  4. ! !
  5. ! Using Spambayes ! !
      ! !
    1. ! How do I configure ! Spambayes? ! !

      ! The system is configured through a file called "bayescustomize.ini". ! In here you can configure the name and type of your database, the ! POP3 server(s) you want to proxy to, the ports you want the proxy ! and the web interface to run on, and so on. You can also control ! details like how sure you want the system to be that message really ! is spam before it marks it as such. The default values for all the ! options, and the documentation for them, all lives in Options.py. !

      !

      ! To change an option, create a bayescustomize.ini and add the option ! to that - don't edit Options.py. If you are using the POP3 proxy, ! SMTP proxy or IMAP filter, you can also change most of the options ! you will need to access via the web user interface. You will ! probably find this at http://localhost:8880. To configure the ! Outlook plugin, you should click on the Anti-Spam button on the ! toolbar. !

      !

      ! To setup the POP3 and SMTP proxies (optional), run; !

      !     pop3proxy.py -b
      ! 
      ! from the command line. The web interface should open in your default ! browser. You need to click on the "Configuration Link" to go to the ! setup page. The minimum you need to do to get started is enter the ! servers and ports information in the POP3 proxy and SMTP proxy ! sections. !

      ! !

      ! The POP3 proxy is then ready for your email client to connect to it ! on port 110 and the SMTP proxy is ready for connections on port 25. ! You now need to configure your email client to talk to the proxies ! instead of the real email servers. Change your equivalent of ! "pop3.my-isp.com" to "localhost" (or to the name of the machine ! you're running the proxy on) in your email client's setup, and do ! the same with your equivalent of "smtp.my-isp.com". Hit "Get new ! email" and look at the headers of the emails (send yourself an email ! if you don't have any!) - there should be an ! X-Spambayes-Classification header there. It probably says "unsure", ! if you haven't done any training yet. You should be able to create a ! mail folder called "Suspected spam" and set up a filtering rule that ! puts emails with an "X-Spambayes-Classification: spam" heading into ! that folder. (Eventually we should publish instructions on how to do ! this in all the popular email clients). !

      !
    2. ! !
    3. ! How do I train ! Spambayes (web method) !

      ! Follow the "Review messages" link and you'll see a list of the ! emails that the system has seen so far. Check the appropriate boxes ! and hit Train. The messages disappear (eventually you'll be able to ! get back to them, for instance to correct any training mistakes) and ! if you go back to the home page you'll see that the "Total emails ! trained" has increased. !

      !

      ! Once you've done this on a few spams and a few hams, you'll find ! that the X-Spambayes-Classification header is getting it right most ! of the time. The more you train it the more accurate it gets. ! There's no need to train it on every message you receive, but you ! should train on a few spams and a few hams on a regular basis. You ! should also try to train it on about the same number of spams as ! hams. !

      !

      ! You can train it on lots of messages in one go by either using the ! Hammie script as explained in the "Command-line training" section, ! or by giving messages to the web interface via the "Train" form on ! the Home page. You can train on individual messages (which is ! tedious) or using mbox files. !

      !
    4. ! !
    5. ! How do I train ! Spambayes (forward/bounce method) !

      ! Alternatively, when you receive an incorrectly classified message, ! you can forward it to the SMTP proxy for training. If the message ! should have been classified as spam, forward or bounce the message ! to spambayes_spam@localhost, and if the message should have been ! classified as ham, forward it to spambayes_ham@localhost. You can ! still review the training through the web interface, if you wish to ! do so. !

      !

      ! Note that you must set (via the web interface) the "add mail id to" ! option in order to use this. You can also use this id to find a ! particular message via the web interface. !

      !

      ! Note that some mail clients (particularly Outlook Express) do not ! forward all headers when you bounce, forward or redirect mail. For ! these clients, you will need to set (via the web interface) the "add ! mail id to" option to body, which will add a unique id to the body ! of each message you receive. !

      !
    6. ! !
    7. ! How do I train Spambayes ! (command line method) ! !

      ! Given a pair of Unix mailbox format files (each message starts with ! a line which begins with 'From '), one containing nothing but spam ! and the other containing nothing but ham, you can train Spambayes ! using a command like: !

      !     hammie.py -g ~/tmp/newham -s ~/tmp/newspam
      ! 
      ! The above command is OS-centric (e.g., UNIX, or Windows command ! prompt). You can also use the web interface for training as ! detailed above. !

      !
    8. ! !
    9. ! I just got a spam, but the ! system said it was "unsure". Why couldn't it tell that it was spam ! — it's obvious? ! !

      ! It may be obvious to you, but the classifier only works on the ! information it has been given. Maybe this is "new" (you've never ! seen this particular flavor of spam before), or maybe there aren't ! enough clues in the message which the system is aware of as strong ! spam clues. !

      !
    10. ! !
    11. ! OK, I trained on that ! message. But I just got another one, and the stupid system ! still thinks it's unsure. Why did it ignore me? ! !

      ! It didn't, but you may need to train on a few more of this type of ! message to get it classified as "spam". The classification algorithm ! weights its results based on the number of times it has seen a ! particular clue, so that clues unique to this type of message may ! need a few more instances to become "convincing". !

      !
    12. ! !
    13. ! I've mucked up my ! training and I want to start all over again, but there isn't an option ! for this anywhere. What do I do? ! !

      ! Because training from scratch is a very rare occurrence, and because ! deleting all your training information is something you don't want ! to do by accident, there isn't an option for this. However, you can ! quite simply do this manually. All the training data is stored in a ! file, usually called hammie.db, and if you delete (or rename) this, ! then you will start training from scratch. If you are using the web ! interface for the POP3 proxy, the configuration page tells you what ! this file is called (and where it is) down towards the bottom of the ! page. !

      !
    14. ! !
    15. ! I can't use a web ! browser, so I can't configure pop3proxy/imapfilter. Also: how do I ! configure hammiefilter and the other applications that don't have a ! user interface? ! !

      ! You need to create a configuration file. This is in the 'standard' ! ini file format (originally created for Windows 3.1, I believe). ! You can find documentation on this format in the Python ! ConfigParser doc, but basically, it's just a text file: lines ! beginning with # are comments, sections start with a line like ! "[Section Name]", and options are set out within the appropriate ! section with lines like "opt = val" or "opt: val" (either is okay). ! Whitespace other than line endings is for the most part ignored, so ! you can make it look like whatever you like. You can see a list of ! what a configuration file of all the defaults would like like if you ! execute the following Python commands: !

      !     >>> from spambayes.Options import options
      !     >>> print options.display()
        
      !

      !
    16. ! !
    17. ! That's great, now I ! know what the format looks like, but what options do I need to ! set? ! !

      ! This depends on exactly what you want to do, and which application ! you are intending to use. The easiest thing is to execute the ! following Python commands: !

      !     >>> from spambayes.Options import options
      !     >>> print options.display_full()
        
      This will print out a complete list of the options, including a ! description of the option, and its default value. You can also look up ! a single section, if you know its name:
      ! !
      !     >>> print options.display_full("section_name")
        
      ! ! Or just a single option:
      ! !
      !     >>> print options.display_full("section_name", "option_name")
        
      ! If you want a list of all the sections, you can use this command:
      ! !
      !     >>> print options.sections()
        
      ! ! If you want a list of all the options, you can use this command:
      ! !
      !     >>> print options.options(prepend_section_name=False)
        
      !

      !
    18. ! !
    19. ! I've made a ! configuration file, but Spambayes is ignoring it. Now what? ! !

      ! Spambayes looks for your configuration file in three places - if it ! can't find it, then, obviously, your options will not be loaded. ! The first place that Spambayes checks is the environment variable ! BAYESCUSTOMIZE. You can set this to the path of your configuration ! file, wherever it is, and it will be loaded. You can also specify ! more than one file, separated by the appropriate path separator for ! your platform. This is the recommended method of specifying the ! location of the file, unless you do so via a user interface (as ! provided by the POP3 proxy, the Outlook plugin, and the IMAP ! filter). If Spambayes doesn't find anything in the BAYESCUSTOMIZE ! variable, then it checks the current working directory and your home ! directory for a bayescustomize.ini or .spambayesrc file ! (respectively). !

      !
    20. ! !
    21. ! Why don't short words or ! long words show up in the clues? ! !

      ! Words less than 3 characters long are skipped, and words greater ! than 12 characters long are converted into a special 'long-word' ! token. These numbers (3 and 12) were determined by brute force ! testing, and produced the best overall results (including compared ! to no upper or lower limits). !

      !
    22. ! !
    23. ! Why is the enable filter ! button is grayed out in Outlook? ! !

      ! You need to have done these things to enable that button: !

        !
      1. ! Trained at least 5 ham and 5 spam !
      2. !
      3. ! Set at least one folder to watch !
      4. !
      5. ! Set folders to move spam to, and to move unsures to !
      6. !
      7. ! Changed the action to "copy" or "move", rather than "untouched" !
      8. !
      !

      !
    24. ! !
    25. ! Is there anything else I ! should know? ! !

      ! While Spambayes does an excellent job of classifying incoming mail, ! it is only as good as the data on which it was trained. Here are ! some tips to help you create a good training set: !

      !
        !
      • ! Don't use old mail. The characteristics of your email change over ! time, sometimes subtly, sometimes dramatically, so it's best to ! use very recent mail to train Spambayes. If you've abandoned an ! email address in the past because it was getting spammed heavily, ! there are probably some clues in mail sent to your old address ! which would bias Spambayes. !
      • !
      • ! Check and recheck your training collections. While you are ! manually classifying mail as spam or ham, it's easy to make a ! mistake and toss a message or ten in the wrong file. Such ! miscategorized mail will throw off the classifier. !
      • !
      !
    26. !
    ! !
  6. ! Development ! !
      !
    1. ! Hey! Why don't you ! implement cool tokenizer trick X? I think it would really foil those ! spammers! ! !

      ! Have you run your tokenizer trick against a set of messages to see ! if it actually works? Many times what seems like a good idea turns ! out not to help much, and sometimes even hurts. If you have a good ! idea, you've run it against a batch of messages and can prove that ! it helps, paste the code for your technique and the proof to the ! mailing list. If you're not a coder, but are really keen on your ! idea, post a feature request on the project page, and wait for ! someone else to code it for you (but make sure you do some testing ! when it's done). Otherwise, you will likely get a message from Tim ! Peters about why you need to test your idea :) Note that as a ! general rule, we've found that with the tokenizer, "stupid beats ! smart" — that is, very specialized tokenizer behavior usually ! produces worse results than a more general approach that just ! generates tokens and throws them at the classifier. !

      !
    2. ! !
    3. ! This software is great! ! I want to implement it for all my users. Are there plans to develop a ! server-side spambayes solution? ! !

      ! The problem with a server-side solution is that everyone has a ! different idea of what is spam - that's the whole strength of the ! bayesian-style filtering concept. If you are certain that ! all of your users would agree on what is spam and what is ! not, then this might work for you, but otherwise you really have to ! have individual databases for each user. Either way, you should be ! able to modify spambayes easily enough to fit into your setup. ! Please let the list know if you do have success in this area, and ! we'll update this answer. !

      !
    4. ! !
    5. ! Forget tokenizing words - ! you should use character n-grams! ! !

      ! This was quite carefully tested. Character 3-grams gave five times ! as many false positives, and twice as many false negatives as ! splitting on whitespace (words). Character 5-grams came fairly close ! to words with false positives, but the number of false negatives was ! worse than with 3-grams. n-grams also creates many more unique ! tokens, which means much slower operation. In addition, it's much ! harder to figure out why a message scored as it did with ! n-grams. On the other hand, words are easy to understand. There ! was, however, one area where n-grams were much better: detecting ! spam in Asian languages. Since a 'word' in an Asian language message ! ends up being an entire line, words don't work very well at all. !

      !
    6. ! !
    7. ! The clues for my mail are all ! in lower case, but "FREE" is a much better clue than "free". Why do ! you force everything into lower case? ! !

      ! This was very carefully considered. On the positive side, removing ! case does hide information (and we're not really sure what it does ! to non-English languages), but on the negative side, it makes the ! database a lot bigger, and requires more training. In the end, ! testing with case removed resulted in no change in the false ! positive rate, and a small reduction in the false negative rate, so ! that's what we do. There is one exception: we keep case in subject ! lines, because testing showed an improvement if we did that. !

      !
    8. !
    !
--- 1,867 ---- ! Title: SpamBayes: Frequently Asked Questions ! Author-Email: spambayes-dev@python.org Author: spambayes !

! Frequently Asked Questions !

!
    !
  1. Overview
    !
      !
    1. ! So what is Spambayes? !
    2. !
    3. ! What do I need to install ! Spambayes? !
    4. !
    5. ! Is there a "ten thousand foot ! view" that shows how this thing works? !
    6. !
    7. ! Where does all this stuff live? !
    8. !
  2. !
  3. Compatibility
    !
      !
    1. ! What version of Outlook does it ! work with? !
    2. !
    3. ! Does Spambayes work with Outlook ! Express? !
    4. !
    5. ! Do I have to have python installed to ! use Spambayes with Outlook? !
    6. !
    7. ! Forget Outlook, what clients will ! Spambayes work with in general? !
    8. !
    9. ! We have Outlook 2000 connecting to an ! Exchange 2000 server. Will spambayes work for us? !
    10. !
  4. !
  5. Using Spambayes
    !
      !
    1. ! How do I configure Spambayes? !
    2. !
    3. ! How do I train Spambayes (web ! method) !
    4. !
    5. ! How do I train Spambayes ! (forward/bounce method) !
    6. !
    7. ! How do I train Spambayes (command line ! method) !
    8. !
    9. ! I just got a spam, but the system said it ! was "unsure". Why couldn't it tell that it ! was spam - it's obvious? !
    10. !
    11. ! OK, I trained on that message. But I ! just got another one, and the stupid system still ! thinks it's unsure. Why did it ignore me??? !
    12. !
    13. ! I've mucked up my training and ! I want to start all over again, but there isn't an ! option for this anywhere. What do I do? !
    14. !
    15. ! I can't use a web browser, so I ! can't configure pop3proxy/imapfilter.
      ! Also: how do I configure hammiefilter and the other ! applications that don't have a user interface?
      !
    16. !
    17. ! That's great, now I know what ! the format looks like, but what options do I need to ! set? !
    18. !
    19. ! I've made a configuration ! file, but Spambayes is ignoring it. Now what? !
    20. !
    21. ! Why don't short words or long ! words show up in the clues? !
    22. !
    23. ! Is there anything else I should ! know? !
    24. !
  6. !
  7. Development
    !
      !
    1. ! Hey! Why don't you implement cool ! tokenizer trick X? I think it would really foil those ! spammers!
    2. ! This software is great! I want to ! implement it for all my users. Are there plans to develop a ! server-side spambayes solution?
    3. ! Forget tokenising words - you should use ! character n-grams!
    4. ! The clues for my mail are all in lower ! case, but "FREE" is a much better clue than ! "free". Why do you force everything into lower ! case?
    5. ! Why don't you provide the ability to ! bounce spam back to the sender?
    !
  8. !
!

! If you have any suggestions about other questions and answers ! that should be included here, please mail the list ! with them. !

!

! 1. Overview !

!

! 1a. So what is Spambayes? !

!

! Spambayes is a tool used to segregate unwanted mail (spam) from ! the mail you want (ham). Before Spambayes can be your spam filter ! of choice you need to train it on representative samples of email ! you receive. After it's been trained, you use Spambayes to ! classify new mail according to its spamminess and hamminess ! qualities. !

!

! To train Spambayes (which you don't need to do if you're ! going to be using the POP3 proxy to classify messages, but ! you'll get better results from the outset if you do) you need ! to save your incoming email for awhile, segregating it into two ! piles, known spam and known ham (ham is our nickname for good ! mail). It's best to train on recent email, because your ! interests and the nature of what spam looks like change over ! time. Once you've collected a fair portion of each (anything ! is better than nothing, but it helps to have a couple hundred of ! each), you can tell Spambayes, "Here's my ham and my ! spam". It will then process that mail and save information ! about different patterns which appear in ham and spam. That ! information is then used during the filtering stage. See the ! "Command-line training" section below for details. !

!

! When Spambayes filters your email, it compares each unclassified ! message against the information it saved from training and makes ! a decision about whether it thinks the message qualifies as ham ! or spam, or if it's unsure about how to classify the message. ! It then adds its classification to the message, either by adding ! a header (X-Spambayes-Classification: spam|ham|unsure), modifying ! the To: or Subject: headers, or adding a "Spam" field ! to the message. Depending on which Spambayes application you are ! using, it may then filter this message for you, or you can set up ! your own filters (to file away suspected spam into its own mail ! folder, for example). !

!

! 1b. What do I need to install ! Spambayes? !

!

! Unless you are using the Outlook plugin, you must have a recent ! version of Python installed on your computer, version 2.2 or ! later. (Don't ask about backporting it to earlier versions of ! Python. It's almost a certainty this won't happen.) If ! you need to install Python on your system, check the Python ! download page for the version appropriate to your computer: !

!

! http://www.python.org/download/ !

!

! You also need version 2.4.3 or above of the Python ! "email" package. If you're running Python 2.2.2 or ! above, then you already have this. If not, you can download it ! from http://mimelib.sf.net ! and install it - unpack the archive, cd to the email-2.4.3 ! directory and type "python setup.py install" (YMMV on ! different platforms). This will install it into your Python ! site-packages directory. You'll also need to move aside the ! standard "email" library - go to your Python ! "Lib" directory and rename "email" to ! "email_old". !

!

! 1c. Is there a "ten thousand foot ! view" that shows how this thing works? !

!

! There are eight main components to the Spambayes system: !

!
    !
  1. A database. Loosely speaking, this is a collection of words ! and associated spam and ham probabilities. The database says ! "If a message contains the word 'Viagra' then ! there's a 98% chance that it's spam, and a 2% chance that ! it's ham." This database is created by training - you ! give it messages, tell it whether those messages are ham or spam, ! and it adjusts its probabilities accordingly. How to train it is ! covered below. By default it lives in a file called ! "hammie.db" or (for the Outlook plugin) ! "default_bayes_database". !
  2. !
  3. The tokeniser/classifier. This is the core engine of the ! system. The tokenizer splits emails into tokens (words, roughly ! speaking), and the classifier looks at those tokens to determine ! whether the message looks like spam or not. You don't use the ! tokeniser/classifier directly - it powers the other parts of the ! system. !
  4. !
  5. The POP3 proxy. This sits between your email client (Eudora, ! Outlook Express, etc) and your incoming email server, and adds ! the classification header to emails as you download them. A ! typical user's email setup looks like this: !
    !    +-----------------+                              +-------------+
    !    | Outlook Express |      Internet or intranet    |             |
    !    |  (or similar)   | <--------------------------> | POP3 server |
    !    |                 |                              |             |
    !    +-----------------+                              +-------------+
    ! 
    The POP3 server runs either at your ISP for internet mail, or ! somewhere on your internal network for corporate mail. The POP3 proxy ! sits in the middle and adds the classification header as you retrieve ! your email: !
    !    +-----------------+        +------------+        +-------------+
    !    | Outlook Express |        | Spambayes  |        |             |
    !    |  (or similar)   | <----> | POP3 proxy | <----> | POP3 server |
    !    |                 |        |            |        |             |
    !    +-----------------+        +------------+        +-------------+
    ! 
    So where you currently have your email client configured to talk ! to say, "pop3.my-isp.com", you instead configure the ! proxy to talk to "pop3.my-isp.com" and configure ! your email client to talk to the proxy. The POP3 proxy can live ! on your PC, or on the same machine as the POP3 server, or on a ! different machine entirely, it really doesn't matter. Say ! it's living on your PC, you'd configure your email ! client to talk to "localhost". You can configure the ! proxy to talk to multiple POP3 servers, if you have more than ! one email account. !
  6. !
  7. The SMTP proxy. This sits between your email client (Eudora, ! Outlook Express, etc) and your outgoing email server. Any mail ! sent to spambayes_spam@localhost or spambayes_ham@localhost is ! intercepted and trained appropriately. A typical user's email ! setup looks like this: !
    !    +-----------------+                              +-------------+
    !    | Outlook Express |      Internet or intranet    |             |
    !    |  (or similar)   | <--------------------------> | SMTP server |
    !    |                 |                              |             |
    !    +-----------------+                              +-------------+
    ! 
    The SMTP server runs either at your ISP for internet mail, or ! somewhere on your internal network for corporate mail. The SMTP proxy ! sits in the middle and checks for mail to train on as you send your ! email: !
    !    +-----------------+        +------------+        +-------------+
    !    | Outlook Express |        | Spambayes  |        |             |
    !    |  (or similar)   | <----> | SMTP proxy | <----> | SMTP server |
    !    |                 |        |            |        |             |
    !    +-----------------+        +------------+        +-------------+
    ! 
    So where you currently have your email client configured to talk ! to say, "smtp.my-isp.com", you instead configure the ! proxy to talk to "smtp.my-isp.com" and configure ! your email client to talk to the proxy. The SMTP proxy can live ! on your PC, or on the same machine as the SMTP server, or on a ! different machine entirely, it really doesn't matter. Say ! it's living on your PC, you'd configure your email ! client to talk to "localhost". You can configure the ! proxy to talk to multiple SMTP servers, if you have more than ! one email account. !
  8. !
  9. The web interface. This is a server that runs alongside the ! POP3 proxy, SMTP proxy, and IMAP filter (see below) and lets you ! control it through the web. You can upload emails to it for ! training or classification, query the probabilities database ! ("How many of my emails really do contain the word ! Viagra" find particular messages, and most importantly, ! train it on the emails you've received. When you start using ! the system, unless you train it using the Hammie script it will ! classify most things as Unsure, and often make mistakes. But it ! keeps copies of all the emails it's seen, and through the web ! interface you can train it by going through a list of all the ! emails you've received and checking a Ham/Spam box next to ! each one. After training on a few messages (say 20 spams and 20 ! hams), you'll find that it's getting it right most of the ! time. The web training interface automatically checks the ! Ham/Spam boxes according to what it thinks, so all you need to do ! it correct the odd mistake - it's very quick and easy. !
  10. !
  11. The Outlook plug-in. For Outlook 2000 and Outlook XP (2002) ! users (not Outlook Express) this lets you manage the whole thing ! from within Outlook. You set up a Ham folder and a Spam folder, ! and train it simply by dragging messages into those folders. ! Alternatively there are buttons to do the same thing. And it ! integrates into Outlook's filtering system to make it easy to ! file all the suspected spam into its own folder, for instance. !
  12. !
  13. The Hammie script. This does three jobs: command-line ! training, procmail filtering, and XML-RPC. See below for details ! of how to use Hammie for training, and how to use it as procmail ! filter. Hammie can also run as an XML-RPC server, so that a ! programmer can write code that uses a remote server to classify ! emails programmatically - see hammiesrv.py. !
  14. !
  15. The IMAP filter. This is a cross between the POP3 proxy and ! the Outlook plugin. If your mail sits on an IMAP server, you can ! use the this to filter your mail. You can designate folders that ! contain mail to train as ham and folders that contain mail to ! train as spam, and the filter does this for you. You can also ! designate folders to filter, along with a folder for messages ! Spambayes is unsure about, and a folder for suspected spam. When ! new mail arrives, the filter will move mail to the appropriate ! location (ham is left in the original folder). !
  16. !
!

! 1d. Where does all this stuff live? !

!

! The Hammie script is called hammie.py. The POP3 proxy lives in ! pop3proxy.py, and the smtpproxy lives in smtpproxy.py. The IMAP ! filter lives in imapfilter.py. The Outlook plug-in lives in the ! Outlook2000 subdirectory — see the README.txt in that ! directory for more information on that. !

!

! As well as these components, there's also a whole pile of ! utility scripts, test harnesses and so on — see README.txt ! and TESTING.txt in the spambayes distribution for more ! information. !

!

! 2. Compatibility !

!

! 2a. What version of Outlook does it ! work with? !

!

! The most up to date list of known compatible versions of Outlook ! may be found here. !

!

! 2b. Does Spambayes work with Outlook ! Express? !

!

! Outlook Express isn't a version of Outlook, it's a ! completely separate program (from the same company). Because they ! give it away for free, OE is a really stripped down program, and ! it's extremely difficult to create a plugin for it. !

!

! You can use pop3proxy and/or imapfilter with Outlook Express, ! however you must have either the alpha 3 release, or a recent CVS ! snapshot in order to do so (alpha 2 does not include all the ! necessary features). Because Outlook Express does not let you ! filter on arbitrary headers (like X-Spambayes-Classification), ! pop3proxy must add the classification to the "To:" ! line, or the "Subject" line. !

!

! Pop3proxy/imapfilter aren't quite as 'transparent' as ! the Outlook plugin, but they're still quite easy to ! use/setup, and they use the same core, so the results will be the ! same. !

!

! 2c. Do I have to have Python installed to use ! Spambayes with Outlook? !

!

! You should be able to download the Outlook plugin binary and ! install that, and that's all you need !

!

! 2d. Forget Outlook, what clients will ! Spambayes work with in general? !

!

! Spambayes will work with most POP3 or IMAP compatible clients. ! How you implement depends on your local architecture. Users with ! access to procmail can just write a recipe that invokes spambayes ! like this: !

!
!  :0fw 
!  | /opt/spambayes/hammiefilter.py
  
!

! Followed by a recipe to check the results and take action: !

!
!  :0
!  * ^X-Spambayes-Classification: spam 
!  ${MAILDIR}/spam
  

! Emacs and XEmacs both come with VM, one of a choice of several ! Emacs-based mail packages. Emacs is extensible using Emacs Lisp ! or Pymacs. This extensibility allows you to easily segregate your ! incoming mail for training purposes. Here's one such example. ! If you place the following code in your ~/.vm file:

!
!  (defun copy-to-spam ()
!    (interactive)
!    (vm-save-message (expand-file-name "~/tmp/newspam"))
!    (vm-undelete-message 1))
! 
!  (defun copy-to-nonspam ()
!    (interactive)
!    (vm-save-message (expand-file-name "~/tmp/newham"))
!    (vm-undelete-message 1))
! 
!  (define-key vm-mode-map "ls" 'copy-to-spam)
!  (define-key vm-summary-mode-map "ls" 'copy-to-spam)
!  (define-key vm-mode-map "lh" 'copy-to-nonspam)
!  (define-key vm-summary-mode-map "lh" 'copy-to-nonspam)
  
!

! "ls" will save a copy of the current message to ! ~/tmp/newspam and "lh" will save a copy of the current ! message to ~/tmp/newham. You can then use those files later as ! arguments to hammie.py for training. !

!

! Users limited to POP3/IMAP communications to the server can use ! the POP3 ! or IMAP ! proxy with the Spambayes ! source code. !

!

! 2e. We have Outlook 2000 connecting to an ! Exchange 2000 server. Will spambayes work for us? !

!

! It should, yes. There haven't been any problems reported ! using that combination. !

!

! 3. Using Spambayes !

!

! 3a. How do I configure Spambayes? !

!

! The system is configured through a file called ! "bayescustomize.ini". In here you can configure the ! name and type of your database, the POP3 server(s) you want to ! proxy to, the ports you want the proxy and the web interface to ! run on, and so on. You can also control details like how sure you ! want the system to be that message really is spam before it marks ! it as such. The default values for all the options, and the ! documentation for them, all lives in Options.py. !

!

! To change an option, create a bayescustomize.ini and add the ! option to that - don't edit Options.py. If you are using the ! POP3 proxy, SMTP proxy or IMAP filter, you can also change most ! of the options you will need to access via the web user ! interface. You will probably find this at ! <http://localhost:8880>. To configure the Outlook plugin, ! you should click on the Anti-Spam button on the toolbar. !

!

! To setup the POP3 and SMTP proxies (optional), run; !

!
!   pop3proxy.py -b
! 
!

! from the command line. The web interface should open in your ! default browser. You need to click on the "Configuration ! Link" to go to the setup page. The minimum you need to do to ! get started is enter the servers and ports information in the ! POP3 proxy and SMTP proxy sections. !

!

! The POP3 proxy is then ready for your email client to connect to ! it on port 110 and the SMTP proxy is ready for connections on ! port 25. You now need to configure your email client to talk to ! the proxies instead of the real email servers. Change your ! equivalent of "pop3.my-isp.com" to ! "localhost" (or to the name of the machine you're ! running the proxy on) in your email client's setup, and do ! the same with your equivalent of "smtp.my-isp.com". Hit ! "Get new email" and look at the headers of the emails ! (send yourself an email if you don't have any!) - there ! should be an X-Spambayes-Classification header there. It probably ! says "unsure", if you haven't done any training ! yet. You should be able to create a mail folder called ! "Suspected spam" and set up a filtering rule that puts ! emails with an "X-Spambayes-Classification: spam" ! heading into that folder. (Eventually we should publish ! instructions on how to do this in all the popular email clients). !

!

! 3b. How do I train Spambayes (web ! method) !

!

! Follow the "Review messages" link and you'll see a ! list of the emails that the system has seen so far. Check the ! appropriate boxes and hit Train. The messages disappear ! (eventually you'll be able to get back to them, for instance ! to correct any training mistakes) and if you go back to the home ! page you'll see that the "Total emails trained" has ! increased. !

!

! Once you've done this on a few spams and a few hams, ! you'll find that the X-Spambayes-Classification header is ! getting it right most of the time. The more you train it the more ! accurate it gets. There's no need to train it on every ! message you receive, but you should train on a few spams and a ! few hams on a regular basis. You should also try to train it on ! about the same number of spams as hams. !

!

! You can train it on lots of messages in one go by either using ! the Hammie script as explained in the "Command-line ! training" section, or by giving messages to the web ! interface via the "Train" form on the Home page. You ! can train on individual messages (which is tedious) or using mbox ! files. !

!

! 3c. How do I train Spambayes ! (forward/bounce method) !

!

! Alternatively, when you receive an incorrectly classified ! message, you can forward it to the SMTP proxy for training. If ! the message should have been classified as spam, forward or ! bounce the message to spambayes_spam@localhost, and if the ! message should have been classified as ham, forward it to ! spambayes_ham@localhost. You can still review the training ! through the web interface, if you wish to do so. !

!

! Note that you must set (via the web interface) the "add mail ! id to" option in order to use this. You can also use this id ! to find a particular message via the web interface. !

!

! Note that some mail clients (particularly Outlook Express) do not ! forward all headers when you bounce, forward or redirect mail. ! For these clients, you will need to set (via the web interface) ! the "add mail id to" option to body, which will add a ! unique id to the body of each message you receive. !

!

! 3d. How do I train Spambayes (command line ! method) !

!

! Given a pair of Unix mailbox format files (each message starts ! with a line which begins with 'From '), one containing ! nothing but spam and the other containing nothing but ham, you ! can train Spambayes using a command like: !

!
!   hammie.py -g ~/tmp/newham -s ~/tmp/newspam
  
+

+ The above command is OS-centric (eg. UNIX, or Windows command + prompt). You can also use the web interface for training as + detailed above. +

+

+ 3e. I just got a spam, but the system said it + was "unsure". Why couldn't it tell that it was spam + — it's obvious? +

+

+ It may be obvious to you, but the classifier only works on the + information it has been given. Maybe this is "new" + (you've never seen this particular flavour of spam before), + or maybe there aren't enough clues in the message which the + system is aware of as strong spam clues. +

+

+ 3f. OK, I trained on that message. But I + just got another one, and the stupid system still thinks + it's unsure. Why did it ignore me? +

+

+ It didn't, but you may need to train on a few more of this + type of message to get it classified as "spam". The + classification algorithm weights its results based on the number + of times it has seen a particular clue, so that clues unique to + this type of message may need a few more instances to become + "convincing". +

+

+ 3g. I've mucked up my training and I + want to start all over again, but there isn't an option for + this anywhere. What do I do? +

+

+ Because training from scratch is a very rare occurrence, and + because deleting all your training information is something you + don't want to do by accident, there isn't an option for + this. However, you can quite simply do this manually. All the + training data is stored in a file, usually called hammie.db, and + if you delete (or rename) this, then you will start training from + scratch. If you are using the web interface for the POP3 proxy, + the configuration page tells you what this file is called (and + where it is) down towards the bottom of the page. +

+

+ 3h. I can't use a web browser, so I + can't configure pop3proxy/imapfilter.
+ Also: how do I configure hammiefilter and the other applications + that don't have a user interface?
+

+

+ You need to create a configuration file. This is in the + 'standard' ini file format (originally created for + Windows 3.1, I believe). You can find documentation on this + format in the Python + ConfigParser doc, but basically, it's just a text file: + lines beginning with # are comments, sections start with a line + like "[Section Name]", and options are set out within + the appropriate section with lines like "opt = val" or + "opt: val" (either is ok). Whitespace other than line + endings is for the most part ignored, so you can make it look + like whatever you like. You can see a list of what a + configuration file of all the defaults would like like if you + execute the following Python commands: +

+
+   >>> from spambayes.Options import options
+   >>> print options.display()
+ 
+

+ 3i. That's great, now I know what the + format looks like, but what options do I need to set? +

+

+ This depends on exactly what you want to do, and which + application you are intending to use. The easiest thing is to + execute the following Python commands: +

+
+   >>> from spambayes.Options import options
+   >>> print options.display_full()
+ 
+

This will print out a complete list of the options, including a ! description of the option, and its default value. You can also ! look up a single section, if you know its name: !

!
!   >>> print options.display_full("section_name")
  
!

! Or just a single option: !

!
!   >>> print options.display_full("section_name", "option_name")
  
!

! If you want a list of all the sections, you can use this command: !

!
!   >>> print options.sections()
  
!

! If you want a list of all the options, you can use this command: !

!
!   >>> print options.options(prepend_section_name=False)
  
!

! 3j. I've made a configuration file, ! but Spambayes is ignoring it. Now what? !

!

! Spambayes looks for your configuration file in three places - if ! it can't find it, then, obviously, your options will not be ! loaded. The first place that Spambayes checks is the environment ! variable BAYESCUSTOMIZE. You can set this to the path of your ! configuration file, wherever it is, and it will be loaded. You ! can also specify more than one file, separated by the appropriate ! path separator for your platform. This is the recommended method ! of specifying the location of the file, unless you do so via a ! user interface (as provided by the POP3 proxy, the Outlook ! plugin, and the IMAP filter). If Spambayes doesn't find ! anything in the BAYESCUSTOMIZE variable, then it checks the ! current working directory and your home directory for a ! bayescustomize.ini or .spambayesrc file (respectively). !

!

! 3k. Why don't short words or long words ! show up in the clues? !

!

! Words less than 3 characters long are skipped, and words greater ! than 12 characters long are converted into a special ! 'long-word' token. These numbers (3 and 12) were ! determined by brute force testing, and produced the best overall ! results (including compared to no upper or lower limits). !

!

! Why is the enable filter button is grayed ! out in Outlook? !

!

! You need to have done these things to enable that button: !

!
    !
  1. Trained at least 5 ham and 5 spam !
  2. !
  3. Set at least one folder to watch !
  4. !
  5. Set folders to move spam to, and to move unsures to !
  6. !
  7. Changed the action to "copy" or "move", ! rather than "untouched" !
  8. !
!

! 3l. Is there anything else I should know? !

!

! While Spambayes does an excellent job of classifying incoming ! mail, it is only as good as the data on which it was trained. ! Here are some tips to help you create a good training set: !

!
    !
  • Don't use old mail. The characteristics of your email ! change over time, sometimes subtly, sometimes dramatically, so ! it's best to use very recent mail to train Spambayes. If ! you've abandoned an email address in the past because it was ! getting spammed heavily, there are probably some clues in mail ! sent to your old address which would bias Spambayes. !
  • !
  • Check and recheck your training collections. While you are ! manually classifying mail as spam or ham, it's easy to make a ! mistake and toss a message or ten in the wrong file. Such ! miscategorized mail will throw off the classifier. !
  • !
!

! 4. Development !

!

! 4a. Hey! Why don't you implement cool ! tokenizer trick X? I think it would really foil those ! spammers! !

!

! Have you run your tokenizer trick against a set of messages to ! see if it actually works? Many times what seems like a good idea ! turns out not to help much, and sometimes even hurts. If you have ! a good idea, you've run it against a batch of messages and ! can prove that it helps, paste the code for your technique and ! the proof to the mailing list. If you're not a coder, but are ! really keen on your idea, post a feature request on the project ! page, and wait for someone else to code it for you (but make sure ! you do some testing when it's done). Otherwise, you will ! likely get a message from Tim Peters about why you need to test ! your idea :) Note that as a general rule, we've found that ! with the tokenizer, "stupid beats smart" — that ! is, very specialised tokenizer behaviour usually produces worse ! results than a more general approach that just generates tokens ! and throws them at the classifier. !

!

! 4b. This software is great! I want to ! implement it for all my users. Are there plans to develop a ! server-side spambayes solution? !

!

! The problem with a server-side solution is that everyone has a ! different idea of what is spam - that's the whole strength of ! the bayesian-style filtering concept. If you are certain that ! all of your users would agree on what is spam and what is ! not, then this might work for you, but otherwise you really have ! to have individual databases for each user. Either way, you ! should be able to modify spambayes easily enough to fit into your ! setup. Please let the list know if you do have success in this ! area, and we'll update this answer. !

!

! 4c. Forget tokenising words - you should use ! character n-grams! !

!

! This was quite carefully tested. Character 3-grams gave five ! times as many false positives, and twice as many false negatives ! as splitting on whitespace (words). Character 5-grams came fairly ! close to words with false positives, but the number of false ! negatives was worse than with 3-grams. n-grams also creates many ! more unique tokens, which means much slower operation. In ! addition, it's much harder to figure out why a message ! scored as it did with n-grams. On the other hand, words are easy ! to understand. There was, however, one area where n-grams were ! much better: detecting spam in Asian languages. Since a ! 'word' in an Asian language message ends up being an ! entire line, words don't work very well at all. !

!

! 4d. The clues for my mail are all in lower case, ! but "FREE" is a much better clue than "free". ! Why do you force everything into lower case? !

!

! This was very carefully weighed up. On the positive side, ! removing case does hide information (and we're not really ! sure what it does to non-English languages), but on the negative ! side, it makes the database a lot bigger, and requires more ! training. In the end, testing with case removed resulted in no ! change in the false positive rate, and a small reduction in the ! false negative rate, so that's what we do. There is one ! exception: we keep case in subject lines, because testing showed ! an improvement if we did that. !

!

! 4e. Why don't you provide the ability to ! bounce spam back to the sender? !

!

! Most spammers these days don't accept incoming email, or ! (worse) forge the From and sender addresses, it's unlikely ! that it would do any good, and may well do some innocent much ! harm. !

From skip at pobox.com Wed May 28 21:31:11 2003 From: skip at pobox.com (Skip Montanaro) Date: Wed May 28 21:31:20 2003 Subject: [spambayes-dev] FAQ update In-Reply-To: <3ED53AAD.7060608@parducci.net> References: <3ED4E3E4.20508@parducci.net> <16085.10545.151192.514702@montanaro.dyndns.org> <3ED53AAD.7060608@parducci.net> Message-ID: <16085.25311.732508.773180@montanaro.dyndns.org> bill> 1. readded "why don't you bounce back spam?" I added this. bill> 2. remade the page w3c compliant (html 4.01) Can you explain in general what you did? I can't apply your patch as it stands because it would completely undo what I did to create version 1.15. bill> 3. skipped TIDY (manually conformed to format) Tidy's not normally a huge deal, but with all the nested lists it helps get all the
    's and
  1. 's lined up with the corresponding
's and 's. (And when I first programmed LISP on the CDC Cyber at Iowa I thought all the parens would drive me nuts. Parens are downright docile compared with HTML tags.) This exercise has convinced me this is a really bad way to maintain the FAQ. We either need to maintain it in another form which can be converted to something with a TOC and body as part of the ht2html/make process or switch to another technology altogether (faq wizard, blog, wiki). Any idea what, if anything could be run on SF? Inputs, give me inputs! My kingdom for an input! Skip From noreply at sourceforge.net Wed May 28 20:23:20 2003 From: noreply at sourceforge.net (SourceForge.net) Date: Wed May 28 22:30:56 2003 Subject: [spambayes-dev] [ spambayes-Bugs-745292 ] Logs Show COM error Message-ID: Bugs item #745292, was opened at 2003-05-28 22:23 Message generated for change (Tracker Item Submitted) made by Item Submitter You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=498103&aid=745292&group_id=61702 Category: Outlook Group: v1.0 (example) Status: Open Resolution: None Priority: 5 Submitted By: Bryan Hunt (brhunt) Assigned to: Mark Hammond (mhammond) Summary: Logs Show COM error Initial Comment: I installed, configured and trained one day. Everything worked great. Next day, it says that I no longer have any items in the database. The "delete as spam" and "filter now" buttons no longer work. The log files show that there are COM errors. This looks similar to bug 689298. ---------------------------------------------------------------------- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=498103&aid=745292&group_id=61702 From bill at parducci.net Wed May 28 22:12:49 2003 From: bill at parducci.net (bill parducci) Date: Thu May 29 00:19:46 2003 Subject: [spambayes-dev] FAQ update References: <3ED4E3E4.20508@parducci.net> <16085.10545.151192.514702@montanaro.dyndns.org> <3ED53AAD.7060608@parducci.net> <16085.25311.732508.773180@montanaro.dyndns.org> Message-ID: <3ED588C1.4020805@parducci.net> Skip Montanaro wrote: > bill> 2. remade the page w3c compliant (html 4.01) > > Can you explain in general what you did? I can't apply your patch as it > stands because it would completely undo what I did to create version 1.15. mostly it was misplaced "

" (they cannot contain "
" tags. the 
changes are not major, which is why i was suprised that diff responded 
thus. i think that if you open both versions in a browser you will only 
see the addition of the 'why not respond to spam' passage. i used the 
latest cvs version so ALL of your changes should be there.

>     bill> 3. skipped TIDY (manually conformed to format)
> 
> Tidy's not normally a huge deal, but with all the nested lists it helps get
> all the 
    's and
  1. 's lined up with the corresponding
's and > 's. (And when I first programmed LISP on the CDC Cyber at Iowa I > thought all the parens would drive me nuts. Parens are downright docile > compared with HTML tags.) either way works for me. personally, i like the indentation it provides. > This exercise has convinced me this is a really bad way to maintain the FAQ. > We either need to maintain it in another form which can be converted to > something with a TOC and body as part of the ht2html/make process or switch > to another technology altogether (faq wizard, blog, wiki). Any idea what, > if anything could be run on SF? i for one am not a big wiki fan because it takes the formatting syntax to a whole new level of obscurity. :) a blog might work, but then you have to read a whole thread to see what is going on. i noticed that sf.net uses php. perhaps we could whip up something that will take simple txt files and generate the necessary html. i do something like this for websites that i admin where users can update their content via e-mail. b From anthony at interlink.com.au Thu May 29 15:25:13 2003 From: anthony at interlink.com.au (Anthony Baxter) Date: Thu May 29 00:26:09 2003 Subject: [spambayes-dev] FAQ update In-Reply-To: <3ED588C1.4020805@parducci.net> Message-ID: <200305290425.h4T4PEf12646@localhost.localdomain> >>> bill parducci wrote > i noticed that sf.net uses php. perhaps we could whip up something that > will take simple txt files and generate the necessary html. i do > something like this for websites that i admin where users can update > their content via e-mail. The other approach would be to make a directory full of text or html files, one per question, and "assemble" the FAQ with the Makefile (use a simple python script). I can do this if people think it's the right thing to do. Anthony From skip at pobox.com Thu May 29 09:20:34 2003 From: skip at pobox.com (Skip Montanaro) Date: Thu May 29 09:20:41 2003 Subject: [spambayes-dev] FAQ update In-Reply-To: <3ED588C1.4020805@parducci.net> References: <3ED4E3E4.20508@parducci.net> <16085.10545.151192.514702@montanaro.dyndns.org> <3ED53AAD.7060608@parducci.net> <16085.25311.732508.773180@montanaro.dyndns.org> <3ED588C1.4020805@parducci.net> Message-ID: <16086.2338.196846.547051@montanaro.dyndns.org> bill> i noticed that sf.net uses php. perhaps we could whip up something bill> that will take simple txt files and generate the necessary html. The most likely candidate in the Python world would be reStructured Text, the package used to do PEPs and the python-dev summary, among other things. It looks as if it can be used to generate HTML FAQs: http://docutils.sourceforge.net/FAQ.html I will investigate. Skip From bill at parducci.net Thu May 29 07:24:22 2003 From: bill at parducci.net (bill parducci) Date: Thu May 29 09:31:20 2003 Subject: [spambayes-dev] FAQ update References: <3ED4E3E4.20508@parducci.net> <16085.10545.151192.514702@montanaro.dyndns.org> <3ED53AAD.7060608@parducci.net> <16085.25311.732508.773180@montanaro.dyndns.org> <3ED588C1.4020805@parducci.net> <16086.2338.196846.547051@montanaro.dyndns.org> Message-ID: <3ED60A06.3010302@parducci.net> html writer module looks interesting as well: html 4.01 compliant output and stylesheet aware. b Skip Montanaro wrote: > bill> i noticed that sf.net uses php. perhaps we could whip up something > bill> that will take simple txt files and generate the necessary html. > > The most likely candidate in the Python world would be reStructured Text, > the package used to do PEPs and the python-dev summary, among other things. > It looks as if it can be used to generate HTML FAQs: > > http://docutils.sourceforge.net/FAQ.html > > I will investigate. > > Skip From skip at pobox.com Thu May 29 10:11:17 2003 From: skip at pobox.com (Skip Montanaro) Date: Thu May 29 10:11:21 2003 Subject: [spambayes-dev] FAQ update In-Reply-To: <16086.2338.196846.547051@montanaro.dyndns.org> References: <3ED4E3E4.20508@parducci.net> <16085.10545.151192.514702@montanaro.dyndns.org> <3ED53AAD.7060608@parducci.net> <16085.25311.732508.773180@montanaro.dyndns.org> <3ED588C1.4020805@parducci.net> <16086.2338.196846.547051@montanaro.dyndns.org> Message-ID: <16086.5381.225847.348619@montanaro.dyndns.org> Skip> It looks as if it can be used to generate HTML FAQs: Skip> http://docutils.sourceforge.net/FAQ.html Skip> I will investigate. Okay, please check out http://spambayes.sf.net/faq-rest.txt http://spambayes.sf.net/faq-rest.html If people like it well enough, I will replace the current stuff with it. To generate the HTML from the reST file you'll have to have the Docutils code installed. It's the usual "python setup.py install" thing, though for some reason the actual front-end program isn't installed in /usr/local/bin. Skip From skip at pobox.com Thu May 29 10:17:24 2003 From: skip at pobox.com (Skip Montanaro) Date: Thu May 29 10:17:52 2003 Subject: [spambayes-dev] FAQ update In-Reply-To: <200305290425.h4T4PEf12646@localhost.localdomain> References: <3ED588C1.4020805@parducci.net> <200305290425.h4T4PEf12646@localhost.localdomain> Message-ID: <16086.5748.383124.408066@montanaro.dyndns.org> >> i noticed that sf.net uses php. perhaps we could whip up something >> that will take simple txt files and generate the necessary html. Anthony> The other approach would be to make a directory full of text or Anthony> html files, one per question, and "assemble" the FAQ with the Anthony> Makefile (use a simple python script). Let's try out preexisting solutions first. reST can be used to generate FAQs though it doesn't yet have specific support for them. I converted the current faq.ht to reST format and uploaded both it and the HTML generated from it to spambayes.sf.net: http://spambayes.sf.net/faq-rest.txt http://spambayes.sf.net/faq-rest.html If we're going to develop something I think it would be worthwhile to get in touch with the Docutils folks and discuss a FAQ generator using their code base. Skip From noreply at sourceforge.net Thu May 29 08:07:34 2003 From: noreply at sourceforge.net (SourceForge.net) Date: Thu May 29 10:18:01 2003 Subject: [spambayes-dev] [ spambayes-Bugs-745518 ] Dragging multiple files doens't update stats Message-ID: Bugs item #745518, was opened at 2003-05-30 00:07 Message generated for change (Tracker Item Submitted) made by Item Submitter You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=498103&aid=745518&group_id=61702 Category: Outlook Group: None Status: Open Resolution: None Priority: 5 Submitted By: Mark Hammond (mhammond) Assigned to: Mark Hammond (mhammond) Summary: Dragging multiple files doens't update stats Initial Comment: >From the mailing list - unverified Using the latest binary for Windows 1.02a. If I select multiple messages in the possible spam folder and drag to the spam folder, the statistics displayed next to the "Train Now" button are not updated. I am unsure if the database is actually updated. Doing each message separately does update the statistics. ---------------------------------------------------------------------- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=498103&aid=745518&group_id=61702 From noreply at sourceforge.net Thu May 29 08:10:05 2003 From: noreply at sourceforge.net (SourceForge.net) Date: Thu May 29 10:18:02 2003 Subject: [spambayes-dev] [ spambayes-Bugs-745518 ] Dragging multiple files doens't update stats Message-ID: Bugs item #745518, was opened at 2003-05-30 00:07 Message generated for change (Comment added) made by mhammond You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=498103&aid=745518&group_id=61702 Category: Outlook Group: None >Status: Pending Resolution: None Priority: 5 Submitted By: Mark Hammond (mhammond) Assigned to: Mark Hammond (mhammond) Summary: Dragging multiple files doens't update stats Initial Comment: >From the mailing list - unverified Using the latest binary for Windows 1.02a. If I select multiple messages in the possible spam folder and drag to the spam folder, the statistics displayed next to the "Train Now" button are not updated. I am unsure if the database is actually updated. Doing each message separately does update the statistics. ---------------------------------------------------------------------- >Comment By: Mark Hammond (mhammond) Date: 2003-05-30 00:10 Message: Logged In: YES user_id=14198 Works for me in CVS. Just did 2 "maybe" via dragging, and the dialog shows 2 additional spam. ---------------------------------------------------------------------- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=498103&aid=745518&group_id=61702 From bill at parducci.net Thu May 29 08:43:50 2003 From: bill at parducci.net (bill parducci) Date: Thu May 29 10:43:54 2003 Subject: [spambayes-dev] FAQ update References: <3ED4E3E4.20508@parducci.net> <16085.10545.151192.514702@montanaro.dyndns.org> <3ED53AAD.7060608@parducci.net> <16085.25311.732508.773180@montanaro.dyndns.org> <3ED588C1.4020805@parducci.net> <16086.2338.196846.547051@montanaro.dyndns.org> <16086.5381.225847.348619@montanaro.dyndns.org> Message-ID: <3ED61CA6.9090101@parducci.net> Skip Montanaro wrote: > Okay, please check out > > http://spambayes.sf.net/faq-rest.txt > http://spambayes.sf.net/faq-rest.html very nice. even work with links & lynx. :) b From noreply at sourceforge.net Thu May 29 10:28:24 2003 From: noreply at sourceforge.net (SourceForge.net) Date: Thu May 29 12:31:23 2003 Subject: [spambayes-dev] [ spambayes-Bugs-744380 ] W982E/Outlook 2000: exception on loading Message-ID: Bugs item #744380, was opened at 2003-05-27 09:51 Message generated for change (Comment added) made by jobbins You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=498103&aid=744380&group_id=61702 Category: Outlook Group: None Status: Open Resolution: None Priority: 5 Submitted By: Steve Clift (sclift) Assigned to: Mark Hammond (mhammond) Summary: W982E/Outlook 2000: exception on loading Initial Comment: Windows 98 2nd Edition Outlook 2000 SR-1 - Corporate or Workgroup SpamBayes throws an execption when loading. From the log file: SpamAddin - Connecting to Outlook pythoncom error: Failed to call the universal dispatcher Traceback (most recent call last): File "E:\src\pythonex\com\win32com\universal.py", line 170, in dispatch File "E:\src\pythonex\com\win32com\server\policy.py", line 322, in _InvokeEx_ File "E:\src\pythonex\com\win32com\server\policy.py", line 601, in _invokeex_ File "E:\src\pythonex\com\win32com\server\policy.py", line 541, in _invokeex_ File "E:\src\spambayes\Outlook2000\addin.py", line 655, in OnConnection File "E:\src\spambayes\Outlook2000\manager.py", line 475, in GetManager File "E:\src\spambayes\Outlook2000\manager.py", line 141, in __init__ File "E:\src\spambayes\Outlook2000\manager.py", line 182, in LocateDataDirectory File "E:\src\python-cvs\lib\ntpath.py", line 269, in isdir exceptions.LookupError: no codec search functions registered: can't find encoding ---------------------------------------------------------------------- Comment By: Larry Jobbins (jobbins) Date: 2003-05-29 09:28 Message: Logged In: YES user_id=788287 Looks similar to 725449 and 740893. ---------------------------------------------------------------------- Comment By: Larry Jobbins (jobbins) Date: 2003-05-27 21:26 Message: Logged In: YES user_id=788287 Same error. Installed Setup-002.exe from http://starship.python.net/crew/mhammond/spambayes/ Using Win98SE, Outlook 2000, all MS updates. Shows add-in, but won't stay checked, no icon appears. Install log looks same - pythoncom error: Failed to call the universal dispatcher, etc ---------------------------------------------------------------------- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=498103&aid=744380&group_id=61702 From skip at pobox.com Fri May 30 12:34:54 2003 From: skip at pobox.com (Skip Montanaro) Date: Fri May 30 12:35:00 2003 Subject: [spambayes-dev] Next to no feedback on the trial faq Message-ID: <16087.34862.433359.570136@montanaro.dyndns.org> (spambayes-dev now seems to have enough people on it to form a quorum of sorts (26), so I'm excluding spambayes...) I posted URLs yesterday for a trial version of the Spambayes FAQ built using Docutils tools: http://spambayes.sourceforge.net/faq-rest.txt http://spambayes.sourceforge.net/faq-rest.html So far, the only person who responded was Bill Parducci. It's not surprising that he responded, because he's the other person who seems to be in FAQ maintenance hell at the moment. Still, I thought one or two other people would have responded. The Docutils version has the advantage that it will automagically generate the table of contents. Its main drawback is that there is no explicit faq generator script in Docutils, so the format of the questions is a bit constrained. Before I make an executive decision and simply adopt this new stuff (I believe it will be much easier to maintain than the status quo), I would like some feedback: * Does anyone have a problem that it doesn't follow the ht2html format for the spambayes site as a whole? I can probably worm around this by extracting the body of the faq to faq.ht then let the usual ht->html dependency work its magic. * Should anyone wanting to update the FAQ be required to install a recent version of Docutils or should I somehow worm around that? (This is especially disconcerting because when you install Docutils it doesn't install the html.py front-end script used to generate the faq.html file, so I can't really assume it will be in PATH.) If either of these is a big deal, please let me know. Skip From tim at fourstonesexpressions.com Fri May 30 12:53:51 2003 From: tim at fourstonesexpressions.com (Tim Stone) Date: Fri May 30 12:55:29 2003 Subject: [spambayes-dev] Next to no feedback on the trial faq In-Reply-To: <16087.34862.433359.570136@montanaro.dyndns.org> References: <16087.34862.433359.570136@montanaro.dyndns.org> Message-ID: On Fri, 30 May 2003 11:34:54 -0500, Skip Montanaro wrote: > > (spambayes-dev now seems to have enough people on it to form a quorum of > sorts (26), so I'm excluding spambayes...) > > I posted URLs yesterday for a trial version of the Spambayes FAQ built > using > Docutils tools: > > http://spambayes.sourceforge.net/faq-rest.txt > http://spambayes.sourceforge.net/faq-rest.html > > So far, the only person who responded was Bill Parducci. It's not > surprising that he responded, because he's the other person who seems to > be > in FAQ maintenance hell at the moment. Still, I thought one or two other > people would have responded. I didn't get my -dev subscription to work till yesterday... not sure what happened... anyway, the faq looks very good. I see that the infoworld article brought quite a number of users, and subsequent questions. We'll need to mine the user list for faq regularly. > > The Docutils version has the advantage that it will automagically > generate > the table of contents. Its main drawback is that there is no explicit > faq > generator script in Docutils, so the format of the questions is a bit > constrained. > > Before I make an executive decision and simply adopt this new stuff (I > believe it will be much easier to maintain than the status quo), I would > like some feedback: > > * Does anyone have a problem that it doesn't follow the ht2html format > for the spambayes site as a whole? I don't have a problem with it. > I can probably worm around this by > extracting the body of the faq to faq.ht then let the usual ht->html > dependency work its magic. This would be more confusing, I would think. > > * Should anyone wanting to update the FAQ be required to install a > recent version of Docutils or should I somehow worm around that? Doesn't matter, as long as we can find out somewhere exactly what we gotta have and do to update faq. OOOORRR, we could assume that you and Bill will be more than happy to incorporate any q&a we send ya c'est moi - TimS From popiel at wolfskeep.com Fri May 30 11:04:16 2003 From: popiel at wolfskeep.com (T. Alexander Popiel) Date: Fri May 30 13:04:21 2003 Subject: [spambayes-dev] Next to no feedback on the trial faq In-Reply-To: Message from Skip Montanaro of "Fri, 30 May 2003 11:34:54 CDT." <16087.34862.433359.570136@montanaro.dyndns.org> References: <16087.34862.433359.570136@montanaro.dyndns.org> Message-ID: <20030530170416.CC7B52DE9C@cashew.wolfskeep.com> In message: <16087.34862.433359.570136@montanaro.dyndns.org> Skip Montanaro writes: > >(spambayes-dev now seems to have enough people on it to form a quorum of >sorts (26), so I'm excluding spambayes...) Reasonable. Heck, I'm close to dropping off the original spambayes list, for lack of interest in hearing about people fighting with Outlook. <.5 wink> >I posted URLs yesterday for a trial version of the Spambayes FAQ built using >Docutils tools: > > http://spambayes.sourceforge.net/faq-rest.txt > http://spambayes.sourceforge.net/faq-rest.html The markup in the words vs. n-grams question doesn't seem to work. Similarly the — in the cool tokenizer trick question. This is not saying that docutils is bad, just that there'll be some trivial cleanup. Overall, the docutils format seems reasonably coherent. Back with the editor hat on, the positive and negative seem to be reversed (or at least confusing) in the case-folding question. >So far, the only person who responded was Bill Parducci. It's not >surprising that he responded, because he's the other person who >seems to be in FAQ maintenance hell at the moment. Still, I thought >one or two other people would have responded. Sorry; I haven't had more than about 2 minutes to rub together. > * Does anyone have a problem that it doesn't follow the ht2html format > for the spambayes site as a whole? I can probably worm around this by > extracting the body of the faq to faq.ht then let the usual ht->html > dependency work its magic. Doesn't bug me; I'd say don't add the extra step. > * Should anyone wanting to update the FAQ be required to install a > recent version of Docutils or should I somehow worm around that? > (This is especially disconcerting because when you install Docutils it > doesn't install the html.py front-end script used to generate the > faq.html file, so I can't really assume it will be in PATH.) Making FAQ devs install a recent version seems reasonable... but put that requirement in the FAQ under 'Why can't I get Docutils to work on my locally-edited version of this FAQ?'. ;-) - Alex From skip at pobox.com Fri May 30 13:18:49 2003 From: skip at pobox.com (Skip Montanaro) Date: Fri May 30 13:19:05 2003 Subject: [spambayes-dev] Next to no feedback on the trial faq In-Reply-To: <20030530170416.CC7B52DE9C@cashew.wolfskeep.com> References: <16087.34862.433359.570136@montanaro.dyndns.org> <20030530170416.CC7B52DE9C@cashew.wolfskeep.com> Message-ID: <16087.37497.497330.922379@montanaro.dyndns.org> >> http://spambayes.sourceforge.net/faq-rest.txt >> http://spambayes.sourceforge.net/faq-rest.html Alex> The markup in the words vs. n-grams question doesn't seem to Alex> work. Similarly the — in the cool tokenizer trick question. I beliebe those were holdover bits of markup from faq.ht, not something Docutils did. I'll fix them. Thanks. Alex> Back with the editor hat on, the positive and negative seem to be Alex> reversed (or at least confusing) in the case-folding question. I'll take a look. >> * Does anyone have a problem that it doesn't follow the ht2html >> format Alex> Doesn't bug me; I'd say don't add the extra step. That's cool. The less work the better. Alex> Making FAQ devs install a recent version seems reasonable... but Alex> put that requirement in the FAQ under 'Why can't I get Docutils to Alex> work on my locally-edited version of this FAQ?'. ;-) I think that can be arranged. Thanks for the feedback. Skip From tim at fourstonesexpressions.com Fri May 30 13:31:28 2003 From: tim at fourstonesexpressions.com (Tim Stone) Date: Fri May 30 13:31:53 2003 Subject: [spambayes-dev] Proposed fourth X-Spambayes-Classification header value Message-ID: I'm working on this problem that crops up in pop3proxy and hammie where malformed headers cause the parser to raise an uncaught exception, rendering spambayes more helpless than a baby seal. It's no problem to catch the exception, but what to do with the message is the issue. I have two suggested approaches: 1. Assume that the mail is spam. 2. Add a new possible value to the classification header, something like 'unclassifiable'. Other alternatives, votes, or general booing and hissing? -- c'est moi - TimS From popiel at wolfskeep.com Fri May 30 11:35:06 2003 From: popiel at wolfskeep.com (T. Alexander Popiel) Date: Fri May 30 13:35:09 2003 Subject: [spambayes-dev] Proposed fourth X-Spambayes-Classification header value In-Reply-To: Message from Tim Stone of "Fri, 30 May 2003 12:31:28 CDT." References: Message-ID: <20030530173506.0F6992DE9C@cashew.wolfskeep.com> In message: Tim Stone writes: >I'm working on this problem that crops up in pop3proxy and hammie where >malformed headers cause the parser to raise an uncaught exception, >rendering spambayes more helpless than a baby seal. It's no problem to >catch the exception, but what to do with the message is the issue. I have >two suggested approaches: > >1. Assume that the mail is spam. > >2. Add a new possible value to the classification header, something like >'unclassifiable'. > >Other alternatives, votes, or general booing and hissing? 3. Mark it as unsure. - Alex From bill at parducci.net Fri May 30 15:00:34 2003 From: bill at parducci.net (bill parducci) Date: Fri May 30 17:07:39 2003 Subject: [spambayes-dev] Next to no feedback on the trial faq References: <16087.34862.433359.570136@montanaro.dyndns.org> Message-ID: <3ED7C672.7080707@parducci.net> Tim Stone wrote: > Doesn't matter, as long as we can find out somewhere exactly what we > gotta have and do to update faq. OOOORRR, we could assume that you and > Bill will be more than happy to incorporate any q&a we send ya fine by me (i will just need to get whatever ultimate contraption is agreed upon installed locally :o) b From bill at parducci.net Fri May 30 15:13:56 2003 From: bill at parducci.net (bill parducci) Date: Fri May 30 17:20:56 2003 Subject: [spambayes-dev] Proposed fourth X-Spambayes-Classification header value References: Message-ID: <3ED7C994.3090709@parducci.net> Tim Stone wrote: > I'm working on this problem that crops up in pop3proxy and hammie where > malformed headers cause the parser to raise an uncaught exception, > rendering spambayes more helpless than a baby seal. It's no problem to > catch the exception, but what to do with the message is the issue. I > have two suggested approaches: > > 1. Assume that the mail is spam. > > 2. Add a new possible value to the classification header, something like > 'unclassifiable'. > > Other alternatives, votes, or general booing and hissing? would it be possible to catch the exception and move to the next line in the header (and/or payload) for parsing? if so, then you could at least make a highly informed guess as to the nature of the message by creating a 'malform token' that can be weighed like anything else (exchanging the contents of the malformed header entry with a single, 'reserved' token). in other words, let the stats decide if malformedness is a bad thing ;-) b From popiel at wolfskeep.com Fri May 30 15:38:18 2003 From: popiel at wolfskeep.com (T. Alexander Popiel) Date: Fri May 30 17:38:22 2003 Subject: [spambayes-dev] Proposed fourth X-Spambayes-Classification header value In-Reply-To: Message from bill parducci of "Fri, 30 May 2003 14:13:56 PDT." <3ED7C994.3090709@parducci.net> References: <3ED7C994.3090709@parducci.net> Message-ID: <20030530213818.438042DE9C@cashew.wolfskeep.com> In message: <3ED7C994.3090709@parducci.net> bill parducci writes: >Tim Stone wrote: >> I'm working on this problem that crops up in pop3proxy and hammie where >> malformed headers cause the parser to raise an uncaught exception, >> rendering spambayes more helpless than a baby seal. It's no problem to >> catch the exception, but what to do with the message is the issue. I >> have two suggested approaches: >> >> 1. Assume that the mail is spam. >> >> 2. Add a new possible value to the classification header, something like >> 'unclassifiable'. >> >> Other alternatives, votes, or general booing and hissing? > >would it be possible to catch the exception and move to the next line in >the header (and/or payload) for parsing? Alas, we're at the wrong level to do that sort of thing. To do that level of granularity properly, we'd have to be in the guts of the parser... and while I'm sure that Barry would love for us to come up with a way to recover inside the parser, I think it's a bit out of scope for the something external to the parser. - Alex From noreply at sourceforge.net Fri May 30 15:44:02 2003 From: noreply at sourceforge.net (SourceForge.net) Date: Fri May 30 19:22:45 2003 Subject: [spambayes-dev] [ spambayes-Bugs-740843 ] No Disk Error with Outlook 2000 on startup Message-ID: Bugs item #740843, was opened at 2003-05-20 18:39 Message generated for change (Comment added) made by portola You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=498103&aid=740843&group_id=61702 Category: Outlook Group: None Status: Open Resolution: None Priority: 5 Submitted By: Sam Snow (snowsam) Assigned to: Mark Hammond (mhammond) Summary: No Disk Error with Outlook 2000 on startup Initial Comment: After installing SpamBayes-Outlook-Setup-002.exe I am now getting an error dialog on Outlook startup. The box says: (Header) Inbox - Microsoft Outlook:OUTLOOK.EXE - No Disk (Body) There is no disk in the drive. Please insert a disk into drive \Device\Harddisk0\DR0. (Buttons) Cancel, Try Again, Continue I am able to click cancel or continue several times and then outlook goes ahead and opens up. I just installed this evening, so I am not sure if the filtering is still working correctly. I was able to train the program sucessfully. I am using Office 2000 SP3 on Win 2000. I will try to attach a jpg of the dialog box. My error log says the following: SpamAddin - Connecting to Outlook Loaded bayes database from 'C:\Documents and Settings\Snow1\Application Data\SpamBayes\default_bayes_database.db' Loaded message database from 'C:\Documents and Settings\Snow1\Application Data\SpamBayes\default_message_database.db' Bayes database initialized with 0 spam and 0 good messages Loaded databases in 4.64165ms AntiSpam: Watching for new messages in folder Inbox AntiSpam: Watching for new messages in folder Spam Processing 0 missed spam in folder 'Inbox' took 31.9599ms pythoncom error: Python error invoking COM method. Traceback (most recent call last): File "E:\src\pythonex\com\win32com\server\policy.py", line 275, in _Invoke_ File "E:\src\pythonex\com\win32com\server\policy.py", line 280, in _invoke_ File "E:\src\pythonex\com\win32com\server\policy.py", line 601, in _invokeex_ File "E:\src\pythonex\com\win32com\server\policy.py", line 541, in _invokeex_ File "E:\src\spambayes\Outlook2000\addin.py", line 203, in OnItemAdd File "E:\src\spambayes\Outlook2000\addin.py", line 163, in ProcessMessage File "E:\src\spambayes\Outlook2000\filter.py", line 15, in filter_message File "E:\src\spambayes\Outlook2000\manager.py", line 440, in score File "e:\src\spambayes\spambayes\classifier.py", line 217, in chi2_spamprob File "e:\src\spambayes\spambayes\classifier.py", line 465, in _getclues File "e:\src\spambayes\spambayes\classifier.py", line 316, in probability exceptions.AssertionError: ---------------------------------------------------------------------- Comment By: Dennis Austin (portola) Date: 2003-05-30 14:44 Message: Logged In: YES user_id=787905 I have noted an additional piece of information. There is no alert first time I start Outlook after logging in. If Outlook is closed and reopened, the alert appears and requires three Cancel clicks before it goes away. (Unless I put a disk in the CD drive.) I'm using Outlook 2002 sp2 on Windows XP sp1. I have two CD drives on the secondary IDE channel and the error appears on the second drive. ---------------------------------------------------------------------- Comment By: Dennis Austin (portola) Date: 2003-05-27 10:16 Message: Logged In: YES user_id=787905 I also usually see this error when I start Outlook, although not every time. I also see it at the end of running the installer. In my configuration it shows up as "No disk in drive E:". E: is CD-ROM 1 on this machine. I can get past the error either by clicking Cancel several times, or by putting any old CD in the drive and clicking Try Again. The error does not seem to affect any function of the add-on. ---------------------------------------------------------------------- Comment By: Ferruccio Barletta (fgb) Date: 2003-05-25 07:40 Message: Logged In: YES user_id=786210 I may have found the root cause of this problem. When I brought up disk management on my notebook I noticed that my hard drive was Disk1 and the SD media drive was Disk0. When I disabled the SD drive and rebooted, the hard drive became Disk0 and the problem disappeared. ---------------------------------------------------------------------- Comment By: Ferruccio Barletta (fgb) Date: 2003-05-24 18:30 Message: Logged In: YES user_id=786210 I get the same error with Office 2002 SP1 on Windows XP SP1 ---------------------------------------------------------------------- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=498103&aid=740843&group_id=61702 From skip at pobox.com Fri May 30 19:49:51 2003 From: skip at pobox.com (Skip Montanaro) Date: Fri May 30 19:49:56 2003 Subject: [spambayes-dev] Proposed fourth X-Spambayes-Classification header value In-Reply-To: <20030530213818.438042DE9C@cashew.wolfskeep.com> References: <3ED7C994.3090709@parducci.net> <20030530213818.438042DE9C@cashew.wolfskeep.com> Message-ID: <16087.60959.324302.344655@montanaro.dyndns.org> >> would it be possible to catch the exception and move to the next line >> in the header (and/or payload) for parsing? Alex> Alas, we're at the wrong level to do that sort of thing. To do Alex> that level of granularity properly, we'd have to be in the guts of Alex> the parser... This is just a shot in the dark, but would it be possible to modify the email parser sufficiently so that it gave more detail about where it was when the error condition was detected (e.g., what body line number, header and/or MIME part)? That might allow Spambayes to tweak the message in the right spot and retry the parse. Skip From noreply at sourceforge.net Fri May 30 18:36:09 2003 From: noreply at sourceforge.net (SourceForge.net) Date: Fri May 30 20:47:00 2003 Subject: [spambayes-dev] [ spambayes-Bugs-706520 ] assert fails in classifier Message-ID: Bugs item #706520, was opened at 2003-03-19 12:46 Message generated for change (Comment added) made by leobru You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=498103&aid=706520&group_id=61702 Category: None Group: None Status: Open Resolution: None Priority: 5 Submitted By: Adam Glass (adamglass) Assigned to: Nobody/Anonymous (nobody) Summary: assert fails in classifier Initial Comment: This morning, I noticed that my emails no longer had a X-Spambayes-Classification header, so I looked through my procmail logs, and sure enough, hammiefilter.py is giving a traceback when an assertion fails. This happens on all messages now; it is not specific to a single message, or intermittent. Therefore, I suspect my .hammiedb is corrupted... I can supply it to anyone who would like to investigate it for debugging purposes. I am using Spambayes 1.0a2, installed on a system with Python 2.2.1, with the new version of the email library (as per the install docs.) Please contact me if you require any further details. Example of how to generate the error follows, along with traceback: adam$ /usr/local/bin/hammiefilter.py -f -d $HOME/.hammiedb < example Traceback (most recent call last): File "/usr/local/bin/hammiefilter.py", line 179, in ? main() File "/usr/local/bin/hammiefilter.py", line 175, in main action(msg) File "/usr/local/bin/hammiefilter.py", line 113, in filter return h.filter(msg) File "/usr/local/lib/python2.2/site-packages/spambayes/hammie.py", line 108, in filter prob, clues = self._scoremsg(msg, True) File "/usr/local/lib/python2.2/site-packages/spambayes/hammie.py", line 38, in _scoremsg return self.bayes.spamprob(tokenize(msg), evidence) File "/usr/local/lib/python2.2/site-packages/spambayes/classifier.py", line 217, in chi2_spamprob clues = self._getclues(wordstream) File "/usr/local/lib/python2.2/site-packages/spambayes/classifier.py", line 441, in _getclues prob = self.probability(record) File "/usr/local/lib/python2.2/site-packages/spambayes/classifier.py", line 304, in probability assert spamcount <= nspam AssertionError ---------------------------------------------------------------------- Comment By: Leonid (leobru) Date: 2003-05-30 17:36 Message: Logged In: YES user_id=790676 This happens, e.g., if a forced re-training was performed on a non-empty database, thus screwing up the message counts - this is for sure, I was bitten by it myself; or, potentially, if hammiefilter.py -t and mboxtrain.py were running at the same time ??? To avoid: do not do it (I do not use hammiefilter.py -t to be on the safe side). To fix, once it happens: start from scratch. Good to have in the next version: a database validator and corrector. ---------------------------------------------------------------------- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=498103&aid=706520&group_id=61702 From vanhorn at whidbey.com Fri May 30 18:56:33 2003 From: vanhorn at whidbey.com (G. Armour Van Horn) Date: Fri May 30 20:56:37 2003 Subject: [spambayes-dev] Next to no feedback on the trial faq References: <16087.34862.433359.570136@montanaro.dyndns.org> Message-ID: <3ED7FDC1.8AA3E871@whidbey.com> Well, I didn't jump on it yesterday because I already thought the FAQ was pretty impressive. But now that you're laying on the guilt trip I went to check out the changes and got bonged by the HTML version: Not Found The requested URL /default.css was not found on this server. Apache/1.3.26 Server at spambayes.sourceforge.net Port 80 Van Skip Montanaro wrote: > (spambayes-dev now seems to have enough people on it to form a quorum of > sorts (26), so I'm excluding spambayes...) > > I posted URLs yesterday for a trial version of the Spambayes FAQ built using > Docutils tools: > > http://spambayes.sourceforge.net/faq-rest.txt > http://spambayes.sourceforge.net/faq-rest.html > > -- ---------------------------------------------------------- Sign up now for Quotes of the Day, a handful of quotations on a theme delivered every morning. Enlightenment! Daily, for free! mailto:twisted@whidbey.com?subject=Subscribe_QOTD For web hosting and maintenance, visit Van's home page: http://www.domainvanhorn.com/van/ ---------------------------------------------------------- -------------- next part -------------- An HTML attachment was scrubbed... URL: http://mail.python.org/pipermail/spambayes-dev/attachments/20030530/68ee90a4/attachment.htm From skip at pobox.com Fri May 30 21:18:48 2003 From: skip at pobox.com (Skip Montanaro) Date: Fri May 30 21:18:48 2003 Subject: [spambayes-dev] Next to no feedback on the trial faq In-Reply-To: <3ED7FDC1.8AA3E871@whidbey.com> References: <16087.34862.433359.570136@montanaro.dyndns.org> <3ED7FDC1.8AA3E871@whidbey.com> Message-ID: <16088.760.32882.625210@montanaro.dyndns.org> Van> Well, I didn't jump on it yesterday because I already thought the Van> FAQ was pretty impressive. But now that you're laying on the guilt Van> trip I went to check out the changes and got bonged by the HTML Van> version: Van> Not Found Van> The requested URL /default.css was not found on this server. Van> Apache/1.3.26 Server at spambayes.sourceforge.net Port 80 Thanks, I forgot that Docutils expects its own css file. I use Safari, which for one reason or another didn't complain. I'll correct that. Skip From skip at pobox.com Fri May 30 21:40:22 2003 From: skip at pobox.com (Skip Montanaro) Date: Fri May 30 21:40:21 2003 Subject: [spambayes-dev] New faq.txt Message-ID: <16088.2054.75594.38536@montanaro.dyndns.org> Folks, I just checked in faq.txt, default.css, Makefile and scripts/make.rules, and cvs removed faq.ht. Together, this means you should edit faq.txt now to update the faq. The last question (4.6) has some info about what you need to install to rebuild faq.html from faq.txt. Thanks for the feedback. We now return you to your regularly scheduled programming. Skip From popiel at wolfskeep.com Fri May 30 20:49:14 2003 From: popiel at wolfskeep.com (T. Alexander Popiel) Date: Fri May 30 22:49:18 2003 Subject: [spambayes-dev] Proposed fourth X-Spambayes-Classification header value In-Reply-To: Message from Skip Montanaro of "Fri, 30 May 2003 18:49:51 CDT." <16087.60959.324302.344655@montanaro.dyndns.org> References: <3ED7C994.3090709@parducci.net> <20030530213818.438042DE9C@cashew.wolfskeep.com> <16087.60959.324302.344655@montanaro.dyndns.org> Message-ID: <20030531024914.2416C2DE9C@cashew.wolfskeep.com> In message: <16087.60959.324302.344655@montanaro.dyndns.org> Skip Montanaro writes: > > >> would it be possible to catch the exception and move to the next line > >> in the header (and/or payload) for parsing? > > Alex> Alas, we're at the wrong level to do that sort of thing. To do > Alex> that level of granularity properly, we'd have to be in the guts of > Alex> the parser... > >This is just a shot in the dark, but would it be possible to modify the >email parser sufficiently so that it gave more detail about where it was >when the error condition was detected (e.g., what body line number, header >and/or MIME part)? That might allow Spambayes to tweak the message in the >right spot and retry the parse. It is my belief that tweaking the message intelligently (as opposed to just forcing the entire body to be treated as plain text by blowing away the MIME headers) would require more intelligence than doing the parsing in the first place. After all, you'd be permuting the data to make it parse, which means that you understand all about the parsing. If we're going to get that smart, we might as well not use the email package... which has already been circled around a few times. Personally, I'd like to see us have a simpler parser which just understood headers vs. body, and didn't try to decode the individual headers (for charset, or anything like that). Ideally, we'd give this simple parser the message (as a string) and a list of headers to remove from the message, and it would return a modified message (again as a string). We could use this simpler parser both for blowing away the MIME headers (as alluded to above for dealing with malformed messages) and for annotating the message with the classification results (blow away all the classification headers, then prepend the new ones (properly formatted) to the message). Of course, that would take about two hours of work, and I'm lucky to get two consecutive minutes right now... - Alex From tim.one at comcast.net Sat May 31 00:06:49 2003 From: tim.one at comcast.net (Tim Peters) Date: Fri May 30 23:08:05 2003 Subject: [spambayes-dev] Proposed fourth X-Spambayes-Classification headervalue In-Reply-To: <20030531024914.2416C2DE9C@cashew.wolfskeep.com> Message-ID: [T. Alexander Popiel] > ... > Personally, I'd like to see us have a simpler parser which just > understood headers vs. body, and didn't try to decode the individual > headers (for charset, or anything like that). Ideally, we'd give > this simple parser the message (as a string) and a list of headers > to remove from the message, and it would return a modified message > (again as a string). We could use this simpler parser both for > blowing away the MIME headers (as alluded to above for dealing with > malformed messages) and for annotating the message with the > classification results (blow away all the classification headers, > then prepend the new ones (properly formatted) to the message). > > Of course, that would take about two hours of work, and I'm lucky > to get two consecutive minutes right now... I don't expect this would help. Decoding base64 and quoted-printable are important, but base64 if and only if it's a text section. In order to identify this stuff requires decoding the MIME structure too. Decoding charsets probably isn't important for *me*, because virtually all my ham is in 7-bit ASCII English, but for non-English users I can easily believe it's vital. Etc -- the email package does a lot of stuff, and it's valuable. As to fiddling damaged msgs to get them thru the parser, the next time just try it. I've had easy success with this every time I've seen it pop up in the Outlook client. Appending a newline is sometimes all it takes. In one case, it required falling back to a different base64 decoder, because the email pkg's decoder is too(!) forgiving. The reason this crap keeps popping up has been covered before: we don't have a chokepoint now for asking the email pkg to parse stuff, so workarounds are spread around the codebase. Of course this won't get fixed until someone who actually likes the email package makes time to make it fly . From skip at pobox.com Fri May 30 23:14:52 2003 From: skip at pobox.com (Skip Montanaro) Date: Fri May 30 23:14:52 2003 Subject: [spambayes-dev] Proposed fourth X-Spambayes-Classification headervalue In-Reply-To: References: <20030531024914.2416C2DE9C@cashew.wolfskeep.com> Message-ID: <16088.7724.849970.104717@montanaro.dyndns.org> Tim> Of course this won't get fixed until someone who actually likes the Tim> email package makes time to make it fly . Maybe you could arrange to trip and spill your Dunkin' Donuts coffee on Barry every morning when you arrive at the office. Just apologize with something like, "Oh, sorry Barry. I was thinking about that email parsing problem in Spambayes again. I just can't seem to figure it out." After awhile he'll get the hint and solve the problem for you^H^H^Hus. ;-) Skip From tim at fourstonesexpressions.com Fri May 30 23:38:11 2003 From: tim at fourstonesexpressions.com (Tim Stone) Date: Fri May 30 23:38:37 2003 Subject: [spambayes-dev] Proposed fourth X-Spambayes-Classification header value In-Reply-To: <20030530173506.0F6992DE9C@cashew.wolfskeep.com> References: <20030530173506.0F6992DE9C@cashew.wolfskeep.com> Message-ID: >> Other alternatives, votes, or general booing and hissing? > > 3. Mark it as unsure. This will almost certainly cause someone to try to train it, which will again break things. We need some way to make sure we never look at this message again. > > - Alex > > -- c'est moi - TimS From popiel at wolfskeep.com Fri May 30 21:55:58 2003 From: popiel at wolfskeep.com (T. Alexander Popiel) Date: Fri May 30 23:56:03 2003 Subject: [spambayes-dev] Proposed fourth X-Spambayes-Classification headervalue In-Reply-To: Message from Tim Peters of "Fri, 30 May 2003 23:06:49 EDT." References: Message-ID: <20030531035558.243A32DE9C@cashew.wolfskeep.com> In message: Tim Peters writes: >[T. Alexander Popiel] >> ... >> Personally, I'd like to see us have a simpler parser > >I don't expect this would help. Decoding base64 and quoted-printable are >important, but base64 if and only if it's a text section. In order to >identify this stuff requires decoding the MIME structure too. Decoding >charsets probably isn't important for *me*, because virtually all my ham is >in 7-bit ASCII English, but for non-English users I can easily believe it's >vital. Etc -- the email package does a lot of stuff, and it's valuable. I see I wasn't clear... I only want the simpler parser for handling the classification headers annotation and the cases where the email package barfs. Definitely keep the email package for what we've got it for now... because yes, it helps immensely. - Alex From bill at parducci.net Sat May 31 08:59:15 2003 From: bill at parducci.net (bill parducci) Date: Sat May 31 11:00:21 2003 Subject: [spambayes-dev] cvs resets References: <3E89A90A.3060600@parducci.net> <16009.43979.172138.854043@montanaro.dyndns.org> <3ED7CAFE.4000906@parducci.net> <16087.59891.631899.666778@montanaro.dyndns.org> Message-ID: <3ED8C343.1020204@parducci.net> i see this a lot: $ cvs up -Pd cvs [update aborted]: end of file from server (consult above messages if any) is the python cvs server overloaded frequently, or is it just me? (usually by the fifth or sixth try i get through.) thanks b From skip at pobox.com Sat May 31 12:19:30 2003 From: skip at pobox.com (Skip Montanaro) Date: Sat May 31 12:19:30 2003 Subject: [spambayes-dev] cvs resets In-Reply-To: <3ED8C343.1020204@parducci.net> References: <3E89A90A.3060600@parducci.net> <16009.43979.172138.854043@montanaro.dyndns.org> <3ED7CAFE.4000906@parducci.net> <16087.59891.631899.666778@montanaro.dyndns.org> <3ED8C343.1020204@parducci.net> Message-ID: <16088.54802.82666.585488@montanaro.dyndns.org> bill> i see this a lot: bill> $ cvs up -Pd bill> cvs [update aborted]: end of file from server (consult above messages if bill> any) SourceForge is busy a lot. It tends to give lower priority to anonymous CVS requests to allow developers continued access. Skip From popiel at wolfskeep.com Sat May 31 15:43:02 2003 From: popiel at wolfskeep.com (T. Alexander Popiel) Date: Sat May 31 17:43:07 2003 Subject: [spambayes-dev] More testing on the common db Message-ID: <20030531214302.2EBCD2DDF2@cashew.wolfskeep.com> Here's some more results from testing with the common db and my own private db: Testing a selection of messages 4-9 months old: Ham (2052 msgs): ham unsure spam common 2011 36 5 popiel 2041 8 3 Spam (3838 msgs): ham unsure spam common 5 53 3773 popiel 8 75 3748 Testing only the most recent 500 messages of each type: Ham (500 msgs): ham unsure spam common 488 11 1 popiel 495 5 0 Spam (500 msgs): ham unsure spam common 1 21 478 popiel 1 10 489 I find it rather interesting that the common db did better on the old spam than my personal one did; I think this is evidence of mail mutations having a real effect on accuracy (since my personal db only contains info from the most recent 4 months), but it could also be attributable to other things... such as differences between Skip's training regime and my own. For the most recent mail, the personal db was a clear win over the common db. - Alex From matt at mondoinfo.com Sat May 31 22:19:41 2003 From: matt at mondoinfo.com (Matthew Dixon Cowles) Date: Sat May 31 22:19:47 2003 Subject: [spambayes-dev] Re: [Spambayes] Database cleaning? In-Reply-To: <20030531170037.10DB82DDF2@cashew.wolfskeep.com> References: <3ED6F33F.9050000@mailcom.com> <20030531170037.10DB82DDF2@cashew.wolfskeep.com> Message-ID: <1054430548.31.1335@sake.mondoinfo.com> [Alex Popiel on nonsense words in spam] > Yes, those words cause database pollution, and yes, they can be > weeded out with just a handful of lines of code... but it's hard to > tell which hapax legomena will be useless, and which will soon get > reinforced by other occurences, so it's (IMNSHO) generally not > worth the hassle. With an eye toward reducing the size of the database, I instrumented the classifier a while ago and found a very strong indication that that's true. Indeed, hapaxes often figured in scoring. I didn't bother to calculate exact numbers because the results were strong enough to persuade me that removing hapaxes wasn't a useful strategy. I tore that code out and instead hacked the classifier so that I could determine how soon after a word figures in scoring that it's used again. I think that the results are at least slightly interesting. Note that the histogram below is log scaled. Unique tokens used for scoring 60627 Used Once 17388 Days prev Count Histogram is log scaled 0 903644 ************************************************** 1 27694 ************************************* 2 15121 *********************************** 3 7024 ******************************** 4 4694 ******************************* 5 3634 ****************************** 6 3134 ***************************** 7 2443 **************************** 8 1697 *************************** 9 1340 ************************** 10 982 ************************* 11 801 ************************ 12 671 ************************ 13 871 ************************* 14 630 ************************ 15 494 *********************** 16 374 ********************** 17 343 ********************* 18 227 ******************** 19 216 ******************** 20 199 ******************* 21 226 ******************** 22 126 ****************** 23 114 ***************** 24 55 *************** 25 22 *********** 26 49 ************** My mail may not be representative in ways that exaggerate the slope here. Specifically, I read postmaster, webmaster, etc addresses for several domains so it's common for me to get multiple copies of the same spam. Regards, Matt