From noreply at sourceforge.net Sat Jun 10 07:21:14 2006 From: noreply at sourceforge.net (SourceForge.net) Date: Fri, 09 Jun 2006 22:21:14 -0700 Subject: [spambayes-bugs] [ spambayes-Patches-824651 ] Multibyte (CJK etc.) message support Message-ID: Patches item #824651, was opened at 2003-10-16 21:23 Message generated for change (Comment added) made by anadelonbrin You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=498105&aid=824651&group_id=61702 Please note that this message will contain a full copy of the comment thread, including the initial issue submission, for this request, not just the latest update. Category: None Group: None Status: Open Resolution: None Priority: 5 Submitted By: Hatuka*nezumi (hatukanezumi) Assigned to: Nobody/Anonymous (nobody) Summary: Multibyte (CJK etc.) message support Initial Comment: Maybe this also applicable to other East-Asian languages. o Unicode'ify text: For example by Japanese message, RFC1468 recommends that ISO/EIC 2022 encoding scheme, with ASCII and multibyte character set both designated to GL, should be used. Original tokenizer generates only bogus meaningless text fragments for Japanese messages. o Concatinate C/J lines. In Japanese (and maybe Chinese) messages, line folding often breaks 'words'. o Bigram of C/J characters. In Japanese (and often Chinese) messages, 'words' are not separated by character such as whitespace. Tokenization to grammatical 'words' will require heuristic algorithms using large corpus. Instead of expensive human-language parser, generate bigram from run of kanji (ideograph for C/J/K) or run of hiragana & katakana (syllabic letters for J). N.B.: - I believe number of database items is roughly O(n^2) for bigram, O(n^3) for trigram,... and O(n^i) for i-gram, where n is size of used character set. On katakana & hiragana n is approximately 100. On kanzi it is approx. 5000 (KS X 1001), 7000 (JIS X 0208), or more (Chinese standards). By C/J messages, 3-or-more-gram will generate very sparse and large database. - Words of single kanzi should not be discarded by tokenizer. Since most of basic kanzi words are of 1 or 2 characters. Words of single hiragana/katakana may be discarded. - As far as I know, in Korean message, phrase (not 'word' but similar) is often separated by whitespace. As run of hangul (syllabic character for K) may not splitted to n-gram. o Punctuation --- what is 'punctuation'? A lot of punctuations, spaces, signs and symbols registered with Unicode Standard are added to punctuation_run_re (for compatibility, some of them are overlapped with subject_words_re). Since many of them are also registered as punctuations or symbols with C/J/K character set standards. Problems: o sb_dbexpimp.py become incompatible. o Only BMP range is supported. Surrogates are not recognized. o Tested by Japanese messages only, not by other East-Asian messages. o No batch tests. This only aims at Japanese support. Configuration: o To support unicode, .spambayesrc must be set: [Tokenizer] replace_nonascii_chars: False ---------------------------------------------------------------------- >Comment By: Tony Meyer (anadelonbrin) Date: 2006-06-10 17:21 Message: Logged In: YES user_id=552329 The simple parts of this have been checked in. At the moment, that doesn't include the tokenizer changes (or the unicode module) or a few of the "server" changes. The non-tokenizer changes will probably be checked in soon; it's not clear what we'll do about the tokenizer ones (but at least this should make things simpler since there are fewer differences). ---------------------------------------------------------------------- Comment By: Hatuka*nezumi (hatukanezumi) Date: 2004-11-25 17:24 Message: Logged In: YES user_id=529503 Auto-detect charset of message. Some messages lack (or fake in some spam) charset information. Codes added to detect suitable charset and to convert to unicode. Unicodedata compatibility module for Python < 2.3. ---------------------------------------------------------------------- Comment By: Hatuka*nezumi (hatukanezumi) Date: 2004-11-10 16:12 Message: Logged In: YES user_id=529503 Estimation for Effect of per_langualge_corpus Option I prepared 4 test sets from 7987 ham and 2364 spam including: ham spam arabic: 1 1 cyrillic: 26 63 greek: 1 0 hebrew: 5 10 ja: 6438 85 ko: 9 29 thai: 1 2 zh: 4 207 other/unknown: 1502 1967 TOTAL: 7987 2364 * Languages/scripts are determined by main charset of each messages. Then I run test by: $ python timtest.py --ham-keep 500 --spam-keep 500 -n 4 with ham/spam cutoffs 0.5 / 0.95. Below is average of 20 tests. x-per_language_corpus: True ham:spam: 6000:6000 fp total: 10 fp %: 0.17 fn total: 253 fn %: 4.23 unsure t: 947 unsure %: 7.90 real cost: $543.15 best cost: $695.66 h mean: 0.88 h sdev: 7.38 s mean: 81.68 s sdev: 31.74 mean diff: 80.80 k: 2.07 x-per_language_corpus: False ham:spam: 6000:6000 fp total: 24 fp %: 0.40 fn total: 81 fn %: 1.36 unsure t: 551 unsure %: 4.60 real cost: $434.45 best cost: $584.04 h mean: 3.00 h sdev: 13.03 s mean: 94.28 s sdev: 19.51 mean diff: 91.27 k: 2.82 x-per_language_corpus increses fp a little and increases fn and unsure more. So x-per_language_corpus feature shall be thrown away (database will be compatible with original again). ---------------------------------------------------------------------- Comment By: Hatuka*nezumi (hatukanezumi) Date: 2004-10-06 02:35 Message: Logged In: YES user_id=529503 Update for 1.0-final. - Normalize Unicode'ified texts by Normalization Form KC (NFKC). - HTTP charset is fixed to UTF-8. Option [html_ui] http_charset was removed. - Some bug fixes. ---------------------------------------------------------------------- Comment By: Hatuka*nezumi (hatukanezumi) Date: 2003-12-08 16:34 Message: Logged In: YES user_id=529503 patch 1.0. Per-language corpus. Ham/spam ratio are different by language of message. This affects performance. NOTE: Format of corpus has been changed. It now contains per-language nham/nspam info and wordinfo. PICKLE_VERSION is 6. New configuration option: [Tokenizer] per_language_corpus ---------------------------------------------------------------------- Comment By: Hatuka*nezumi (hatukanezumi) Date: 2003-11-29 21:17 Message: Logged In: YES user_id=529503 o hammie.py / sb_filter.py / sb_xmlrpcserver.py: - clues in X-Spambayes-Evidence: header will be MIME header encoded. ---------------------------------------------------------------------- Comment By: Hatuka*nezumi (hatukanezumi) Date: 2003-11-26 21:31 Message: Logged In: YES user_id=529503 server patch 1.0a7-0.6 o Dibbler performs HTTP charset conversion (to/from internal UTF-8). o New configuration option: [html_ui] http_charset ---------------------------------------------------------------------- Comment By: Tony Meyer (anadelonbrin) Date: 2003-11-26 13:13 Message: Logged In: YES user_id=552329 Added the sb_dbexpimp.py patch (v1.3). Will look at the rest, shortly - thanks for your patience! ---------------------------------------------------------------------- Comment By: Hatuka*nezumi (hatukanezumi) Date: 2003-11-12 01:01 Message: Logged In: YES user_id=529503 o db_expimp.py is imcompatible again. It exports / imports data as UTF-8. o Unicode'ifyed sb_server.py. - HTTP charset is UTF-8. - clues in X-Spambayes-Evidences will be MIME header encoded. ---------------------------------------------------------------------- Comment By: Hatuka*nezumi (hatukanezumi) Date: 2003-10-29 23:29 Message: Logged In: YES user_id=529503 OK. I'll test the code untill addition. minor fix: 'replace_nonascii_chars' option works correctly, etc. ---------------------------------------------------------------------- Comment By: Tony Meyer (anadelonbrin) Date: 2003-10-21 16:52 Message: Logged In: YES user_id=552329 Just a wee note to say thanks for this, and that someone will get to looking at adding this in, but everyone's pretty busy with other stuff at the moment! ---------------------------------------------------------------------- Comment By: Hatuka*nezumi (hatukanezumi) Date: 2003-10-19 23:02 Message: Logged In: YES user_id=529503 fix for Korean message. Hangul phrases/words can be of 1 or 2 chars. ---------------------------------------------------------------------- Comment By: Hatuka*nezumi (hatukanezumi) Date: 2003-10-17 22:19 Message: Logged In: YES user_id=529503 minor fix. ---------------------------------------------------------------------- Comment By: Hatuka*nezumi (hatukanezumi) Date: 2003-10-17 01:52 Message: Logged In: YES user_id=529503 > ISO/EIC 2022 encoding scheme, with ASCII and > multibyte character set both designated to GL, Not 'designate'. 'Invoke' is correct. ---------------------------------------------------------------------- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=498105&aid=824651&group_id=61702 From noreply at sourceforge.net Mon Jun 12 17:36:08 2006 From: noreply at sourceforge.net (SourceForge.net) Date: Mon, 12 Jun 2006 08:36:08 -0700 Subject: [spambayes-bugs] [ spambayes-Bugs-1449691 ] Cannot see child folders in "Browse..." to set training. Message-ID: Bugs item #1449691, was opened at 2006-03-14 10:57 Message generated for change (Settings changed) made by sunapee You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=498103&aid=1449691&group_id=61702 Please note that this message will contain a full copy of the comment thread, including the initial issue submission, for this request, not just the latest update. Category: Outlook Group: 1.0.4 >Status: Open >Resolution: Wont Fix >Priority: 6 Submitted By: Sunapee (sunapee) Assigned to: Nobody/Anonymous (nobody) Summary: Cannot see child folders in "Browse..." to set training. Initial Comment: Outlook 2003 on Exchange Server 2000 SpamBayes v 1.0.4 I have been using SpamBayes for a long time, with no problems. Yesterday it started to act up and I uninstalled it, removed its toolbar in Outlook, and reinstalled it. The install seemed to go fine, but when I try to select the folders to train on, I cannot get my mailbox to expand. When I click on the + sign, nothing happens. When I try to check my in box, I am told that it is a top level folder and to select a child folder. But I cannot get to my child folders.... Any help would be appreciated! ---------------------------------------------------------------------- Comment By: Tony Meyer (anadelonbrin) Date: 2006-04-02 06:31 Message: Logged In: YES user_id=552329 This was a problem with the version of pywin32 at the time, and has been since fixed (so 1.1a2 will include the fix, or if there is a 1.0.5 that will include it). ---------------------------------------------------------------------- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=498103&aid=1449691&group_id=61702 From noreply at sourceforge.net Mon Jun 12 17:40:17 2006 From: noreply at sourceforge.net (SourceForge.net) Date: Mon, 12 Jun 2006 08:40:17 -0700 Subject: [spambayes-bugs] [ spambayes-Bugs-1449691 ] Cannot see child folders in "Browse..." to set training. Message-ID: Bugs item #1449691, was opened at 2006-03-14 10:57 Message generated for change (Comment added) made by sunapee You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=498103&aid=1449691&group_id=61702 Please note that this message will contain a full copy of the comment thread, including the initial issue submission, for this request, not just the latest update. Category: Outlook Group: 1.0.4 Status: Open Resolution: Wont Fix Priority: 6 Submitted By: Sunapee (sunapee) Assigned to: Nobody/Anonymous (nobody) Summary: Cannot see child folders in "Browse..." to set training. Initial Comment: Outlook 2003 on Exchange Server 2000 SpamBayes v 1.0.4 I have been using SpamBayes for a long time, with no problems. Yesterday it started to act up and I uninstalled it, removed its toolbar in Outlook, and reinstalled it. The install seemed to go fine, but when I try to select the folders to train on, I cannot get my mailbox to expand. When I click on the + sign, nothing happens. When I try to check my in box, I am told that it is a top level folder and to select a child folder. But I cannot get to my child folders.... Any help would be appreciated! ---------------------------------------------------------------------- >Comment By: Sunapee (sunapee) Date: 2006-06-12 10:40 Message: Logged In: YES user_id=1475290 I just re-opened this after I installed "spambayes- 1.1a2.exe" and still have the problem. Prior to installing I uninstalled the old version and deleted my spambayes directory under program files and my user in Documents & Setting. ---------------------------------------------------------------------- Comment By: Tony Meyer (anadelonbrin) Date: 2006-04-02 06:31 Message: Logged In: YES user_id=552329 This was a problem with the version of pywin32 at the time, and has been since fixed (so 1.1a2 will include the fix, or if there is a 1.0.5 that will include it). ---------------------------------------------------------------------- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=498103&aid=1449691&group_id=61702 From noreply at sourceforge.net Mon Jun 26 17:21:16 2006 From: noreply at sourceforge.net (SourceForge.net) Date: Mon, 26 Jun 2006 08:21:16 -0700 Subject: [spambayes-bugs] [ spambayes-Bugs-1512794 ] web interface does not display scores Message-ID: Bugs item #1512794, was opened at 2006-06-26 17:21 Message generated for change (Tracker Item Submitted) made by Item Submitter You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=498103&aid=1512794&group_id=61702 Please note that this message will contain a full copy of the comment thread, including the initial issue submission, for this request, not just the latest update. Category: None Group: 1.0.4 Status: Open Resolution: None Priority: 5 Submitted By: Alexandre Fayolle (afayolle) Assigned to: Nobody/Anonymous (nobody) Summary: web interface does not display scores Initial Comment: Hi, I'm the debian maintainer of the spambayes package, and I received the following bug report (available online at http://bugs.debian.org/374496) Package: spambayes Version: 1.0.4-1 Severity: normal The display_score configuration option states that it will show a column in the web interface scoring page, that indicates the score of a message. It also states it has a dependency: "Note that in order to use this option, you must also enable the option to include the score in the message headers." With both of these options enabled, if you access the web interface to score messages, it displays a question mark in the Score column, instead of the score of the message. Examining the message shows that it does get a header containing a score (this is the only spambayes related header I could find): X-Spambayes-Classification: ham; 0.00 The spambayes configuration is as follows: [html_ui] allow_remote_connections:192.168.0.* display_score=True http_authentication:Basic http_password:********* http_user_name:********* display_headers:Subject From spam_discard_level:95.0 default_ham_action:discard default_spam_action:discard [Storage] persistent_storage_file=/home/paul/.spambayes/hammiedb messageinfo_storage_file=/home/paul/.spambayes/spambayes.messageinfo.db [Headers] include_score=True [Tokenizer] mine_received_headers=True ---------------------------------------------------------------------- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=498103&aid=1512794&group_id=61702 From noreply at sourceforge.net Thu Jun 29 18:05:53 2006 From: noreply at sourceforge.net (SourceForge.net) Date: Thu, 29 Jun 2006 09:05:53 -0700 Subject: [spambayes-bugs] [ spambayes-Bugs-1514409 ] sb_filter.py problem with spam using broken MIME Message-ID: Bugs item #1514409, was opened at 2006-06-29 18:05 Message generated for change (Tracker Item Submitted) made by Item Submitter You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=498103&aid=1514409&group_id=61702 Please note that this message will contain a full copy of the comment thread, including the initial issue submission, for this request, not just the latest update. Category: None Group: None Status: Open Resolution: None Priority: 5 Submitted By: Alexandre Fayolle (afayolle) Assigned to: Nobody/Anonymous (nobody) Summary: sb_filter.py problem with spam using broken MIME Initial Comment: Hi, I received the following bug report on the Debian package of spambayes (http://bugs.debian.org/375990): Package: spambayes Version: 1.0.3-1 Severity: normal sb_filter.py does not deal well with broken MIME messages. I've been receiving spams with broken headers lately: From aayeaouoauyuaoea at cyber.net.pk Thu Jun 29 16:32:36 2006 ... Subject: Incredible ... Date: Thu, 29 Jun 2006 06:32:29 -0800 MIME-Version: 1.0 Content-Type: multipart/related; boundary="----=_NextPart_000_000B_01C69AEB.282715C0" X-Priority: 3 X-MSMail-Priority: Normal X-Mailer: Microsoft Outlook Express 6.00.2800.1506 X-MimeOLE: Produced By Microsoft MimeOLE V6.00.2800.1506 [etc] Note the extra blank line in the Content-Type header. When this message gets filtered through sb_filter.py, most of the headers are discarded, and I get X-Spambayes-Classification: spam; 0.94 boundary="----=_NextPart_000_000B_01C69AEB.282715C0" X-Priority: 3 X-MSMail-Priority: Normal X-Mailer: Microsoft Outlook Express 6.00.2800.1506 X-MimeOLE: Produced By Microsoft MimeOLE V6.00.2800.1506 [etc] I use sb_filter.py from procmail, which then appends the email to a Unix mbox file. Since the 'From ' pseudo-header was removed by sb_filter.py, the remaining text gets concatenated to the last email in the mailbox. ---------------------------------------------------------------------- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=498103&aid=1514409&group_id=61702 From noreply at sourceforge.net Fri Jun 30 03:00:07 2006 From: noreply at sourceforge.net (SourceForge.net) Date: Thu, 29 Jun 2006 18:00:07 -0700 Subject: [spambayes-bugs] [ spambayes-Bugs-1514409 ] sb_filter.py problem with spam using broken MIME Message-ID: Bugs item #1514409, was opened at 2006-06-30 04:05 Message generated for change (Comment added) made by anadelonbrin You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=498103&aid=1514409&group_id=61702 Please note that this message will contain a full copy of the comment thread, including the initial issue submission, for this request, not just the latest update. Category: None Group: None >Status: Closed >Resolution: Wont Fix Priority: 5 Submitted By: Alexandre Fayolle (afayolle) >Assigned to: Tony Meyer (anadelonbrin) Summary: sb_filter.py problem with spam using broken MIME Initial Comment: Hi, I received the following bug report on the Debian package of spambayes (http://bugs.debian.org/375990): Package: spambayes Version: 1.0.3-1 Severity: normal sb_filter.py does not deal well with broken MIME messages. I've been receiving spams with broken headers lately: From aayeaouoauyuaoea at cyber.net.pk Thu Jun 29 16:32:36 2006 ... Subject: Incredible ... Date: Thu, 29 Jun 2006 06:32:29 -0800 MIME-Version: 1.0 Content-Type: multipart/related; boundary="----=_NextPart_000_000B_01C69AEB.282715C0" X-Priority: 3 X-MSMail-Priority: Normal X-Mailer: Microsoft Outlook Express 6.00.2800.1506 X-MimeOLE: Produced By Microsoft MimeOLE V6.00.2800.1506 [etc] Note the extra blank line in the Content-Type header. When this message gets filtered through sb_filter.py, most of the headers are discarded, and I get X-Spambayes-Classification: spam; 0.94 boundary="----=_NextPart_000_000B_01C69AEB.282715C0" X-Priority: 3 X-MSMail-Priority: Normal X-Mailer: Microsoft Outlook Express 6.00.2800.1506 X-MimeOLE: Produced By Microsoft MimeOLE V6.00.2800.1506 [etc] I use sb_filter.py from procmail, which then appends the email to a Unix mbox file. Since the 'From ' pseudo-header was removed by sb_filter.py, the remaining text gets concatenated to the last email in the mailbox. ---------------------------------------------------------------------- >Comment By: Tony Meyer (anadelonbrin) Date: 2006-06-30 13:00 Message: Logged In: YES user_id=552329 There are too many ways for mail to be malformed for it to be worth trying to handle them in SpamBayes. If you use version 3 or above of the Python email package (this is included in Python 2.4+, or you can install it separately) then it handles all of these problems. ---------------------------------------------------------------------- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=498103&aid=1514409&group_id=61702