[Spambayes] Tokenizing ideas (images, attachments)

Meyer, Tony T.A.Meyer at massey.ac.nz
Wed Aug 27 20:45:46 EDT 2003


> Yeah, I read that FAQ, I'm currently just learning Python.

All the better to practice with <wink>.  Seriously, if you are confident
that an improvement with have a significant effect, then write it up in
detail and submit it as a feature request.  Someone will get around to
trying it and posting a patch.  Unless it's an amazing idea, you have to
be willing to run the tests as well, though, otherwise we end up with
the (common at the moment) situation where we have inconclusive data
about an option because only a couple of people have done the testing.

> I don't see any url:tokens, I use the Outlook plugin, perhaps the
problem 
> is there, it does not use the HTMLBody property?

In your 'show clues', I don't see a body at all!  See my 'show clues'
for your message at the end.  If you go right down to the bottom you'll
see some url:tokens.  What version of the plug-in are you using?  The
'message stream' bit of the 'show clues' should match what you see in
Outlook.

> Btw, what does header:Received:1 and header:User-Agent:1 mean? Does
> SpamBayes have an internal black list?

No, there's no black or white listing.

> Also Date:1, From:1,
> MIME-Version:1 etc, what do they mean? :-)

One set of tokens generated is simply the number of times a header line
appears in the message.  These are all saying that each header is there
once.

=Tony Meyer

---

Spam Score: 0% (0)

word                                spamprob         #ham  #spam
'*H*'                               1                   -      -
'*S*'                               0                   -      -
'content-type:text/plain subject:Spambayes' 0.00195626        159      1
'subject:: ['                       0.00298211         75      0
'subject:Spambayes'                 0.00393682        167      5
'&gt;'                              0.00520659         59      1
'email addr:python.org'             0.00536099        171      8
'the main'                          0.00542823         41      0
'aug'                               0.00612213         50      1
'1.0'                               0.00693374         32      0
'the problem'                       0.00820669         37      1
'[spambayes]'                       0.00884086         25      0
'spambayes'                         0.00934694         69      5
'with microsoft'                    0.0104895          21      0
'suggested'                         0.0110024          20      0
'tony'                              0.0114217          41      3
'url:mailman'                       0.0119871         184     23
'skip:_ 40'                         0.0122475         180     23
'url:python'                        0.0126601         174     23
'outlook'                           0.0129152          56      6
'meyer'                             0.0135191          22      1
'proto:http url:mail'               0.0136382         174     25
'url:spambayes proto:http'          0.0136778          16      0
'meyer,'                            0.0145631          15      0
'worthwhile'                        0.0145631          15      0
'(this'                             0.0155304          19      1
'the url'                           0.0155709          14      0
'perhaps'                           0.0162613          44      6
'check the'                         0.0163408          18      1
'(see'                              0.0174785          31      4
'message-----'                      0.0182364          25      3
'empty'                             0.0182454          16      1
'url:listinfo'                      0.0191548         186     39
'skip:f 20'                         0.0195131          19      2
"didn't have"                       0.0196507          11      0
'sent:'                             0.0197086          23      3
'main'                              0.0197377          62     12
're:'                               0.0198445          66     13
'internal'                          0.0206047          26      4
'(or'                               0.0208107          38      7
'header:Received:7 header:From:1'   0.0217905         147     35
'url'                               0.0219612          32      6
'microsoft'                         0.0222952          43      9
'url:spambayes'                     0.0236263          26      5
'subject:] '                        0.0238125         250     67
'header:Received:7'                 0.0245479         151     41
'like the'                          0.0258482          50     13
"but didn't"                        0.0266272           8      0
'skip:s 10 skip:c 10'               0.0266272           8      0
'look the'                          0.0275098          19      4
'sender:addr:spambayes-bounces'     0.0275098          19      4
'email name:spambayes'              0.0275637          22      5
':-)'                               0.0280665          10      1
'skip:s 20'                         0.0285679         126     40
'version'                           0.0296442         101     33
'skip:- 10'                         0.0299704          57     18
'though, and'                       0.0302013           7      0
'url.'                              0.0302013           7      0
'containing'                        0.0307464          25      7
'the message'                       0.0307464          25      7
'sender:no real name:2**0 reply-to:none' 0.0307997         186     65
'not that'                          0.0308363           9      1
'to:addr:spambayes'                 0.0315122          19      5
'to:addr:python.org to:no real name:2**0' 0.0318036         153     55
'sender:addr:python.org'            0.0328382         174     65
'tue,'                              0.0342154           8      1
"don't see"                         0.0348837           6      0
'enough people'                     0.0348837           6      0
'for message'                       0.0348837           6      0
'reason why'                        0.0348837           6      0
"skip:' 20"                         0.0348837           6      0
'url,'                              0.0348837           6      0
'why not'                           0.0368007          16      5
'use the'                           0.0389043          65     28
'related'                           0.0389254          41     17
'python.'                           0.0412844           5      0
"there's reason"                    0.0412844           5      0
'case,'                             0.0415825          24     10
'testing'                           0.0417589          36     16
'problem'                           0.0419707          64     30
'this,'                             0.0429501          31     14
'skip:( 10'                         0.04336            93     46
'=tony'                             0.04384             6      1
'faq'                               0.04384             6      1
'have enough'                       0.04384             6      1
'skip:& 70'                         0.04384             6      1
'yeah,'                             0.04384             6      1
'skip:c 20'                         0.0444757          45     22
'skip:x 20'                         0.0456534           9      3
'(as'                               0.0456727          29     14
'into the'                          0.0460221          69     36
'date:'                             0.0461054          16      7
'the one'                           0.0492735          25     13
'what they'                         0.0498514          13      6
'email name:[mailto:t.a.meyer'      0.0505618           4      0
'fine,'                             0.0505618           4      0
'ones,'                             0.0505618           4      0
'integrate'                         0.05104             5      1
'2.0'                               0.0553604          16      9
'headers'                           0.0554278          13      7
'there,'                            0.0599594          16     10
'not use'                           0.0611158           4      1
"i'm"                               0.0648698          91     70
'(look'                             0.0652174           3      0
'one you'                           0.0652174           3      0
'see any'                           0.0652174           3      0
'token'                             0.0652174           3      0
'spam'                              0.0669622          51     40
'image'                             0.0673626          25     19
"skip:' 10"                         0.0736107          16     13
'you look'                          0.0772416          12     10
'skip:s 10'                         0.0780672         232    221
'url:html proto:http'               0.0792337          65     62
'should'                            0.0842324         153    158
'skip:u 10'                         0.0868949          94    100
'skip:g 10'                         0.0893932          65     71
'(utc)'                             0.0918367           2      0
'+0000'                             0.0918367           2      0
'151'                               0.0918367           2      0
'[...]'                             0.0918367           2      0
'attachments)'                      0.0918367           2      0
'empty,'                            0.0918367           2      0
'faq,'                              0.0918367           2      0
'mail internet'                     0.0918367           2      0
'not many'                          0.0918367           2      0
'read that'                         0.0918367           2      0
'subject:images'                    0.0918367           2      0
'tokenize'                          0.0918367           2      0
'tokenizing'                        0.0918367           2      0
'url:faq url:html'                  0.0918367           2      0
'url:net url:faq'                   0.0918367           2      0
'clues'                             0.0923291           3      2
'used'                              0.0963302          88    105
'101'                               0.102015            2      1
'times the'                         0.102015            2      1
'skip:d 10'                         0.102175          122    156
'and skip:h 10'                     0.103502            4      4
'unique'                            0.103877           26     33
'but'                               0.10437           198    260
'skip:2 10'                         0.107069            6      7
'that any'                          0.112918            7      9
'it.'                               0.115918           91    134
'does'                              0.118972           89    135
'information the'                   0.120233           11     16
'this out'                          0.12355             2      2
'black'                             0.124805           13     20
'this message'                      0.878774           16   1314
'170'                               0.908163            0      2
'message-id:'                       0.908163            0      2
'subject:ideas'                     0.969799            0      7
'skip:c 10 text/html'               0.998849            0    195
'text/html'                         0.998849            0    195

Message Stream:


X-MS-Mail-Gibberish: Microsoft Mail Internet Headers Version 2.0
Received: from its-xchg5.massey.ac.nz ([130.123.129.15]) by
	its-xchg4.massey.ac.nz with Microsoft SMTPSVC(5.0.2195.5329); 
	Wed, 27 Aug 2003 19:23:09 +1200
Received: from its-mail1.massey.ac.nz ([130.123.128.11]) by
	its-xchg5.massey.ac.nz with Microsoft SMTPSVC(5.0.2195.5329); 
	Wed, 27 Aug 2003 19:23:09 +1200
Received: from its-mm1.massey.ac.nz (its-mm1.massey.ac.nz
[130.123.128.45])
	by its-mail1.massey.ac.nz (8.9.3/8.9.3) with ESMTP id TAA13247;
	Wed, 27 Aug 2003 19:23:09 +1200 (NZST)
Received: from mu-relay1.massey.ac.nz (Not Verified[130.123.2.98]) by
	its-mm1.massey.ac.nz with NetIQ MailMarshal
	id <B001b4b27c>; Wed, 27 Aug 2003 19:23:08 +1200
Received: from mail.python.org (mail.python.org [12.155.117.29])
	by mu-relay1.massey.ac.nz (Postfix) with ESMTP id A535B3763E
	for <t.a.meyer at massey.ac.nz>; Wed, 27 Aug 2003 19:23:07 +1200
(NZST)
Received: from localhost.localdomain ([127.0.0.1] helo=mail.python.org)
	by mail.python.org with esmtp (Exim 4.05)
	id 19rudy-0005kY-01; Wed, 27 Aug 2003 03:23:02 -0400
Received: from postman.wicom.com ([62.236.218.18]
helo=postman.merlin.fi)
	by mail.python.org with esmtp (Exim 4.05) id 19rudu-0005kH-00
	for spambayes at python.org; Wed, 27 Aug 2003 03:22:58 -0400
MIME-Version: 1.0
Content-Type: text/plain;
	charset="us-ascii"
Content-Transfer-Encoding: quoted-printable
Subject: RE: [Spambayes] Tokenizing ideas (images, attachments)
Date: Wed, 27 Aug 2003 10:22:54 +0300
Message-ID: <FC2AA30A6C037640BBDC45785F0E9B26076325 at postman.wicom.com>
From: "Harri Pesonen" <harri.pesonen at wicom.com>
To: <spambayes at python.org>
X-Spam-Status: OK (default 0.000)
X-BeenThere: spambayes at python.org
X-Mailman-Version: 2.1.2
Precedence: list
List-Id: Discussion list for Pythonic Bayesian classifier
	<spambayes.python.org>
List-Unsubscribe: <http://mail.python.org/mailman/listinfo/spambayes>,
	<mailto:spambayes-request at python.org?subject=unsubscribe>
List-Archive: <http://mail.python.org/pipermail/spambayes>
List-Post: <mailto:spambayes at python.org>
List-Help: <mailto:spambayes-request at python.org?subject=help>
List-Subscribe: <http://mail.python.org/mailman/listinfo/spambayes>,
	<mailto:spambayes-request at python.org?subject=subscribe>
Sender: spambayes-bounces at python.org
Errors-To: spambayes-bounces at python.org
Return-Path: spambayes-bounces at python.org
X-OriginalArrivalTime: 27 Aug 2003 07:23:09.0557 (UTC)
	FILETIME=[11182A50:01C36C6C]

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2//EN">
<HTML>
<HEAD>
<META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=iso-8859-1">
<META NAME="Generator" CONTENT="MS Exchange Server version 6.0.6396.0">
<TITLE>RE: [Spambayes] Tokenizing ideas (images, attachments)</TITLE>
</HEAD>
<BODY>
<!-- Converted from text/plain format -->

<P><FONT SIZE=2>Yeah, I read that FAQ, I'm currently just learning
Python. I don't see</FONT>

<BR><FONT SIZE=2>any url:tokens, I use the Outlook plugin, perhaps the
problem is there,</FONT>

<BR><FONT SIZE=2>it does not use the HTMLBody property?</FONT>
</P>

<P><FONT SIZE=2>Btw, what does header:Received:1 and header:User-Agent:1
mean? Does</FONT>

<BR><FONT SIZE=2>SpamBayes have an internal black list? Also Date:1,
From:1,</FONT>

<BR><FONT SIZE=2>MIME-Version:1 etc, what do they mean? :-)</FONT>
</P>

<P><FONT SIZE=2>Spam Score: 0.997882</FONT>
</P>

<P><FONT
SIZE=2>word&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&
nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&
nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
spamprob&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; #ham&nbsp;
#spam</FONT>

<BR><FONT
SIZE=2>'*H*'&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
5.3187e-005&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
-&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; -</FONT>

<BR><FONT
SIZE=2>'*S*'&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
0.995817&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbs
p; -&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; -</FONT>

<BR><FONT
SIZE=2>'subjectcharset:iso-8859-1'&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&n
bsp;&nbsp;
0.15272&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp
; 57&nbsp;&nbsp;&nbsp;&nbsp; 12</FONT>

<BR><FONT SIZE=2>'subject: -
'&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp
;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
0.241103&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
64&nbsp;&nbsp;&nbsp;&nbsp; 24</FONT>

<BR><FONT
SIZE=2>'reply-to:none'&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&n
bsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
0.340317&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
349&nbsp;&nbsp;&nbsp; 214</FONT>

<BR><FONT
SIZE=2>'header:Date:1'&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&n
bsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
0.616644&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
232&nbsp;&nbsp;&nbsp; 444</FONT>

<BR><FONT
SIZE=2>'header:From:1'&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&n
bsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
0.617705&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
232&nbsp;&nbsp;&nbsp; 446</FONT>

<BR><FONT
SIZE=2>'header:MIME-Version:1'&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
0.61773&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
207&nbsp;&nbsp;&nbsp; 398</FONT>

<BR><FONT SIZE=2>'to:no real
name:2**0'&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&n
bsp;&nbsp;&nbsp;
0.648347&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
170&nbsp;&nbsp;&nbsp; 373</FONT>

<BR><FONT
SIZE=2>'header:Return-Path:1'&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&
nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
0.680694&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
175&nbsp;&nbsp;&nbsp; 444</FONT>

<BR><FONT
SIZE=2>'header:Message-ID:1'&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&n
bsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
0.6998&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
151&nbsp;&nbsp;&nbsp; 419</FONT>

<BR><FONT
SIZE=2>'to:addr:merlin.fi'&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbs
p;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
0.723659&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
101&nbsp;&nbsp;&nbsp; 315</FONT>

<BR><FONT
SIZE=2>'subject:!'&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
&nbsp;&nbsp;&nbsp;
0.733392&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
13&nbsp;&nbsp;&nbsp;&nbsp; 43</FONT>

<BR><FONT
SIZE=2>'from:addr:aboydhd'&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbs
p;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
0.82569&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp
;&nbsp; 0&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 1</FONT>

<BR><FONT
SIZE=2>'from:addr:merlin.net.au'&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbs
p;&nbsp;&nbsp;&nbsp;
0.82569&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp
;&nbsp; 0&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 1</FONT>

<BR><FONT SIZE=2>'from:name:amalia
boyd'&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&
nbsp;
0.82569&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp
;&nbsp; 0&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 1</FONT>

<BR><FONT
SIZE=2>'message-id:@merlin.net.au'&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&n
bsp;&nbsp;
0.82569&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp
;&nbsp; 0&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 1</FONT>

<BR><FONT
SIZE=2>'subject:Chance'&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&
nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
0.82569&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp
;&nbsp; 0&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 1</FONT>

<BR><FONT
SIZE=2>'subject:Last'&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nb
sp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nb
sp;
0.82569&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp
;&nbsp; 0&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 1</FONT>

<BR><FONT
SIZE=2>'subject:blowout'&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
0.82569&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp
;&nbsp; 0&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 1</FONT>

<BR><FONT
SIZE=2>'subject:inventory'&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbs
p;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
0.82569&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp
;&nbsp; 0&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 1</FONT>

<BR><FONT SIZE=2>'subject:&nbsp;
'&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp
;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
0.916657&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbs
p; 4&nbsp;&nbsp;&nbsp;&nbsp; 55</FONT>

<BR><FONT
SIZE=2>'subject:Citrate'&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
0.924304&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbs
p; 0&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 3</FONT>

<BR><FONT
SIZE=2>'subject:Sildenafil'&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nb
sp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
0.924304&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbs
p; 0&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; 3</FONT>

<BR><FONT
SIZE=2>'header:Received:1'&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbs
p;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
0.958977&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbs
p; 5&nbsp;&nbsp;&nbsp; 145</FONT>

<BR><FONT
SIZE=2>'header:User-Agent:1'&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&n
bsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;
0.986969&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbs
p; 0&nbsp;&nbsp;&nbsp;&nbsp; 20</FONT>
</P>

<P><FONT SIZE=2>Message Stream:</FONT>
</P>

<P><FONT SIZE=2>X-MS-Mail-Gibberish: Microsoft Mail Internet Headers
Version 2.0</FONT>

<BR><FONT SIZE=2>Received: from thing.de ([67.122.162.175]) by
postman.merlin.fi with</FONT>

<BR><FONT SIZE=2>Microsoft</FONT>

<BR>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <FONT
SIZE=2>SMTPSVC(5.0.2195.6713); Wed, 27 Aug 2003 01:37:36 +0300</FONT>

<BR><FONT SIZE=2>User-Agent: Mozilla/5.001 (windows; U; NT4.0; en-us)
Gecko/25250101</FONT>

<BR><FONT SIZE=2>From: &quot;Amalia Boyd&quot;
&lt;aboydhd at merlin.net.au&gt;</FONT>

<BR><FONT SIZE=2>Date: Tue, 26 Aug 2003 18:33:58 +0000</FONT>

<BR><FONT SIZE=2>Message-ID:
&lt;3F4BA816.730F2157 at merlin.net.au&gt;</FONT>

<BR><FONT SIZE=2>To: harri.pesonen at merlin.fi</FONT>

<BR><FONT SIZE=2>MIME-Version: 1.0</FONT>

<BR><FONT SIZE=2>Subject:</FONT>

<BR><FONT
SIZE=2>=?iso-8859-1?b?TGFzdCBDaGFuY2UgLSBTaWxkZW5hZmlsIENpdHJhdGUgIGludm
VudG9ye</FONT>

<BR><FONT SIZE=2>SBibG93b3V0IQ==?=</FONT>

<BR><FONT SIZE=2>Content-Type: text/html</FONT>

<BR><FONT SIZE=2>Content-Transfer-Encoding: 8bit</FONT>

<BR><FONT SIZE=2>Return-Path: aboydhd at merlin.net.au</FONT>

<BR><FONT SIZE=2>X-OriginalArrivalTime: 26 Aug 2003 22:37:36.0988
(UTC)</FONT>

<BR>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; <FONT
SIZE=2>FILETIME=[A637F9C0:01C36C22]</FONT>
</P>

<P><FONT SIZE=2>Message Tokens:</FONT>
</P>

<P><FONT SIZE=2>33 unique tokens</FONT>
</P>

<P><FONT SIZE=2>'cc:none'</FONT>

<BR><FONT SIZE=2>'content-type:text/plain'</FONT>

<BR><FONT SIZE=2>'from:addr:aboydhd'</FONT>

<BR><FONT SIZE=2>'from:addr:merlin.net.au'</FONT>

<BR><FONT SIZE=2>'from:name:amalia boyd'</FONT>

<BR><FONT SIZE=2>'header:Date:1'</FONT>

<BR><FONT SIZE=2>'header:From:1'</FONT>

<BR><FONT SIZE=2>'header:MIME-Version:1'</FONT>

<BR><FONT SIZE=2>'header:Message-ID:1'</FONT>

<BR><FONT SIZE=2>'header:Received:1'</FONT>

<BR><FONT SIZE=2>'header:Return-Path:1'</FONT>

<BR><FONT SIZE=2>'header:Subject:1'</FONT>

<BR><FONT SIZE=2>'header:To:1'</FONT>

<BR><FONT SIZE=2>'header:User-Agent:1'</FONT>

<BR><FONT SIZE=2>'message-id:@merlin.net.au'</FONT>

<BR><FONT SIZE=2>'reply-to:none'</FONT>

<BR><FONT SIZE=2>'sender:none'</FONT>

<BR><FONT SIZE=2>'subject: '</FONT>

<BR><FONT SIZE=2>'subject: '</FONT>

<BR><FONT SIZE=2>'subject: - '</FONT>

<BR><FONT SIZE=2>'subject:!'</FONT>

<BR><FONT SIZE=2>'subject:Chance'</FONT>

<BR><FONT SIZE=2>'subject:Citrate'</FONT>

<BR><FONT SIZE=2>'subject:Last'</FONT>

<BR><FONT SIZE=2>'subject:Sildenafil'</FONT>

<BR><FONT SIZE=2>'subject:blowout'</FONT>

<BR><FONT SIZE=2>'subject:inventory'</FONT>

<BR><FONT SIZE=2>'subjectcharset:iso-8859-1'</FONT>

<BR><FONT SIZE=2>'to:2**0'</FONT>

<BR><FONT SIZE=2>'to:addr:harri.pesonen'</FONT>

<BR><FONT SIZE=2>'to:addr:merlin.fi'</FONT>

<BR><FONT SIZE=2>'to:no real name:2**0'</FONT>

<BR><FONT SIZE=2>'x-mailer:none'</FONT>
</P>

<P><FONT SIZE=2>-----Original Message-----</FONT>

<BR><FONT SIZE=2>From: Meyer, Tony [<A
HREF="mailto:T.A.Meyer at massey.ac.nz">mailto:T.A.Meyer at massey.ac.nz</A>]
</FONT>

<BR><FONT SIZE=2>Sent: 27. elokuuta 2003 10:08</FONT>

<BR><FONT SIZE=2>To: Harri Pesonen; spambayes at python.org</FONT>

<BR><FONT SIZE=2>Subject: RE: [Spambayes] Tokenizing ideas (images,
attachments)</FONT>
</P>
<BR>

<P><FONT SIZE=2>&gt; Why not tokenize image URLs?</FONT>

<BR><FONT SIZE=2>[...]</FONT>

<BR><FONT SIZE=2>&gt; While SpamBayes detected this message just
fine,</FONT>
</P>

<P><FONT SIZE=2>There's a reason why not ;)</FONT>
</P>

<P><FONT SIZE=2>&gt; Many times the message is empty or almost</FONT>

<BR><FONT SIZE=2>&gt; empty, containing only an image URL.</FONT>
</P>

<P><FONT SIZE=2>Not that any URL, including image ones, is
tokenized.&nbsp; If you look at</FONT>

<BR><FONT SIZE=2>the clues for a message like the one you used as an
example, you should</FONT>

<BR><FONT SIZE=2>see some url: tokens.</FONT>
</P>

<P><FONT SIZE=2>It has been suggested that tokenizing (textual)
information at the end</FONT>

<BR><FONT SIZE=2>of the URL would be worthwhile (this includes a token
if the URL 404s).</FONT>

<BR><FONT SIZE=2>We tested this out (look at the urlslurper.py file),
but didn't have</FONT>

<BR><FONT SIZE=2>enough people testing to integrate it into the main
code (as a</FONT>

<BR><FONT SIZE=2>default-to-off option).&nbsp; Death2Spam (see the
related page) does this,</FONT>

<BR><FONT SIZE=2>though, and Richard swears by it.</FONT>
</P>

<P><FONT SIZE=2>In any case, the best thing is to try these (or any
other) ideas out.</FONT>

<BR><FONT SIZE=2>See FAQ 6.1:</FONT>
</P>

<P><FONT
SIZE=2>&lt;file:///D:/cvs/spambayes/website/faq.html#why-don-t-you-imple
ment-cool-</FONT>

<BR><FONT SIZE=2>tokenizer-trick-x&gt;</FONT>
</P>

<P><FONT SIZE=2>=Tony Meyer</FONT>
</P>

<P><FONT SIZE=2>_______________________________________________</FONT>

<BR><FONT SIZE=2>Spambayes at python.org</FONT>

<BR><FONT SIZE=2><A
HREF="http://mail.python.org/mailman/listinfo/spambayes">http://mail.pyt
hon.org/mailman/listinfo/spambayes</A></FONT>

<BR><FONT SIZE=2>Check the FAQ before asking: <A
HREF="http://spambayes.sf.net/faq.html">http://spambayes.sf.net/faq.html
</A></FONT>
</P>

</BODY>
</HTML>
Yeah, I read that FAQ, I'm currently just learning Python. I don't see
any url:tokens, I use the Outlook plugin, perhaps the problem is there,
it does not use the HTMLBody property?

Btw, what does header:Received:1 and header:User-Agent:1 mean? Does
SpamBayes have an internal black list? Also Date:1, From:1,
MIME-Version:1 etc, what do they mean? :-)

Spam Score: 0.997882

word                                spamprob         #ham  #spam
'*H*'                               5.3187e-005         -      -
'*S*'                               0.995817            -      -
'subjectcharset:iso-8859-1'         0.15272            57     12
'subject: - '                       0.241103           64     24
'reply-to:none'                     0.340317          349    214
'header:Date:1'                     0.616644          232    444
'header:From:1'                     0.617705          232    446
'header:MIME-Version:1'             0.61773           207    398
'to:no real name:2**0'              0.648347          170    373
'header:Return-Path:1'              0.680694          175    444
'header:Message-ID:1'               0.6998            151    419
'to:addr:merlin.fi'                 0.723659          101    315
'subject:!'                         0.733392           13     43
'from:addr:aboydhd'                 0.82569             0      1
'from:addr:merlin.net.au'           0.82569             0      1
'from:name:amalia boyd'             0.82569             0      1
'message-id:@merlin.net.au'         0.82569             0      1
'subject:Chance'                    0.82569             0      1
'subject:Last'                      0.82569             0      1
'subject:blowout'                   0.82569             0      1
'subject:inventory'                 0.82569             0      1
'subject:  '                        0.916657            4     55
'subject:Citrate'                   0.924304            0      3
'subject:Sildenafil'                0.924304            0      3
'header:Received:1'                 0.958977            5    145
'header:User-Agent:1'               0.986969            0     20

Message Stream:

X-MS-Mail-Gibberish: Microsoft Mail Internet Headers Version 2.0
Received: from thing.de ([67.122.162.175]) by postman.merlin.fi with
Microsoft
	SMTPSVC(5.0.2195.6713); Wed, 27 Aug 2003 01:37:36 +0300
User-Agent: Mozilla/5.001 (windows; U; NT4.0; en-us) Gecko/25250101
From: "Amalia Boyd" <aboydhd at merlin.net.au>
Date: Tue, 26 Aug 2003 18:33:58 +0000
Message-ID: <3F4BA816.730F2157 at merlin.net.au>
To: harri.pesonen at merlin.fi
MIME-Version: 1.0
Subject:
=?iso-8859-1?b?TGFzdCBDaGFuY2UgLSBTaWxkZW5hZmlsIENpdHJhdGUgIGludmVudG9ye
SBibG93b3V0IQ==?=
Content-Type: text/html
Content-Transfer-Encoding: 8bit
Return-Path: aboydhd at merlin.net.au
X-OriginalArrivalTime: 26 Aug 2003 22:37:36.0988 (UTC)
	FILETIME=[A637F9C0:01C36C22]

Message Tokens:

33 unique tokens

'cc:none'
'content-type:text/plain'
'from:addr:aboydhd'
'from:addr:merlin.net.au'
'from:name:amalia boyd'
'header:Date:1'
'header:From:1'
'header:MIME-Version:1'
'header:Message-ID:1'
'header:Received:1'
'header:Return-Path:1'
'header:Subject:1'
'header:To:1'
'header:User-Agent:1'
'message-id:@merlin.net.au'
'reply-to:none'
'sender:none'
'subject: '
'subject: '
'subject: - '
'subject:!'
'subject:Chance'
'subject:Citrate'
'subject:Last'
'subject:Sildenafil'
'subject:blowout'
'subject:inventory'
'subjectcharset:iso-8859-1'
'to:2**0'
'to:addr:harri.pesonen'
'to:addr:merlin.fi'
'to:no real name:2**0'
'x-mailer:none'

-----Original Message-----
From: Meyer, Tony [mailto:T.A.Meyer at massey.ac.nz] 
Sent: 27. elokuuta 2003 10:08
To: Harri Pesonen; spambayes at python.org
Subject: RE: [Spambayes] Tokenizing ideas (images, attachments)


> Why not tokenize image URLs?
[...]
> While SpamBayes detected this message just fine,

There's a reason why not ;)

> Many times the message is empty or almost
> empty, containing only an image URL.

Not that any URL, including image ones, is tokenized.  If you look at
the clues for a message like the one you used as an example, you should
see some url: tokens.

It has been suggested that tokenizing (textual) information at the end
of the URL would be worthwhile (this includes a token if the URL 404s).
We tested this out (look at the urlslurper.py file), but didn't have
enough people testing to integrate it into the main code (as a
default-to-off option).  Death2Spam (see the related page) does this,
though, and Richard swears by it.

In any case, the best thing is to try these (or any other) ideas out.
See FAQ 6.1:

<file:///D:/cvs/spambayes/website/faq.html#why-don-t-you-implement-cool-
tokenizer-trick-x>

=Tony Meyer

_______________________________________________
Spambayes at python.org
http://mail.python.org/mailman/listinfo/spambayes
Check the FAQ before asking: http://spambayes.sf.net/faq.html

Message Tokens:

313 unique tokens

'"amalia'
'#ham'
'#spam'
'>'
'"amalia'
"'*h*'"
"'*s*'"
"'cc:none'"
"'subject:"
"'subject:!'"
"'to:2**0'"
"'to:no"
'(as'
'(images,'
'(look'
'(or'
'(see'
'(textual)'
'(this'
'(utc)'
'(windows;'
'+0000'
'+0300'
'0.15272'
'0.241103'
'0.340317'
'0.616644'
'0.617705'
'0.61773'
'0.648347'
'0.680694'
'0.6998'
'0.723659'
'0.733392'
'0.82569'
'0.916657'
'0.924304'
'0.958977'
'0.986969'
'0.995817'
'0.997882'
'01:37:36'
'1.0'
'101'
'10:08'
'145'
'151'
'170'
'175'
'18:33:58'
'2.0'
'2003'
'207'
'214'
'232'
'27.'
'315'
'349'
'373'
'398'
'404s).'
'419'
'444'
'446'
'5.3187e-005'
'6.1:'
'8bit'
':-)'
'=tony'
'[...]'
'[spambayes]'
'almost'
'also'
'and'
'any'
'asking:'
'attachments)'
'aug'
'been'
'before'
'best'
'black'
'boyd"'
'boyd"'
"boyd'"
'btw,'
'but'
'case,'
'cc:none'
'check'
'clues'
'code'
'containing'
'content-type:text/plain'
'currently'
'date:'
'date:1,'
'death2spam'
'detected'
"didn't"
'does'
"don't"
'elokuuta'
'email addr:massey.ac.nz]'
'email addr:merlin.fi'
'email addr:merlin.net.au'
'email addr:merlin.net.au>'
"email addr:merlin.net.au'"
'email addr:python.org'
'email name:<3f4ba816.730f2157'
'email name:<aboydhd'
"email name:'message-id:"
'email name:[mailto:t.a.meyer'
'email name:aboydhd'
'email name:harri.pesonen'
'email name:spambayes'
'empty'
'empty,'
'en-us)'
'end'
'enough'
'etc,'
'example,'
'faq'
'faq,'
'file),'
'fine,'
'for'
'from'
'from:'
'from:1,'
'from:addr:harri.pesonen'
'from:addr:wicom.com'
'from:name:harri pesonen'
'harri'
'has'
'have'
'header:Date:1'
'header:Errors-To:1'
'header:From:1'
'header:MIME-Version:1'
'header:Message-ID:1'
'header:Received:7'
'header:Return-Path:1'
'header:Subject:1'
'header:To:1'
'headers'
'htmlbody'
"i'm"
'ideas'
'image'
'includes'
'including'
'information'
'integrate'
'internal'
'internet'
'into'
'it.'
'just'
'learning'
'like'
'list?'
'look'
'mail'
'main'
'many'
'mean?'
'message'
'message-----'
'message-id:'
'message-id:@postman.wicom.com'
'meyer'
'meyer,'
'microsoft'
"name:2**0'"
'not'
'nt4.0;'
'one'
'ones,'
'only'
'option).'
'other)'
'out'
'out.'
'outlook'
'page)'
'people'
'perhaps'
'pesonen;'
'plugin,'
'problem'
'property?'
'proto:http'
'python.'
're:'
'read'
'real'
'reason'
'received:'
'related'
'reply-to:none'
'return-path:'
'richard'
'score:'
'see'
'sender:addr:python.org'
'sender:addr:spambayes-bounces'
'sender:no real name:2**0'
'sent:'
'should'
'skip:& 70'
"skip:' 10"
"skip:' 20"
'skip:( 10'
'skip:- 10'
'skip:2 10'
'skip:= 70'
'skip:_ 40'
'skip:c 10'
'skip:c 20'
'skip:d 10'
'skip:f 20'
'skip:g 10'
'skip:h 10'
'skip:m 10'
'skip:p 10'
'skip:s 10'
'skip:s 20'
'skip:t 20'
'skip:u 10'
'skip:x 20'
'some'
'spam'
'spambayes'
'spamprob'
'stream:'
'subject:'
'subject: '
'subject: ('
'subject:)'
'subject:, '
'subject:: ['
'subject:Spambayes'
'subject:Tokenizing'
'subject:] '
'subject:attachments'
'subject:ideas'
'subject:images'
'suggested'
'swears'
'tested'
'testing'
'text/html'
'that'
'the'
"there's"
'there,'
'these'
'they'
'thing'
'thing.de'
'this'
'this,'
'though,'
'times'
'to:'
'to:2**0'
'to:addr:python.org'
'to:addr:spambayes'
'to:no real name:2**0'
'token'
'tokenize'
'tokenized.'
'tokenizing'
'tokens'
'tokens.'
'tokens:'
'tony'
'try'
'tue,'
'unique'
'url'
'url,'
'url.'
'url:'
'url:faq'
'url:html'
'url:listinfo'
'url:mail'
'url:mailman'
'url:net'
'url:org'
'url:python'
'url:sf'
'url:spambayes'
'url:tokens,'
'urls?'
'use'
'used'
'user-agent:'
'version'
'wed,'
'what'
'while'
'why'
'with'
'word'
'worthwhile'
'would'
'x-mailer:none'
'yeah,'
'you'



More information about the Spambayes mailing list