[Spambayes] Tokenizing ideas (images, attachments)
Meyer, Tony
T.A.Meyer at massey.ac.nz
Wed Aug 27 20:45:46 EDT 2003
> Yeah, I read that FAQ, I'm currently just learning Python.
All the better to practice with <wink>. Seriously, if you are confident
that an improvement with have a significant effect, then write it up in
detail and submit it as a feature request. Someone will get around to
trying it and posting a patch. Unless it's an amazing idea, you have to
be willing to run the tests as well, though, otherwise we end up with
the (common at the moment) situation where we have inconclusive data
about an option because only a couple of people have done the testing.
> I don't see any url:tokens, I use the Outlook plugin, perhaps the
problem
> is there, it does not use the HTMLBody property?
In your 'show clues', I don't see a body at all! See my 'show clues'
for your message at the end. If you go right down to the bottom you'll
see some url:tokens. What version of the plug-in are you using? The
'message stream' bit of the 'show clues' should match what you see in
Outlook.
> Btw, what does header:Received:1 and header:User-Agent:1 mean? Does
> SpamBayes have an internal black list?
No, there's no black or white listing.
> Also Date:1, From:1,
> MIME-Version:1 etc, what do they mean? :-)
One set of tokens generated is simply the number of times a header line
appears in the message. These are all saying that each header is there
once.
=Tony Meyer
---
Spam Score: 0% (0)
word spamprob #ham #spam
'*H*' 1 - -
'*S*' 0 - -
'content-type:text/plain subject:Spambayes' 0.00195626 159 1
'subject:: [' 0.00298211 75 0
'subject:Spambayes' 0.00393682 167 5
'>' 0.00520659 59 1
'email addr:python.org' 0.00536099 171 8
'the main' 0.00542823 41 0
'aug' 0.00612213 50 1
'1.0' 0.00693374 32 0
'the problem' 0.00820669 37 1
'[spambayes]' 0.00884086 25 0
'spambayes' 0.00934694 69 5
'with microsoft' 0.0104895 21 0
'suggested' 0.0110024 20 0
'tony' 0.0114217 41 3
'url:mailman' 0.0119871 184 23
'skip:_ 40' 0.0122475 180 23
'url:python' 0.0126601 174 23
'outlook' 0.0129152 56 6
'meyer' 0.0135191 22 1
'proto:http url:mail' 0.0136382 174 25
'url:spambayes proto:http' 0.0136778 16 0
'meyer,' 0.0145631 15 0
'worthwhile' 0.0145631 15 0
'(this' 0.0155304 19 1
'the url' 0.0155709 14 0
'perhaps' 0.0162613 44 6
'check the' 0.0163408 18 1
'(see' 0.0174785 31 4
'message-----' 0.0182364 25 3
'empty' 0.0182454 16 1
'url:listinfo' 0.0191548 186 39
'skip:f 20' 0.0195131 19 2
"didn't have" 0.0196507 11 0
'sent:' 0.0197086 23 3
'main' 0.0197377 62 12
're:' 0.0198445 66 13
'internal' 0.0206047 26 4
'(or' 0.0208107 38 7
'header:Received:7 header:From:1' 0.0217905 147 35
'url' 0.0219612 32 6
'microsoft' 0.0222952 43 9
'url:spambayes' 0.0236263 26 5
'subject:] ' 0.0238125 250 67
'header:Received:7' 0.0245479 151 41
'like the' 0.0258482 50 13
"but didn't" 0.0266272 8 0
'skip:s 10 skip:c 10' 0.0266272 8 0
'look the' 0.0275098 19 4
'sender:addr:spambayes-bounces' 0.0275098 19 4
'email name:spambayes' 0.0275637 22 5
':-)' 0.0280665 10 1
'skip:s 20' 0.0285679 126 40
'version' 0.0296442 101 33
'skip:- 10' 0.0299704 57 18
'though, and' 0.0302013 7 0
'url.' 0.0302013 7 0
'containing' 0.0307464 25 7
'the message' 0.0307464 25 7
'sender:no real name:2**0 reply-to:none' 0.0307997 186 65
'not that' 0.0308363 9 1
'to:addr:spambayes' 0.0315122 19 5
'to:addr:python.org to:no real name:2**0' 0.0318036 153 55
'sender:addr:python.org' 0.0328382 174 65
'tue,' 0.0342154 8 1
"don't see" 0.0348837 6 0
'enough people' 0.0348837 6 0
'for message' 0.0348837 6 0
'reason why' 0.0348837 6 0
"skip:' 20" 0.0348837 6 0
'url,' 0.0348837 6 0
'why not' 0.0368007 16 5
'use the' 0.0389043 65 28
'related' 0.0389254 41 17
'python.' 0.0412844 5 0
"there's reason" 0.0412844 5 0
'case,' 0.0415825 24 10
'testing' 0.0417589 36 16
'problem' 0.0419707 64 30
'this,' 0.0429501 31 14
'skip:( 10' 0.04336 93 46
'=tony' 0.04384 6 1
'faq' 0.04384 6 1
'have enough' 0.04384 6 1
'skip:& 70' 0.04384 6 1
'yeah,' 0.04384 6 1
'skip:c 20' 0.0444757 45 22
'skip:x 20' 0.0456534 9 3
'(as' 0.0456727 29 14
'into the' 0.0460221 69 36
'date:' 0.0461054 16 7
'the one' 0.0492735 25 13
'what they' 0.0498514 13 6
'email name:[mailto:t.a.meyer' 0.0505618 4 0
'fine,' 0.0505618 4 0
'ones,' 0.0505618 4 0
'integrate' 0.05104 5 1
'2.0' 0.0553604 16 9
'headers' 0.0554278 13 7
'there,' 0.0599594 16 10
'not use' 0.0611158 4 1
"i'm" 0.0648698 91 70
'(look' 0.0652174 3 0
'one you' 0.0652174 3 0
'see any' 0.0652174 3 0
'token' 0.0652174 3 0
'spam' 0.0669622 51 40
'image' 0.0673626 25 19
"skip:' 10" 0.0736107 16 13
'you look' 0.0772416 12 10
'skip:s 10' 0.0780672 232 221
'url:html proto:http' 0.0792337 65 62
'should' 0.0842324 153 158
'skip:u 10' 0.0868949 94 100
'skip:g 10' 0.0893932 65 71
'(utc)' 0.0918367 2 0
'+0000' 0.0918367 2 0
'151' 0.0918367 2 0
'[...]' 0.0918367 2 0
'attachments)' 0.0918367 2 0
'empty,' 0.0918367 2 0
'faq,' 0.0918367 2 0
'mail internet' 0.0918367 2 0
'not many' 0.0918367 2 0
'read that' 0.0918367 2 0
'subject:images' 0.0918367 2 0
'tokenize' 0.0918367 2 0
'tokenizing' 0.0918367 2 0
'url:faq url:html' 0.0918367 2 0
'url:net url:faq' 0.0918367 2 0
'clues' 0.0923291 3 2
'used' 0.0963302 88 105
'101' 0.102015 2 1
'times the' 0.102015 2 1
'skip:d 10' 0.102175 122 156
'and skip:h 10' 0.103502 4 4
'unique' 0.103877 26 33
'but' 0.10437 198 260
'skip:2 10' 0.107069 6 7
'that any' 0.112918 7 9
'it.' 0.115918 91 134
'does' 0.118972 89 135
'information the' 0.120233 11 16
'this out' 0.12355 2 2
'black' 0.124805 13 20
'this message' 0.878774 16 1314
'170' 0.908163 0 2
'message-id:' 0.908163 0 2
'subject:ideas' 0.969799 0 7
'skip:c 10 text/html' 0.998849 0 195
'text/html' 0.998849 0 195
Message Stream:
X-MS-Mail-Gibberish: Microsoft Mail Internet Headers Version 2.0
Received: from its-xchg5.massey.ac.nz ([130.123.129.15]) by
its-xchg4.massey.ac.nz with Microsoft SMTPSVC(5.0.2195.5329);
Wed, 27 Aug 2003 19:23:09 +1200
Received: from its-mail1.massey.ac.nz ([130.123.128.11]) by
its-xchg5.massey.ac.nz with Microsoft SMTPSVC(5.0.2195.5329);
Wed, 27 Aug 2003 19:23:09 +1200
Received: from its-mm1.massey.ac.nz (its-mm1.massey.ac.nz
[130.123.128.45])
by its-mail1.massey.ac.nz (8.9.3/8.9.3) with ESMTP id TAA13247;
Wed, 27 Aug 2003 19:23:09 +1200 (NZST)
Received: from mu-relay1.massey.ac.nz (Not Verified[130.123.2.98]) by
its-mm1.massey.ac.nz with NetIQ MailMarshal
id <B001b4b27c>; Wed, 27 Aug 2003 19:23:08 +1200
Received: from mail.python.org (mail.python.org [12.155.117.29])
by mu-relay1.massey.ac.nz (Postfix) with ESMTP id A535B3763E
for <t.a.meyer at massey.ac.nz>; Wed, 27 Aug 2003 19:23:07 +1200
(NZST)
Received: from localhost.localdomain ([127.0.0.1] helo=mail.python.org)
by mail.python.org with esmtp (Exim 4.05)
id 19rudy-0005kY-01; Wed, 27 Aug 2003 03:23:02 -0400
Received: from postman.wicom.com ([62.236.218.18]
helo=postman.merlin.fi)
by mail.python.org with esmtp (Exim 4.05) id 19rudu-0005kH-00
for spambayes at python.org; Wed, 27 Aug 2003 03:22:58 -0400
MIME-Version: 1.0
Content-Type: text/plain;
charset="us-ascii"
Content-Transfer-Encoding: quoted-printable
Subject: RE: [Spambayes] Tokenizing ideas (images, attachments)
Date: Wed, 27 Aug 2003 10:22:54 +0300
Message-ID: <FC2AA30A6C037640BBDC45785F0E9B26076325 at postman.wicom.com>
From: "Harri Pesonen" <harri.pesonen at wicom.com>
To: <spambayes at python.org>
X-Spam-Status: OK (default 0.000)
X-BeenThere: spambayes at python.org
X-Mailman-Version: 2.1.2
Precedence: list
List-Id: Discussion list for Pythonic Bayesian classifier
<spambayes.python.org>
List-Unsubscribe: <http://mail.python.org/mailman/listinfo/spambayes>,
<mailto:spambayes-request at python.org?subject=unsubscribe>
List-Archive: <http://mail.python.org/pipermail/spambayes>
List-Post: <mailto:spambayes at python.org>
List-Help: <mailto:spambayes-request at python.org?subject=help>
List-Subscribe: <http://mail.python.org/mailman/listinfo/spambayes>,
<mailto:spambayes-request at python.org?subject=subscribe>
Sender: spambayes-bounces at python.org
Errors-To: spambayes-bounces at python.org
Return-Path: spambayes-bounces at python.org
X-OriginalArrivalTime: 27 Aug 2003 07:23:09.0557 (UTC)
FILETIME=[11182A50:01C36C6C]
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2//EN">
<HTML>
<HEAD>
<META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=iso-8859-1">
<META NAME="Generator" CONTENT="MS Exchange Server version 6.0.6396.0">
<TITLE>RE: [Spambayes] Tokenizing ideas (images, attachments)</TITLE>
</HEAD>
<BODY>
<!-- Converted from text/plain format -->
<P><FONT SIZE=2>Yeah, I read that FAQ, I'm currently just learning
Python. I don't see</FONT>
<BR><FONT SIZE=2>any url:tokens, I use the Outlook plugin, perhaps the
problem is there,</FONT>
<BR><FONT SIZE=2>it does not use the HTMLBody property?</FONT>
</P>
<P><FONT SIZE=2>Btw, what does header:Received:1 and header:User-Agent:1
mean? Does</FONT>
<BR><FONT SIZE=2>SpamBayes have an internal black list? Also Date:1,
From:1,</FONT>
<BR><FONT SIZE=2>MIME-Version:1 etc, what do they mean? :-)</FONT>
</P>
<P><FONT SIZE=2>Spam Score: 0.997882</FONT>
</P>
<P><FONT
SIZE=2>word &
nbsp; &
nbsp;
spamprob #ham
#spam</FONT>
<BR><FONT
SIZE=2>'*H*'
5.3187e-005
- -</FONT>
<BR><FONT
SIZE=2>'*S*'
0.995817 &nbs
p; - -</FONT>
<BR><FONT
SIZE=2>'subjectcharset:iso-8859-1' &n
bsp;
0.15272  
; 57 12</FONT>
<BR><FONT SIZE=2>'subject: -
'  
;
0.241103
64 24</FONT>
<BR><FONT
SIZE=2>'reply-to:none' &n
bsp;
0.340317
349 214</FONT>
<BR><FONT
SIZE=2>'header:Date:1' &n
bsp;
0.616644
232 444</FONT>
<BR><FONT
SIZE=2>'header:From:1' &n
bsp;
0.617705
232 446</FONT>
<BR><FONT
SIZE=2>'header:MIME-Version:1'
0.61773
207 398</FONT>
<BR><FONT SIZE=2>'to:no real
name:2**0' &n
bsp;
0.648347
170 373</FONT>
<BR><FONT
SIZE=2>'header:Return-Path:1' &
nbsp;
0.680694
175 444</FONT>
<BR><FONT
SIZE=2>'header:Message-ID:1' &n
bsp;
0.6998
151 419</FONT>
<BR><FONT
SIZE=2>'to:addr:merlin.fi' &nbs
p;
0.723659
101 315</FONT>
<BR><FONT
SIZE=2>'subject:!'
0.733392
13 43</FONT>
<BR><FONT
SIZE=2>'from:addr:aboydhd' &nbs
p;
0.82569  
; 0 1</FONT>
<BR><FONT
SIZE=2>'from:addr:merlin.net.au' &nbs
p;
0.82569  
; 0 1</FONT>
<BR><FONT SIZE=2>'from:name:amalia
boyd' &
nbsp;
0.82569  
; 0 1</FONT>
<BR><FONT
SIZE=2>'message-id:@merlin.net.au' &n
bsp;
0.82569  
; 0 1</FONT>
<BR><FONT
SIZE=2>'subject:Chance' &
nbsp;
0.82569  
; 0 1</FONT>
<BR><FONT
SIZE=2>'subject:Last' &nb
sp; &nb
sp;
0.82569  
; 0 1</FONT>
<BR><FONT
SIZE=2>'subject:blowout'
0.82569  
; 0 1</FONT>
<BR><FONT
SIZE=2>'subject:inventory' &nbs
p;
0.82569  
; 0 1</FONT>
<BR><FONT SIZE=2>'subject:
'  
;
0.916657 &nbs
p; 4 55</FONT>
<BR><FONT
SIZE=2>'subject:Citrate'
0.924304 &nbs
p; 0 3</FONT>
<BR><FONT
SIZE=2>'subject:Sildenafil' &nb
sp;
0.924304 &nbs
p; 0 3</FONT>
<BR><FONT
SIZE=2>'header:Received:1' &nbs
p;
0.958977 &nbs
p; 5 145</FONT>
<BR><FONT
SIZE=2>'header:User-Agent:1' &n
bsp;
0.986969 &nbs
p; 0 20</FONT>
</P>
<P><FONT SIZE=2>Message Stream:</FONT>
</P>
<P><FONT SIZE=2>X-MS-Mail-Gibberish: Microsoft Mail Internet Headers
Version 2.0</FONT>
<BR><FONT SIZE=2>Received: from thing.de ([67.122.162.175]) by
postman.merlin.fi with</FONT>
<BR><FONT SIZE=2>Microsoft</FONT>
<BR> <FONT
SIZE=2>SMTPSVC(5.0.2195.6713); Wed, 27 Aug 2003 01:37:36 +0300</FONT>
<BR><FONT SIZE=2>User-Agent: Mozilla/5.001 (windows; U; NT4.0; en-us)
Gecko/25250101</FONT>
<BR><FONT SIZE=2>From: "Amalia Boyd"
<aboydhd at merlin.net.au></FONT>
<BR><FONT SIZE=2>Date: Tue, 26 Aug 2003 18:33:58 +0000</FONT>
<BR><FONT SIZE=2>Message-ID:
<3F4BA816.730F2157 at merlin.net.au></FONT>
<BR><FONT SIZE=2>To: harri.pesonen at merlin.fi</FONT>
<BR><FONT SIZE=2>MIME-Version: 1.0</FONT>
<BR><FONT SIZE=2>Subject:</FONT>
<BR><FONT
SIZE=2>=?iso-8859-1?b?TGFzdCBDaGFuY2UgLSBTaWxkZW5hZmlsIENpdHJhdGUgIGludm
VudG9ye</FONT>
<BR><FONT SIZE=2>SBibG93b3V0IQ==?=</FONT>
<BR><FONT SIZE=2>Content-Type: text/html</FONT>
<BR><FONT SIZE=2>Content-Transfer-Encoding: 8bit</FONT>
<BR><FONT SIZE=2>Return-Path: aboydhd at merlin.net.au</FONT>
<BR><FONT SIZE=2>X-OriginalArrivalTime: 26 Aug 2003 22:37:36.0988
(UTC)</FONT>
<BR> <FONT
SIZE=2>FILETIME=[A637F9C0:01C36C22]</FONT>
</P>
<P><FONT SIZE=2>Message Tokens:</FONT>
</P>
<P><FONT SIZE=2>33 unique tokens</FONT>
</P>
<P><FONT SIZE=2>'cc:none'</FONT>
<BR><FONT SIZE=2>'content-type:text/plain'</FONT>
<BR><FONT SIZE=2>'from:addr:aboydhd'</FONT>
<BR><FONT SIZE=2>'from:addr:merlin.net.au'</FONT>
<BR><FONT SIZE=2>'from:name:amalia boyd'</FONT>
<BR><FONT SIZE=2>'header:Date:1'</FONT>
<BR><FONT SIZE=2>'header:From:1'</FONT>
<BR><FONT SIZE=2>'header:MIME-Version:1'</FONT>
<BR><FONT SIZE=2>'header:Message-ID:1'</FONT>
<BR><FONT SIZE=2>'header:Received:1'</FONT>
<BR><FONT SIZE=2>'header:Return-Path:1'</FONT>
<BR><FONT SIZE=2>'header:Subject:1'</FONT>
<BR><FONT SIZE=2>'header:To:1'</FONT>
<BR><FONT SIZE=2>'header:User-Agent:1'</FONT>
<BR><FONT SIZE=2>'message-id:@merlin.net.au'</FONT>
<BR><FONT SIZE=2>'reply-to:none'</FONT>
<BR><FONT SIZE=2>'sender:none'</FONT>
<BR><FONT SIZE=2>'subject: '</FONT>
<BR><FONT SIZE=2>'subject: '</FONT>
<BR><FONT SIZE=2>'subject: - '</FONT>
<BR><FONT SIZE=2>'subject:!'</FONT>
<BR><FONT SIZE=2>'subject:Chance'</FONT>
<BR><FONT SIZE=2>'subject:Citrate'</FONT>
<BR><FONT SIZE=2>'subject:Last'</FONT>
<BR><FONT SIZE=2>'subject:Sildenafil'</FONT>
<BR><FONT SIZE=2>'subject:blowout'</FONT>
<BR><FONT SIZE=2>'subject:inventory'</FONT>
<BR><FONT SIZE=2>'subjectcharset:iso-8859-1'</FONT>
<BR><FONT SIZE=2>'to:2**0'</FONT>
<BR><FONT SIZE=2>'to:addr:harri.pesonen'</FONT>
<BR><FONT SIZE=2>'to:addr:merlin.fi'</FONT>
<BR><FONT SIZE=2>'to:no real name:2**0'</FONT>
<BR><FONT SIZE=2>'x-mailer:none'</FONT>
</P>
<P><FONT SIZE=2>-----Original Message-----</FONT>
<BR><FONT SIZE=2>From: Meyer, Tony [<A
HREF="mailto:T.A.Meyer at massey.ac.nz">mailto:T.A.Meyer at massey.ac.nz</A>]
</FONT>
<BR><FONT SIZE=2>Sent: 27. elokuuta 2003 10:08</FONT>
<BR><FONT SIZE=2>To: Harri Pesonen; spambayes at python.org</FONT>
<BR><FONT SIZE=2>Subject: RE: [Spambayes] Tokenizing ideas (images,
attachments)</FONT>
</P>
<BR>
<P><FONT SIZE=2>> Why not tokenize image URLs?</FONT>
<BR><FONT SIZE=2>[...]</FONT>
<BR><FONT SIZE=2>> While SpamBayes detected this message just
fine,</FONT>
</P>
<P><FONT SIZE=2>There's a reason why not ;)</FONT>
</P>
<P><FONT SIZE=2>> Many times the message is empty or almost</FONT>
<BR><FONT SIZE=2>> empty, containing only an image URL.</FONT>
</P>
<P><FONT SIZE=2>Not that any URL, including image ones, is
tokenized. If you look at</FONT>
<BR><FONT SIZE=2>the clues for a message like the one you used as an
example, you should</FONT>
<BR><FONT SIZE=2>see some url: tokens.</FONT>
</P>
<P><FONT SIZE=2>It has been suggested that tokenizing (textual)
information at the end</FONT>
<BR><FONT SIZE=2>of the URL would be worthwhile (this includes a token
if the URL 404s).</FONT>
<BR><FONT SIZE=2>We tested this out (look at the urlslurper.py file),
but didn't have</FONT>
<BR><FONT SIZE=2>enough people testing to integrate it into the main
code (as a</FONT>
<BR><FONT SIZE=2>default-to-off option). Death2Spam (see the
related page) does this,</FONT>
<BR><FONT SIZE=2>though, and Richard swears by it.</FONT>
</P>
<P><FONT SIZE=2>In any case, the best thing is to try these (or any
other) ideas out.</FONT>
<BR><FONT SIZE=2>See FAQ 6.1:</FONT>
</P>
<P><FONT
SIZE=2><file:///D:/cvs/spambayes/website/faq.html#why-don-t-you-imple
ment-cool-</FONT>
<BR><FONT SIZE=2>tokenizer-trick-x></FONT>
</P>
<P><FONT SIZE=2>=Tony Meyer</FONT>
</P>
<P><FONT SIZE=2>_______________________________________________</FONT>
<BR><FONT SIZE=2>Spambayes at python.org</FONT>
<BR><FONT SIZE=2><A
HREF="http://mail.python.org/mailman/listinfo/spambayes">http://mail.pyt
hon.org/mailman/listinfo/spambayes</A></FONT>
<BR><FONT SIZE=2>Check the FAQ before asking: <A
HREF="http://spambayes.sf.net/faq.html">http://spambayes.sf.net/faq.html
</A></FONT>
</P>
</BODY>
</HTML>
Yeah, I read that FAQ, I'm currently just learning Python. I don't see
any url:tokens, I use the Outlook plugin, perhaps the problem is there,
it does not use the HTMLBody property?
Btw, what does header:Received:1 and header:User-Agent:1 mean? Does
SpamBayes have an internal black list? Also Date:1, From:1,
MIME-Version:1 etc, what do they mean? :-)
Spam Score: 0.997882
word spamprob #ham #spam
'*H*' 5.3187e-005 - -
'*S*' 0.995817 - -
'subjectcharset:iso-8859-1' 0.15272 57 12
'subject: - ' 0.241103 64 24
'reply-to:none' 0.340317 349 214
'header:Date:1' 0.616644 232 444
'header:From:1' 0.617705 232 446
'header:MIME-Version:1' 0.61773 207 398
'to:no real name:2**0' 0.648347 170 373
'header:Return-Path:1' 0.680694 175 444
'header:Message-ID:1' 0.6998 151 419
'to:addr:merlin.fi' 0.723659 101 315
'subject:!' 0.733392 13 43
'from:addr:aboydhd' 0.82569 0 1
'from:addr:merlin.net.au' 0.82569 0 1
'from:name:amalia boyd' 0.82569 0 1
'message-id:@merlin.net.au' 0.82569 0 1
'subject:Chance' 0.82569 0 1
'subject:Last' 0.82569 0 1
'subject:blowout' 0.82569 0 1
'subject:inventory' 0.82569 0 1
'subject: ' 0.916657 4 55
'subject:Citrate' 0.924304 0 3
'subject:Sildenafil' 0.924304 0 3
'header:Received:1' 0.958977 5 145
'header:User-Agent:1' 0.986969 0 20
Message Stream:
X-MS-Mail-Gibberish: Microsoft Mail Internet Headers Version 2.0
Received: from thing.de ([67.122.162.175]) by postman.merlin.fi with
Microsoft
SMTPSVC(5.0.2195.6713); Wed, 27 Aug 2003 01:37:36 +0300
User-Agent: Mozilla/5.001 (windows; U; NT4.0; en-us) Gecko/25250101
From: "Amalia Boyd" <aboydhd at merlin.net.au>
Date: Tue, 26 Aug 2003 18:33:58 +0000
Message-ID: <3F4BA816.730F2157 at merlin.net.au>
To: harri.pesonen at merlin.fi
MIME-Version: 1.0
Subject:
=?iso-8859-1?b?TGFzdCBDaGFuY2UgLSBTaWxkZW5hZmlsIENpdHJhdGUgIGludmVudG9ye
SBibG93b3V0IQ==?=
Content-Type: text/html
Content-Transfer-Encoding: 8bit
Return-Path: aboydhd at merlin.net.au
X-OriginalArrivalTime: 26 Aug 2003 22:37:36.0988 (UTC)
FILETIME=[A637F9C0:01C36C22]
Message Tokens:
33 unique tokens
'cc:none'
'content-type:text/plain'
'from:addr:aboydhd'
'from:addr:merlin.net.au'
'from:name:amalia boyd'
'header:Date:1'
'header:From:1'
'header:MIME-Version:1'
'header:Message-ID:1'
'header:Received:1'
'header:Return-Path:1'
'header:Subject:1'
'header:To:1'
'header:User-Agent:1'
'message-id:@merlin.net.au'
'reply-to:none'
'sender:none'
'subject: '
'subject: '
'subject: - '
'subject:!'
'subject:Chance'
'subject:Citrate'
'subject:Last'
'subject:Sildenafil'
'subject:blowout'
'subject:inventory'
'subjectcharset:iso-8859-1'
'to:2**0'
'to:addr:harri.pesonen'
'to:addr:merlin.fi'
'to:no real name:2**0'
'x-mailer:none'
-----Original Message-----
From: Meyer, Tony [mailto:T.A.Meyer at massey.ac.nz]
Sent: 27. elokuuta 2003 10:08
To: Harri Pesonen; spambayes at python.org
Subject: RE: [Spambayes] Tokenizing ideas (images, attachments)
> Why not tokenize image URLs?
[...]
> While SpamBayes detected this message just fine,
There's a reason why not ;)
> Many times the message is empty or almost
> empty, containing only an image URL.
Not that any URL, including image ones, is tokenized. If you look at
the clues for a message like the one you used as an example, you should
see some url: tokens.
It has been suggested that tokenizing (textual) information at the end
of the URL would be worthwhile (this includes a token if the URL 404s).
We tested this out (look at the urlslurper.py file), but didn't have
enough people testing to integrate it into the main code (as a
default-to-off option). Death2Spam (see the related page) does this,
though, and Richard swears by it.
In any case, the best thing is to try these (or any other) ideas out.
See FAQ 6.1:
<file:///D:/cvs/spambayes/website/faq.html#why-don-t-you-implement-cool-
tokenizer-trick-x>
=Tony Meyer
_______________________________________________
Spambayes at python.org
http://mail.python.org/mailman/listinfo/spambayes
Check the FAQ before asking: http://spambayes.sf.net/faq.html
Message Tokens:
313 unique tokens
'"amalia'
'#ham'
'#spam'
'>'
'"amalia'
"'*h*'"
"'*s*'"
"'cc:none'"
"'subject:"
"'subject:!'"
"'to:2**0'"
"'to:no"
'(as'
'(images,'
'(look'
'(or'
'(see'
'(textual)'
'(this'
'(utc)'
'(windows;'
'+0000'
'+0300'
'0.15272'
'0.241103'
'0.340317'
'0.616644'
'0.617705'
'0.61773'
'0.648347'
'0.680694'
'0.6998'
'0.723659'
'0.733392'
'0.82569'
'0.916657'
'0.924304'
'0.958977'
'0.986969'
'0.995817'
'0.997882'
'01:37:36'
'1.0'
'101'
'10:08'
'145'
'151'
'170'
'175'
'18:33:58'
'2.0'
'2003'
'207'
'214'
'232'
'27.'
'315'
'349'
'373'
'398'
'404s).'
'419'
'444'
'446'
'5.3187e-005'
'6.1:'
'8bit'
':-)'
'=tony'
'[...]'
'[spambayes]'
'almost'
'also'
'and'
'any'
'asking:'
'attachments)'
'aug'
'been'
'before'
'best'
'black'
'boyd"'
'boyd"'
"boyd'"
'btw,'
'but'
'case,'
'cc:none'
'check'
'clues'
'code'
'containing'
'content-type:text/plain'
'currently'
'date:'
'date:1,'
'death2spam'
'detected'
"didn't"
'does'
"don't"
'elokuuta'
'email addr:massey.ac.nz]'
'email addr:merlin.fi'
'email addr:merlin.net.au'
'email addr:merlin.net.au>'
"email addr:merlin.net.au'"
'email addr:python.org'
'email name:<3f4ba816.730f2157'
'email name:<aboydhd'
"email name:'message-id:"
'email name:[mailto:t.a.meyer'
'email name:aboydhd'
'email name:harri.pesonen'
'email name:spambayes'
'empty'
'empty,'
'en-us)'
'end'
'enough'
'etc,'
'example,'
'faq'
'faq,'
'file),'
'fine,'
'for'
'from'
'from:'
'from:1,'
'from:addr:harri.pesonen'
'from:addr:wicom.com'
'from:name:harri pesonen'
'harri'
'has'
'have'
'header:Date:1'
'header:Errors-To:1'
'header:From:1'
'header:MIME-Version:1'
'header:Message-ID:1'
'header:Received:7'
'header:Return-Path:1'
'header:Subject:1'
'header:To:1'
'headers'
'htmlbody'
"i'm"
'ideas'
'image'
'includes'
'including'
'information'
'integrate'
'internal'
'internet'
'into'
'it.'
'just'
'learning'
'like'
'list?'
'look'
'mail'
'main'
'many'
'mean?'
'message'
'message-----'
'message-id:'
'message-id:@postman.wicom.com'
'meyer'
'meyer,'
'microsoft'
"name:2**0'"
'not'
'nt4.0;'
'one'
'ones,'
'only'
'option).'
'other)'
'out'
'out.'
'outlook'
'page)'
'people'
'perhaps'
'pesonen;'
'plugin,'
'problem'
'property?'
'proto:http'
'python.'
're:'
'read'
'real'
'reason'
'received:'
'related'
'reply-to:none'
'return-path:'
'richard'
'score:'
'see'
'sender:addr:python.org'
'sender:addr:spambayes-bounces'
'sender:no real name:2**0'
'sent:'
'should'
'skip:& 70'
"skip:' 10"
"skip:' 20"
'skip:( 10'
'skip:- 10'
'skip:2 10'
'skip:= 70'
'skip:_ 40'
'skip:c 10'
'skip:c 20'
'skip:d 10'
'skip:f 20'
'skip:g 10'
'skip:h 10'
'skip:m 10'
'skip:p 10'
'skip:s 10'
'skip:s 20'
'skip:t 20'
'skip:u 10'
'skip:x 20'
'some'
'spam'
'spambayes'
'spamprob'
'stream:'
'subject:'
'subject: '
'subject: ('
'subject:)'
'subject:, '
'subject:: ['
'subject:Spambayes'
'subject:Tokenizing'
'subject:] '
'subject:attachments'
'subject:ideas'
'subject:images'
'suggested'
'swears'
'tested'
'testing'
'text/html'
'that'
'the'
"there's"
'there,'
'these'
'they'
'thing'
'thing.de'
'this'
'this,'
'though,'
'times'
'to:'
'to:2**0'
'to:addr:python.org'
'to:addr:spambayes'
'to:no real name:2**0'
'token'
'tokenize'
'tokenized.'
'tokenizing'
'tokens'
'tokens.'
'tokens:'
'tony'
'try'
'tue,'
'unique'
'url'
'url,'
'url.'
'url:'
'url:faq'
'url:html'
'url:listinfo'
'url:mail'
'url:mailman'
'url:net'
'url:org'
'url:python'
'url:sf'
'url:spambayes'
'url:tokens,'
'urls?'
'use'
'used'
'user-agent:'
'version'
'wed,'
'what'
'while'
'why'
'with'
'word'
'worthwhile'
'would'
'x-mailer:none'
'yeah,'
'you'
More information about the Spambayes
mailing list