[Spambayes] Chi**2 results

Rob Hooft rob@hooft.net
Sat, 12 Oct 2002 23:00:58 +0200


This is a multi-part message in MIME format.
---------------------- multipart/mixed attachment
Here is my chi results. I am amazed by the high cutoff it is advising me 
to use! This feels very good. On the FP side bad messages are:
  * a yahoo account created to correct incorrect listings in their
    database
  * A problem with my Linux Journal subscription
  * India student applying for a course
  * Amazon.com membership update
  * Red Cross blood drive announcement

Which is 5 out of 16000; but I have to admit that even missing 4 out of 
these 5 would not have been too costly.

The middle ground is amazingly empty! I'd almost want to set my cutoff 
at 0.99 or 0.995! One thing that does bother me a bit is that some words 
have a very high correlation of co-existing in a message, and there is 
no way of finding this out. E.g. all the "bad jokes" I'm referring to in 
the attachment were sent by a friend of mine that uses a very strange 
way of forwarding by modifying the "From:" line:

   From: callaway@indigo.picower.edu (David Callaway) (by way of Pieter 
Stouten)


Which results in the highly correlated:

prob('from:pieter') = 0.00151566
prob('message-id:@[158.117.170.103]') = 0.00306331
prob('x-mailer:eudora pro 3.1 for macintosh') = 0.00474183
prob('from:stouten)') = 0.0115681
prob('from:way') = 0.012894
prob('from:(by') = 0.0167286

Regards,

Rob

-- 
Rob W.W. Hooft  ||  rob@hooft.net  ||  http://www.hooft.net/people/rob/

---------------------- multipart/mixed attachment
[TestDriver]
pickle_basename = class
save_trained_pickles = False
show_histograms = True
show_ham_lo = 0.40
show_best_discriminators = 50
nbuckets = 200
show_ham_hi = 0.80
spam_cutoff = 0.70
spam_directories = Data/Spam/Set%d
show_spam_lo = 0.40
show_false_negatives = True
ham_directories = Data/Ham/Set%d
compute_best_cutoffs_from_histograms = True
show_false_positives = True
best_cutoff_fp_weight = 10
show_spam_hi = 0.80
save_histogram_pickles = False
show_charlimit = 100000

[CV Driver]
build_each_classifier_from_scratch = False

[Tokenizer]
mine_received_headers = False
octet_prefix_size = 5
generate_long_skips = True
count_all_header_lines = False
check_octets = False
ignore_redundant_html = False
basic_header_tokenize = False
safe_headers = abuse-reports-to
	date
	errors-to
	from
	importance
	in-reply-to
	message-id
	mime-version
	organization
	received
	reply-to
	return-path
	subject
	to
	user-agent
	x-abuse-info
	x-complaints-to
	x-face
basic_header_skip = received
	date
	x-.*
basic_header_tokenize_only = False
retain_pure_html_tags = False

-> <stat> Ham scores for all runs: 16000 items; mean 0.57; sdev 5.03
-> <stat> min -2.22045e-13; median 9.99201e-14; max 100
* = 253 items
 0.0 15408 *************************************************************
 0.5   115 *
 1.0    59 *
 1.5    27 *
 2.0    25 *
 2.5    19 *
 3.0    27 *
 3.5     9 *
 4.0     6 *
 4.5     8 *
 5.0    12 *
 5.5     7 *
 6.0     4 *
 6.5     8 *
 7.0     6 *
 7.5     4 *
 8.0     6 *
 8.5     5 *
 9.0     4 *
 9.5    12 *
10.0     9 *
10.5     6 *
11.0     3 *
11.5     1 *
12.0     6 *
12.5     4 *
13.0     1 *
13.5     1 *
14.0     2 *
14.5     6 *
15.0     3 *
15.5     2 *
16.0     3 *
16.5     5 *
17.0     4 *
17.5     5 *
18.0     1 *
18.5     2 *
19.0     2 *
19.5     2 *
20.0     7 *
20.5     1 *
21.0     4 *
21.5     2 *
22.0     4 *
22.5     5 *
23.0     2 *
23.5     3 *
24.0     1 *
24.5     3 *
25.0     2 *
25.5     1 *
26.0     2 *
26.5     1 *
27.0     0 
27.5     1 *
28.0     2 *
28.5     0 
29.0     2 *
29.5     2 *
30.0     4 *
30.5     3 *
31.0     1 *
31.5     1 *
32.0     1 *
32.5     2 *
33.0     0 
33.5     1 *
34.0     2 *
34.5     1 *
35.0     1 *
35.5     2 *
36.0     0 
36.5     2 *
37.0     0 
37.5     6 *
38.0     2 *
38.5     1 *
39.0     4 *
39.5     0 
40.0     2 * Someone replying to a spam on a mailinglist (NO Fwd:!);
	     Bad joke
40.5     2 * Official company press release; Bad joke

Bruker AXS Announces Appointment of Laura Francis as New Chief Financial Officer
<http://cbs.marketwatch.com/tools/quotes/newsarticle.asp?guid={B0A78D89-ADE4
-4E54-B3ED-28553C959466}&siteid=mktw&dist=nbs> 
3/25/2002 9:03:00 AM MADISON, Wis., Mar 25, 2002 (BUSINESS WIRE)
Bruker AXS Inc., a leading global provider of advanced X-ray solutions for
life and advanced materials sciences, today announced that it has appointed
Laura Francis as its new Chief Financial Officer, effective April 8, 2002.
Ms. Francis will also be responsible for investor relations.

41.0     2 * Bad joke; ISP helpdesk reply (payment related)
41.5     2 * Internic regret; Unsubscribe confirmation commercial mailing list.

       We regret to inform you that we were unable to accept your
  credit card payment for the domain names listed below, in the amount 
  of $100.00.  To determine the specific reason your credit card
  was not accepted please contact your credit card company as we
  do not receive that information.  For accounting purposes, we can
  not reflect a paid status for this domain name.  Please resubmit 
  payment by calling (703)742-4777, or by sending a check to the
  address listed on your invoice.  If you submit a check, please
  ensure that the domain name and invoice number are listed as
  references.  We apologize for any inconvenience, and hope this   
  matter can be resolved as quickly as possible.
 
  Thank you,

  Jill Dodson 
  InterNIC Registration Services

42.0     0 
42.5     4 * Unsubscribe from commercial newsletter;
             Bad joke; Bad joke; Someone mass-asking for help
43.0     1 * My wife sending me a link to a housing service.
43.5     2 * Happy birthday via WBW; Happy birthday via WBW.
44.0     2 * Press release International Court of Justice (Nigeria....);
             Linux journal autoreply
44.5     3 * Linux journal autoreply; Bad joke; Bad joke.
45.0     2 * Bad joke; Bad Joke.
45.5     2 * Customer license request; Bad joke

------=_NextPart_000_0005_01C00135.2727FB40
Content-Type: text/plain;
	charset="ks_c_5601-1987"
Content-Transfer-Encoding: base64

SGksDQpJIGhhdmUgYSBxdWVzdGlvbi4uDQpEbyBJIG5lZWQgdG8gZ2V0IG5ldyBsaWNlbnNlIGlm
IEkgdXBncmFkZSB0aGUgY29sbGVjdCBzb2Z0d2FyZT8NCkkgaGF2ZSB1cGdyYWRlZCBpdCBqdXN0
IG5vdywgYnV0IGl0IGRvZXNuJ3Qgc2VlbSB0byBiZSBjb25uZWN0ZWQgdG8gQ0NEIGNvbnRyb2xs
ZXIuDQpTbyBJJ20gdXNpbmcgdGhlIG9sZCB2ZXJzaW9uLg0KDQpQbGVhc2UsIHRlbGwgbWUgd2hh
dCBzaG9sZCBJIGRvLg0KDQogICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgIHNp
bmNlcmVseSwNCiAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgRG9uZyBN
b2sgU2hpbg0KICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICBTZW91bCBO
YXRpb25hbCBVbml2ZXJzaXR5DQo=

46.0     3 * Bad joke; Shogi mailing list posting; Bad joke
46.5     1 * My boss asking me for help with a SPAM on his mailing list (Fwd)
47.0     0 
47.5     2 * Bad joke; Happy birthday via WBW
48.0     4 * Bad joke; Company press release; Bad joke;
	     Colleague notifying me of an important new web service.


The following press release went out earlier this morning. =20

Bruker AXS Acquires MAC Science to Further Penetrate Japanese Life =
Science
and Materials Research Markets

48.5     4 * Conference invitation; Happy birthday from WBW; 
             Company sales budget (in German); 
	     Press release International Court of Justice (Congo, Burundi,...)
49.0     2 * Headhunter hunting me via mailing; Bad joke.
49.5     3 * Shogi mailing list posting; 
             My boss asking me how to deal with a spam message (Fwd:);
	     Shogi mailing list posting announcing a tournament
50.0     1 * Customer sending 1.44MB binary file as text/plain attachment :-)
50.5     0 
51.0     2 * Bad joke; Bad joke.
51.5     0 
52.0     0 
52.5     1 * Bad joke.
53.0     0 
53.5     1 * ECA Annual fee reminder

Dear ECA Members

Just to remember those that have not paid the 2001 annual fee can do
that in Krakow.
It is easier and "cheaper".

54.0     0 
54.5     0 
55.0     1 * Python Professional Services Europe (PPSE) announcement.
55.5     0 
56.0     0 
56.5     0 
57.0     1 * Bad joke.
57.5     0 
58.0     0 
58.5     1 * Bad joke.
59.0     1 * Happy birthday via WBW
59.5     0 
60.0     1 * ISP Newsletter in German
60.5     2 * Happy birthday via WBW; request for information

Hi, 
I see you name im the web museum, in M. C. Escher page, OK, I am
building a site about ilusions, can you help me in this? I use 2
pictures of the artist in my page this is ok? Anyway you known other
images that I can use in my work? Please visit my page at:

http://www.geocities.com/SoHo/Studios/4762/

61.0     0 
61.5     0 
62.0     0 
62.5     0 
63.0     1 * Happy birthday via WBW
63.5     1 * Bad joke
64.0     0 
64.5     0 
65.0     2 * Bad joke; Bad joke.
65.5     0 
66.0     0 
66.5     0 
67.0     0 
67.5     2 * Free copy of Caldera linux; Web-site registration code.
Linux Developer:

We greatly appreciate the contribution you have made to the Linux
community and, to demonstrate that appreciation, we would like to send 
you a free copy of our latest Linux-based product, OpenLinux Standard 1.1.

68.0     1 * Happy birthday via WBW
68.5     0 
69.0     0 
69.5     0 
70.0     0 
70.5     0 
71.0     1 * Happy birthday via WBW
71.5     0 
72.0     0 
72.5     2 * Self-reminder of a bug in a program; Auto-reply to a web request

fields can start with </PRE>

73.0     0 
73.5     0 
74.0     0 
74.5     0 
75.0     0 
75.5     0 
76.0     0 
76.5     1 * Happy birthday via WBW
77.0     0 
77.5     0 
78.0     0 
78.5     1 * Four11 directory listing announcement
79.0     0 
79.5     0 
80.0     0 
80.5     1 *
81.0     0 
81.5     0 
82.0     0 
82.5     0 
83.0     0 
83.5     0 
84.0     0 
84.5     0 
85.0     1 *
85.5     0 
86.0     0 
86.5     1 *
87.0     0 
87.5     0 
88.0     0 
88.5     0 
89.0     0 
89.5     0 
90.0     0 
90.5     0 
91.0     1 *
91.5     0 
92.0     0 
92.5     0 
93.0     0 
93.5     0 
94.0     0 
94.5     0 
95.0     0 
95.5     0 
96.0     0 
96.5     1 *
97.0     1 *
97.5     0 
98.0     1 *
98.5     0 
99.0     1 *
99.5     7 *

-> <stat> Spam scores for all runs: 5600 items; mean 99.35; sdev 5.40
-> <stat> min 4.22602e-09; median 100; max 100
* = 89 items
 0.0    3 *
 0.5    0 
 1.0    0 
 1.5    1 *
 2.0    0 
 2.5    0 
 3.0    0 
 3.5    0 
 4.0    0 
 4.5    1 *
 5.0    0 
 5.5    0 
 6.0    1 *
 6.5    0 
 7.0    0 
 7.5    0 
 8.0    0 
 8.5    0 
 9.0    0 
 9.5    0 
10.0    0 
10.5    0 
11.0    0 
11.5    0 
12.0    0 
12.5    0 
13.0    0 
13.5    0 
14.0    0 
14.5    0 
15.0    0 
15.5    0 
16.0    0 
16.5    0 
17.0    0 
17.5    0 
18.0    0 
18.5    0 
19.0    0 
19.5    0 
20.0    0 
20.5    1 *
21.0    0 
21.5    0 
22.0    0 
22.5    0 
23.0    0 
23.5    0 
24.0    0 
24.5    0 
25.0    0 
25.5    0 
26.0    0 
26.5    0 
27.0    0 
27.5    0 
28.0    0 
28.5    0 
29.0    0 
29.5    0 
30.0    0 
30.5    0 
31.0    0 
31.5    1 *
32.0    0 
32.5    0 
33.0    1 *
33.5    0 
34.0    1 *
34.5    0 
35.0    0 
35.5    0 
36.0    0 
36.5    0 
37.0    0 
37.5    0 
38.0    0 
38.5    0 
39.0    0 
39.5    1 *
40.0    1 * "we would like to send you our information". May be misclassified.
40.5    0 
41.0    0 
41.5    0 
42.0    0 
42.5    0 
43.0    0 
43.5    0 
44.0    0 
44.5    1 * ObjectSpace C++ product announcement
45.0    0 
45.5    0 
46.0    0 
46.5    1 * Webcounter

 You tried other counters now try something AMAZING!!!

 ONE CODE FOR ALL YOUR PAGES AND DOMAINS!!!

http://www.freewebcounter.com


1.  View your full raw log files and perform tracerouts from the hosts!
2.  See the every page the person visited in order!
3.  Top 50 full search phrase used to find your site!
4.  All countries
5.  unique visits / page views.
6.  visites by day/week/month/year.
7.  Top 50 browser agents.
8.  Emails.

47.0    0 
47.5    0 
48.0    0 
48.5    1 * Character analysis

This is  a: Commercial Electronic Mail Message. It is TOTALLY   LEGAL  (Washington.' Law; chapter 
19.190 RCW)
and with U.S. Federal requirements for commercial email under bill: S 1618 Title 111 section 301 
paragraph (a) (2) (C) because it includes a removal mechanism.     To be removed:the list: please see below.

49.0    0 
49.5    3 * Translation company based in Beijing; Distance education IT school;
            Anti-aids medicin from Beijing
50.0    7 * HTML-only with image maps; How to juggle women (book); 
            Far east spam; HTML only far east spam; conference announcement;
            Conference announcement; Hunza diet bread; Tim's hometown stories;
            Far east spam
50.5    2 * Web advertising
51.0    1 * Far east spam
51.5    3 * Far east spam; Get rich via python mailinglist; empty message
52.0    0 
52.5    0 
53.0    1 * Hyper porn
            YIKES: prob('subject:porn') = 0.696523 only!

From: HairyKevin <HairyKevin@aol.com>
Return-path: <HairyKevin@aol.com>
To: HairyKevin@aol.com
Subject: hyper porn
Date: Sun, 24 May 1998 15:11:51 EDT
Organization: AOL (http://www.aol.com)
Mime-Version: 1.0
Content-type: text/plain; charset=US-ASCII
Content-transfer-encoding: 7bit

<a href="http://sex4free.dyn.ml.org/index.html">click here</a>

53.5    0 
54.0    0 
54.5    0 
55.0    2 * Web hosting (German); Internet programming offered
55.5    0 
56.0    0 
56.5    0 
57.0    0 
57.5    0 
58.0    2 * Affengeil; Far east spam

AFFENGEIL !!!!
002.45.29.65.83
... Ruf an!

58.5    0 
59.0    1 * Microsoft office training
59.5    0 
60.0    0 
60.5    0 
61.0    0 
61.5    0 
62.0    1 * here is the picture of me that you asked for... 
62.5    1 * "E-bay auction" spam (Congradulations (sic) on your selling)
63.0    1 * Dahanut newsletter
63.5    1 * "My friend is going out with this girl"
64.0    0 
64.5    0 
65.0    0 
65.5    0 
66.0    0 
66.5    1 * Make a million
67.0    0 
67.5    0 
68.0    0 
68.5    0 
69.0    0 
69.5    2 * Both same mailinglist removal confirmation. MISCLASSIFIED.
70.0    0 
70.5    0 
71.0    0 
71.5    0 
72.0    1 * Spanish, HTML only.
72.5    0 
73.0    0 
73.5    0 
74.0    0 
74.5    0 
75.0    1 * Happy birthday via WBW with commercial appendix
75.5    0 
76.0    1 * Christian site in Jerusalem
76.5    1 * "\Below is the result of your feedback form."
77.0    0 
77.5    0 
78.0    2 * Diet science (close to the biology I worked in for a while);
            Medical website on-line announcement
78.5    1 * Medical conference announcement
79.0    0 
79.5    1 * "I have attached my web page with new photos!"
80.0    1 *
80.5    0 
81.0    2 *
81.5    4 *
82.0    2 *
82.5    2 *
83.0    0 
83.5    1 *
84.0    1 *
84.5    1 *
85.0    2 *
85.5    1 *
86.0    1 *
86.5    1 *
87.0    0 
87.5    5 *
88.0    0 
88.5    0 
89.0    1 *
89.5    1 *
90.0    1 *
90.5    2 *
91.0    2 *
91.5    2 *
92.0    2 *
92.5    3 *
93.0   36 *
93.5    1 *
94.0    4 *
94.5    4 *
95.0    5 *
95.5    6 *
96.0    9 *
96.5    8 *
97.0    4 *
97.5    7 *
98.0    7 *
98.5   15 *
99.0   26 *
99.5 5378 *************************************************************
-> best cutoff for all runs: 0.87
->     with weighted total 10*12 fp + 71 fn = 191
->     fp rate 0.075%  fn rate 1.27%
->     matched at 0.875 with 12 fp & 71 fn; fp rate 0.075%; fn rate 1.27%

---------------------- multipart/mixed attachment--