[Spambayes] Mining the headers
Tim Peters
tim.one@comcast.net
Mon Oct 28 01:07:01 2002
About:
[Tokenizer]
generate_time_buckets: True
extract_dow: True
Across my c.l.py test (10-fold cv; mixed source; 20,000 c.l.py ham + 14,000
bruceg spam) it didn't change the FP, FN or unsure rates, but there's
nothing that's ever going to get rid of my 2 remaining FP and 2 remaining
FN.
There's evidence that bruceg got spam more often on weekends than c.l.py got
ham on weekends, and mostly because c.l.py traffic drops on weekends. Here
in decreasing order of spamprob, but from just 1 of the 10 classifiers built
during the test:
'dow:invalid' 57 426 0.913973614869
'dow:5' 1611 1562 0.580722462655
'dow:6' 1599 1413 0.557982096725
'dow:1' 2982 1701 0.449007117316
'dow:0' 2738 1480 0.435736703465
'dow:3' 3067 1661 0.436204501487
'dow:4' 2860 1535 0.433990360589
'dow:2' 3086 1642 0.431861726944
NOTE: Since I'm running with the default robinson_minimum_prob_strength ==
0.1, all words with spamprob between 0.4 and 0.6 are ignored. Therefore
only the 'dow:invalid' token *could* have had an effect on this test.
Time buckets show higher spamprobs in the hours most of America is asleep.
Again this appears to have more to do with that there's less c.l.py traffic
then than with an increase in spam then -- but for purposes of prediction,
any regularity in spam *or* ham is exploitable:
hh:mm #h #s spamprob
0.00 133 66 0.415
0.10 108 74 0.495
0.20 114 81 0.504
0.30 93 85 0.566
0.40 103 87 0.547
0.50 102 93 0.566
1.00 82 62 0.519
1.10 85 89 0.599
1.20 83 70 0.546
-------------------------- above .60 starting roughly here
1.30 79 89 0.616
1.40 106 84 0.531
1.50 74 88 0.629
2.00 60 65 0.607
2.10 67 99 0.678
2.20 60 76 0.644
2.30 79 89 0.616
2.40 45 75 0.703
2.50 81 81 0.588
3.00 55 67 0.635
3.10 58 99 0.709
3.20 52 66 0.644
3.30 66 81 0.636
3.40 64 81 0.643
3.50 62 81 0.651
4.00 45 68 0.683
4.10 47 53 0.616
4.20 45 57 0.643
4.30 45 85 0.729
4.40 56 49 0.555
4.50 46 57 0.638
5.00 32 83 0.786
5.10 47 77 0.700
5.20 42 56 0.655
5.30 50 49 0.583
5.40 44 55 0.640
5.50 48 63 0.652
6.00 52 76 0.676
6.10 46 48 0.598
6.20 42 57 0.659
6.30 53 59 0.613
6.40 56 52 0.570
6.50 41 65 0.693
7.00 49 56 0.620
-------------------------- and ending roughly here
7.10 58 53 0.566
7.20 69 50 0.509
7.30 75 64 0.549
7.40 83 65 0.528
7.50 94 57 0.464
8.00 97 48 0.414
8.10 113 69 0.466
8.20 109 76 0.499
8.30 141 70 0.415
8.40 112 50 0.390
8.50 117 58 0.415
9.00 120 55 0.396
9.10 137 57 0.373
9.20 154 57 0.346
9.30 171 57 0.323
9.40 141 55 0.358
9.50 170 49 0.292
10.00 159 81 0.421
10.10 182 76 0.374
10.20 200 73 0.343
10.30 176 69 0.359
10.40 132 81 0.467
10.50 163 67 0.370
11.00 184 92 0.417
11.10 174 66 0.352
11.20 181 58 0.314
11.30 169 73 0.382
11.40 170 69 0.367
11.50 167 60 0.339
12.00 191 95 0.416
12.10 182 73 0.365
12.20 128 63 0.413
12.30 156 70 0.391
12.40 153 82 0.434
12.50 170 106 0.471
13.00 157 78 0.415
13.10 149 77 0.425
13.20 160 82 0.423
13.30 140 71 0.420
13.40 172 66 0.354
13.50 192 64 0.323
14.00 169 99 0.456
14.10 170 90 0.431
14.20 203 69 0.327
14.30 168 89 0.431
14.40 192 78 0.367
14.50 199 63 0.312
15.00 200 68 0.327
15.10 195 59 0.302
15.20 183 71 0.357
15.30 198 67 0.326
15.40 193 78 0.366
15.50 195 60 0.306
16.00 181 81 0.390
16.10 176 72 0.369
16.20 194 98 0.419
16.30 177 70 0.361
16.40 175 81 0.398
16.50 185 88 0.405
17.00 187 79 0.377
17.10 167 67 0.365
17.20 165 80 0.409
17.30 185 74 0.364
17.40 179 82 0.396
17.50 166 90 0.437
18.00 136 65 0.406
18.10 141 77 0.438
18.20 165 65 0.360
18.30 168 74 0.386
18.40 148 96 0.481
18.50 144 70 0.410
19.00 140 83 0.459
19.10 130 93 0.505
19.20 139 67 0.408
19.30 111 79 0.504
19.40 129 64 0.415
19.50 128 80 0.472
20.00 121 71 0.456
20.10 129 71 0.440
20.20 124 63 0.421
20.30 127 95 0.517
20.40 140 86 0.467
20.50 131 78 0.460
21.00 142 84 0.458
21.10 144 87 0.463
21.20 143 79 0.441
21.30 143 82 0.450
21.40 142 93 0.483
21.50 132 98 0.515
22.00 135 70 0.426
22.10 133 69 0.426
22.20 136 98 0.507
22.30 121 95 0.529
22.40 128 82 0.478
22.50 117 70 0.461
23.00 121 60 0.415
23.10 142 61 0.381
23.20 146 74 0.420
23.30 115 77 0.489
23.40 107 71 0.487
23.50 114 80 0.501
So, overall, in this data there are mild indicators based on DOW and on time
bucket tokens. The c.l.py test was already getting all the info it could
use, though, and they're too mild to make much of a dent in a chi-score
unless there are very few total clues.