[Spambayes] Mining the headers

Tim Peters tim.one@comcast.net
Mon Oct 28 01:07:01 2002


About:

[Tokenizer]
generate_time_buckets: True
extract_dow: True

Across my c.l.py test (10-fold cv; mixed source; 20,000 c.l.py ham + 14,000
bruceg spam) it didn't change the FP, FN or unsure rates, but there's
nothing that's ever going to get rid of my 2 remaining FP and 2 remaining
FN.

There's evidence that bruceg got spam more often on weekends than c.l.py got
ham on weekends, and mostly because c.l.py traffic drops on weekends.  Here
in decreasing order of spamprob, but from just 1 of the 10 classifiers built
during the test:

'dow:invalid' 57  426 0.913973614869
'dow:5'     1611 1562 0.580722462655
'dow:6'     1599 1413 0.557982096725
'dow:1'     2982 1701 0.449007117316
'dow:0'     2738 1480 0.435736703465
'dow:3'     3067 1661 0.436204501487
'dow:4'     2860 1535 0.433990360589
'dow:2'     3086 1642 0.431861726944

NOTE:  Since I'm running with the default robinson_minimum_prob_strength ==
0.1, all words with spamprob between 0.4 and 0.6 are ignored.  Therefore
only the 'dow:invalid' token *could* have had an effect on this test.

Time buckets show higher spamprobs in the hours most of America is asleep.
Again this appears to have more to do with that there's less c.l.py traffic
then than with an increase in spam then -- but for purposes of prediction,
any regularity in spam *or* ham is exploitable:

hh:mm   #h   #s  spamprob
 0.00  133   66  0.415
 0.10  108   74  0.495
 0.20  114   81  0.504
 0.30   93   85  0.566
 0.40  103   87  0.547
 0.50  102   93  0.566
 1.00   82   62  0.519
 1.10   85   89  0.599
 1.20   83   70  0.546
-------------------------- above .60 starting roughly here
 1.30   79   89  0.616
 1.40  106   84  0.531
 1.50   74   88  0.629
 2.00   60   65  0.607
 2.10   67   99  0.678
 2.20   60   76  0.644
 2.30   79   89  0.616
 2.40   45   75  0.703
 2.50   81   81  0.588
 3.00   55   67  0.635
 3.10   58   99  0.709
 3.20   52   66  0.644
 3.30   66   81  0.636
 3.40   64   81  0.643
 3.50   62   81  0.651
 4.00   45   68  0.683
 4.10   47   53  0.616
 4.20   45   57  0.643
 4.30   45   85  0.729
 4.40   56   49  0.555
 4.50   46   57  0.638
 5.00   32   83  0.786
 5.10   47   77  0.700
 5.20   42   56  0.655
 5.30   50   49  0.583
 5.40   44   55  0.640
 5.50   48   63  0.652
 6.00   52   76  0.676
 6.10   46   48  0.598
 6.20   42   57  0.659
 6.30   53   59  0.613
 6.40   56   52  0.570
 6.50   41   65  0.693
 7.00   49   56  0.620
-------------------------- and ending roughly here
 7.10   58   53  0.566
 7.20   69   50  0.509
 7.30   75   64  0.549
 7.40   83   65  0.528
 7.50   94   57  0.464
 8.00   97   48  0.414
 8.10  113   69  0.466
 8.20  109   76  0.499
 8.30  141   70  0.415
 8.40  112   50  0.390
 8.50  117   58  0.415
 9.00  120   55  0.396
 9.10  137   57  0.373
 9.20  154   57  0.346
 9.30  171   57  0.323
 9.40  141   55  0.358
 9.50  170   49  0.292
10.00  159   81  0.421
10.10  182   76  0.374
10.20  200   73  0.343
10.30  176   69  0.359
10.40  132   81  0.467
10.50  163   67  0.370
11.00  184   92  0.417
11.10  174   66  0.352
11.20  181   58  0.314
11.30  169   73  0.382
11.40  170   69  0.367
11.50  167   60  0.339
12.00  191   95  0.416
12.10  182   73  0.365
12.20  128   63  0.413
12.30  156   70  0.391
12.40  153   82  0.434
12.50  170  106  0.471
13.00  157   78  0.415
13.10  149   77  0.425
13.20  160   82  0.423
13.30  140   71  0.420
13.40  172   66  0.354
13.50  192   64  0.323
14.00  169   99  0.456
14.10  170   90  0.431
14.20  203   69  0.327
14.30  168   89  0.431
14.40  192   78  0.367
14.50  199   63  0.312
15.00  200   68  0.327
15.10  195   59  0.302
15.20  183   71  0.357
15.30  198   67  0.326
15.40  193   78  0.366
15.50  195   60  0.306
16.00  181   81  0.390
16.10  176   72  0.369
16.20  194   98  0.419
16.30  177   70  0.361
16.40  175   81  0.398
16.50  185   88  0.405
17.00  187   79  0.377
17.10  167   67  0.365
17.20  165   80  0.409
17.30  185   74  0.364
17.40  179   82  0.396
17.50  166   90  0.437
18.00  136   65  0.406
18.10  141   77  0.438
18.20  165   65  0.360
18.30  168   74  0.386
18.40  148   96  0.481
18.50  144   70  0.410
19.00  140   83  0.459
19.10  130   93  0.505
19.20  139   67  0.408
19.30  111   79  0.504
19.40  129   64  0.415
19.50  128   80  0.472
20.00  121   71  0.456
20.10  129   71  0.440
20.20  124   63  0.421
20.30  127   95  0.517
20.40  140   86  0.467
20.50  131   78  0.460
21.00  142   84  0.458
21.10  144   87  0.463
21.20  143   79  0.441
21.30  143   82  0.450
21.40  142   93  0.483
21.50  132   98  0.515
22.00  135   70  0.426
22.10  133   69  0.426
22.20  136   98  0.507
22.30  121   95  0.529
22.40  128   82  0.478
22.50  117   70  0.461
23.00  121   60  0.415
23.10  142   61  0.381
23.20  146   74  0.420
23.30  115   77  0.489
23.40  107   71  0.487
23.50  114   80  0.501

So, overall, in this data there are mild indicators based on DOW and on time
bucket tokens.  The c.l.py test was already getting all the info it could
use, though, and they're too mild to make much of a dent in a chi-score
unless there are very few total clues.