[Spambayes] more date field mining
Skip Montanaro
skip@pobox.com
Tue, 1 Oct 2002 01:10:00 -0500
I have now modified the Tokenizer class thus:
class Tokenizer:
date_hms_re = re.compile(r' (?P<hour>[0-9][0-9]):'
r'(?P<minute>[0-9][0-9]):'
r'(?P<second>[0-9][0-9]) ')
date_formats = ("%a, %d %b %Y %H:%M:%S (%Z)",
"%a, %d %b %Y %H:%M:%S %Z",
"%d %b %Y %H:%M:%S (%Z)",
"%d %b %Y %H:%M:%S %Z")
...
def tokenize_headers(self, msg):
# Special tagging of header lines and MIME metadata.
...
if options.mine_date_headers:
for header in msg.get_all("date", ()):
mat = self.date_hms_re.search(header)
# return the time in Date: headers arranged in
# six-minute buckets
if mat is not None:
h = int(mat.group('hour'))
bucket = int(mat.group('minute')) // 10
yield 'time:%02d:%d' % (h, bucket)
# extract the day of the week
for fmt in self.date_formats:
try:
timetuple = time.strptime(header, fmt)
except ValueError:
pass
else:
yield 'dow:%d' % timetuple[6]
else:
yield 'dow:invalid'
Times and days of the week seem like they should be pretty distinct. I
should probably analyze them separately using two options. Still, here are
my initial results using this coarser grained scheme:
cutoffs -> times
-> <stat> tested 200 hams & 200 spams against 1800 hams & 1800 spams
...
false positive percentages
1.000 1.000 tied
1.500 1.500 tied
1.000 1.000 tied
1.000 1.500 lost +50.00%
1.000 1.000 tied
1.500 1.500 tied
3.500 3.500 tied
1.500 1.500 tied
1.500 1.500 tied
1.500 2.000 lost +33.33%
won 0 times
tied 8 times
lost 2 times
total unique fp went from 30 to 32 lost +6.67%
mean fp % went from 1.5 to 1.6 lost +6.67%
false negative percentages
0.500 0.500 tied
1.500 1.500 tied
0.500 0.500 tied
0.500 0.500 tied
2.000 2.000 tied
0.000 0.000 tied
1.000 1.500 lost +50.00%
1.000 1.000 tied
0.000 0.000 tied
1.500 1.500 tied
won 0 times
tied 9 times
lost 1 times
total unique fn went from 17 to 18 lost +5.88%
mean fn % went from 0.85 to 0.9 lost +5.88%
ham mean ham sdev
20.82 21.05 +1.10% 6.43 6.47 +0.62%
21.86 22.00 +0.64% 6.63 6.61 -0.30%
21.38 21.56 +0.84% 6.49 6.57 +1.23%
21.96 22.13 +0.77% 6.26 6.27 +0.16%
21.51 21.73 +1.02% 6.72 6.73 +0.15%
21.66 21.88 +1.02% 6.98 7.01 +0.43%
21.45 21.62 +0.79% 7.66 7.59 -0.91%
21.74 21.93 +0.87% 6.69 6.67 -0.30%
21.71 21.88 +0.78% 7.44 7.43 -0.13%
21.87 22.01 +0.64% 5.93 5.93 +0.00%
ham mean and sdev for all runs
21.60 21.78 +0.83% 6.75 6.75 +0.00%
spam mean spam sdev
74.10 73.79 -0.42% 12.99 12.71 -2.16%
72.47 72.11 -0.50% 13.92 13.63 -2.08%
74.05 73.75 -0.41% 13.00 12.80 -1.54%
74.00 73.68 -0.43% 12.27 12.03 -1.96%
72.43 72.06 -0.51% 13.73 13.33 -2.91%
72.68 72.35 -0.45% 13.27 13.04 -1.73%
72.57 72.29 -0.39% 13.03 12.84 -1.46%
71.50 71.26 -0.34% 12.12 11.95 -1.40%
73.25 72.92 -0.45% 12.67 12.39 -2.21%
73.02 72.73 -0.40% 12.44 12.24 -1.61%
spam mean and sdev for all runs
73.01 72.69 -0.44% 12.98 12.73 -1.93%
ham/spam mean difference: 51.41 50.91 -0.50
I'll try it with a more fine-grained set of options tomorrow after a little
snooze.
Skip