[Spambayes] Frequency distribution for wordinfo counts?
Tim Peters
tim.one at comcast.net
Sun Feb 15 00:12:08 EST 2004
[Brad Clements]
> I'd like to get feedback from folks on the distribution of nham and
> nspam counts in their wordinfo databases.
>
> For example, I used sb_dbexpimp to dump my dbm based storage, then
> loaded it into excel and did a histogram on nham and nspam.
>
> ...
> Anyway, what I'm interested in is the number of tokens whose nspam
> or nham count is greater than 255 vs the total number of tokens and
> ham and spam count.
Here's mine today; I'm using bigrams:
nham 674 nspam 621
ham counts
value #times cumm % cumm%
0 107422 107422 47.24 47.24
1 103431 210853 45.49 92.73
2 8304 219157 3.65 96.38
3 2797 221954 1.23 97.61
4 1508 223462 0.66 98.28
5 852 224314 0.37 98.65
6 513 224827 0.23 98.88
7 359 225186 0.16 99.03
8 259 225445 0.11 99.15
9 203 225648 0.09 99.24
10 192 225840 0.08 99.32
11 117 225957 0.05 99.37
12 139 226096 0.06 99.43
13 90 226186 0.04 99.47
14 90 226276 0.04 99.51
15 91 226367 0.04 99.55
16 77 226444 0.03 99.59
17 66 226510 0.03 99.62
18 63 226573 0.03 99.64
19 52 226625 0.02 99.67
20 30 226655 0.01 99.68
21 34 226689 0.01 99.70
22 39 226728 0.02 99.71
23 31 226759 0.01 99.73
24 30 226789 0.01 99.74
25 25 226814 0.01 99.75
26 16 226830 0.01 99.76
27 20 226850 0.01 99.77
28 20 226870 0.01 99.77
29 22 226892 0.01 99.78
30 18 226910 0.01 99.79
31 14 226924 0.01 99.80
32 16 226940 0.01 99.81
33 16 226956 0.01 99.81
34 13 226969 0.01 99.82
35 12 226981 0.01 99.82
36 9 226990 0.00 99.83
37 9 226999 0.00 99.83
38 9 227008 0.00 99.84
39 12 227020 0.01 99.84
40 6 227026 0.00 99.84
41 9 227035 0.00 99.85
42 7 227042 0.00 99.85
43 2 227044 0.00 99.85
44 5 227049 0.00 99.85
45 4 227053 0.00 99.86
46 8 227061 0.00 99.86
47 5 227066 0.00 99.86
48 8 227074 0.00 99.86
49 5 227079 0.00 99.87
50 4 227083 0.00 99.87
51 5 227088 0.00 99.87
52 9 227097 0.00 99.87
53 6 227103 0.00 99.88
54 5 227108 0.00 99.88
55 8 227116 0.00 99.88
56 4 227120 0.00 99.88
58 1 227121 0.00 99.89
59 5 227126 0.00 99.89
61 4 227130 0.00 99.89
62 2 227132 0.00 99.89
63 5 227137 0.00 99.89
64 3 227140 0.00 99.89
65 2 227142 0.00 99.89
66 1 227143 0.00 99.89
67 1 227144 0.00 99.90
68 3 227147 0.00 99.90
69 2 227149 0.00 99.90
70 3 227152 0.00 99.90
71 3 227155 0.00 99.90
72 2 227157 0.00 99.90
73 3 227160 0.00 99.90
74 4 227164 0.00 99.90
75 1 227165 0.00 99.90
76 3 227168 0.00 99.91
77 3 227171 0.00 99.91
78 2 227173 0.00 99.91
79 1 227174 0.00 99.91
80 3 227177 0.00 99.91
81 2 227179 0.00 99.91
82 4 227183 0.00 99.91
83 5 227188 0.00 99.91
84 2 227190 0.00 99.92
85 4 227194 0.00 99.92
87 2 227196 0.00 99.92
88 4 227200 0.00 99.92
89 2 227202 0.00 99.92
90 3 227205 0.00 99.92
91 3 227208 0.00 99.92
92 2 227210 0.00 99.92
93 1 227211 0.00 99.92
94 2 227213 0.00 99.93
95 3 227216 0.00 99.93
96 1 227217 0.00 99.93
98 1 227218 0.00 99.93
100 8 227226 0.00 99.93
101 2 227228 0.00 99.93
102 1 227229 0.00 99.93
103 3 227232 0.00 99.93
104 1 227233 0.00 99.93
105 3 227236 0.00 99.94
107 1 227237 0.00 99.94
108 2 227239 0.00 99.94
109 1 227240 0.00 99.94
111 1 227241 0.00 99.94
114 2 227243 0.00 99.94
115 5 227248 0.00 99.94
118 2 227250 0.00 99.94
120 3 227253 0.00 99.94
123 2 227255 0.00 99.94
125 1 227256 0.00 99.94
126 1 227257 0.00 99.95
127 1 227258 0.00 99.95
129 1 227259 0.00 99.95
132 2 227261 0.00 99.95
133 1 227262 0.00 99.95
135 1 227263 0.00 99.95
138 1 227264 0.00 99.95
139 1 227265 0.00 99.95
140 1 227266 0.00 99.95
142 3 227269 0.00 99.95
144 1 227270 0.00 99.95
149 4 227274 0.00 99.95
153 1 227275 0.00 99.95
155 1 227276 0.00 99.95
156 2 227278 0.00 99.95
157 1 227279 0.00 99.95
158 2 227281 0.00 99.96
160 1 227282 0.00 99.96
163 2 227284 0.00 99.96
165 1 227285 0.00 99.96
166 1 227286 0.00 99.96
179 1 227287 0.00 99.96
181 1 227288 0.00 99.96
185 2 227290 0.00 99.96
192 2 227292 0.00 99.96
195 3 227295 0.00 99.96
202 3 227298 0.00 99.96
203 1 227299 0.00 99.96
204 1 227300 0.00 99.96
211 1 227301 0.00 99.96
213 1 227302 0.00 99.96
219 1 227303 0.00 99.97
225 1 227304 0.00 99.97
227 1 227305 0.00 99.97
232 1 227306 0.00 99.97
237 1 227307 0.00 99.97
238 1 227308 0.00 99.97
246 1 227309 0.00 99.97
252 1 227310 0.00 99.97
257 1 227311 0.00 99.97
260 1 227312 0.00 99.97
269 1 227313 0.00 99.97
270 2 227315 0.00 99.97
272 1 227316 0.00 99.97
273 1 227317 0.00 99.97
274 1 227318 0.00 99.97
275 2 227320 0.00 99.97
279 1 227321 0.00 99.97
286 1 227322 0.00 99.97
308 1 227323 0.00 99.97
314 1 227324 0.00 99.97
320 2 227326 0.00 99.98
321 1 227327 0.00 99.98
322 1 227328 0.00 99.98
329 1 227329 0.00 99.98
345 1 227330 0.00 99.98
350 1 227331 0.00 99.98
352 1 227332 0.00 99.98
353 1 227333 0.00 99.98
360 4 227337 0.00 99.98
366 1 227338 0.00 99.98
375 2 227340 0.00 99.98
380 1 227341 0.00 99.98
382 1 227342 0.00 99.98
389 1 227343 0.00 99.98
398 1 227344 0.00 99.98
401 7 227351 0.00 99.99
409 1 227352 0.00 99.99
424 1 227353 0.00 99.99
446 9 227362 0.00 99.99
450 2 227364 0.00 99.99
456 1 227365 0.00 99.99
465 1 227366 0.00 99.99
466 1 227367 0.00 99.99
493 2 227369 0.00 99.99
515 1 227370 0.00 99.99
519 1 227371 0.00 100.00
542 1 227372 0.00 100.00
545 1 227373 0.00 100.00
562 1 227374 0.00 100.00
573 1 227375 0.00 100.00
583 1 227376 0.00 100.00
621 1 227377 0.00 100.00
673 2 227379 0.00 100.00
674 3 227382 0.00 100.00
spam counts
value #times cumm % cumm%
0 104332 104332 45.88 45.88
1 108911 213243 47.90 93.78
2 7225 220468 3.18 96.96
3 2368 222836 1.04 98.00
4 1190 224026 0.52 98.52
5 692 224718 0.30 98.83
6 486 225204 0.21 99.04
7 305 225509 0.13 99.18
8 280 225789 0.12 99.30
9 183 225972 0.08 99.38
10 152 226124 0.07 99.45
11 127 226251 0.06 99.50
12 76 226327 0.03 99.54
13 73 226400 0.03 99.57
14 76 226476 0.03 99.60
15 74 226550 0.03 99.63
16 45 226595 0.02 99.65
17 58 226653 0.03 99.68
18 52 226705 0.02 99.70
19 38 226743 0.02 99.72
20 42 226785 0.02 99.74
21 26 226811 0.01 99.75
22 20 226831 0.01 99.76
23 35 226866 0.02 99.77
24 23 226889 0.01 99.78
25 20 226909 0.01 99.79
26 20 226929 0.01 99.80
27 24 226953 0.01 99.81
28 13 226966 0.01 99.82
29 14 226980 0.01 99.82
30 6 226986 0.00 99.83
31 11 226997 0.00 99.83
32 19 227016 0.01 99.84
33 9 227025 0.00 99.84
34 6 227031 0.00 99.85
35 11 227042 0.00 99.85
36 9 227051 0.00 99.85
37 10 227061 0.00 99.86
38 5 227066 0.00 99.86
39 5 227071 0.00 99.86
40 3 227074 0.00 99.86
41 7 227081 0.00 99.87
42 10 227091 0.00 99.87
43 5 227096 0.00 99.87
44 7 227103 0.00 99.88
45 7 227110 0.00 99.88
46 1 227111 0.00 99.88
47 4 227115 0.00 99.88
48 6 227121 0.00 99.89
49 4 227125 0.00 99.89
50 5 227130 0.00 99.89
51 1 227131 0.00 99.89
52 2 227133 0.00 99.89
53 1 227134 0.00 99.89
54 2 227136 0.00 99.89
55 4 227140 0.00 99.89
56 4 227144 0.00 99.90
57 5 227149 0.00 99.90
58 5 227154 0.00 99.90
59 8 227162 0.00 99.90
60 3 227165 0.00 99.90
61 2 227167 0.00 99.91
62 4 227171 0.00 99.91
63 5 227176 0.00 99.91
64 2 227178 0.00 99.91
65 2 227180 0.00 99.91
66 2 227182 0.00 99.91
68 5 227187 0.00 99.91
69 1 227188 0.00 99.91
70 1 227189 0.00 99.92
71 1 227190 0.00 99.92
72 4 227194 0.00 99.92
74 2 227196 0.00 99.92
75 4 227200 0.00 99.92
76 1 227201 0.00 99.92
77 3 227204 0.00 99.92
78 2 227206 0.00 99.92
79 5 227211 0.00 99.92
80 4 227215 0.00 99.93
81 4 227219 0.00 99.93
84 1 227220 0.00 99.93
85 3 227223 0.00 99.93
86 1 227224 0.00 99.93
87 3 227227 0.00 99.93
88 3 227230 0.00 99.93
89 3 227233 0.00 99.93
90 3 227236 0.00 99.94
91 1 227237 0.00 99.94
92 1 227238 0.00 99.94
93 2 227240 0.00 99.94
94 1 227241 0.00 99.94
95 2 227243 0.00 99.94
97 2 227245 0.00 99.94
98 1 227246 0.00 99.94
100 1 227247 0.00 99.94
101 3 227250 0.00 99.94
102 2 227252 0.00 99.94
103 1 227253 0.00 99.94
104 2 227255 0.00 99.94
105 3 227258 0.00 99.95
107 4 227262 0.00 99.95
109 2 227264 0.00 99.95
111 1 227265 0.00 99.95
113 2 227267 0.00 99.95
115 1 227268 0.00 99.95
120 1 227269 0.00 99.95
121 1 227270 0.00 99.95
123 2 227272 0.00 99.95
126 1 227273 0.00 99.95
128 1 227274 0.00 99.95
130 1 227275 0.00 99.95
131 1 227276 0.00 99.95
132 1 227277 0.00 99.95
138 3 227280 0.00 99.96
140 2 227282 0.00 99.96
141 2 227284 0.00 99.96
145 1 227285 0.00 99.96
147 1 227286 0.00 99.96
148 3 227289 0.00 99.96
151 1 227290 0.00 99.96
152 4 227294 0.00 99.96
154 1 227295 0.00 99.96
157 1 227296 0.00 99.96
158 1 227297 0.00 99.96
159 1 227298 0.00 99.96
163 2 227300 0.00 99.96
164 1 227301 0.00 99.96
167 7 227308 0.00 99.97
176 11 227319 0.00 99.97
178 1 227320 0.00 99.97
181 1 227321 0.00 99.97
183 1 227322 0.00 99.97
185 1 227323 0.00 99.97
187 1 227324 0.00 99.97
191 2 227326 0.00 99.98
194 2 227328 0.00 99.98
198 1 227329 0.00 99.98
205 1 227330 0.00 99.98
216 1 227331 0.00 99.98
218 1 227332 0.00 99.98
234 1 227333 0.00 99.98
236 1 227334 0.00 99.98
246 1 227335 0.00 99.98
259 11 227346 0.00 99.98
261 1 227347 0.00 99.98
263 1 227348 0.00 99.99
268 1 227349 0.00 99.99
288 2 227351 0.00 99.99
291 1 227352 0.00 99.99
296 1 227353 0.00 99.99
298 2 227355 0.00 99.99
301 1 227356 0.00 99.99
308 1 227357 0.00 99.99
322 1 227358 0.00 99.99
328 1 227359 0.00 99.99
331 1 227360 0.00 99.99
333 1 227361 0.00 99.99
355 1 227362 0.00 99.99
364 1 227363 0.00 99.99
381 1 227364 0.00 99.99
394 2 227366 0.00 99.99
398 2 227368 0.00 99.99
418 1 227369 0.00 99.99
420 1 227370 0.00 99.99
434 1 227371 0.00 100.00
474 1 227372 0.00 100.00
488 1 227373 0.00 100.00
517 1 227374 0.00 100.00
562 1 227375 0.00 100.00
566 1 227376 0.00 100.00
571 1 227377 0.00 100.00
606 1 227378 0.00 100.00
615 1 227379 0.00 100.00
618 1 227380 0.00 100.00
620 1 227381 0.00 100.00
621 1 227382 0.00 100.00
Database is Berkeley, disk size 20,062,208 bytes. Ironically <wink>, the
plain-text db export file is a third that size.
[Tony Meyer]
> I'm happy to give you data for my setup, but I'm lazy <wink>. Would
> you mind sending me/the list a copy of the Excel formulae for these?
Here's what I used:
f = file('/code/spambayes/db.txt') # change to your export file
nham, nspam = map(int, f.readline().split(',')[:-1])
print 'nham', nham, 'nspam', nspam
hamcounts = []
spamcounts = []
for line in f:
h, s = map(int, line.split('`')[1:3])
hamcounts.append(h)
spamcounts.append(s)
def hist(tag, data):
count = {}
for x in data:
count[x] = count.get(x, 0) + 1
totalcount = sum(count.itervalues())
sofar = 0
counts = count.items()
counts.sort()
print tag, "counts"
print "value #times cumm % cumm%"
for value, count in counts:
sofar += count
print "%6d %6d %6d %6.2f %6.2f" % (
value,
count,
sofar,
count * 1e2 / totalcount,
sofar * 1e2 / totalcount)
hist("ham", hamcounts)
hist("spam", spamcounts)
More information about the Spambayes
mailing list