[Spambayes] Seeking a giant idle machine w/ a miserable corpus
Rob Hooft
rob@hooft.net
Sun Nov 17 08:07:59 2002
Tim Peters wrote:
> I ran my fat c.l.py test w/ the hash space clamped at 256K buckets. That
> was clearly a bad idea for that test, since there are about 330K unique
> unigrams in that corpus (let alone bigrams and trigrams).
I collected some unigram statistics yesterday: training hammie on my
2x10 sets in the corpus one by one, and after each 1600ham+580spam set
run a program that reports the would-be collisions using the python 32
bit hash function:
Set1 : 109280
Set2 : 183560
Set3 : 227699 (2 clashes)
Set4 : 277253 (3 clashes)
Set5 : 329662 (5)
Set6 : 362847 (7)
Set7 : 394585 (12)
Set8 : 422898 (12)
Set9 : 448767 (16)
Set10: 481393 (22)
clash: [('1156', 0.027), ('607.80', 0.142)]
clash: [('url:2516', 0.838), ('>beautiful', 0.142)]
clash: [("erhc's", 0.964), ('27.7-0.144', 0.142)]
clash: [('19271', 0.0841), ('richtig', 0.722)]
clash: [('geleefd.', 0.142), ('20:10:05', 0.142)]
clash: [('#000000', 0.905), ('from:name:jean richelle', 0.142)]
clash: [('*lunit,', 0.142), ('.2635', 0.084)]
clash: [("aminggs'", 0.142), ('m"f\'^', 0.142)]
clash: [('02-6203-3010', 0.838), ('arona,', 0.838)]
clash: [('dislin.graf.', 0.142), ('(inquires', 0.905)]
clash: [('/9?!o_(jz?\\`', 0.142), ('arnhemse', 0.084)]
clash: [('1075,1079', 0.142), ('from:name:c31', 0.838)]
clash: [('1096377', 0.142), ('url:baoding', 0.838)]
clash: [('url:bible', 0.565), ('scis', 0.084)]
clash: [('334.8', 0.0596), ('\xc0\xd6\xbd\xc0\xb4\xcf\xb4\xd9.*', 0.905)]
clash: [('d8/apex', 0.142), ('3\xb8\xb89\xc3\xb5\xbf\xf8\xc0\xbb', 0.838)]
clash: [('subject:!!! ', 0.905),
('from:addr:lll2002', 0.838)]
clash: [('constitutes', 0.848), ('roast)', 0.142)]
clash: [('>madison,', 0.017), ('subject:dison', 0.059)]
clash: [('(powerpc)', 0.142), ('url:table', 0.849)]
clash: [('subject:Complaint', 0.142), ('-24.727', 0.142)]
clash: [('om=-96.953', 0.142), ('line-with', 0.142)]
The experienced spambayeser can see that I didn't use the standard
parameters, this is because I did run an optimization using simplexloop
in the background at the same time.
Here, the number of hash collisions is still fairly low, but subtract
bits, and see it explode.....
Another thing that I learned from this, is that the number of distinct
words with this test does not increase with the sqrt of the number of
messages.
Here is clash.py:
-----
from hammie import DBDict
from Options import options
d=DBDict(options.persistent_storage_file,'r',('saved state',))
h={}
n=0
for k in d.iterkeys():
n += 1
#print k,type(d[k])
hs=hash(k)
if h.has_key(hs):
h[hs].append((k,d[k].spamprob))
print "clash: ",h[hs]
else:
h[hs]=[(k,d[k].spamprob)]
print n
-----
Regards,
Rob
--
Rob W.W. Hooft || rob@hooft.net || http://www.hooft.net/people/rob/
More information about the Spambayes
mailing list