[Spambayes] Seeking a giant idle machine w/ a miserable corpus

Rob Hooft rob@hooft.net
Sun Nov 17 08:07:59 2002


Tim Peters wrote:
> I ran my fat c.l.py test w/ the hash space clamped at 256K buckets.  That
> was clearly a bad idea for that test, since there are about 330K unique
> unigrams in that corpus (let alone bigrams and trigrams).

I collected some unigram statistics yesterday: training hammie on my 
2x10 sets in the corpus one by one, and after each 1600ham+580spam set 
run a program that reports the would-be collisions using the python 32 
bit hash function:

Set1 : 109280
Set2 : 183560
Set3 : 227699 (2 clashes)
Set4 : 277253 (3 clashes)
Set5 : 329662 (5)
Set6 : 362847 (7)
Set7 : 394585 (12)
Set8 : 422898 (12)
Set9 : 448767 (16)
Set10: 481393 (22)

clash:  [('1156', 0.027), ('607.80', 0.142)]
clash:  [('url:2516', 0.838), ('>beautiful', 0.142)]
clash:  [("erhc's", 0.964), ('27.7-0.144', 0.142)]
clash:  [('19271', 0.0841), ('richtig', 0.722)]
clash:  [('geleefd.', 0.142), ('20:10:05', 0.142)]
clash:  [('#000000', 0.905), ('from:name:jean richelle', 0.142)]
clash:  [('*lunit,', 0.142), ('.2635', 0.084)]
clash:  [("aminggs'", 0.142), ('m"f\'^', 0.142)]
clash:  [('02-6203-3010', 0.838), ('arona,', 0.838)]
clash:  [('dislin.graf.', 0.142), ('(inquires', 0.905)]
clash:  [('/9?!o_(jz?\\`', 0.142), ('arnhemse', 0.084)]
clash:  [('1075,1079', 0.142), ('from:name:c31', 0.838)]
clash:  [('1096377', 0.142), ('url:baoding', 0.838)]
clash:  [('url:bible', 0.565), ('scis', 0.084)]
clash:  [('334.8', 0.0596), ('\xc0\xd6\xbd\xc0\xb4\xcf\xb4\xd9.*', 0.905)]
clash:  [('d8/apex', 0.142), ('3\xb8\xb89\xc3\xb5\xbf\xf8\xc0\xbb', 0.838)]
clash:  [('subject:!!!                          ', 0.905), 
('from:addr:lll2002', 0.838)]
clash:  [('constitutes', 0.848), ('roast)', 0.142)]
clash:  [('>madison,', 0.017), ('subject:dison', 0.059)]
clash:  [('(powerpc)', 0.142), ('url:table', 0.849)]
clash:  [('subject:Complaint', 0.142), ('-24.727', 0.142)]
clash:  [('om=-96.953', 0.142), ('line-with', 0.142)]

The experienced spambayeser can see that I didn't use the standard 
parameters, this is because I did run an optimization using simplexloop 
in the background at the same time.

Here, the number of hash collisions is still fairly low, but subtract 
bits, and see it explode.....

Another thing that I learned from this, is that the number of distinct 
words with this test does not increase with the sqrt of the number of 
messages.

Here is clash.py:
-----
from hammie import DBDict
from Options import options

d=DBDict(options.persistent_storage_file,'r',('saved state',))

h={}

n=0
for k in d.iterkeys():
      n += 1
      #print k,type(d[k])
      hs=hash(k)
      if h.has_key(hs):
          h[hs].append((k,d[k].spamprob))
          print "clash: ",h[hs]
      else:
          h[hs]=[(k,d[k].spamprob)]

print n
-----

Regards,

Rob

-- 
Rob W.W. Hooft  ||  rob@hooft.net  ||  http://www.hooft.net/people/rob/




More information about the Spambayes mailing list