[Spambayes] Seeking a giant idle machine w/ a miserable corpus

Tim Peters tim.one@comcast.net
Sat Nov 16 06:36:04 2002


[Tim]
> ...
> Skip?).  Big caution:  this is a memory hog.  I don't have enough
> RAM to run my full c.l.py test, or even half of it.

So the new patch attached plays hash games to slash it.  Changing MASK to
boost it may help; it's set for 256K max hash codes as-is.

On my full c.l.py test (which has over 330K distinct words, so squashing
into 256K hash codes necessarily conflates many words):

filename:       cv     tri
ham:spam:  20000:14000
                   20000:14000
fp total:        3       0
fp %:         0.01    0.00
fn total:        0       1
fn %:         0.00    0.01
unsure t:      103     586
unsure %:     0.30    1.72
real cost:  $50.60 $118.20
best cost:  $21.40  $32.40
h mean:       0.24    1.69
h sdev:       2.76    5.70
s mean:      99.93   99.68
s sdev:       1.59    3.37
mean diff:   99.69   97.99
k:           22.92   10.80

The Unsure rate zoomed.  I'm not sure why.  The lowest-scoring spam was
absurd, a giant multi-level marketing spam written in German:

prob = 0.0580526384697
prob('*H*') = 1
prob('*S*') = 0.116105
prob('haben sie schon') = 0.00185261
prob('gegeben finanziell') = 0.00405771
prob('... ich habe') = 0.00413223
prob('die power') = 0.00418173
prob('skip:d 10 wurde mir') = 0.00464396
prob('und adresse die') = 0.00530035
prob('skip:a 10 passierte') = 0.00570342
prob('#6".') = 0.0065312
prob('ein produkt,') = 0.00715421
prob('weiteren schwung') = 0.00764007
prob('sie bei') = 0.00790861
prob('zealand ich') = 0.00872423
prob('beste') = 0.00884086
prob('sich ein fenster') = 0.00920245
prob('100 bestellungen (oder') = 0.00959488

etc.  Of course it's never seen most of those phrases at all in ham, but
hash codes don't know that.

The full quote of the Nigerian-scam spam fell from off-the-charts spam to
middling Unsure.  Again hash collisions must account for it:

Data/Ham/Set5/74506.txt
prob = 0.580354361406
prob('*H*') = 0.839291
prob('*S*') = 1
prob('report the existence') = 0.00238221
...
prob('identified the amount') = 0.00455005
prob('country. please note') = 0.00693374
prob('numbers your reply') = 0.00715421
prob('process. because the') = 0.00959488
...
prob('duties, have') = 0.0328367
prob('foreign partner.') = 0.0503757
prob('solicit your strict') = 0.0724398
prob('which chairman') = 0.0757576
prob('skip:w 10 "abass kabiru"') = 0.0812396
...
prob('25% for') = 0.0907928
prob('subject::  subject: ') = 0.0937339
prob('would use') = 0.102003
prob('for skip:m 10 intend') = 0.103881
prob('complex,') = 0.107769
prob('matter trust') = 0.108386
prob('more details this') = 0.123444
prob('subject: ( subject:)') = 0.127565
prob('present authorities, they') = 0.12886
...
prob('housing federal secretariat') = 0.207914

So like previous gimmicks using hash codes, the mistakes are unfathomable to
human eyes, although you're unlikely to see any unless you've got a lot of
training and testing data (in which case wild mistakes become more certain
the more you've got).  When the hash space is too small (as it surely was in
this test), what *would* have been mild-prob hapaxes get associated with
strong-probability phrases by accident.

Aha!  "On average" you can expect those accidents to cancel out, but
chi-combining tends to Unsure in the presence of cancellation.  I bet that
explains the bulk of the Unsure rate boost.  Sometimes the accidents will
pile up in one direction or the other, though, likely accounting for the
examples above (especially the German example, where the hash code of
virtually every phrase is an accident).
-------------- next part --------------
A non-text attachment was scrubbed...
Name: tri2.patch
Type: application/octet-stream
Size: 5256 bytes
Desc: not available
Url : http://mail.python.org/pipermail/spambayes/attachments/20021116/dd2c93b0/tri2.exe


More information about the Spambayes mailing list