[Spambayes] defaults vs. chi-square

Mon, 14 Oct 2002 17:22:45 -0400

[T. Alexander Popiel]
>>> (The false positives I get from it are fairly hopeless cases:
>>> FDIC informing customers that NextBank died, a contractor's bid
>>> containing only an encoded .pdf,

[Tim]
>> That one surprises me:  assuming we threw the body away unlooked-at (we
>> ignore MIME sections that aren't of text/* type), it's hard to get
>> enough other clues to force a spam score so high.  If possible, I'd
>> like to see the list of clues (the "prob('word') = 0.432' thingies in
>> the main output file, assuing you have show_false_positives enabled).

[Alex]
> Data/Ham/Set5/2745
> prob = 0.685540245196

How did this end up getting counted as an FP?  A score of 0.69 was very
solidly in your middle ground.

> prob('*H*') = 0.535842
> prob('*S*') = 0.906922
> prob('content-type:application/pdf') = 0.0918367
> prob('filename:fname piece:pdf') = 0.0918367
> prob('subject:Electrical') = 0.155172
> prob('content-type:text/plain') = 0.389566
> prob('header:Received:5') = 0.389918
> prob('content-type:multipart/mixed') = 0.737422
> prob('content-type:multipart/alternative') = 0.948917
> prob('&nbsp;') = 0.959269
> prob('content-type:text/html') = 0.986282
>
> That's the whole list of probabilities.

Right, that's what I expected:  if we skipped the .pdf attachment, there's
very little left, and it's hard for very little to get a killer-strong spam
score.

> I did fib slightly: in addition to the bid.pdf, there's a
> one-space-character message body represented in both plain text
> and HTML.  Effectively null, but the classifier doesn't see it that
> way.  It's that dual-body that's killing it.

As above, this just doesn't *have* a high spam score.  I think you must have
confused this with some other other FP.  The tokenizer should probably get
rid of "&nbsp;" anyway, but that's a different experiment.

>>> The false negatives are a bunch of particularly chatty spams, and
>>> one or two with empty bodies.  Again, fairly hopeless.)

>> Long chatty spam has been pretty reliably scoring near 0.5 for
>> me, which has been a real advantage of chi combining.  So again I'd
>> really like to see the list of clues.

> My error... I was looking at the fn output without paying attention
> to the listed probs.  Since the fn output is based on the single
> cutoff (set at 0.56),

Ah, that would also explain why the 0.69 msg above was mistaken for an FP
rather than a middle-ground msg.

> it was getting some of the chatty stuff.  The real fns are pretty
> short, and generally in odd languages or binary.
>
> This one looks like a worm:
>
> Data/Spam/Set3/32
> prob = 0.000317545970781
> prob('*H*') = 0.999926
> prob('*S*') = 0.000560844
> prob('skip:b 70') = 0.0412844
> prob('skip:a 70') = 0.0505618
> prob('skip:d 70') = 0.0505618
> prob('skip:e 70') = 0.0505618
> prob('email name:debian-java-request') = 0.0547407
> prob('email addr:lists.debian.org') = 0.0594895
> prob('email name:listmaster') = 0.0599834
> prob("control: couldn't decode") = 0.0652174
> prob('from:email addr:t-online.de>') = 0.0652174
> prob('skip:c 70') = 0.0652174
> prob('skip:i 70') = 0.0652174
> prob('skip:y 70') = 0.0652174
> prob('skip:z 70') = 0.0652174

An odd thing is that you must have a lot of 'skip:z 70' (etc) tokens in your
ham too, else these spamprobs wouldn't be so small.  Any idea where they
come from?  It suggests the tokenizer is giving up on something it should
really be picking apart -- but I don't have many of these in my ham, so I'm
at a loss to guess where they come from.

> prob('trouble?') = 0.0753369
> prob('skip:" 10') = 0.277389
> prob('skip:a 20') = 0.295202
> prob('content-type:text/plain') = 0.388944
> prob('header:Message-Id:1') = 0.6167
> prob('email') = 0.787497
> prob('x-mailer:microsoft outlook express 5.50.4133.2400') = 0.791262
> prob('message-id:@lists.debian.org') = 0.844828
> prob('skip:5 70') = 0.844828
>
> And again:
>
> Data/Spam/Set3/2472
> prob = 0.0029549796705
> prob('*H*') = 0.999949
> prob('*S*') = 0.00585924
> prob('header:In-Reply-To:1') = 0.000449595
> prob('skip:s 70') = 0.0412844
> prob('skip:d 70') = 0.0505618
> prob('skip:o 70') = 0.0505618
> prob('skip:t 70') = 0.0505618
> prob("control: couldn't decode") = 0.0652174
> prob('skip:c 70') = 0.0652174
> prob('skip:i 70') = 0.0652174
> prob('skip:l 70') = 0.0652174
> prob('skip:z 70') = 0.0652174

As above, you must have an awful lot of low-spamprob skip tokens in your
ham.

> prob('from:email addr:mail.com>') = 0.23545
> prob('charset:us-ascii') = 0.317057
> prob('skip:n 30') = 0.355072
> prob('content-type:text/plain') = 0.388944
> prob('header:Message-Id:1') = 0.6167
> prob('content-disposition:inline') = 0.661659
> prob('content-type:multipart/mixed') = 0.696645
> prob('x-mailer:microsoft outlook, build 10.0.2616') = 0.97619
>
>
> This one actually wasn't too long and chatty, but it seemed
> to hit a bunch of good words, and was half in french:

You must have more French in your ham, then (else the French words wouldn't
have low spamprobs).

> Data/Spam/Set6/2011
> prob = 0.00173950022128
> prob('*H*') = 0.99774
> prob('*S*') = 0.00121919
> prob('forum') = 0.0121951
> prob('url:be') = 0.0302013
> prob('email name:debian-java-request') = 0.0341451
> prob('email addr:lists.debian.org') = 0.0441114
> prob('email name:listmaster') = 0.044487
> prob('trouble?') = 0.0604856
> prob('des') = 0.0652174
> prob('cross') = 0.117486
> prob('avec') = 0.155172
> prob('est') = 0.155172
> prob('firmwares') = 0.155172
> prob('progress,') = 0.155172
> prob('toute') = 0.155172
> prob('...') = 0.180314
> prob('occasionally') = 0.184814
> prob('still') = 0.237895
> prob('but') = 0.249098
> prob('skip:" 10') = 0.278104
> prob('site') = 0.295343
> prob('already') = 0.301798
> prob('charset:us-ascii') = 0.308681
> prob('after') = 0.341657
> prob('x-mailer:microsoft outlook express 6.00.2600.0000') = 0.347036
> prob('content-type:text/plain') = 0.390599
> prob('header:Reply-To:1') = 0.60073
> prob('from') = 0.604083
> prob('subject:.') = 0.605015
> prob('available') = 0.637633
> prob('header:Mime-Version:1') = 0.646706
> prob('email') = 0.785132
> prob('please') = 0.83219
> prob('subject:skip:W 10') = 0.908163
> prob('url:') = 0.936848
>
> I don't know what happened to the other fn < 0.03.  Close, but not
> quite, is a nigerian spam (!!!):
>
> Data/Spam/Set7/352
> prob = 0.0344593026264
> prob('*H*') = 0.999908
> prob('*S*') = 0.0688269
> prob('indeed') = 0.00556242
> prob('aim') = 0.012894
> prob('(my') = 0.0145631
> prob('manner') = 0.0180723
> prob('wrote') = 0.0211545
> prob('reminder') = 0.0238095
> prob('nigerian') = 0.0266272

You have lot of ham containing "Nigerian"?  If so, that may be my fault for
talking about my Nigerian-scam FP every chance I get <wink>.

> prob('december') = 0.0266272
> prob('so.') = 0.0281933
> prob('okay') = 0.0302013
> prob('although') = 0.0350768
> prob('numbered') = 0.0412844
> prob('ratio') = 0.0446266
> prob('opposed') = 0.0481336
> prob('apparently,') = 0.0505618
> prob('revert') = 0.0505618
> prob('officer') = 0.0505618
> prob('subsequently') = 0.0505618
> prob('patience') = 0.0505618
> prob('however') = 0.0524146
> prob('overcome') = 0.0599022
> prob('fixed') = 0.0617239
> prob('infer') = 0.0652174
> prob('presumed') = 0.0652174
> prob('filename:fname piece:txt') = 0.0652174
> prob('therefore') = 0.0838752
> prob('attempts') = 0.0874263
> prob('expert,') = 0.0918367
> prob('calendar') = 0.0918367
> prob('travelling') = 0.0918367
> prob('nigeria.') = 0.0918367
> prob('apparently') = 0.0929593
> prob('forwarding') = 0.106987
> prob('saw') = 0.107116
> prob('thus') = 0.110275
> prob('did') = 0.112618
> prob('concern') = 0.114396
> prob('especially') = 0.125537
> prob('finally,') = 0.126719
> prob('shall') = 0.135258
> prob('worked') = 0.138554
> prob('point') = 0.154593
> prob('totaling') = 0.155172
> prob('proposition') = 0.155172
> prob('6th') = 0.155172
> prob('actively') = 0.165428
> prob('since') = 0.166612
> prob('knows') = 0.169148
> prob('which') = 0.172635
> prob('necessary') = 0.182854
> prob('source') = 0.183395
> prob('routine') = 0.189922
> prob('driven') = 0.205305
> prob('got') = 0.206143
> prob('reality') = 0.206601
> prob('light') = 0.207284
> prob('skip:h 20') = 0.211375
> prob('some') = 0.214937
> prob('there') = 0.219934
> prob('same') = 0.227242
> prob('still') = 0.238027
> prob('but') = 0.254404
> prob('according') = 0.254563
> prob('very') = 0.256327
> prob('skip:m 10') = 0.258633
> prob('stand') = 0.260226
> prob('died') = 0.263314
> prob('branch') = 0.263314
> prob('zero') = 0.26593
> prob('number') = 0.267526
> prob('them') = 0.274205
> prob('large') = 0.27431
> prob('his') = 0.276565
> prob('transaction') = 0.281659
> prob('consultant') = 0.283198
> prob('reason') = 0.288324
> prob('dead') = 0.288434
> prob('trace') = 0.29021
> prob('mr.') = 0.292388
> prob('part') = 0.294772
> prob('when') = 0.297739
> prob('ask') = 0.299886
> prob('already') = 0.299963
> prob('listing') = 0.310964
> prob('given') = 0.311411
> prob('down') = 0.311983
> prob('charset:us-ascii') = 0.312457
> prob('being') = 0.312739
> prob('federal') = 0.695627
> prob('president') = 0.697044
> prob('safely') = 0.700267
> prob('notification') = 0.700364
> prob('information') = 0.703131
> prob('skip:r 10') = 0.706302
> prob('inform') = 0.707612
> prob('brought') = 0.70783
> prob('your') = 0.710937
> prob('complete') = 0.711206
> prob('content-type:application/octet-stream') = 0.718341

Eh?  A Nigerian scam with an octet-stream attachment?!  That's unique!

> prob('country.') = 0.718341
> prob('immediately') = 0.727163
> prob('further') = 0.728674
> prob('obtained') = 0.732221
> prob('risk') = 0.747156
> prob('content-type:multipart/mixed') = 0.751609
> prob('contract') = 0.754669
> prob('informed') = 0.75788
> prob('business') = 0.761283
> prob('internet') = 0.768097
> prob('phone') = 0.774467
> prob('questions') = 0.795045
> prob('money,') = 0.796192
> prob('bank') = 0.801151
> prob('succeed') = 0.805677
> prob('settled') = 0.810078
> prob('month') = 0.811997
> prob('claim') = 0.812913
> prob('confidential') = 0.815186
> prob('money.') = 0.8156
> prob('our') = 0.820323
> prob('please') = 0.828641
> prob('months,') = 0.829218
> prob('fund') = 0.83557
> prob('national') = 0.835796
> prob('sent') = 0.837147
> prob('blood') = 0.843797
> prob('asked,') = 0.844828
> prob('treasury') = 0.844828
> prob('address') = 0.860353
> prob('reply') = 0.864689
> prob('achieving') = 0.87037
> prob('money') = 0.878353
> prob('70%') = 0.880818
> prob('million') = 0.885051
> prob('corporation') = 0.891198
> prob('free') = 0.90477
> prob('approval') = 0.904949
> prob('x-mailer:microsoft outlook express 5.00.2919.6900 dm') = 0.908163
> prob('modalities') = 0.908163
> prob('employment') = 0.912574
> prob('claim.') = 0.915225
> prob('skip:y 10') = 0.922406
> prob('deposit') = 0.929253
> prob('wish') = 0.930416
> prob('credit') = 0.941699
> prob('valued') = 0.950726
> prob('guaranteed') = 0.956906
> prob('honored') = 0.958716
> prob('message-id:@ucsu.colorado.edu') = 0.965116
> prob('conservative') = 0.983271
>
> All you folks _talking_ about the nigerian spams has turned them
> into ham for me! ;-)

That could be.  I hardly ever mention modalities here <wink>.