[Spambayes] Chi-squared perl port problems
Matt Sergeant
msergeant@startechgroup.co.uk
Thu Nov 7 15:03:43 2002
Skip Montanaro said the following on 07/11/02 14:42:
>
> Matt> OK, I've tried to convert your chi-squared stuff to Perl, but for
> Matt> some reason it's producing bizarre results.
>
> I think
>
> $S = log($S) + $Sexp + LN2;
> $H = log($H) + $Hexp + LN2;
>
> should be
>
> $S = log($S) + $Sexp * LN2;
> $H = log($H) + $Hexp * LN2;
Thanks. That was one difference. However I still get odd results. Here's
another set of tokens, which scores 1.0 under graham, but 0.03-ish with
my chi-squared code:
2161361384acrd-zgwm => 1.00000
FREE?! => 1.00000
Schoolgirl => 1.00000
src=http://www.studiocev.com/stop/images/amateuryouth/images/spacer.gif
=> 1.00000
received-by:mail2.studiocev.com => 1.00000
amateuryouth.com => 1.00000
bgcolor=#525D94 => 1.00000
src=http://www.studiocev.com/stop/images/amateuryouth/images/index_09.jpg
=> 1.00000
src=http://www.studiocev.com/stop/images/amateuryouth/images/index_07.jpg
=> 1.00000
src=http://www.studiocev.com/stop/images/amateuryouth/images/index_05.jpg
=> 1.00000
received-ip:216.136.138.4 => 1.00000
src=http://www.studiocev.com/stop/images/amateuryouth/images/index_04.jpg
=> 1.00000
src=http://www.studiocev.com/stop/images/amateuryouth/images/index_03.jpg
=> 1.00000
src=http://www.studiocev.com/stop/images/amateuryouth/images/index_01.jpg
=> 1.00000
skip:21613 19 => 1.00000
from:<sforeman@studiocev.com> => 1.00000
height=188 => 1.00000
href=http://amateuryouth.com/enter.html => 1.00000
href=http://www.studiocev.com/unsubscribe.html => 1.00000
width=345 => 1.00000
width=376 => 1.00000
height=73 => 0.97424
height=141 => 0.96062
size=+1 => 0.95901
height=62 => 0.95276
free!! => 0.91652
content-type:text/html => 0.90486
rowspan=4 => 0.89331
width=298 => 0.88645
size=5 => 0.88634
width=375 => 0.84656
align=center => 0.78637
color=#FFFFFF => 0.77172
remove => 0.76079
width=30 => 0.74293
width=153 => 0.73757
rowspan=2 => 0.73211
height=47 => 0.71467
color=WHITE => 0.70927
border=0 => 0.69548
target=_blank => 0.69225
width=47 => 0.69081
width=1 => 0.67389
cellpadding=0 => 0.65914
colspan=6 => 0.65661
cellspacing=0 => 0.65439
colspan=5 => 0.64894
width=15 => 0.64689
width=130 => 0.64543
height=1 => 0.63505
colspan=2 => 0.61052
from:Foreman" => 0.59412
face=Verdana => 0.58404
sites => 0.57298
enter => 0.52994
colspan=4 => 0.52631
colspan=3 => 0.52302
absolutely => 0.51730
width=77 => 0.49915
here => 0.46375
yourself => 0.43398
offer => 0.33481
face=arial => 0.33373
http-equiv=Content-Type => 0.27761
years => 0.27318
models => 0.25215
least => 0.24571
time => 0.22573
this => 0.21638
within => 0.19113
content=text/html; charset=iso-8859-1 => 0.16949
from => 0.12558
come => 0.10411
listed => 0.07929
please => 0.07193
service => 0.02372
from:"Susan => 0.02048
limited => 0.01139
And here's the S and H values at each stage this time:
S1=1e-10; H1=1.36351472450952e-22
S2=-23.0258509299405; H2=-50.3468063235703
S3=0.00083341211626875; H3=0.941170965180294
Every single email I throw at this gives me a high H and a low S. I'm
really not sure what I'm doing wrong here...
Matt.
More information about the Spambayes
mailing list