[Spambayes] Chi-squared perl port problems

Matt Sergeant msergeant@startechgroup.co.uk
Thu Nov 7 15:03:43 2002


Skip Montanaro said the following on 07/11/02 14:42:
> 
>     Matt> OK, I've tried to convert your chi-squared stuff to Perl, but for
>     Matt> some reason it's producing bizarre results. 
> 
> I think
> 
>           $S = log($S) + $Sexp + LN2;
>           $H = log($H) + $Hexp + LN2;
> 
> should be
> 
>           $S = log($S) + $Sexp * LN2;
>           $H = log($H) + $Hexp * LN2;

Thanks. That was one difference. However I still get odd results. Here's 
another set of tokens, which scores 1.0 under graham, but 0.03-ish with 
my chi-squared code:

2161361384acrd-zgwm                                => 1.00000
FREE?!                                             => 1.00000
Schoolgirl                                         => 1.00000
src=http://www.studiocev.com/stop/images/amateuryouth/images/spacer.gif 
=> 1.00000
received-by:mail2.studiocev.com                    => 1.00000
amateuryouth.com                                   => 1.00000
bgcolor=#525D94                                    => 1.00000
src=http://www.studiocev.com/stop/images/amateuryouth/images/index_09.jpg 
=> 1.00000
src=http://www.studiocev.com/stop/images/amateuryouth/images/index_07.jpg 
=> 1.00000
src=http://www.studiocev.com/stop/images/amateuryouth/images/index_05.jpg 
=> 1.00000
received-ip:216.136.138.4                          => 1.00000
src=http://www.studiocev.com/stop/images/amateuryouth/images/index_04.jpg 
=> 1.00000
src=http://www.studiocev.com/stop/images/amateuryouth/images/index_03.jpg 
=> 1.00000
src=http://www.studiocev.com/stop/images/amateuryouth/images/index_01.jpg 
=> 1.00000
skip:21613 19                                      => 1.00000
from:<sforeman@studiocev.com>                      => 1.00000
height=188                                         => 1.00000
href=http://amateuryouth.com/enter.html            => 1.00000
href=http://www.studiocev.com/unsubscribe.html     => 1.00000
width=345                                          => 1.00000
width=376                                          => 1.00000
height=73                                          => 0.97424
height=141                                         => 0.96062
size=+1                                            => 0.95901
height=62                                          => 0.95276
free!!                                             => 0.91652
content-type:text/html                             => 0.90486
rowspan=4                                          => 0.89331
width=298                                          => 0.88645
size=5                                             => 0.88634
width=375                                          => 0.84656
align=center                                       => 0.78637
color=#FFFFFF                                      => 0.77172
remove                                             => 0.76079
width=30                                           => 0.74293
width=153                                          => 0.73757
rowspan=2                                          => 0.73211
height=47                                          => 0.71467
color=WHITE                                        => 0.70927
border=0                                           => 0.69548
target=_blank                                      => 0.69225
width=47                                           => 0.69081
width=1                                            => 0.67389
cellpadding=0                                      => 0.65914
colspan=6                                          => 0.65661
cellspacing=0                                      => 0.65439
colspan=5                                          => 0.64894
width=15                                           => 0.64689
width=130                                          => 0.64543
height=1                                           => 0.63505
colspan=2                                          => 0.61052
from:Foreman"                                      => 0.59412
face=Verdana                                       => 0.58404
sites                                              => 0.57298
enter                                              => 0.52994
colspan=4                                          => 0.52631
colspan=3                                          => 0.52302
absolutely                                         => 0.51730
width=77                                           => 0.49915
here                                               => 0.46375
yourself                                           => 0.43398
offer                                              => 0.33481
face=arial                                         => 0.33373
http-equiv=Content-Type                            => 0.27761
years                                              => 0.27318
models                                             => 0.25215
least                                              => 0.24571
time                                               => 0.22573
this                                               => 0.21638
within                                             => 0.19113
content=text/html; charset=iso-8859-1              => 0.16949
from                                               => 0.12558
come                                               => 0.10411
listed                                             => 0.07929
please                                             => 0.07193
service                                            => 0.02372
from:"Susan                                        => 0.02048
limited                                            => 0.01139

And here's the S and H values at each stage this time:

S1=1e-10; H1=1.36351472450952e-22
S2=-23.0258509299405; H2=-50.3468063235703
S3=0.00083341211626875; H3=0.941170965180294

Every single email I throw at this gives me a high H and a low S. I'm 
really not sure what I'm doing wrong here...

Matt.




More information about the Spambayes mailing list