[Spambayes] Seeking a giant idle machine w/ a miserable corpus

Tim Peters tim.one@comcast.net
Mon Nov 18 06:57:12 2002


[T. Alexander Popiel, tries "exact bigrams"]
> I haven't been able to do a big run of this, but here's my
> results:

Thank you!

> filename:      org  orgbix
> ham:spam:  1000:1000
>                    1000:1000
> fp total:        3       2
> fp %:         0.30    0.20
> fn total:       10       7
> fn %:         1.00    0.70
> unsure t:       27      28
> unsure %:     1.35    1.40
> real cost:  $45.40  $32.60
> best cost:  $24.00  $24.20
> h mean:       0.43    0.50
> h sdev:       5.64    5.95
> s mean:      97.94   98.28
> s sdev:      11.59   10.45
> mean diff:   97.51   97.78
> k:            5.66    5.96
>
> This is from a five-fold cross validation run.  Looks very nice.

Yet the "best cost" measure increased; add that to the list of mysteries.
I'd be keener about it if it were clearer how to make the time and database
burdens reasonable.  A less anal way of searching for the strongest unigrams
and bigrams would probably take care of time (Gary suggested something
cheaper to begin with, but that could miss some high-strength bigrams in
favor of lower-value unigrams, and I wanted more to see the ultimate
potential here).  The database bloat is jaw-dropping, though, and I'm still
unsure why that is.  Hash codes are right out, IMO -- the goofy mistakes
they lead to are intolerable.




More information about the Spambayes mailing list