[Spambayes] Seeking a giant idle machine w/ a miserable corpus
Tim Peters
tim.one@comcast.net
Mon Nov 18 06:57:12 2002
[T. Alexander Popiel, tries "exact bigrams"]
> I haven't been able to do a big run of this, but here's my
> results:
Thank you!
> filename: org orgbix
> ham:spam: 1000:1000
> 1000:1000
> fp total: 3 2
> fp %: 0.30 0.20
> fn total: 10 7
> fn %: 1.00 0.70
> unsure t: 27 28
> unsure %: 1.35 1.40
> real cost: $45.40 $32.60
> best cost: $24.00 $24.20
> h mean: 0.43 0.50
> h sdev: 5.64 5.95
> s mean: 97.94 98.28
> s sdev: 11.59 10.45
> mean diff: 97.51 97.78
> k: 5.66 5.96
>
> This is from a five-fold cross validation run. Looks very nice.
Yet the "best cost" measure increased; add that to the list of mysteries.
I'd be keener about it if it were clearer how to make the time and database
burdens reasonable. A less anal way of searching for the strongest unigrams
and bigrams would probably take care of time (Gary suggested something
cheaper to begin with, but that could miss some high-strength bigrams in
favor of lower-value unigrams, and I wanted more to see the ultimate
potential here). The database bloat is jaw-dropping, though, and I'm still
unsure why that is. Hash codes are right out, IMO -- the goofy mistakes
they lead to are intolerable.
More information about the Spambayes
mailing list