[Spambayes] Chi True results

Sat, 12 Oct 2002 16:59:45 -0400

[Brad Clements]
> Oh, I reached nirvana a few weeks ago.

Cool -- I hope to join you there soon <wink>.

> Any of these schemes seem like a big win for me. though I did
> like the central limit schemes well enough.

Because?  That is, what about them was attractive to you, in contrast to the
others?

> That is, the original graham method didn't have "sure, mostly
> sure" (ham x spam)..
> Which I like to have.
>
> I can appreciate gary's interest in numerical purity, but the
> absolute difference between 1% fn and 2%fn is, in my case, only
> 1 spam message a day.

All of the remaining schemes beyond the current default (the 3 clt schemes,
tim combining, and chi combining) haven't been about numerical purity, but
about refining "the middle ground":  isolating as many mistakes into as
small a group of "unsure" msgs as possible, with as least touchy a set of
cutoff values as possible.  On my test data, chi combining blows the others
out of the water by these measures, and python.org:

1. Deals with many more msgs than any individual deals with.

and

2. Has a mail admin notorious for whining about currently reviewing a
   measly 20 msgs per day <wink>.

Cutting an error rate in half means half the work, and probably a quarter of
the whining, in that context.

> At this point, I'm working to put the rubber on the road and
> tackle deployment issues ..
> Like how could you implement this scheme for 300 users on an IMAP
> server?

There you go:  cut an error rate in half there, and your "1 msg per day"
instantly turns into 300.

> Not with a 20 megabyte pickle per user!

Things to look at:  we shouldn't need an 8-byte timestamp per word; the
killcount may not be useful at all when we stop *comparing* schemes; about
half of all words will be found only once in the whole database (this is an
Invariant Truth across all computer indexing applications -- "hapax
legomena"(*) is what it's called in the literature), so half the words in
your database can be expected to be useless because unique; work needs to be
done on pruning the database over time; and these are all related.

Note that incremental adjustments to the clt schemes bristle with problems
the non-clt schemes don't have, due to the third training pass unique to the
clt schemes.

> if tim_combining works "nearly as well" as chi, but takes 1/4 the
> processor time.. I'd probably choose the former.

Processor time won't be a factor here -- tokenization and I/O times dominate
all schemes so far, and the combining method is an expense distinct from
those (note that all the variations discussed here are purely variations in
the combining method:  they all see the same token streams and word counts,
the differences are in how they *use* the evidence).  I barely noticed the
time difference as-is, yet chi combining is invoking log about 50x more
often than necessary now, and computing chi2Q() to about 14 significant
digits is way more than necessary too.

> Sorry, guess I haven't answered your question.

Indeed not, but you answered other interesting questions I didn't think to
ask <wink>.

(*) For our grammarians, the plural is hapaxes, as in

     31.6% of English hapaxes have corresponding Lithuanian hapaxes.

    and

     Among the evangelists, Luke is the most capable of apparently
     writing “uncharacteristically” since he has the largest vocabulary,
     the greatest number of hapax legomena, and a disturbing habit of
     varying his synonyms.  Paffenroth does not engage, for example,
     with Michael Goulder’s claim that Luke introduces more hapaxes
     into Mark than he takes over.

And you thought we were getting academic *here* <wink>.