[spambayes-dev] Re: Idea to re-energize corpus learning

Mon Nov 17 20:03:12 EST 2003

Tim Peters wrote:

> [Martin Stone Davis]
> 
>>...
>>So why not soften the blow?  That's what my proposal amounts to:
>>achieving some sort of middle ground between the status quo and
>>starting over.  After performing a "Soften training SEVERELY" (where
>>the counts are all set to their square roots), messages would still
>>be classified in more-or-less the same way.
> 
> 
> You can't know that without running serious tests, and it sounds like
> something tests would prove wrong.  SpamBayes effectively computes spamprobs
> from ratios, and sqrt(x)/sqrt(y) = sqrt(x/y):  the effective relative ratios
> would also get "square rooted", and that's likely to cause massive changes
> in scoring.

Yes, scores in my system would get pushed closer to 1.  Which means it 
should act a little more "unsure" about all the words.  I don't see 
anything so terrible about that, but it's something to keep in mind.

> 
> "The usual" way (in many fields) to diminish counts that have grown "too
> large" is to add 1, then shift right by a bit.  The purpose of adding 1
> first is to prevent an original count of 1 from becoming 0.  Other than
> that, it's basically "cut all the counts in half".  Then (x/2)/(y/2) = x/y,
> so that relative ratios aren't affected (much; counts 2*i+1 and 2*i+2, for
> any i >= 0, are both reduced to i+1, so relative ratios can still change
> some, and especially for small i).

This way would be fine too.  As long as the counts are reduced somehow, 
I'd achieve the goal of making further training more effective.  I will 
try it though, so thanks for the tip.

> 
> 
>>However, further training would then be far more effective, since the
>>counts would be lower.
>>
>>Doesn't that sound like a good idea?
> 
> 
> If test results say that it is, yes; otherwise no.  A problem with
> artificially mangling token counts is that you'll probably lose the ability
> to meaningfully untrain a message again (the relationship betwen token
> counts and total number of ham and spam trained on is destroyed by reducing
> only one of them, but if you reduce the total counts too then you've got
> more messages you *could* untrain on than the (reduced) total count believes
> is possible; untraining anyway will then lead to worsening inaccuracy until
> the reduced total count "goes negative", at which point the code will
> probably blow up, or start to deliver pure nonsense results).

True, but the whole point of my system is that I don't want to have to 
go over previously trained stuff to try to make it work better.  So the 
fact that it's tough to meaningfully untrain messages after softening is 
no problem for me.

(Hmmm, you might still do it: train A, soften, train B, harden, untrain 
A.  That should be kinda meaningful, if a little confusing.  But again, 
it's not a big issue for me.)

> 
> 
>>-Martin
>>
>>P.S. I'm also sure that POPfile learns just as quickly as SpamBayes,
>>since they are based on the same principle.
> 
> 
> Sorry, but unless you've tested this, you have no basis for such a claim.
> May be true, may be false, but "same principle" doesn't determine it a
> priori (overlooking that the ways in which SpamBayes and POPfile determine a
> category actually have very little in common).

True, but I was just expressing my confidence in Skip's assertion to the 
same effect.  I'll be more careful next time.  :P

-Martin

P.S. Someone posted a hack to POPfile which will let me test this idea. 
  So that makes one tester...  I'll try both the "square root" method 
and the "cut all the counts in half" method.