[spambayes-dev] Give up onexperimental_ham_spam_imbalance_adjustment?

Sat Sep 13 20:24:24 EDT 2003

[Tony Meyer]
> Since around April/May, I've had this option off,

Why did you turn it off?

> and I generally run with an imbalance of roughly 1:10 ham to spam [1] -
> it's 418:4660 at the moment.  I've been happy with the results, both
> in terms of correct classification and training speed.

It's good know that the classification is OK for you.  The option has no
effect on training speed (the database built is identical regardless of the
option's value -- anyone thinking of switching it off or on should know that
you don't need to retrain -- you can switch as often as you like without
retraining).  Scoring is provably (although perhaps not measurably!) a
little slower with the option True.

> Do you want people like me, who have it off, to turn it on for a week?

Nope.  Since it's the default in the Outlook client, and I'm suggesting to
change that default (or remove the code entirely), the most interesting
question is whether changing it True -> False hurts anyone.

> (If I do, I'll turn off the mixed uni/bigram scheme for the week,
> too).
>
> I think the option tends to help with little imbalances (up to 1:5,
> for example),

It was tested earlier and the results were mixed.  Unfortunately, that was
around the time I got yanked from the project, and it was left hanging in
that ambiguous state.  We've been lax since then about getting loser code
out of the codebase.

> and then starts to confuse people.

That part is demonstrably true <wink>.

> Unfortunately, in real life this teaches people the wrong thing - they
> train and things improve, so they keep doing it, and then it starts to
> go wrong again.  If this is true (that it's good up to a certain
> imbalance) then the plug-in could be smart enough to disable the option
> if the imbalance reached a certain level - or it could warn the user
> that their training method isn't that good (I know, I should test
> whether I'm right, but I don't have the time at the moment).

Tests were run on imbalance before this option existed, and we already know
imbalance hurts, at least for cross-validation kinds of tests.  But those
differ from real-life training patterns in ways covered last time.  Rob and
I started running tests closer to real-life use (like modeling time-ordered
mistake-based training), and at least I was surprised at how well they
performed.  I didn't run any tests like that with an eye on imbalance,
though.

> Or if the plug-in was even smarter ;) then it could auto-manage the
> corpora.  If the user is training on too many spam, start
> automatically training on all messages that are replied to.  If the
> user is training on too many ham, start subscribing her to junk lists
> <wink>.
>
> In the meantime, I think the default could be changed to False.  At
> least the reason for things going 'wrong' is then more obvious to the
> people that have no idea how it works.

If switching True -> False doesn't generate any "whoa! it's killing me!"
reports from people currently using True, I'm more inclined to purge the
code supporting experimental_ham_spam_imbalance_adjustment.  It's limited to
Classifier.probability(), and getting rid of the code would speed all
scoring (albeit perhaps not measurably).