[spambayes-dev] A spectacular false positive

Thu Nov 27 07:01:26 EST 2003

    >> What I'd like to know is which message, if added to my training
    >> database, would have the greatest effect on the scores of the other
    >> unsure messages.  That would help me decide which ones yield the most
    >> benefit.

    Tim> If you can define what "greatest effect on the scores of the other
    Tim> unsure messages" means, exactly, then it should be easy to automate
    Tim> that decision (for each unsure: train on it, score all the other
    Tim> unsures, compute "the effect" on their scores (whatever that means
    Tim> to you), untrain it; then pick the one with the greatest
    Tim> whatever-it-is you measured).

I mean "pushes the remaining unsures the furthest away from their current
scores".  I guess I want to maximize:

    sum([abs(old-new) for (old,new) in zip(oldprobs, newprobs)])

    Tim> Google on

    Tim>     "active learning" classification

    Tim> to get a warm fuzzy feeling that this may be a fine thing to do
    Tim> <wink>.

Thanks.  When I get a chance, I may.  On the other hand, I may just take
your word for it.

    Tim> I train on "the worst" Unsure first (lowest-scoring spam or
    Tim> highest-scoring ham), then rescore Unsures, and repeat until
    Tim> they're all gone.  A number of Unsures usually get resolved on
    Tim> their own this way, especially near-duplicates of a new spam

I've been doing this sort of thing, though perhaps not consistently enough.

    Tim> I don't spend any time any more trying to guess whether a message
    Tim> "really is" ham or spam -- if it's not obvious after 5 seconds, I
    Tim> toss it without training on it at all.

Ditto.

Skip