[Spambayes] Just for fun

Paul Moore lists@morpheus.demon.co.uk
Mon Nov 18 23:15:44 2002


Tim Peters <tim.one@comcast.net> writes:

>> - I see most internet headers as good spam clues, which is mildly
>> worrying, although hasn't caused any real issues yet.
>
> If your spam comes from the internet, it's appropriate <wink>.

A good chunk of ham comes from the Internet, too, but that chunk isn't
available in my training set. It could be (to an extent) but see below.

>> The obvious implication is that getting a really good training corpus
>> is *hard*. Probably beyond the means of the average user.
>
> The best possible training corpus is the email they actually get, correctly
> classified.  If they know their own judgment about ham vs spam, all the rest
> should happen by magic.  It's still hard for clients to do that, though.

Agreed (on both points - it's the best and it's hard).

In practice, I'm not completely comfortable with the approach of
starting from nothing and training only on new mail [1]. But collecting a
truly representative corpus isn't easy. The overhead of religiously
collecting and manually classifying all mail for a reasonable period
is prohibitive, and any attempt to just grab existing filed mails will
always introduce bias [2].

I'm really just trying to get to grips with what can be done to ease
the "entry cost" of the system.

Paul.

[1] It works (pretty much any training method works remarkably well)
    but as has been reported here before, unsures are surprising. And
    worse than that, in my experience, is the fact that training on an
    error or unsure and then rescoring it can show it still as
    unsure. This is *very* offputting - you just told the system it is
    spam, how come the system ignored you? (I know the answer, but
    it's almost impossible to make it feel like reasonable behaviour).

[2] The main forms of bias I see with my mail are on the one hand,
    massive imbalance in numbers, because I keep all sorts of ancient
    junk whereas I (used to) delete spam instantly. On the other hand,
    taking just my inbox excludes almost all ham which originated from
    the internet (as a simple example). Tomorrow, I'm hoping to try
    your new option to compensate for imbalance. Let's hit it with a
    truly massive ratio and see how it goes!

-- 
This signature intentionally left blank



More information about the Spambayes mailing list